Skip to main content
cancel
Showing results for 
Search instead for 
Did you mean: 

Microsoft is giving away 50,000 FREE Microsoft Certification exam vouchers. Get Fabric certified for FREE! Learn more

ibarrau

Data Wrangler - The python transformation method similar to PowerBi Query Editor

To begin, let’s look at the definition of Data Wrangler.

 

“It’s a data cleaning tool centered around code, integrated into Visual Studio Code. It aims to increase the productivity of data experts by automating core functions of the Pandas library and providing useful insights on columns.”

 

This, roughly translated, is the definition given by the creators. While at the time of writing this post, it only supported Pandas, it now also works with PySpark Frames.

To find the tool, simply search for it in VS Code, or you can learn more about the open-source project on its GitHub repository. There, you will also find download options.

Once installed, opening it is as simple as performing a “head” for a Pandas DataFrame:

ibarrau_0-1742487116001.png

It suggest a button to launch Data Wrangler extension.

The UI will fastly show usefull information. Let's see the sections one by one:

ibarrau_1-1742487185625.png

  1. Columns with distribution and statistics that allow us to understand them more quickly.
  2. Transformation actions available to execute, separated by categories.
  3. List of transformation steps executed to review the history.
  4. Generated code. This section is ideal for learning more about the language, as it shows us how an action is executed and also allows us to modify it as we wish.

On the left-hand side, we see the transformations we can choose by selecting a column beforehand. For example, let’s change the format of our "Order Date" column. We select the "DateTime formatting example" from the format options to display it exactly how we want the format to be. This generates a new column that, by defining the example, will be completed and created after applying it:

ibarrau_2-1742487245375.png

If we go to the previous step, we will be able to see the executed code as well as modify it. If we wanted to change the name of the new column derivedCol to "Fecha de Orden," we would simply go to the underlined code where the name appears and change it.

 

NOTE: We can only change the code from the previous step. Older steps are kept in preview to avoid overloading the memory.

 

There are traditional text operations like split, where we can simply use the delimiter to adjust the columns.

ibarrau_3-1742487327620.png

In case we know more about the language, we also have a space to write a Python formula for a new or existing column. For example, a simple mathematical operation:

ibarrau_4-1742487407298.png

Some other examples I’ve been testing include replacing text strings, deleting columns, adding columns, for example, and changing data types.

If at any point we want to go back, we have a "Back to Notebook" button that would allow us to reorder code and/or manually adjust everything that was generated.

Keep in mind that if we go back to the notebook, we won’t be able to return to the wrangler exactly as it was with its steps to view and edit. Instead, we would need to generate a new “head()” from the most recent cleaning step's dataframe to restart it and execute the missing steps.

 

Integration with Fabric

In addition to the local use mentioned earlier, the tool was integrated with Microsoft Build 2023 releases. Fabric Notebooks can not only be opened in Visual Studio Code for use from there, but they also have a "data" tab that detects pandas frames to open Data Wrangler.

ibarrau_5-1742487414173.png

That way we will open something similar to what we have seen at VScode.

ibarrau_6-1742487472109.png

 

Although it looks identical but in a light version, it has some differences. The integration with Fabric doesn’t have all the local options, such as "Create column from formula." We don’t have the option to write the code for a column. The rest is quite similar to what we mentioned earlier.

 

Conclusion

This is an excellent tool that can strengthen our beginnings with Python to improve the user experience and learn about the code by seeing what it generates after applying the operations.

If we compare it to a Power BI query editor, it feels a bit rough. However, it has a lot of potential and room for growth to improve.

A great alternative for those who haven’t quite gotten comfortable with Python and need to perform some transformations between layers of a Lake or data ingestion.