Microsoft is giving away 50,000 FREE Microsoft Certification exam vouchers. Get Fabric certified for FREE! Learn more
To begin, let’s look at the definition of Data Wrangler.
“It’s a data cleaning tool centered around code, integrated into Visual Studio Code. It aims to increase the productivity of data experts by automating core functions of the Pandas library and providing useful insights on columns.”
This, roughly translated, is the definition given by the creators. While at the time of writing this post, it only supported Pandas, it now also works with PySpark Frames.
To find the tool, simply search for it in VS Code, or you can learn more about the open-source project on its GitHub repository. There, you will also find download options.
Once installed, opening it is as simple as performing a “head” for a Pandas DataFrame:
It suggest a button to launch Data Wrangler extension.
The UI will fastly show usefull information. Let's see the sections one by one:
On the left-hand side, we see the transformations we can choose by selecting a column beforehand. For example, let’s change the format of our "Order Date" column. We select the "DateTime formatting example" from the format options to display it exactly how we want the format to be. This generates a new column that, by defining the example, will be completed and created after applying it:
If we go to the previous step, we will be able to see the executed code as well as modify it. If we wanted to change the name of the new column derivedCol to "Fecha de Orden," we would simply go to the underlined code where the name appears and change it.
NOTE: We can only change the code from the previous step. Older steps are kept in preview to avoid overloading the memory.
There are traditional text operations like split, where we can simply use the delimiter to adjust the columns.
In case we know more about the language, we also have a space to write a Python formula for a new or existing column. For example, a simple mathematical operation:
Some other examples I’ve been testing include replacing text strings, deleting columns, adding columns, for example, and changing data types.
If at any point we want to go back, we have a "Back to Notebook" button that would allow us to reorder code and/or manually adjust everything that was generated.
Keep in mind that if we go back to the notebook, we won’t be able to return to the wrangler exactly as it was with its steps to view and edit. Instead, we would need to generate a new “head()” from the most recent cleaning step's dataframe to restart it and execute the missing steps.
Integration with Fabric
In addition to the local use mentioned earlier, the tool was integrated with Microsoft Build 2023 releases. Fabric Notebooks can not only be opened in Visual Studio Code for use from there, but they also have a "data" tab that detects pandas frames to open Data Wrangler.
That way we will open something similar to what we have seen at VScode.
Although it looks identical but in a light version, it has some differences. The integration with Fabric doesn’t have all the local options, such as "Create column from formula." We don’t have the option to write the code for a column. The rest is quite similar to what we mentioned earlier.
Conclusion
This is an excellent tool that can strengthen our beginnings with Python to improve the user experience and learn about the code by seeing what it generates after applying the operations.
If we compare it to a Power BI query editor, it feels a bit rough. However, it has a lot of potential and room for growth to improve.
A great alternative for those who haven’t quite gotten comfortable with Python and need to perform some transformations between layers of a Lake or data ingestion.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.