Microsoft is giving away 50,000 FREE Microsoft Certification exam vouchers. Get Fabric certified for FREE! Learn more
This is an original article from LaDataWeb blog that I wanted to share with the community in English.
Many times, I have found people working with Python in data engineering or as data scientists who need to work with a Microsoft data movement tool (Data Factory, Logic Apps, etc.) or a reporting tool (Power BI, Excel, etc.) due to a company requirement. The truth here is that those people hate working without coding in Python. They don't want to manage a small environment configuring and clicking for transformations.
Today, Microsoft is pushing Power Query to its limits, integrating the language in many services or tools for data transformations. Now, it can be used with SSIS, SSAS, Power Automate, Logic Apps, Power BI and Dataflows (in Power Platform and Data Factory). Each day, the work with Power Query is growing faster and embracing not only BI roles but also data engineers and scientists.
That's why I have created this post to help them how to code the basic SQL and Pandas (Python lib) transformations with Power Query.
First, they need to understand that in most common situations (dataflows and Power BI) Power Query works in steps. Each line calls the previous result to increase a new step from the previous one and to keep it in order. Now you know that if you see #"Last Step" we are talking about a table result of the previous step.
Let's check the examples that will be listed in the following order: SQL, Python Pandas, Power Query.
Show first 5 lines of the dataset
In SQL:
SELECT TOP 5 * FROM table
In Pandas:
df.head()
In Power Query:
Table.FirstN(#"Last Step",5)
Count rows
SELECT COUNT(*) FROM table1
df.shape()
Table.RowCount(#"Last Step")
Select the columns of a table
SELECT column1, column2 FROM table1
df[["column1", "column2"]]
#"Last Step"[[Columna1],[Columna2]]
//Or it might be:
Table.SelectColumns(#"Last Step", {"Columna1", "Columna2"} )
Filter rows
SELECT column1, column2 FROM table1 WHERE column1 = 2
df[['column1', 'column2']].loc[df['column1'] == 2]
Table.SelectRows(#"Last Step", each [column1] == 2 )
Filtering with more than 1 column
SELECT * FROM table1 WHERE column1 > 1 AND column2 < 25
df.loc[(df['column1'] > 1) & (df['column2'] < 25)]
# Or using OR y NOT operators
df.loc[(df['column1'] > 1) | ~(df['column2'] < 25)]
Table.SelectRows(#"Last Step", each [column1] > 1 and column2 < 25 )
//Or using OR y NOT operators
Table.SelectRows(#"Last Step", each [column1] > 1 or not ([column1] < 25 ) )
Filters with complex operators
SELECT * FROM table1 WHERE column1 BETWEEN 1 and 5 AND column2 IN (20,30,40,50) AND column3 LIKE '%arcelona%'
df.loc[(df['colum1'].between(1,5)) & (df['column2'].isin([20,30,40,50])) & (df['column3'].str.contains('arcelona'))]
Table.SelectRows(#"Last Step", each ([column1] > 1 and [column1] < 5) and List.Contains({20,30,40,50}, [column2]) and Text.Contains([column3], "arcelona") )
Join tables
SELECT t1.column1, t2.column1 FROM table1 t1 LEFT JOIN table2 t2 ON t1.column_id = t2.column_id
There are two functions that can help us in this process. Merge and Join.
df_joined = df1.merge(df2, left_on='lkey', right_on='rkey', how='left')
df_joined = df1.join(df2, on='column_id', how='left')Luego seleccionamos dos columnas
df_joined.loc[['column1_df1', 'column1_df2']]
In Power Query, we first we set the join conditions and then select the columns with the ExpandTableColumn function.
#"Origen" = #"Last Step"[[column1_t1]]
#"Paso Join" = Table.NestedJoin(#"Origen", {"column_t1_id"}, table2, {"column_t2_id"}, "Prefijo", JoinKind.LeftOuter)
#"Expansion" = Table.ExpandTableColumn(#"Paso Join", "Prefijo", {"column1_t2"}, {"Prefijo_column1_t2"})
Group By
SELECT column1, count(*) FROM table1 GROUP BY column1
df.groupby('column1')['column1'].count()
Table.Group(#"Last Step", {"column1"}, {{"Alias de count", each Table.RowCount(_), type number}})
Filtering a table grouped by
SELECT store, sum(sales) FROM table1 GROUP BY store HAVING sum(sales) > 1000
df_grouped = df.groupby('store')['sales'].sum()
df_grouped.loc[df_grouped > 1000]
#”Grouping” = Table.Group(#"Last Step", {"store"}, {{"Alias de sum", each List.Sum([sales]), type number}})
#"Final" = Table.SelectRows( #"Grouping" , each [Alias de sum] > 1000 )
Sort descending by column
SELECT * FROM table1 ORDER BY column1 DESC
df.sort_values(by=['column1'], ascending=False)
Table.Sort(#"Last Step",{{"column1", Order.Descending}})
UNION two tables with the same characteristics
SELECT * FROM table1 UNION SELECT * FROM table2
In Pandas we have two known options, append and concat.
df.append(df2)
pd.concat([df1, df2])
Table.Combine({table1, table2})
Transformations
The following transformations are only for Pandas and Power Query because the are not as regular in query languages as SQL.
Analyze table content
df.describe()
Table.Profile(#"Last Step")
Check for unique values in columns
df.value_counts()
Table.Profile(#"Last Step")[[Column],[DistinctCount]]
Generate a test table with manual entry data.
df = pd.DataFrame([[1,2],["Boris Yeltsin", "Mikhail Gorbachev"]], columns=["CustomerID", "Name"])
Table.FromRecords({[CustomerID = 1, Name = "Bob", Phone = "123-4567"]})
Remove column from dataset
There are two ways
df.drop(columns=['column1'])
df.drop(['column1'], axis=1)
Table.RemoveColumns(#"Last Step",{"column1"})
Apply transformations in a column
df.apply(lambda x : x['column1'] + 1 , axis = 1)
Table.TransformColumns(#"Last Step", {{"column1", each _ + 1, type number}})
Alright! That is all folks. It was a long way and it could be longer, but this might be the basics we need. I hope this helps some people find Power Query more user-friendly, since they can just write code directly.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.