From Pandas and SQL to Power Query Code

ibarrau · ‎05-08-2020

This is an original article from LaDataWeb blog that I wanted to share with the community in English.

Many times, I have found people working with Python in data engineering or as data scientists who need to work with a Microsoft data movement tool (Data Factory, Logic Apps, etc.) or a reporting tool (Power BI, Excel, etc.) due to a company requirement. The truth here is that those people hate working without coding in Python. They don't want to manage a small environment configuring and clicking for transformations.

Today, Microsoft is pushing Power Query to its limits, integrating the language in many services or tools for data transformations. Now, it can be used with SSIS, SSAS, Power Automate, Logic Apps, Power BI and Dataflows (in Power Platform and Data Factory). Each day, the work with Power Query is growing faster and embracing not only BI roles but also data engineers and scientists.

That's why I have created this post to help them how to code the basic SQL and Pandas (Python lib) transformations with Power Query.

First, they need to understand that in most common situations (dataflows and Power BI) Power Query works in steps. Each line calls the previous result to increase a new step from the previous one and to keep it in order. Now you know that if you see #"Last Step" we are talking about a table result of the previous step.

Let's check the examples that will be listed in the following order: SQL, Python Pandas, Power Query.

Show first 5 lines of the dataset

In SQL:

SELECT TOP 5 * FROM table

In Pandas:

df.head()

In Power Query:

Table.FirstN(#"Last Step",5)

Count rows

SELECT COUNT(*) FROM table1

df.shape()

Table.RowCount(#"Last Step")

Select the columns of a table

SELECT column1, column2 FROM table1

df[["column1", "column2"]]

#"Last Step"[[Columna1],[Columna2]]

//Or it might be:

Table.SelectColumns(#"Last Step", {"Columna1", "Columna2"} )

Filter rows

SELECT column1, column2 FROM table1 WHERE column1 = 2

df[['column1', 'column2']].loc[df['column1'] == 2]

Table.SelectRows(#"Last Step", each [column1] == 2 )

Filtering with more than 1 column

SELECT * FROM table1 WHERE column1 > 1 AND column2 < 25

df.loc[(df['column1'] > 1) & (df['column2'] < 25)]

# Or using OR y NOT operators

df.loc[(df['column1'] > 1) | ~(df['column2'] < 25)]

Table.SelectRows(#"Last Step", each [column1] > 1 and column2 < 25 )

//Or using OR y NOT operators

Table.SelectRows(#"Last Step", each [column1] > 1 or not ([column1] < 25 ) )

Filters with complex operators

SELECT * FROM table1 WHERE column1 BETWEEN 1 and 5 AND column2 IN (20,30,40,50) AND column3 LIKE '%arcelona%'

df.loc[(df['colum1'].between(1,5)) & (df['column2'].isin([20,30,40,50])) & (df['column3'].str.contains('arcelona'))]

Table.SelectRows(#"Last Step", each ([column1] > 1 and [column1] < 5) and List.Contains({20,30,40,50}, [column2]) and Text.Contains([column3], "arcelona") )

Join tables

SELECT t1.column1, t2.column1 FROM table1 t1 LEFT JOIN table2 t2 ON t1.column_id = t2.column_id

There are two functions that can help us in this process. Merge and Join.

df_joined = df1.merge(df2, left_on='lkey', right_on='rkey', how='left')
df_joined = df1.join(df2, on='column_id', how='left')Luego seleccionamos dos columnas
df_joined.loc[['column1_df1', 'column1_df2']]

In Power Query, we first we set the join conditions and then select the columns with the ExpandTableColumn function.

#"Origen" = #"Last Step"[[column1_t1]]
#"Paso Join" = Table.NestedJoin(#"Origen", {"column_t1_id"}, table2, {"column_t2_id"}, "Prefijo", JoinKind.LeftOuter)
#"Expansion" = Table.ExpandTableColumn(#"Paso Join", "Prefijo", {"column1_t2"}, {"Prefijo_column1_t2"})

Group By

SELECT column1, count(*) FROM table1 GROUP BY column1

df.groupby('column1')['column1'].count()

Table.Group(#"Last Step", {"column1"}, {{"Alias de count", each Table.RowCount(_), type number}})

Filtering a table grouped by

SELECT store, sum(sales) FROM table1 GROUP BY store HAVING sum(sales) > 1000

df_grouped = df.groupby('store')['sales'].sum()
df_grouped.loc[df_grouped > 1000]

#”Grouping” = Table.Group(#"Last Step", {"store"}, {{"Alias de sum", each List.Sum([sales]), type number}})
#"Final" = Table.SelectRows( #"Grouping" , each [Alias de sum] > 1000 )

Sort descending by column

SELECT * FROM table1 ORDER BY column1 DESC

df.sort_values(by=['column1'], ascending=False)

Table.Sort(#"Last Step",{{"column1", Order.Descending}})

UNION two tables with the same characteristics

SELECT * FROM table1 UNION SELECT * FROM table2

In Pandas we have two known options, append and concat.

df.append(df2)
pd.concat([df1, df2])

Table.Combine({table1, table2})

Transformations

The following transformations are only for Pandas and Power Query because the are not as regular in query languages as SQL.

Analyze table content

df.describe()

Table.Profile(#"Last Step")

Check for unique values in columns

df.value_counts()

Table.Profile(#"Last Step")[[Column],[DistinctCount]]

Generate a test table with manual entry data.

df = pd.DataFrame([[1,2],["Boris Yeltsin", "Mikhail Gorbachev"]], columns=["CustomerID", "Name"])

Table.FromRecords({[CustomerID = 1, Name = "Bob", Phone = "123-4567"]})

Remove column from dataset

There are two ways

df.drop(columns=['column1'])
df.drop(['column1'], axis=1)

Table.RemoveColumns(#"Last Step",{"column1"})

Apply transformations in a column

df.apply(lambda x : x['column1'] + 1 , axis = 1)

Table.TransformColumns(#"Last Step", {{"column1", each _ + 1, type number}})

Alright! That is all folks. It was a long way and it could be longer, but this might be the basics we need. I hope this helps some people find Power Query more user-friendly, since they can just write code directly.

From Pandas and SQL to Power Query Code

Row-Level Security

Programmatically deploy Semantic Models and Report...

Dynamic Row-Level Security (RLS) Implementation in...

🏆 Power BI Data Visualization World Championships...

Migrating from Tableau to Power BI