Overview of Module 2: Prepare data (40–45%)

rsaprano · ‎11-26-2024

Overview of Module 2: Prepare data (40–45%)

Module 2 dives into the heart of data preparation, focusing on turning raw data into clean, structured, and insightful information ready for analysis.

With Fabric you can bring data in from multiple sources such as Excel files, SQL databases, and files from cloud storage. This can be using low/no code tools such as Dataflows Gen 2 where data from multiple sources can be automatically pulled, cleaned, and loaded into a central Lakehouse. Warehouse or KQL Database, ensuring a consistent data pipeline.

Choosing the right artifact in Fabric to store data and apply transformations depends on the type of data and skillsets of the personas who are working with them. The table below summarises some of the key differences between them:

	Warehouse	Lakehouse	Eventhouse
Data volume	Unlimited	Unlimited	Unlimited
Type of data	Structured	Unstructured, semi-structured, structured	Unstructured, semi-structured, structured
Primary developer persona	Data warehouse developer, SQL engineer	Data engineer, data scientist	Citizen data scientist, data engineer, data scientist, SQL engineer
Primary dev skill	SQL	Spark(Scala, PySpark, Spark SQL, R)	No code, KQL, SQL
Data organized by	Databases, schemas, and tables	Folders and files, databases, and tables	Databases, schemas, and tables
Read operations	T-SQL, Spark*	Spark, T-SQL	KQL, T-SQL, Spark, Power BI
Write operations	T-SQL	Spark(Scala, PySpark, Spark SQL, R)	KQL, Spark, connector ecosystem
Multi-table transactions	Yes	No	Yes, for multi -table ingestion
Primary development interface	SQL scripts	Spark notebooks,Spark job definitions	KQL Queryset, KQL Database
Security	Object level , RLS , CLS , DDL/DML, dynamic data masking	RLS, CLS**, table level (T-SQL), none for Spark	RLS
Access data via shortcuts	Yes	Yes	Yes
Can be a source for shortcuts	Yes (tables)	Yes (files and tables)	Yes
Query across items	Yes	Yes	Yes
Advanced analytics	Interface for large-scale data processing, built-in data parallelism and fault tolerance	Interface for large-scale data processing, built-in data parallelism and fault tolerance	Time Series native elements, full geo-spatial storing and query capabilities
Advanced formatting support	Tables defined using PARQUET, CSV, AVRO, JSON, and any Apache Hive compatible file format	Tables defined using PARQUET, CSV, AVRO, JSON, and any Apache Hive compatible file format	Full indexing for free text and semi-structured data like JSON
Ingestion latency	Available instantly for querying	Available instantly for querying	Queued ingestion, streaming ingestion has a couple of seconds latency

Eventhouses have a setting called OneLake availability, which once turned on means that you can query the data in your KQL database in Delta Lake format via other Fabric engines such as Direct Lake mode in Power BI, Warehouse, Lakehouse, Notebooks, and more

Using Data Pipelines, you can use a UI to copy data from a source into a lakehouse or Warehouse using parameterised values and loops to control destinations or logic as data is copied. You can also use a Pipeline to run notebooks or SQL stored procedures on a schedule/based on triggers. With shortcuts, users can link external data (like files from Azure Data Lake or AWS S3) directly to Fabric, creating live connections without duplicating storage, making updates seamless.

The module also emphasizes the importance of structuring data using star schemas, where data is structured in an optimal format to support efficient analysis. For example, with a retail company, a central fact table may hold sales transactions, while dimension tables provide context with details like Product, Customer, and Store.

This arrangement optimizes querying for insights like sales trends per region or best-selling products. Mastering Slowly Changing Dimensions (SCDs) is key here. Type 1 SCDs (which overwrite old data) and Type 2 SCDs (which track historical changes) are crucial for handling updates to customer details or product changes over time. For instance, if a product’s category changes, a Type 2 SCD allows tracking both the old and new categories without losing historical accuracy, enabling deeper trend analysis.

Ensuring data quality is another pillar of Module 2. Removing duplicates, filling null values, and handling type mismatches are essential for accurate analytics. A data engineer might use SQL or Python scripts within Fabric Notebooks to standardize formats and clean datasets. For example, converting currency data into a consistent format or eliminating redundant records across multiple sources would make aggregated sales reports accurate and reliable. By mastering these techniques, data professionals ensure that they’re working with clean, dependable data, laying a solid foundation for any subsequent analysis or reporting.

Querying and analysing data using SQL or KQL is also a key component of this module. Many data analysts are already familiar with SQL but potentially less so with Kusto Query Language (KQL); luckily in Microsoft Fabric you can convert from SQL to KQL within the KQL editor by prefacing a SQL query with the keyword explain.

The table below shows equivalent queries in SQL and KQL for common queries.

Category

SQL Query Kusto Query

--

SELECT name, COUNT(DISTINCT type)

FROM dependencies

GROUP BY name

dependencies

| summarize by name, type

| summarize count() by name

// or approximate for large sets

dependencies

| summarize dcount(type) by name

Column aliases,

Extending

SELECT operationName as Name, AVG(duration) as

AvgD

FROM dependencies

dependencies

| summarize AvgD = avg(duration) by Name=operationName

Ordering

SELECT name, timestamp FROM dependencies

ORDER BY timestamp ASC

dependencies

| project name, timestamp

| sort by timestamp asc nulls last

Top n by measure

SELECT TOP 100 name, COUNT(*) as Count

FROM dependencies

GROUP BY name

ORDER BY Count DESC

dependencies

| summarize Count = count() by name

| top 100 by Count desc

Nested queries

SELECT * FROM dependencies

WHERE resultCode ==

(SELECT TOP 1 resultCode FROM dependencies

WHERE resultId = 7

ORDER BY timestamp DESC)

dependencies

| where resultCode == toscalar( dependencies

| where resultId == 7

| top 1 by timestamp desc

| project resultCode)

Having

SELECT COUNT(\*) FROM dependencies

GROUP BY name

HAVING COUNT(\*) > 3

dependencies

| summarize Count = count() by name

| where Count > 3

Module 2 Worked Example - Creating an Eventhouse

In the attached PDF, there is a guided walkthrough of creating an Eventhouse and Eventstream based on one of the sample datasets (N.b: This is based on the Fabric RTI Tutorial)

This example is avaialable as an interactive "practice the clicks" exercise in the DP600 Study guide...

Navigating the (new) DP-600 exam: Part 2 - Prepare Data

Overview of Module 2: Prepare data (40–45%)