Skip to main content
cancel
Showing results for 
Search instead for 
Did you mean: 

Microsoft is giving away 50,000 FREE Microsoft Certification exam vouchers. Get Fabric certified for FREE! Learn more

rsaprano

Navigating the (new) DP-600 exam: Part 2 - Prepare Data

Overview of Module 2: Prepare data (40–45%)

 

Module 2 dives into the heart of data preparation, focusing on turning raw data into clean, structured, and insightful information ready for analysis.


With Fabric you can bring data in from multiple sources such as Excel files, SQL databases, and files from cloud storage. This can be using low/no code tools such as Dataflows Gen 2 where data from multiple sources can be automatically pulled, cleaned, and loaded into a central Lakehouse. Warehouse or KQL Database, ensuring a consistent data pipeline.

Choosing the right artifact in Fabric to store data and apply transformations depends on the type of data and skillsets of the personas who are working with them. The table below summarises some of the key differences between them:

 

 

Warehouse

Lakehouse

Eventhouse

Data volume

Unlimited

Unlimited

Unlimited

Type of data

Structured

Unstructured, semi-structured, structured

Unstructured, semi-structured, structured

Primary developer persona

Data warehouse developer,

SQL engineer

Data engineer, data scientist

Citizen data scientist, data engineer, data scientist, SQL

engineer

Primary dev skill

SQL

Spark(Scala, PySpark, Spark

SQL, R)

No code, KQL, SQL

Data organized by

Databases, schemas, and tables

Folders and files, databases, and tables

            Databases, schemas, and tables

Read operations

T-SQL, Spark*

Spark, T-SQL

KQL, T-SQL, Spark, Power BI

Write operations

T-SQL

Spark(Scala, PySpark, Spark

SQL, R)

KQL, Spark, connector ecosystem

Multi-table transactions

Yes

No

Yes, for multi-table ingestion

Primary development interface

SQL scripts

Spark notebooks,Spark job definitions

KQL Queryset, KQL Database

Security

Object level, RLS, CLS,

DDL/DML, dynamic data masking

RLS, CLS**, table level (T-SQL), none for Spark

RLS

Access data via shortcuts

Yes

Yes

Yes

Can be a source for shortcuts

Yes (tables)

Yes (files and tables)

Yes

Query across items

Yes

Yes

Yes

Advanced analytics

Interface for large-scale data processing, built-in data parallelism and fault tolerance

Interface for large-scale data processing, built-in data parallelism and fault tolerance

Time Series native elements, full geo-spatial storing and query capabilities

Advanced formatting support

Tables defined using

PARQUET, CSV, AVRO, JSON, and any Apache Hive compatible file format

Tables defined using PARQUET,

CSV, AVRO, JSON, and any Apache Hive compatible file format

Full indexing for free text and semi-structured data

like JSON

Ingestion latency

Available instantly for querying

Available instantly for querying

Queued ingestion, streaming ingestion has a couple of seconds latency


Eventhouses have a setting called OneLake availability, which once turned on means that you can query the data in your KQL database in Delta Lake format via other Fabric engines such as Direct Lake mode in Power BI, Warehouse, Lakehouse, Notebooks, and more

Using Data Pipelines, you can use a UI to copy data from a source into a lakehouse or Warehouse using parameterised values and loops to control destinations or logic as data is copied. You can also use a Pipeline to run notebooks or SQL stored procedures on a schedule/based on triggers. With shortcuts, users can link external data (like files from Azure Data Lake or AWS S3) directly to Fabric, creating live connections without duplicating storage, making updates seamless.

 

The module also emphasizes the importance of structuring data using star schemas, where data is structured in an optimal format to support efficient analysis. For example, with a retail company,  a central fact table may hold sales transactions, while dimension tables provide context with details like Product, Customer, and Store.

 

This arrangement optimizes querying for insights like sales trends per region or best-selling products. Mastering Slowly Changing Dimensions (SCDs) is key here. Type 1 SCDs (which overwrite old data) and Type 2 SCDs (which track historical changes) are crucial for handling updates to customer details or product changes over time. For instance, if a product’s category changes, a Type 2 SCD allows tracking both the old and new categories without losing historical accuracy, enabling deeper trend analysis.

 

Ensuring data quality is another pillar of Module 2. Removing duplicates, filling null values, and handling type mismatches are essential for accurate analytics. A data engineer might use SQL or Python scripts within Fabric Notebooks to standardize formats and clean datasets. For example, converting currency data into a consistent format or eliminating redundant records across multiple sources would make aggregated sales reports accurate and reliable. By mastering these techniques, data professionals ensure that they’re working with clean, dependable data, laying a solid foundation for any subsequent analysis or reporting.

Querying and analysing data using SQL or KQL is also a key component of this module.  Many data analysts are already familiar with SQL but potentially less so with Kusto Query Language (KQL); luckily in Microsoft Fabric you can convert from SQL to KQL within the KQL editor by prefacing a SQL query with the keyword explain.

 

The table below shows equivalent queries in SQL and KQL for common queries.

 

Category

SQL Query       Kusto Query

 

--

SELECT name, COUNT(DISTINCT type)           

FROM dependencies

GROUP BY name

dependencies

| summarize by name, type  

| summarize count() by name

 

// or approximate for large sets

 

dependencies

| summarize dcount(type) by name

Column aliases,

Extending

SELECT operationName as Name, AVG(duration) as

AvgD  

FROM dependencies

dependencies

| summarize AvgD = avg(duration) by Name=operationName

Ordering

SELECT name, timestamp FROM dependencies

ORDER BY timestamp ASC

dependencies

| project name, timestamp

| sort by timestamp asc nulls last

Top n by measure

SELECT TOP 100 name, COUNT(*) as Count

FROM dependencies

GROUP BY name

ORDER BY Count DESC

dependencies

| summarize Count = count() by name

| top 100 by Count desc

Nested queries

SELECT * FROM dependencies

WHERE resultCode ==

(SELECT TOP 1 resultCode FROM dependencies

WHERE resultId = 7

ORDER BY timestamp DESC)

dependencies

| where resultCode == toscalar( dependencies

| where resultId == 7

| top 1 by timestamp desc

| project resultCode)

Having

SELECT COUNT(\*) FROM dependencies

GROUP BY name

HAVING COUNT(\*) > 3

dependencies

| summarize Count = count() by name

| where Count > 3

 

Module 2 Worked Example - Creating an Eventhouse

 

In the attached PDF, there is a guided walkthrough of creating an Eventhouse and Eventstream based on one of the sample datasets (N.b: This is based on the Fabric RTI Tutorial)

This example is avaialable as an interactive "practice the clicks" exercise in the DP600 Study guide...