Microsoft is giving away 50,000 FREE Microsoft Certification exam vouchers. Get Fabric certified for FREE! Learn more
Module 2 dives into the heart of data preparation, focusing on turning raw data into clean, structured, and insightful information ready for analysis.
With Fabric you can bring data in from multiple sources such as Excel files, SQL databases, and files from cloud storage. This can be using low/no code tools such as Dataflows Gen 2 where data from multiple sources can be automatically pulled, cleaned, and loaded into a central Lakehouse. Warehouse or KQL Database, ensuring a consistent data pipeline.
Choosing the right artifact in Fabric to store data and apply transformations depends on the type of data and skillsets of the personas who are working with them. The table below summarises some of the key differences between them:
| Warehouse | Lakehouse | Eventhouse |
Data volume | Unlimited | Unlimited | Unlimited |
Type of data | Structured | Unstructured, semi-structured, structured | Unstructured, semi-structured, structured |
Primary developer persona | Data warehouse developer, SQL engineer | Data engineer, data scientist | Citizen data scientist, data engineer, data scientist, SQL engineer |
Primary dev skill | SQL | Spark(Scala, PySpark, Spark SQL, R) | No code, KQL, SQL |
Data organized by | Databases, schemas, and tables | Folders and files, databases, and tables | Databases, schemas, and tables |
Read operations | T-SQL, Spark* | Spark, T-SQL | KQL, T-SQL, Spark, Power BI |
Write operations | T-SQL | Spark(Scala, PySpark, Spark SQL, R) | KQL, Spark, connector ecosystem |
Multi-table transactions | Yes | No | Yes, for multi-table ingestion |
Primary development interface | SQL scripts | Spark notebooks,Spark job definitions | KQL Queryset, KQL Database |
Security | DDL/DML, dynamic data masking | RLS, CLS**, table level (T-SQL), none for Spark | RLS |
Access data via shortcuts | Yes | Yes | Yes |
Can be a source for shortcuts | Yes (tables) | Yes (files and tables) | Yes |
Query across items | Yes | Yes | Yes |
Advanced analytics | Interface for large-scale data processing, built-in data parallelism and fault tolerance | Interface for large-scale data processing, built-in data parallelism and fault tolerance | Time Series native elements, full geo-spatial storing and query capabilities |
Advanced formatting support | Tables defined using PARQUET, CSV, AVRO, JSON, and any Apache Hive compatible file format | Tables defined using PARQUET, CSV, AVRO, JSON, and any Apache Hive compatible file format | Full indexing for free text and semi-structured data like JSON |
Ingestion latency | Available instantly for querying | Available instantly for querying | Queued ingestion, streaming ingestion has a couple of seconds latency |
Eventhouses have a setting called OneLake availability, which once turned on means that you can query the data in your KQL database in Delta Lake format via other Fabric engines such as Direct Lake mode in Power BI, Warehouse, Lakehouse, Notebooks, and more
Using Data Pipelines, you can use a UI to copy data from a source into a lakehouse or Warehouse using parameterised values and loops to control destinations or logic as data is copied. You can also use a Pipeline to run notebooks or SQL stored procedures on a schedule/based on triggers. With shortcuts, users can link external data (like files from Azure Data Lake or AWS S3) directly to Fabric, creating live connections without duplicating storage, making updates seamless.
The module also emphasizes the importance of structuring data using star schemas, where data is structured in an optimal format to support efficient analysis. For example, with a retail company, a central fact table may hold sales transactions, while dimension tables provide context with details like Product, Customer, and Store.
This arrangement optimizes querying for insights like sales trends per region or best-selling products. Mastering Slowly Changing Dimensions (SCDs) is key here. Type 1 SCDs (which overwrite old data) and Type 2 SCDs (which track historical changes) are crucial for handling updates to customer details or product changes over time. For instance, if a product’s category changes, a Type 2 SCD allows tracking both the old and new categories without losing historical accuracy, enabling deeper trend analysis.
Ensuring data quality is another pillar of Module 2. Removing duplicates, filling null values, and handling type mismatches are essential for accurate analytics. A data engineer might use SQL or Python scripts within Fabric Notebooks to standardize formats and clean datasets. For example, converting currency data into a consistent format or eliminating redundant records across multiple sources would make aggregated sales reports accurate and reliable. By mastering these techniques, data professionals ensure that they’re working with clean, dependable data, laying a solid foundation for any subsequent analysis or reporting.
Querying and analysing data using SQL or KQL is also a key component of this module. Many data analysts are already familiar with SQL but potentially less so with Kusto Query Language (KQL); luckily in Microsoft Fabric you can convert from SQL to KQL within the KQL editor by prefacing a SQL query with the keyword explain.
The table below shows equivalent queries in SQL and KQL for common queries.
Category | SQL Query Kusto Query |
|
-- | SELECT name, COUNT(DISTINCT type) FROM dependencies GROUP BY name | dependencies | summarize by name, type | summarize count() by name
// or approximate for large sets
dependencies | summarize dcount(type) by name |
Column aliases, Extending | SELECT operationName as Name, AVG(duration) as AvgD FROM dependencies | dependencies | summarize AvgD = avg(duration) by Name=operationName |
Ordering | SELECT name, timestamp FROM dependencies ORDER BY timestamp ASC | dependencies | project name, timestamp | sort by timestamp asc nulls last |
Top n by measure | SELECT TOP 100 name, COUNT(*) as Count FROM dependencies GROUP BY name ORDER BY Count DESC | dependencies | summarize Count = count() by name | top 100 by Count desc |
Nested queries | SELECT * FROM dependencies WHERE resultCode == (SELECT TOP 1 resultCode FROM dependencies WHERE resultId = 7 ORDER BY timestamp DESC) | dependencies | where resultCode == toscalar( dependencies | where resultId == 7 | top 1 by timestamp desc | project resultCode) |
Having | SELECT COUNT(\*) FROM dependencies GROUP BY name HAVING COUNT(\*) > 3 | dependencies | summarize Count = count() by name | where Count > 3 |
Module 2 Worked Example - Creating an Eventhouse
In the attached PDF, there is a guided walkthrough of creating an Eventhouse and Eventstream based on one of the sample datasets (N.b: This is based on the Fabric RTI Tutorial)
This example is avaialable as an interactive "practice the clicks" exercise in the DP600 Study guide...
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.