How to Use Apache Spark for Data Lakehouse Ingesti...

Sahir_Maharaj · ‎11-18-2024

As a data professional, you've likely encountered the challenge of effectively managing and transforming large datasets. The growth of big data technologies has made traditional methods insufficient, and now more than ever, tools like Microsoft Fabric and Apache Spark have become essential. Consider the situation where you have lots of raw data, and you need to transform it into something meaningful - something that powers insights and drives decisions. This is where a lakehouse architecture combined with the power of Apache Spark can be transformative. But how exactly can you achieve data ingestion into a Microsoft Fabric lakehouse? That’s what we're here to explore today.

What you will learn: In this edition, I’ll walk you through why ingesting data into a Microsoft Fabric lakehouse matters, how Apache Spark plays a pivotal role in this process, and ultimately, how you can do it yourself. Whether you’re an experienced data engineer or a data analyst wanting to expand your toolkit, this guide is for you.

Read Time: 5 minutes

If you're familiar with data lakes and data warehouses, you may already understand the pros and cons of each. Data lakes offer flexibility, while data warehouses provide structure and fast querying. But what if you could have both? Thats where the lakehouse comes in - a hybrid that brings the best of both worlds. Microsoft Fabric lakehouse takes this concept even further by providing an integrated experience that merges data storage, analytics, and collaboration. It allows data professionals to work cohesively, removing the barriers between engineers, analysts, and scientists.

Source: Microsoft Learn

Now, why use Apache Spark? Spark is like the engine that drives the lakehouse, a distributed processing powerhouse that can handle large volumes of data with ease. Apache Spark is open-source, versatile, and highly scalable, which makes it ideal for handling transformations and computations on the data you store in your lakehouse. Imagine you have millions of customer records, and you want to apply some complex analytics or clean the data.

Spark’s parallel processing capabilities ensure that these operations happen not only accurately but quickly, and that speed is critical in a world that demands rapid insights. With Microsoft Fabric lakehouse providing the architecture and Apache Spark driving the compute, you have a dynamic duo capable of tackling a wide range of data challenges, from simple transformations to complex machine learning workloads.

Source: Microsoft Learn

To ingest data effectively, we need to set up the right environment in Microsoft Fabric. If you've ever set up a workspace in Power BI, this is similar but designed specifically for dealing with big data. Think of the Fabric lakehouse as a workspace that brings all the necessary components - storage, compute, and collaboration - under one roof. First things first, you'll need access to the Microsoft Fabric portal and permissions to create a lakehouse. If you're unsure about permissions, it's best to check with your organization's administrator to get the right level of access.

In addition, you'll need to have Apache Spark runtime available. In Microsoft Fabric, this is straightforward because it offers a built-in Spark environment, so no need to handle clusters or configurations manually. This Spark environment will be used to perform transformations and ingestion, making it ideal for large-scale data processing without the administrative overhead.

Now, let’s get practical now on how to ingest data into a Microsoft Fabric lakehouse using Apache Spark.

1. Once you're logged in to Microsoft Fabric, navigate to Microsoft Fabric and click on Data Engineering in the left-hand menu.

Source: Sahir Maharaj

2. Within the Recommended items to create section, select on the Lakehouse tile.

Source: Sahir Maharaj

Give the lakehouse a descriptive name, such as Sales_Lakehouse_2024. Naming it appropriately helps you and others easily recognize its purpose.

Source: Sahir Maharaj

3. Once your lakehouse is created, you’ll be taken to the workspace. Here, click on Get Data and select your source type. This could be from a data pipeline, Eventstream or even a local file.

Source: Sahir Maharaj

4. When uploading a local file, select the Upload Files option and browse to select the CSV file from your local machine. After selecting it, click on Upload to bring the file into the lakehouse.

Source: Sahir Maharaj

5. For instance, let's say you don't have any data prepared and wish to use fictional data for exploratory purposes. Select Start with sample data to automatically import tables filled with sample data.

Source: Sahir Maharaj

6. Now that the data is in your lakehouse, it’s time to make it meaningful. To do this, select on New Notebook in the lakehouse.

Source: Sahir Maharaj

7. A notebook is like your playground for running Spark commands. In your newly created notebook, start by importing Spark libraries. You can use Python, Scala, or SQL, but for simplicity, let’s use PySpark (the Python version of Spark).

from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("DataIngestion").getOrCreate()

Source: Sahir Maharaj

8. Use Spark to read the sample data that was created as this makes it easier to perform any transformations.

df = spark.sql("SELECT * FROM Sales_Lakehouse_2024.publicholidays LIMIT 1000")
display(df)

Source: Sahir Maharaj

9. Often, the data you receive isn’t quite clean. Use Spark to apply transformations, such as dropping null values or casting data types.

df_cleaned = df.dropna().withColumn("holidayName", df["holidayName"].cast("string"))

Finally, write the cleaned DataFrame back to your lakehouse. This saves your processed data back to the lakehouse storage, ready to be used for reporting.

df_cleaned.write.mode("overwrite").save("/lakehouse/Sales_Lakehouse_2024/CleanedData")

Source: Sahir Maharaj

Congratulations - you've just ingested data into a Microsoft Fabric lakehouse using Apache Spark! By now, you should understand why a lakehouse architecture with Apache Spark can be so powerful. You’ve not only learned about the theory behind it, but you've also put that knowledge into action by setting up your environment and ingesting your own data.

My call to action for you is simple: Don't stop here. Data ingestion is just the first step. With this data now in your lakehouse, think about what kind of analytics or machine learning projects you could implement. If you haven't explored Microsoft Fabric’s integration with Power BI or its machine learning capabilities, now's the time to do so. Data is only as powerful as the insights it can generate - so get out there and start generating some real value from your newly ingested data!

How to Use Apache Spark for Data Lakehouse Ingestion with Microsoft Fabric

How Microsoft Fabric Helps You Build Smarter Insig...

How to transform your Data Science workflows with...

Simplify, Transform, and Scale with the Data Wrang...

All in One Place - How Fabric Notebooks Simplify D...

How to Use Apache Spark for Data Lakehouse Ingesti...