Microsoft is giving away 50,000 FREE Microsoft Certification exam vouchers. Get Fabric certified for FREE! Learn more
Article was originally publised here
Let's say you have one or more CSV files and you want to convert them to Parquet format and also upload them to a Lakehouse table. The available options for this in the Fabric environment are either through a notebook or a data pipeline but there are aren't any pre-built out-of-the-box solutions.
For instance, you might have an application that generates CSV files, and you want to upload the CSV data directly to the Lakehouse at that moment. One approach could be to make application store the CSV files in ADLS2 storage and use an event-based pipeline triggered by storage events to upload the data to the Lakehouse. However, what if storing the files on cloud storage isn't an option and the files will always be stored on a on prem storage ?
In this post, I've tried to provide an out-of-the-box solution that automates the conversion of CSV files to Parquet and upload them to a Lakehouse table. You might then ask, "Why convert to Parquet?" There’s no reason to do so, but just incase if you require the Parquet files to integrate the data with other applications, it can be useful.
In the solution I used multiple components/libraries.
CsvHelper : Amazing library if you are dealing with csv files. It reads the csv records in no time. All you have to do is read the file through a file stream and send the stream to the CsvReader class of the library to return the data.
Example:
void Main()
{
using (var reader = new StreamReader("path\\to\\file.csv"))
using (var csv = new CsvReader(reader, CultureInfo.InvariantCulture))
{
var records = csv.GetRecords<MyClass>();
}
}
public class MyClass
{
public int Id {get; set;}
public string Name {get; set;}
}
Parquet.Net : A very versatile .Net library that provides out of box abilities to deal with parquet file formats with extremely fast serialization and deserialization capabilities. I have extensively used this library in my previous articles here and here.
ADLS GEN2 API : Provides the ability to interact with Azure Blob Storage through a file system interface.
Fabric REST API : An extensive set of API’s that can be interact, manipulate and automate Fabric objects and processes.
The approach used here is that we first fetch the csv data through CsvHelper and convert it to List<T>. Then send List<T> to Parquet.Net for conversion to parquet format and serialize the data and return the serialized data as a stream. I used Memorystream to store the serialized data in parquet format. You can instead use FileStream if you require a physical parquet file generated.
I then used ADLS GEN2 API to push this stream as a file to Files folder on the lakehouse similar to what I did in my previous article. However, in that article I had a physical file available but in this instance I used memorystream to patch and flush the file to the lakehouse folder.
Once the file is available in the lakehouse, I then used Fabric API Load Table method to move the file to the table. I had penned down that process in my earlier article on Fabric REST API. You can find the details here.
Note : We require two different scopes for authentication. First one for ADLS GEN2 API and the second one for Fabric REST API.
Code
I have uploaded the source code to Github. You can find the repository here.
The aim of this post was to highlight the out of box capabilities of uploading the data and automation of table generation based on the newly uploaded data through ADLS GEN2 and fabric REST API's.
Do let me know any feedback or comments. Thanks for reading !!!
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.