Microsoft is giving away 50,000 FREE Microsoft Certification exam vouchers. Get Fabric certified for FREE! Learn more
Hey all, I'd appreciate some help because I'm stuck on how to delete duplicate records from a table before appending the records in my staging table.
Scenario: I have a table in an on-prem data source. I'm able to copy data using a Data Pipeline (This was a fantastic recent update). I can pass a query to get the latest records but the records based on an updated_at datetime column and stick those records in a staging table. The issue is that the records can be updated so an order that was created can be updated to include a shipping date when it actually ships out the door. So I need a way to delete the record in the production table before appending the staging data.
Here's what I've got: (This is a simplified version - I'm actually iterating over a handful of tables but the steps are the same)
So I know the brute force way is to build a notebook that loads the production table using a left anti query then load the staging table into the notebook then append the staging data to the production table and overwrite the production table in the lakehouse. It seems like a very compute-heavy approach. It's also relatively time consuming too which means less frequent refreshes. But it does work.
So I'm thinking there has to be a better way that I'm just not thinking about. I've tried:
Prior to choosing to use a dataflow gen2 to build the table with the latest updated_at date, I did try building an SQL script to create a table but that generated an error. I saw the structure in this Youtube video. I could create a schema but couldn't create a table (I did talk to tech support and they said table creation isn't supported in the Lakehouse since we're accessing parquet files via an SQL endpoint).
Most of the incremental refresh documentation doesn't include deleting duplicate rows from the production table before appending the staging table. They were very helpful in getting this far.
So I'm at this point where I have a functional solution but I feel like it's unnecessarily resource intensive. I'm hoping to tap into the community to see if there's a better way. I'm a citizen developer so not super technical but have done a lot with PowerBI and already built out a lot in Fabric. But this one has me stumped and I figure it's because I haven't learned it yet or I'm looking for a solution that doesn't exist.
Please let me know if anyone has figured this out or has any ideas or if I'm just nuts and looking to do the impossible.
Solved! Go to Solution.
Hello Everyone, thanks for the ideas on this. I've been able to successfully build out an Incremental refresh structure without using a notebook. Some notes on my use case:
So here's some details on how I've set this up. I'll post additional details.
What I set up:
Dataflow Gen 2: Admin Table – I start with a Dataflow Gen2 which is just a manual set of tables that I use to iterate over for pulling in data. So that includes these tables and the columns listed:
*This is 1 if the table needs to use a dataflow to copy into the data warehouse instead of a straight copy in Data Factory. I’ve found that long text fields that exceed varchar(8000) need to be copied using a dataflow or custom scripting. Since I’m trying to stay low-code, I’m going with the dataflow gen2 for that. But you could add a column to pull in a script for that specific table if preferred.
A data pipeline manages everything. Here’s what’s in it:
There are many nuances so let me know if you have questions. The pattern is pretty simple and basically repeated for each step in the medallion architecture. It would be easy to simplify clean and refining the data. And most of the dataflow gen2 entries could be replaced by SQL scripts if needed/preferred.
Notebooks are easier than all of this. But the limitation on number of notebooks running at one time could cause conflicts with other scheduled notebooks/pipelines (with notebooks) or even with your users who are running ad hoc notebooks.
Hello Everyone, thanks for the ideas on this. I've been able to successfully build out an Incremental refresh structure without using a notebook. Some notes on my use case:
So here's some details on how I've set this up. I'll post additional details.
What I set up:
Dataflow Gen 2: Admin Table – I start with a Dataflow Gen2 which is just a manual set of tables that I use to iterate over for pulling in data. So that includes these tables and the columns listed:
*This is 1 if the table needs to use a dataflow to copy into the data warehouse instead of a straight copy in Data Factory. I’ve found that long text fields that exceed varchar(8000) need to be copied using a dataflow or custom scripting. Since I’m trying to stay low-code, I’m going with the dataflow gen2 for that. But you could add a column to pull in a script for that specific table if preferred.
A data pipeline manages everything. Here’s what’s in it:
There are many nuances so let me know if you have questions. The pattern is pretty simple and basically repeated for each step in the medallion architecture. It would be easy to simplify clean and refining the data. And most of the dataflow gen2 entries could be replaced by SQL scripts if needed/preferred.
Notebooks are easier than all of this. But the limitation on number of notebooks running at one time could cause conflicts with other scheduled notebooks/pipelines (with notebooks) or even with your users who are running ad hoc notebooks.
Is it possible to the same without using a data warehouse?
Yes. But you would need to user a Notebook to perform CRUD ops against a lakehouse, for example, using PySpark or SQLSpark. So now you would need to learn best practices on how to organize and manage Spark Notebooks and figure out if the additional incurred CUs from running Notebooks from a pipeline is within your SKU's budget.
@simarsokhi I know Dataflow Gen2 does support incremental refresh now. Last time I tried (it's been months), Dataflow Gen2 required that the table has a created at timestamp and an updated at timestamp as two separate columns. I believe they were working to improve that experience so you can just use an updated at timestamp but I don't know if Microsoft has launched that feature yet. I also found the Dataflow Gen2 tended to use more capacity to do the incremental refresh.
Another idea... why not do an UPDATE of the records instead of DELETE using a Notebook?
Also, you don't need to store the updated_at in a LH. From within a pipeline, using a Script activity, you query the LH table and grab the latest date and compare to the latest on-prem date (or you can add an integer index column to both the on-prem and the LH table and then just grab the MAX index from the LH table and then compare that to the index value from the on-prem table).
Either way, you save this value in a pipeline variable. Then use it in a SQL query in the Copy data settings to the on-prem data that hasn't been persisted in the LH yet.
I am in the middle of testing this myself.
I've not tried an update statement, I'll explore that!
Hi @GregMarbais ,
We haven’t heard from you on the last response and was just checking back to see if you have a resolution yet .
In case if you have any resolution please do share that same with the community as it can be helpful to others .
Otherwise, will respond back with the more details and we will try to help .
@Anonymous I think i have a solution but I'm testing it to make sure it does everything I need. The trick (I think) is not using a lakehouse but instead using a Datawarehouse (without a lakehouse). I'm working through it now and will post it as soon as I have a definitive answer. I'm very hopeful!
Hi @GregMarbais ,
We haven’t heard from you on the last response and was just checking back to see if you have a resolution yet .
In case if you have any resolution please do share that same with the community as it can be helpful to others .
Otherwise, will respond back with the more details and we will try to help .
Unfortunately, I think you hit the nail on the head, meaning your only option is a Notebook using PySpark (unless the Notebook route also allows T-SQL DELETE statements). I say only option because a DFg2 doesn't allow modifying a LH, nor does a Script activity in a pipeline. Best of luck.
User | Count |
---|---|
1 | |
1 | |
1 | |
1 | |
1 |
User | Count |
---|---|
6 | |
3 | |
3 | |
2 | |
2 |