Re: Append Table action in Data pipeline duplicate...

KruthikaPillay1 · ‎04-10-2025

When a pipeline which uses copy data assistant is set to Append mode, every consequest refresh doules the data in Lakehouse table.

Example:
If there are initially 10 records in the table, performing a refresh will append another 10 records, resulting in 20 rows (10 existing + 10 new).

Now, if the source has 2 additional records i.e total 12 records, and another refresh is triggered, the current 20 rows in the Lakehouse table will be appended with the new 12 records, making the total 32 rows (20 + 12).

V-yubandi-msft

Hi @KruthikaPillay1 ,

May I ask if you have resolved this issue? If so, please mark the helpful reply and accept it as the solution. This will be helpful for other community members who have similar problems to solve it faster.

Thank you.

V-yubandi-msft

Hi @KruthikaPillay1 ,

Has your issue been resolved, or do you require any further information? Your feedback is valuable to us. If the solution was effective, please mark it as 'Accepted Solution' to assist other community members experiencing the same issue.

V-yubandi-msft · ‎04-10-2025

Hi @KruthikaPillay1 ,

Thank you for engaging with the Microsoft Fabric Communit. and @spencer_sa explained it well.

To summarize.

Append mode in Microsoft Fabric’s Copy Data Assistant will re-add all source records on each refresh, even if they already exist in the destination.
This is expected behavior, as Append mode does not perform deduplication or check for changes.
Overwrite mode is your best option if you want to always keep only the latest version of the source data.
For more advanced control, you’ll need to use a Notebook with Spark SQL or PySpark.

If there is any post helps, then please consider Accept it as the solution to help the other members find it more quickly.

KruthikaPillay1 · ‎04-11-2025

Hi @V-yubandi-msft ,

I would like to raise a few concerns regarding the behavior of the Append mode in Data Pipelines within Microsoft Fabric:

Data Duplication in Append Mode
Data Pipelines are recommended by Microsoft for uploading large datasets into OneLake. However, if the Append mode re-adds all records during each refresh, even when they already exist in the destination—it leads to exponential growth in dataset size. In that case, what is the benefit of using pipelines for large datasets if the result is duplicated data with every refresh?
Impact on Data Integrity
When the primary key is duplicated with each refresh, the resulting dataset becomes unusable for reporting purposes. This requires additional logic in the pipeline to clean or de-duplicate data, which adds unnecessary complexity.
Limitations of Overwrite Mode
I understand that the Overwrite option is available, it drops and reloads the entire dataset from the source during each refresh. However, for large datasets—especially where pipelines are recommended—reloading the full dataset each time is highly resource-intensive and time-consuming, contradicting the purpose of using Data Pipelines for large-scale data operations.
Inconsistent Behavior Across Platforms
Dataflows created within a Fabric workspace also duplicate data in Append mode. However, Dataflows created via Power Platform behave differently: they append only new and updated records rather than duplicating the entire dataset. This inconsistency across Fabric Data Pipelines, Fabric Dataflows, and Power Platform Dataflows raises confusion and makes it difficult to standardize implementation.

Could you please confirm if there are any plans to improve Append mode in Fabric pipelines to support delta loads or de-duplication?

V-yubandi-msft · ‎04-14-2025

Hi @KruthikaPillay1 ,

1. Append mode in Microsoft Fabric’s Copy Data pipelines does not check for existing records, it simply adds all data from the source to the destination each time it runs. Although this method is efficient for data ingestion, it does not prevent duplicates or track changes, which can result in data duplication if the source includes both old and new records.

2. Additionally, duplicated primary keys can disrupt reporting and cause inconsistencies. Since Append mode does not enforce uniqueness or support upserts (updates/inserts), you will need to implement additional logic using Spark Notebooks or SQL MERGE to maintain data integrity in your lakehouse.

3. Conversely, Overwrite mode replaces the entire dataset with each run. While suitable for full refreshes, it is inefficient for frequently updated large datasets. For substantial data updates, utilizing SparkSQL or PySpark in Notebooks is recommended in Fabric.

4. A key distinction between Fabric Pipelines and Power Platform Dataflows is that Dataflows support incremental refresh and change tracking, whereas Fabric Pipelines do not. This can be confusing when working across platforms, so the best choice depends on your specific requirements.

In Simply:

Append mode = fast but duplicates everything.

Overwrite mode = clean slate every run, but not scalable for big data.

Upserts/deduplication = use Notebooks (Spark/PySpark) or SQL MERGE logic.

Cross-platform behavior = yes, inconsistent; choose based on capabilities needed.

Did I answer your question? Mark my post as a solution.

spencer_sa · ‎04-11-2025

Hi @KruthikaPillay1

The limitations of Append/Overwrite are not actually Microsoft's but are limitations of the Delta Table file structure that it sits on.

Copy Activities are deliberately a very lightweight, fast way of moving data from one place to another with the minimum of compute - hence using the native two methods of append and overwrite which add new files to the table and then either keep the old ones (for append) or 'remove' them (for overwrite)

The capabilities you're looking for require a degree of compute that a straight Copy Activity does not have - specifically you're asking it to check if a record already exists in the table and to not reload it if it does - effectively a SQL MERGE statement. (If I were to do that on my biggest dataset, I'd be comparing 90M new records against 12B existing ones.)

The compute requirements for Dataflows are *significantly* higher than Pipelines, which is why they have this capability.

You have a number of options depending on your requirements - all work around the Delta Table limitation invisibly.

Dataflows / Dataflow Gen IIs: High compute cost and quite slow, low/no code, many connectors
Warehouse and SQL MERGEs: Will work, has some peculiarities
SparkSQL MERGE notebooks: Fast, limited to what you can do in SQL, limited parameterisation capabilities
pySpark notebooks: Fast, full power of Python *and* Spark, can also execute Spark SQL, lots of parameterisation, definitely not low code

For our application we're using pySpark to Upsert (aka merge) data - I'm trying to wean people off of dynamic SQL, but when you work with a bunch of SQL developers, I'm not trying that hard.

There are some tricks you can do with partitioning, deletes, and appends to speed things up if you know the data you're loading is a complete slice (say all of March's data)

If this helps, please consider Accepting as a Solution to helps others find it more easily.

spencer_sa · ‎04-10-2025

As I understand it, append works in exactly that way. It takes whatever's in the source and just adds the records on the end - so as you've discovered; start with 10 records, reload the same 10, bingo you now have 20.

You get two options with the copy data activity - append (just add source records to the end) and overwrite (replaces what's there with the source records). So start with 10 records, overwrite with a source of 12 records, you now have 12 records.

If you want to do anything more fancy with a lakehouse you're heading into Notebook territory and either SparkSQL MERGE statements or pyspark goodness.

If this helps, please consider Accepting as a Solution to helps others find it more easily.

Append Table action in Data pipeline duplicates all data rows instead of appending the new rows

Helpful resources

Fabric Monthly Update - March 2025

NEW! Community Notebooks Gallery

Fabric Community Update - April 2025