Skip to main content
cancel
Showing results for 
Search instead for 
Did you mean: 

Microsoft is giving away 50,000 FREE Microsoft Certification exam vouchers. Get Fabric certified for FREE! Learn more

Reply
Teskay
Regular Visitor

Dataflow Gen 2 failure with fuzzy matching

I am trying to do a left join fuzzy merge and DF Gen2 just too slow and takes hours before failing. 

 

Started with data of over 12 million rows as the left table and right table about 1000 rows, then scaled down the left table to 1000 rows too, thinking it was the size, but I still got the same issue.
Tried to use the notebook and script option, but the large left table is not even showing the data in the lakehouse delta table.

 

I understand fuzzy matching can be computationally intensive, but it works fine but slow with smaller data in PowerBI power query, thought DF Gen2 will be better and able to handle more data. Staging and Fastcopy did not make any difference.

 

Kindly help and advice. Thank you.

@nilendraFabric  @spencer_sa @all

1 ACCEPTED SOLUTION
v-mdharahman
Community Support
Community Support

Hi @Teskay,

Thanks for reaching out to the Microsoft fabric community forum.

 

It looks like you're facing performance issues with fuzzy merging in Dataflow Gen2, especially when dealing with large datasets. Fuzzy matching is inherently resource-intensive, and DF Gen2 may not be the best tool for handling such operations at scale. You can use Fabric Notebooks with PySpark which is best suited for large datasets and is optimized for distributed processing, instead of performing fuzzy matching in DF Gen2. You can also use a SQL-based approach with Soundex or Levenshtein distance as it runs faster in Lakehouse compared to DF Gen2.

If you still want to go with fuzzy matching then, reduce the number of comparisons by preprocessing. You can apply Standardization like convert all to lowercase or remove spaces and special characters before matching or use blocking techniques for example: compare only rows starting with the same letter.

 

If I misunderstand your needs or you still have problems on it, please feel free to let us know.  

Best Regards,
Hammad.
Community Support Team

 

If this post helps then please mark it as a solution, so that other members find it more quickly.

Thank you.

View solution in original post

2 REPLIES 2
spencer_sa
Super User
Super User

As @v-mdharahman says, fuzzy matching is computationally expensive - you effectively have to do a cross-join between the two tables first.
There are a few optimisations you can do - as Hammad mentions you can reduce the number of comparisons.  You can also remove exact matches (as equality is much quicker than Levenshtein) and you can also consider broadcasting the smaller table to all nodes (8GB limit).
Pyspark has both Soundex and Levenshtein built into it.

I did a basic proof of concept that took a table of 1 million random 8 character strings and another of 1000 strings and ran a filter on distances of 1 - on an F64 it took around 8 minutes (which was far faster than I had anticipated) and returned a single row (which was not what I was expecting so I'll need to convince myself it did what I wanted.)

v-mdharahman
Community Support
Community Support

Hi @Teskay,

Thanks for reaching out to the Microsoft fabric community forum.

 

It looks like you're facing performance issues with fuzzy merging in Dataflow Gen2, especially when dealing with large datasets. Fuzzy matching is inherently resource-intensive, and DF Gen2 may not be the best tool for handling such operations at scale. You can use Fabric Notebooks with PySpark which is best suited for large datasets and is optimized for distributed processing, instead of performing fuzzy matching in DF Gen2. You can also use a SQL-based approach with Soundex or Levenshtein distance as it runs faster in Lakehouse compared to DF Gen2.

If you still want to go with fuzzy matching then, reduce the number of comparisons by preprocessing. You can apply Standardization like convert all to lowercase or remove spaces and special characters before matching or use blocking techniques for example: compare only rows starting with the same letter.

 

If I misunderstand your needs or you still have problems on it, please feel free to let us know.  

Best Regards,
Hammad.
Community Support Team

 

If this post helps then please mark it as a solution, so that other members find it more quickly.

Thank you.

Helpful resources

Announcements
MarchFBCvideo - carousel

Fabric Monthly Update - March 2025

Check out the March 2025 Fabric update to learn about new features.

Notebook Gallery Carousel1

NEW! Community Notebooks Gallery

Explore and share Fabric Notebooks to boost Power BI insights in the new community notebooks gallery.

April2025 Carousel

Fabric Community Update - April 2025

Find out what's new and trending in the Fabric community.

"); $(".slidesjs-pagination" ).prependTo(".pagination_sec"); $(".slidesjs-pagination" ).append("
"); $(".slidesjs-play.slidesjs-navigation").appendTo(".playpause_sec"); $(".slidesjs-stop.slidesjs-navigation").appendTo(".playpause_sec"); $(".slidesjs-pagination" ).append(""); $(".slidesjs-pagination" ).append(""); } catch(e){ } /* End: This code is added by iTalent as part of iTrack COMPL-455 */ $(".slidesjs-previous.slidesjs-navigation").attr('tabindex', '0'); $(".slidesjs-next.slidesjs-navigation").attr('tabindex', '0'); /* start: This code is added by iTalent as part of iTrack 1859082 */ $('.slidesjs-play.slidesjs-navigation').attr('id','playtitle'); $('.slidesjs-stop.slidesjs-navigation').attr('id','stoptitle'); $('.slidesjs-play.slidesjs-navigation').attr('role','tab'); $('.slidesjs-stop.slidesjs-navigation').attr('role','tab'); $('.slidesjs-play.slidesjs-navigation').attr('aria-describedby','tip1'); $('.slidesjs-stop.slidesjs-navigation').attr('aria-describedby','tip2'); /* End: This code is added by iTalent as part of iTrack 1859082 */ }); $(document).ready(function() { if($("#slides .item").length < 2 ) { /* Fixing Single Slide click issue (commented following code)*/ // $(".item").css("left","0px"); $(".item.slidesjs-slide").attr('style', 'left:0px !important'); $(".slidesjs-stop.slidesjs-navigation").trigger('click'); $(".slidesjs-previous").css("display", "none"); $(".slidesjs-next").css("display", "none"); } var items_length = $(".item.slidesjs-slide").length; $(".slidesjs-pagination-item > button").attr("aria-setsize",items_length); $(".slidesjs-next, .slidesjs-pagination-item button").attr("tabindex","-1"); $(".slidesjs-pagination-item button").attr("role", "tab"); $(".slidesjs-previous").attr("tabindex","-1"); $(".slidesjs-next").attr("aria-hidden","true"); $(".slidesjs-previous").attr("aria-hidden","true"); $(".slidesjs-next").attr("aria-label","Next"); $(".slidesjs-previous").attr("aria-label","Previous"); //$(".slidesjs-stop.slidesjs-navigation").attr("role","button"); //$(".slidesjs-play.slidesjs-navigation").attr("role","button"); $(".slidesjs-pagination").attr("role","tablist").attr("aria-busy","true"); $("li.slidesjs-pagination-item").attr("role","list"); $(".item.slidesjs-slide").attr("tabindex","-1"); $(".item.slidesjs-slide").attr("aria-label","item"); /*$(".slidesjs-stop.slidesjs-navigation").on('click', function() { var itemNumber = parseInt($('.slidesjs-pagination-item > a.active').attr('data-slidesjs-item')); $($('.item.slidesjs-slide')[itemNumber]).find('.c-call-to-action').attr('tabindex', '0'); });*/ $(".slidesjs-stop.slidesjs-navigation, .slidesjs-pagination-item > button").on('click keydown', function() { $.each($('.item.slidesjs-slide'),function(i,el){ $(el).find('.c-call-to-action').attr('tabindex', '-1'); }); var itemNumber = parseInt($('.slidesjs-pagination-item > button.active').attr('data-slidesjs-item')); $($('.item.slidesjs-slide')[itemNumber]).find('.c-call-to-action').attr('tabindex', '0'); }); $(".slidesjs-play.slidesjs-navigation").on('click', function() { $.each($('.item.slidesjs-slide'),function(i,el){ $(el).find('.c-call-to-action').attr('tabindex', '-1'); }); }); $(".slidesjs-pagination-item button").keyup(function(e){ var keyCode = e.keyCode || e.which; if (keyCode == 9) { e.preventDefault(); $(".slidesjs-stop.slidesjs-navigation").trigger('click').blur(); $("button.active").focus(); } }); $(".slidesjs-play").on("click",function (event) { if (event.handleObj.type === "click") { $(".slidesjs-stop").focus(); } else if(event.handleObj.type === "keydown"){ if (event.which === 13 && $(event.target).hasClass("slidesjs-play")) { $(".slidesjs-stop").focus(); } } }); $(".slidesjs-stop").on("click",function (event) { if (event.handleObj.type === "click") { $(".slidesjs-play").focus(); } else if(event.handleObj.type === "keydown"){ if (event.which === 13 && $(event.target).hasClass("slidesjs-stop")) { $(".slidesjs-play").focus(); } } }); $(".slidesjs-pagination-item").keydown(function(e){ switch (e.which){ case 37: //left arrow key $(".slidesjs-previous.slidesjs-navigation").trigger('click'); e.preventDefault(); break; case 39: //right arrow key $(".slidesjs-next.slidesjs-navigation").trigger('click'); e.preventDefault(); break; default: return; } $(".slidesjs-pagination-item button.active").focus(); }); }); // Start This code is added by iTalent as part of iTrack 1859082 $(document).ready(function(){ $("#tip1").attr("aria-hidden","true").addClass("hidden"); $("#tip2").attr("aria-hidden","true").addClass("hidden"); $(".slidesjs-stop.slidesjs-navigation, .slidesjs-play.slidesjs-navigation").attr('title', ''); $("a#playtitle").focus(function(){ $("#tip1").attr("aria-hidden","false").removeClass("hidden"); }); $("a#playtitle").mouseover(function(){ $("#tip1").attr("aria-hidden","false").removeClass("hidden"); }); $("a#playtitle").blur(function(){ $("#tip1").attr("aria-hidden","true").addClass("hidden"); }); $("a#playtitle").mouseleave(function(){ $("#tip1").attr("aria-hidden","true").addClass("hidden"); }); $("a#play").keydown(function(ev){ if (ev.which ==27) { $("#tip1").attr("aria-hidden","true").addClass("hidden"); ev.preventDefault(); return false; } }); $("a#stoptitle").focus(function(){ $("#tip2").attr("aria-hidden","false").removeClass("hidden"); }); $("a#stoptitle").mouseover(function(){ $("#tip2").attr("aria-hidden","false").removeClass("hidden"); }); $("a#stoptitle").blur(function(){ $("#tip2").attr("aria-hidden","true").addClass("hidden"); }); $("a#stoptitle").mouseleave(function(){ $("#tip2").attr("aria-hidden","true").addClass("hidden"); }); $("a#stoptitle").keydown(function(ev){ if (ev.which ==27) { $("#tip2").attr("aria-hidden","true").addClass("hidden"); ev.preventDefault(); return false; } }); }); // End This code is added by iTalent as part of iTrack 1859082