Re: Ms Fabric spark notebook with poor performance

Marco117 · ‎09-30-2024

I have a notebook that was running on a Databricks cluster with only 4 executor cores and 4 driver cores, and Autoscale and Dynamically allocate executors were disabled. Here, the notebook executed in approximately 3 minutes.

Now, in Fabric, the same notebook with the same inputs that I read from Databricks takes 14 minutes to execute (attached image, the notebook is invoked from a pipeline). What I see different in Fabric is that the workspace has Autoscale and Dynamically allocate executors enabled, and there are moments when the notebook starts using 72 cores when this is really not necessary.

What could I do to improve this?

prasbharat · ‎12-15-2024

@Marco117

It sounds like you’re experiencing significant differences in execution time when running your Spark notebook in Microsoft Fabric compared to Databricks. Based on your description, here are some additional suggestions that could help optimize performance without conflicting with the previous insights provided:

1. Validate Resource Allocation and Execution Overhead

Use the Monitor Run series to check where the additional time is being spent. Running the notebook directly might yield insights into whether the pipeline execution is introducing delays.

If there are multiple notebooks in the pipeline, consider enabling High Concurrency Mode to minimize sequential bottlenecks.
If concurrency doesn’t apply, you can try running the notebooks independently for a direct comparison.

2. Since Autoscale and Dynamic Allocation are enabled, they might be overprovisioning resources unnecessarily. If your workload is stable and doesn’t require frequent scaling:

Disable Autoscale and Dynamic Allocation temporarily.
Configure fixed resources for Spark using settings similar to your Databricks cluster:

spark.conf.set("spark.executor.instances", "2")
spark.conf.set("spark.executor.cores", "4")
spark.conf.set("spark.driver.cores", "4")
spark.conf.set("spark.sql.shuffle.partitions", "8")

3. Ensure your input data is stored in an optimized format like Parquet or Delta for better read performance. If the current format is CSV or JSON, consider converting it:

df = spark.read.csv("path_to_data").repartition(4).write.parquet("optimized_data_path")

Partition your dataset based on the number of executor cores to reduce shuffle overhead:

df = df.repartition(4)

4. Monitor Job performance:

Enable Spark UI in Fabric to analyze job execution. Look for long-running stages or inefficient shuffling during transformations.
Use the explain() method to analyze your query execution plans and optimize joins, filters, or other transformations: df.explain()

If this post helps you resolve the issue, please consider accepting it as the solution to help other members find it more quickly. If you still have additional questions, feel free to let me know, and I’ll be happy to assist further. Thanks a lot!

abuislam · ‎12-13-2024

It seems like the Autoscale and Dynamic Allocation settings in Fabric might be over-allocating resources, which could be slowing things down. Try adjusting those settings to better match your workload, or consider disabling Autoscale and Dynamically allocating executors to match the Databricks setup. You could also experiment with manually adjusting the number of cores to see if that helps improve performance.

lbendlin · ‎10-02-2024

I have been told that bucket sizes play an outsized role (sorry for the pun). Maybe something you can compare between your two setups.

v-cgao-msft · ‎10-01-2024

Hi @Marco117 ,

1. I tested running notebooks in Pipeline and running notebooks directly, and it took a few seconds longer to use Pipeline than to run notebooks (my query is simple), and you can further confirm where the time is being spent in the Monitor Run series.

Monitor Apache Spark run series
Not sure how many Notebooks are in the pipeline, if there are more than one, consider using high concurrency mode.

Introducing High Concurrency Mode for Notebooks in Pipelines for Fabric Spark

If no exceptions are found above, then it is time to move to the autoscale and dynamically allocated actuators.

If your workload is relatively stable and you don't need additional cores to speed up execution, you might consider disabling these two features.

2. why using 72 cores:
When you enable autoscale for Spark pools, jobs exuecute with their minimum node configuration. During runtime, scaling may occur. These requests go through the job admission control. Approved requests scale up to the maximum limits based on total available cores. Rejected requests don't affect active jobs; they continue to run with their current configuration until cores become available.
Job admission in Apache Spark for Fabric

3. Other possible reasons:
If you change the default pool from Starter Pool to a Custom Spark pool you may see longer session start (~3 minutes).
Both session and command execution times have increased in the first time. (It only took 20 seconds before.).

Workspace administration settings in Microsoft Fabric

Best Regards,
Gao
Community Support Team

If there is any post helps, then please consider Accept it as the solution to help the other members find it more quickly.
If I misunderstand your needs or you still have problems on it, please feel free to let us know. Thanks a lot!

How to get your questions answered quickly -- How to provide sample data in the Power BI Forum

Ms Fabric spark notebook with poor performance

1. Validate Resource Allocation and Execution Overhead

Helpful resources

Fabric Monthly Update - March 2025

Fabric Community Update - March 2025

NEW! Community Notebooks Gallery

Ms Fabric spark notebook with poor performance

1. Validate Resource Allocation and Execution Overhead

Helpful resources

Fabric Monthly Update - March 2025

Fabric Community Update - March 2025