DatabricksPythonInfrastructure

Benchmarking PySpark shuffle: what the metrics actually tell you

Building a benchmarking utility for shuffle and network transfer metrics in Databricks clusters.

20 February 2026 · 7 min read

Shuffle is where PySpark jobs go to die. A query that processes 500GB of data might shuffle 2TB across the cluster, turning a 10-minute job into a 45-minute crawl. I built a benchmarking utility to make shuffle costs visible before they become production incidents.

Why shuffle metrics matter

Every groupBy, join, repartition, and window operation triggers a shuffle. Data is serialised, written to local disk, transferred across the network, deserialised, and reassembled. Each step has a cost, and those costs compound.

The problem is visibility. Spark’s UI shows shuffle read and write bytes, but not in a way that’s easy to compare across job versions or cluster configurations. You need structured metrics.

The benchmarking approach

from pyspark.sql import SparkSession
from dataclasses import dataclass
from time import perf_counter

@dataclass
class ShuffleMetrics:
    job_name: str
    shuffle_read_bytes: int
    shuffle_write_bytes: int
    network_bytes_sent: int
    executor_run_time_ms: int
    gc_time_ms: int
    peak_memory_bytes: int
    wall_clock_seconds: float

The utility wraps a Spark action, captures metrics from the Spark listener API, and logs them to a structured format. Running the same job with different configurations — partition counts, join strategies, broadcast thresholds — produces comparable data points.

Key findings from production workloads

After benchmarking several Databricks production jobs:

Partition count is the biggest lever. A job with 200 partitions shuffling 1TB will create 200 shuffle files per mapper. With 50 mappers, that’s 10,000 files per stage. Reducing to 100 partitions cut shuffle time by 35% on our workloads.
Broadcast joins eliminate shuffle entirely for small tables, but “small” depends on executor memory. The default 10MB threshold is too conservative for modern clusters.
GC time correlates with shuffle spill. When shuffle data exceeds available memory and spills to disk, GC pressure spikes. Monitoring GC time is an early warning for under-provisioned jobs.

The utility now runs as part of our CI pipeline. Every PR that modifies a PySpark job triggers a benchmark run against a sample dataset, and the metrics are compared against the baseline. Regressions in shuffle volume or execution time block the merge until reviewed.

Databricks Python Infrastructure