DatabricksPythonInfrastructure

Benchmarking PySpark shuffle: what the metrics actually tell you

Building a benchmarking utility for shuffle and network transfer metrics in Databricks clusters.

20 February 2026 · 7 min read

Shuffle is where PySpark jobs go to die. A query that processes 500GB of data might shuffle 2TB across the cluster, turning a 10-minute job into a 45-minute crawl. I built a benchmarking utility to make shuffle costs visible before they become production incidents.

Why shuffle metrics matter

Every groupBy, join, repartition, and window operation triggers a shuffle. Data is serialised, written to local disk, transferred across the network, deserialised, and reassembled. Each step has a cost, and those costs compound.

The problem is visibility. Spark’s UI shows shuffle read and write bytes, but not in a way that’s easy to compare across job versions or cluster configurations. You need structured metrics.

The benchmarking approach

from pyspark.sql import SparkSession
from dataclasses import dataclass
from time import perf_counter

@dataclass
class ShuffleMetrics:
    job_name: str
    shuffle_read_bytes: int
    shuffle_write_bytes: int
    network_bytes_sent: int
    executor_run_time_ms: int
    gc_time_ms: int
    peak_memory_bytes: int
    wall_clock_seconds: float

The utility wraps a Spark action, captures metrics from the Spark listener API, and logs them to a structured format. Running the same job with different configurations — partition counts, join strategies, broadcast thresholds — produces comparable data points.

Key findings from production workloads

After benchmarking several Databricks production jobs:

  • Partition count is the biggest lever. A job with 200 partitions shuffling 1TB will create 200 shuffle files per mapper. With 50 mappers, that’s 10,000 files per stage. Reducing to 100 partitions cut shuffle time by 35% on our workloads.
  • Broadcast joins eliminate shuffle entirely for small tables, but “small” depends on executor memory. The default 10MB threshold is too conservative for modern clusters.
  • GC time correlates with shuffle spill. When shuffle data exceeds available memory and spills to disk, GC pressure spikes. Monitoring GC time is an early warning for under-provisioned jobs.

The utility now runs as part of our CI pipeline. Every PR that modifies a PySpark job triggers a benchmark run against a sample dataset, and the metrics are compared against the baseline. Regressions in shuffle volume or execution time block the merge until reviewed.