Benchmarking PySpark shuffle: what the metrics actually tell you
Building a benchmarking utility for shuffle and network transfer metrics in Databricks clusters.
Shuffle is where PySpark jobs go to die. A query that processes 500GB of data might shuffle 2TB across the cluster, turning a 10-minute job into a 45-minute crawl. I built a benchmarking utility to make shuffle costs visible before they become production incidents.
Why shuffle metrics matter
Every groupBy, join, repartition, and window operation triggers a shuffle. Data is serialised, written to local disk, transferred across the network, deserialised, and reassembled. Each step has a cost, and those costs compound.
The problem is visibility. Spark’s UI shows shuffle read and write bytes, but not in a way that’s easy to compare across job versions or cluster configurations. You need structured metrics.
The benchmarking approach
from pyspark.sql import SparkSession
from dataclasses import dataclass
from time import perf_counter
@dataclass
class ShuffleMetrics:
job_name: str
shuffle_read_bytes: int
shuffle_write_bytes: int
network_bytes_sent: int
executor_run_time_ms: int
gc_time_ms: int
peak_memory_bytes: int
wall_clock_seconds: float
The utility wraps a Spark action, captures metrics from the Spark listener API, and logs them to a structured format. Running the same job with different configurations — partition counts, join strategies, broadcast thresholds — produces comparable data points.
Key findings from production workloads
After benchmarking several Databricks production jobs:
- Partition count is the biggest lever. A job with 200 partitions shuffling 1TB will create 200 shuffle files per mapper. With 50 mappers, that’s 10,000 files per stage. Reducing to 100 partitions cut shuffle time by 35% on our workloads.
- Broadcast joins eliminate shuffle entirely for small tables, but “small” depends on executor memory. The default 10MB threshold is too conservative for modern clusters.
- GC time correlates with shuffle spill. When shuffle data exceeds available memory and spills to disk, GC pressure spikes. Monitoring GC time is an early warning for under-provisioned jobs.
The utility now runs as part of our CI pipeline. Every PR that modifies a PySpark job triggers a benchmark run against a sample dataset, and the metrics are compared against the baseline. Regressions in shuffle volume or execution time block the merge until reviewed.