DatabricksPython
Benchmarking PySpark shuffle: what the metrics actually tell you
Building a benchmarking utility for shuffle and network transfer metrics in Databricks clusters.
20 Feb 2026 · 7 min read
3 posts
Building a benchmarking utility for shuffle and network transfer metrics in Databricks clusters.
How we built an end-to-end pipeline using Spark and H2O Sparkling Water to get machine learning models from source control into production with minimal friction.
A concise overview of why Apache Spark was created, how RDDs enable in-memory processing for iterative and interactive workloads, and its key programming abstractions.