Optimizing Apache Spark™ on Databricks
In this course, you will explore the five key problems that represent the vast majority of performance issues in an Apache Spark application: skew, spill, shuffle, storage, and serialization. With examples based on 100 GB to 1+ TB datasets, you will investigate and diagnose sources of bottlenecks with the Spark UI and learn effective mitigation strategies. You will also discover new features introduced in Spark 3 that can automatically address common performance problems. Lastly, you learn how to design and configure clusters for optimal performance based on specific team needs and concerns.
2 full days
- Articulate how the five most common performance problems in a Spark application can be mitigated to achieve better application performance
- Summarize the most common performance problems associated with data ingestion and how to mitigate them
- Articulate how new features in Spark 3.x can be employed to mitigate performance problems in your Spark applications
- Configure a Spark cluster for maximum performance given specific job requirements
- Hands-on experience developing Apache Spark applications (6+ months). We recommend the Apache Spark Programming course to get started working with Spark.
- Intermediate experience in Python or Scala
- Review of Spark architecture and Spark UI
- Ingestion basics
- Predicate push downs
- Disk partitioning
- Optimization with Adaptive Query Execution (AQE)
- Designing and configuring clusters for high performance
Online – Virtual