Optimizing Apache Spark™ on Databricks
Description
In this course, you will explore the five key problems that represent the vast majority of performance issues in an Apache Spark application: skew, spill, shuffle, storage, and serialization. With examples based on 100 GB to 1+ TB datasets, you will investigate and diagnose sources of bottlenecks with the Spark UI and learn effective mitigation strategies. You will also discover new features introduced in Spark 3 that can automatically address common performance problems. Lastly, you learn how to design and configure clusters for optimal performance based on specific team needs and concerns.
Duration
2 full days
Objectives
- Articulate how the five most common performance problems in a Spark application can be mitigated to achieve better application performance
- Summarize the most common performance problems associated with data ingestion and how to mitigate them
- Articulate how new features in Spark 3.x can be employed to mitigate performance problems in your Spark applications
- Configure a Spark cluster for maximum performance given specific job requirements
Prerequisites
- Hands-on experience developing Apache Spark applications (6+ months). We recommend the Apache Spark Programming course to get started working with Spark.
- Intermediate experience in Python or Scala
Outline
Day 1
- Review of Spark architecture and Spark UI
- Skew
- Spill
- Shuffle
- Storage
- Serialization
Day 2
- Ingestion basics
- Predicate push downs
- Disk partitioning
- Z-ordering
- Bucketing
- Optimization with Adaptive Query Execution (AQE)
- Designing and configuring clusters for high performance
Location
Online – Virtual
Price
$2000 USD