Optimizing Apache Spark™ on Databricks

Optimizing Apache Spark™ on Databricks

Description

In this course, you will explore the five key problems that represent the vast majority of performance issues in an Apache Spark application: skew, spill, shuffle, storage, and serialization. With examples based on 100 GB to 1+ TB datasets, you will investigate and diagnose sources of bottlenecks with the Spark UI and learn effective mitigation strategies. You will also discover new features introduced in Spark 3 that can automatically address common performance problems. Lastly, you learn how to design and configure clusters for optimal performance based on specific team needs and concerns.

Duration

2 full days

Objectives

  • Articulate how the five most common performance problems in a Spark application can be mitigated to achieve better application performance
  • Summarize the most common performance problems associated with data ingestion and how to mitigate them
  • Articulate how new features in Spark 3.x can be employed to mitigate performance problems in your Spark applications
  • Configure a Spark cluster for maximum performance given specific job requirements

Prerequisites

  • Hands-on experience developing Apache Spark applications (6+ months). We recommend the Apache Spark Programming course to get started working with Spark.
  • Intermediate experience in Python or Scala

Outline

Day 1

  • Review of Spark architecture and Spark UI
  • Skew
  • Spill
  • Shuffle
  • Storage
  • Serialization

Day 2

  • Ingestion basics
  • Predicate push downs
  • Disk partitioning
  • Z-ordering
  • Bucketing
  • Optimization with Adaptive Query Execution (AQE)
  • Designing and configuring clusters for high performance

Location

Online – Virtual

Price

$2000 USD