Cloudera Data Engineering: Developing Applications with Apache Spark

28 hours
1840,00 €
Classroom or Live Virtual Class
Classroom or Live Virtual Class


This four-day hands-on training course delivers the key concepts and knowledge developers need to use Apache Spark to develop high-performance, parallel applications on the Cloudera Data Platform (CDP).

Hands-on exercises allow students to practice writing Spark applications that integrate with CDP core components, such as Hive and Kafka. Participants will learn how to use Spark SQL to query structured data, how to use Spark Streaming to perform real-time processing on streaming data, and how to work with “big data” stored in a distributed file system.

After taking this course, participants will be prepared to face real-world challenges and build applications to execute faster decisions, better decisions, and interactive analysis, applied to a wide variety of use cases, architectures, and industries.

PUE, Cloudera Strategic Partner, is authorized by this multinational to provide official training in Cloudera technologies.

PUE is also accredited and recognized to carry out consulting and mentoring services in the implementation of Cloudera solutions in the business field with the added value in the practical and business approach to knowledge that is translated in its official courses.

Audience and prerequisites

This course is designed for developer and data engineers who want to improve their development of high-performance parallel applications on Cloudera Data Platform (CDP) using Apache Spark.


  • Basic experience in Linux and a basic command of programming languages such as Python or Scala
  • Basic knowledge of SQL will also be helpful
  • No prior knowledge of Spark and Hadoop is required


Students who successfully complete this course will be able to:

  • Distribute, store, and process data in a CDP cluster
  • Write, configure, and deploy Apache Spark applications
  • Use the Spark interpreters and Spark applications to explore, process, and analyze distributed data
  • Query data using Spark SQL, DataFrames, and Hive tables
  • Use Spark Streaming together with Kafka to process a data stream


Module 1: Introduction to Zeppelin

  • Why Notebooks?
  • Zeppelin Notes
  • Demo: Apache Spark In 5 Minutes

Module 2: HDFS Introduction

  • HDFS Overview
  • HDFS Components and Interactions
  • Additional HDFS Interactions
  • Ozone Overview
  • Exercise: Working with HDFS

Module 3: YARN Introduction

  • YARN Overview
  • YARN Components and Interaction
  • Working with YARN
  • Exercise: Working with YARN

Module 4: Distributed Processing History

  • The Disk Years: 2000 -> 2010
  • The Memory Years: 2010 -> 2020
  • The GPU Years: 2020 ->

Module 5: Working with RDDs

  • Resilient Distributed Datasets (RDDs)
  • Exercise: Working with RDDs

Module 6: Working with DataFrames

  • Introduction to DataFrames
  • Exercise: Introducing DataFrames
  • Exercise: Reading and Writing DataFrames
  • Exercise: Working with Columns
  • Exercise: Working with Complex Types
  • Exercise: Combining and Splitting DataFrames
  • Exercise: Summarizing and Grouping DataFrames
  • Exercise: Working with UDFs
  • Exercise: Working with Windows

Module 7: Introduction to Apache Hive

  • About Hive

Module 8: Hive and Spark Integration

  • Hive and Spark Integration
  • Exercise: Spark Integration with Hive

Module 9: Data Visualization with Zeppelin

  • Introduction to Data Visualization with Zeppelin
  • Zeppelin Analytics
  • Zeppelin Collaboration
  • Exercise: AdventureWorks

Module 10: Distributed Processing Challenges

  • Shuffle
  • Skew
  • Order

Module 11: Spark Distributed Processing

  • Spark Distributed Processing
  • Exercise: Explore Query ExecutionOrder

Module 12: Spark Distributed Persistence

  • DataFrame and Dataset Persistence
  • Persistence Storage Levels
  • Viewing Persisted RDDs
  • Exercise: Persisting DataFrames

Module 13: Writing, Configuring, and Running Spark Applications

  • Writing a Spark Application
  • Building and Running an Application
  • Application Deployment Mode
  • The Spark Application Web UI
  • Configuring Application Properties
  • Exercise: Writing, Configuring, and Running a Spark Application

Module 14: Introduction to Structured Streaming

  • Introduction to Structured Streaming
  • Exercise: Processing Streaming Data

Module 15: Message Processing with Apache Kafka

  • What is Apache Kafka?
  • Apache Kafka Overview
  • Scaling Apache Kafka
  • Apache Kafka Cluster Architecture
  • Apache Kafka Command Line Tools

Module 16: Structured Streaming with Apache Kafka

  • Receiving Kafka Messages
  • Sending Kafka Messages
  • Exercise: Working with Kafka Streaming Messages

Module 17: Aggregating and Joining Streaming DataFrames

  • Streaming Aggregation
  • Joining Streaming DataFrames
  • Exercise: Aggregating and Joining Streaming DataFrames

Appendix: Working with Datasets in Scala

  • Working with Datasets in Scala
  • Exercise: Using Datasets in Scala

Open calls