Cloudera Data Engineering: Developing Applications with Apache Spark - Virtual English

28 hours

2970,00 €

Live Virtual Class

Description
Addressed to
Objectives
Topics
Request Info

Request Info

Currently there are no classrooms scheduled for this course. Remember that you can contact us to request Private Training on other dates and cities

Description

This four-day hands-on training course delivers the key concepts and knowledge developers need to use Apache Spark to develop high-performance, parallel applications on the Cloudera Data Platform (CDP).

Hands-on exercises allow students to practice writing Spark applications that integrate with CDP core components, such as Hive and Kafka. Participants will learn how to use Spark SQL to query structured data, how to use Spark Streaming to perform real-time processing on streaming data, and how to work with “big data” stored in a distributed file system.

After taking this course, participants will be prepared to face real-world challenges and build applications to execute faster decisions, better decisions, and interactive analysis, applied to a wide variety of use cases, architectures, and industries.

PUE, Cloudera Strategic Partner, is authorized by this multinational to provide official training in Cloudera technologies.

PUE is also accredited and recognized to carry out consulting and mentoring services in the implementation of Cloudera solutions in the business field with the added value in the practical and business approach to knowledge that is translated in its official courses.

Audience and prerequisites

This course is designed for developer and data engineers who want to improve their development of high-performance parallel applications on Cloudera Data Platform (CDP) using Apache Spark.

Prerequisites

Basic experience in Linux and a basic command of programming languages such as Python or Scala
Basic knowledge of SQL will also be helpful
No prior knowledge of Spark and Hadoop is required

Objectives

Students who successfully complete this course will be able to:

Distribute, store, and process data in a CDP cluster
Write, configure, and deploy Apache Spark applications
Use the Spark interpreters and Spark applications to explore, process, and analyze distributed data
Query data using Spark SQL, DataFrames, and Hive tables
Use Spark Streaming together with Kafka to process a data stream

Topics

Module 1: Introduction to Zeppelin

Why Notebooks?
Zeppelin Notes
Demo: Apache Spark In 5 Minutes

Module 2: HDFS Introduction

HDFS Overview
HDFS Components and Interactions
Additional HDFS Interactions
Ozone Overview
Exercise: Working with HDFS

Module 3: YARN Introduction

YARN Overview
YARN Components and Interaction
Working with YARN
Exercise: Working with YARN

Module 4: Distributed Processing History

The Disk Years: 2000 -> 2010
The Memory Years: 2010 -> 2020
The GPU Years: 2020 ->

Module 5: Working with RDDs

Resilient Distributed Datasets (RDDs)
Exercise: Working with RDDs

Module 6: Working with DataFrames

Introduction to DataFrames
Exercise: Introducing DataFrames
Exercise: Reading and Writing DataFrames
Exercise: Working with Columns
Exercise: Working with Complex Types
Exercise: Combining and Splitting DataFrames
Exercise: Summarizing and Grouping DataFrames
Exercise: Working with UDFs
Exercise: Working with Windows

Module 7: Introduction to Apache Hive

About Hive

Module 8: Hive and Spark Integration

Hive and Spark Integration
Exercise: Spark Integration with Hive

Module 9: Data Visualization with Zeppelin

Introduction to Data Visualization with Zeppelin
Zeppelin Analytics
Zeppelin Collaboration
Exercise: AdventureWorks

Module 10: Distributed Processing Challenges

Shuffle
Skew
Order

Module 11: Spark Distributed Processing

Spark Distributed Processing
Exercise: Explore Query ExecutionOrder

Module 12: Spark Distributed Persistence

DataFrame and Dataset Persistence
Persistence Storage Levels
Viewing Persisted RDDs
Exercise: Persisting DataFrames

Module 13: Writing, Configuring, and Running Spark Applications

Writing a Spark Application
Building and Running an Application
Application Deployment Mode
The Spark Application Web UI
Configuring Application Properties
Exercise: Writing, Configuring, and Running a Spark Application

Module 14: Introduction to Structured Streaming

Introduction to Structured Streaming
Exercise: Processing Streaming Data

Module 15: Message Processing with Apache Kafka

What is Apache Kafka?
Apache Kafka Overview
Scaling Apache Kafka
Apache Kafka Cluster Architecture
Apache Kafka Command Line Tools

Module 16: Structured Streaming with Apache Kafka

Receiving Kafka Messages
Sending Kafka Messages
Exercise: Working with Kafka Streaming Messages

Module 17: Aggregating and Joining Streaming DataFrames

Streaming Aggregation
Joining Streaming DataFrames
Exercise: Aggregating and Joining Streaming DataFrames

Appendix: Working with Datasets in Scala

Working with Datasets in Scala
Exercise: Using Datasets in Scala

Open calls

Currently there are no classrooms scheduled for this course. Remember that you can contact us to request Private Training on other dates and cities