Cloudera Data Scientist - Virtual English

28 hours

2970,00 €

Live Virtual Class

Description
Addressed to
Objectives
Topics
Request Info

Request Info

Currently there are no classrooms scheduled for this course. Remember that you can contact us to request Private Training on other dates and cities

Description

TASTE OF TRAINING

The data scientists are head of build platforms of information for provisioning a deep vision and answer questions previously unthinkable. Spark and Hadoop are already changing the way data scientists operate by allowing the data analysis to scale interactively and iteratively.

This four-day course covers enterprise data science and machine learning using Apache Spark in Cloudera Data Science Workbench (CDSW). Participants use Spark SQL to load, explore, cleanse, join, and analyze data and Spark MLlib to specify, train, evaluate, tune, and deploy machine learning pipelines. They dive into the foundations of the Spark architecture and execution model necessary to effectively configure, monitor, and tune their Spark applications. Participants also learn how Spark integrates with key components of the Cloudera platform such as HDFS, YARN, Hive, Impala, and Hue as well as their favourite Python packages.

PUE, Cloudera Strategic Partner, is authorized by this multinational to provide official training in Cloudera technologies.

PUE is accredited and recognized for realize consulting services and mentoring on implementing Cloudera solutions in business environment with the added value in the practical business-centred focus of knowledge that is transfer to the official courses.

Audience and prerequisites

This course is designed for data scientists who use Python or R to work with small datasets on a single machine and who need to scale up their data science and machine learning workflows to large datasets on distributed clusters. Data engineers, data analysts, developers, and solution architects who collaborate with data scientists will also find this training valuable.

Participants should have a basic understanding of Python or R and some experience exploring and analyzing data and developing statistical or machine learning models. Knowledge of Spark, Hadoop, or the Cloudera platform is not required.

Objectives

Participants are going to walk through an end-to-end data science and machine learning workflow based on realistic scenarios and datasets from a fictitious technology company. The material is presented through a sequence of brief lectures, interactive demonstrations, extensive hands-on exercises, and lively discussions. The demonstrations and exercises are conducted in Python (with PySpark) using Cloudera Data Science Workbench (CDSW). Supplemental examples using R (with sparklyr) are provided.

Topics

Module 1. Introduction

Module 2. Data Science Overview

What Data Scientists Do
What Process Data Scientists Use
What Tools Data Scientists Use

Module 3. Cloudera Data Science Workbench (CDSW)

Introduction to Cloudera Data Science Workbench
How Cloudera Data Science Workbench Works
How to Use Cloudera Data Science Workbench
Entering Code
Getting Help
Accessing the Linux Command Line
Working with Python Packages
Formatting Session Output

Module 4. Case Study

DuoCar
How DuoCar Works
DuoCar Datasets
DuoCar Business Goals
DuoCar Data Science Platform
DuoCar Cloudera EDH Cluster
HDFS
Apache Spark
Apache Hive
Apache Impala
Hue
YARN
DuoCar Cluster Architecture

Module 5. Apache Spark

Apache Spark
How Spark Works
The Spark Stack
Spark SQL
DataFrames
File Formats in Apache Spark
Text File Formats
Parquet File Format
Spark Interface Languages
PySpark
Data Science with PySpark
sparklyr
dplyr and sparklyr
Comparison of PySpark and sparklyr
How sparklyr Works with dplyr
sparklyr DataFrame and MLlib Functions
When to Use PySpark and sparklyr

Module 6. Running a Spark Application from CDSW

Overview
Starting a Spark Application
Reading Data into A Spark SQL DataFrame
Examining the Schema of a DataFrame
Computing the Number of Rows and Columns of a DataFrame
Examining Rows of a DataFrame
Stopping a Spark Application

Module 7. Inspecting a Spark SQL DataFrame

Overview
Inspecting a DataFrame
Inspecting a DataFrame Column
Inspecting a Primary Key Variable
Inspecting a Categorical Variable
Inspecting a Numerical Variable
Inspecting a Date and Time Variable

Module 8. Transforming DataFrames

Spark SQL DataFrames
Working with Columns
Selecting Columns
Dropping Columns
Specifying Columns
Adding Columns
Changing the Column Name
Changing the Column Type
Working with Rows
Ordering Rows
Selecting a Fixed Number of Rows
Selecting Distinct Rows
Filtering Rows
Sampling Rows
Working with Missing Values

Module 9. Transforming DataFrame Columns

Spark SQL Data Types
Working with Numerical Columns
Working with String Columns
Working with Date and Timestamp Columns
Working with Boolean Columns

Module 10. Complex Types (optional)

Complex Collection Data Types
Arrays
Maps
Structs

Module 11. User-Defined Functions (optional)

User-Defined Functions
Defining a Python Function
Registering a Python Function as a User-Defined Function
Applying a User-Defined Function

Module 12. Reading and Writing Data

Reading and Writing Data
Working with Delimited Text Files
Working with Text Files
Working with Parquet Files
Working with Hive Tables
Working with Object Stores
Working with pandas DataFrames

Module 13. Combining and Splitting DataFrames

Joining DataFrames
Cross Join
Inner Join
Left Semi Join
Left Anti Join
Left Outer Join
Right Outer Join
Full Outer Join
Applying Set Operations to DataFrames
Splitting a DataFrame

Module 14. Summarizing and Grouping DataFrames

Summarizing Data with Aggregate functions
Grouping Data
Pivoting Data

Module 15. Window Functions (optional)

Introduction to Window Functions
Creating a Window Specification
Aggregating over a Window Specification

Module 16. Exploring DataFrames

Possible Workflows for Big Data
Exploring a Single Variable
Exploring a Categorical Variable
Exploring a Continuous Variable
Exploring a Pair of Variables
Categorical-Categorical Pair
Categorical-Continuous Pair
Continuous-Continuous Pair

Module 17. Apache Spark Job Execution

DataFrame Operations
Input Splits
Narrow Operations
Wide Operations
Stages and Tasks
Shuffle

Module 18. Monitoring, Tuning, and Configuring Spark Applications

Monitoring Spark Applications
Persisting DataFrames
Partitioning DataFrames
Configuring the Spark Environment

Module 19. Machine Learning Overview

Machine Learning
Underfitting and Overfitting
Model Validation
Hyperparameters
Supervised and Unsupervised Learning
Machine Learning Algorithms
Machine Learning Libraries
Apache Spark MLlib

Module 20. Training and Evaluating Regression Models

Introduction to Regression Models
Scenario
Preparing the Regression Data
Assembling the Feature Vector
Creating a Train and Test Set
Specifying a Linear Regression Model
Training a Linear Regression Model
Examining the Model Parameters
Examining Various Model Performance Measures
Examining Various Model Diagnostics
Applying the Linear Regression Model to the Test Data
Evaluating the Linear Regression Model on the Test Data
Plotting the Linear Regression Model

Module 21. Training and Evaluating Classification Models

Introduction to Classification Models
Scenario
Preprocessing the Modeling Data
Generate a Label
Extract, Transform, And Select Features
Create Train and Test Sets
Specify A Logistic Regression Model
Train the Logistic Regression Model
Examine the Logistic Regression Model
Evaluate Model Performance on the Test Set

Module 22. Tuning Algorithm Hyperparameters Using Grid Search

Requirements for Hyperparameter Tuning
Specifying the Estimator
Specifying the Hyperparameter Grid
Specifying the Evaluator
Tuning Hyperparameters using Holdout Cross-validation
Tuning Hyperparameters using K-fold Cross-validation

Module 23. Train and Evaluating Clustering Models

Introduction to Clustering
Scenario
Preprocessing the Data
Extracting, Transforming, and Selecting Features
Specifying a Gaussian Mixture Model
Training a Gaussian Mixture Model
Examining the Gaussian Mixture Model
Plotting the Clusters
Exploring the Cluster Profiles
Saving and Loading the Gaussian Mixture Model

Module 24. Processing Text and Training and Evaluating Topic Models (optional)

Introduction to Topic Models
Scenario
Extracting and Transforming Features
Parsing Text Data
Removing Common (Stop) Words
Counting the Frequency of Words
Specifying a Topic Model
Training a topic model using Latent Dirichlet Allocation (LDA)
Assessing the Topic Model Fit
Examining a Topic Model
Applying a Topic Model

Module 25. Training and Evaluating Recommender Models (optional)

Introduction to Recommender Models
Scenario
Preparing Data for a Recommender Model
Specifying a Recommender Model
Training a Recommender Model using Alternating Least Squares
Examining a Recommender Model
Applying a Recommender Model
Evaluating a Recommender Model
Generating Recommendations

Module 26. Working with Machine Learning Pipelines

Specifying Pipeline Stages
Specifying a Pipeline
Training a Pipeline Model
Querying a Pipeline Model
Applying a Pipeline Model

Module 27. Deploying Machine Learning Pipelines

Saving and Loading Pipelines and Pipeline Models in Python
Loading Pipelines and Pipeline Models in Scala

Module 28. Overview of sparklyr (optional)

Connecting to Spark
Reading Data
Inspecting Data
Transforming Data Using dplyr Verbs
Using SQL Queries
Spark DataFrames Functions
Visualizing Data from Spark
Machine Learning with MLlib

Module 29. Introduction to Additional CDSW Features (optional)

Collaboration
Jobs
Experiments
Models
Applications

Module 30. Conclusion

Open calls

Currently there are no classrooms scheduled for this course. Remember that you can contact us to request Private Training on other dates and cities