Cloudera

Cloudera Data Scientist - Virtual English

28 hours
2695,00 €
Live Virtual Class
Live Virtual Class

06 Sep 2021 - 09 Sep 2021   |  

Cloudera Data Scientist - Virtual English

28 h | 2695 € | Live Virtual Class | English
from Monday to Thursday (09:00h - 17:00h)
Calendario de sesiones

13 Dec 2021 - 16 Dec 2021   |  

Cloudera Data Scientist - Virtual English

28 h | 2695 € | Live Virtual Class | English
from Monday to Thursday (09:00h - 17:00h)
Calendario de sesiones

Description

TASTE OF TRAINING

The data scientists are head of build platforms of information for provisioning a deep vision and answer questions previously unthinkable. Spark and Hadoop are already changing the way data scientists operate by allowing the data analysis to scale interactively and iteratively.

This four-day course covers enterprise data science and machine learning using Apache Spark in Cloudera Data Science Workbench (CDSW). Participants use Spark SQL to load, explore, cleanse, join, and analyze data and Spark MLlib to specify, train, evaluate, tune, and deploy machine learning pipelines. They dive into the foundations of the Spark architecture and execution model necessary to effectively configure, monitor, and tune their Spark applications. Participants also learn how Spark integrates with key components of the Cloudera platform such as HDFS, YARN, Hive, Impala, and Hue as well as their favourite Python packages.

PUE is an Oficial Training Partner of Cloudera, is authorized by this multinational company for give official training in Cloudera technologies.

PUE is accredited and recognized for realize consulting services and mentoring on implementing Cloudera solutions in business environment with the added value in the practical business-centred focus of knowledge that is transfer to the official courses.

Audience and prerequisites

Este curso está diseñado para científicos de datos que usan Python o R para trabajar con pequeños conjuntos de datos en una sola máquina y que necesitan ampliar sus flujos de trabajo de data science y aprendizaje automático a grandes conjuntos de datos en clústeres distribuidos.

Aquellos ingenieros de datos, analistas de datos, desarrolladores y arquitectos de soluciones que colaboran con científicos de datos también encontrarán provechosa esta formación.

Requisitos previos

Los participantes deben tener un conocimiento básico de Python o R y algo de experiencia explorando, analizando datos y desarrollando modelos estadísticos o de aprendizaje automático. No se requieren conocimientos de Spark, Hadoop o de la plataforma Cloudera.

Objectives

Participants are going to walk through an end-to-end data science and machine learning workflow based on realistic scenarios and datasets from a fictitious technology company. The material is presented through a sequence of brief lectures, interactive demonstrations, extensive hands-on exercises, and lively discussions. The demonstrations and exercises are conducted in Python (with PySpark) using Cloudera Data Science Workbench (CDSW). Supplemental examples using R (with sparklyr) are provided.

Topics

Module 1. Introduction

Module 2. Data Science Overview

  • What Data Scientists Do
  • What Process Data Scientists Use
  • What Tools Data Scientists Use

Module 3. Cloudera Data Science Workbench (CDSW)

  • Introduction to Cloudera Data Science Workbench
  • How Cloudera Data Science Workbench Works
  • How to Use Cloudera Data Science Workbench
  • Entering Code
  • Getting Help
  • Accessing the Linux Command Line
  • Working with Python Packages
  • Formatting Session Output

Module 4. Case Study

  • DuoCar
  • How DuoCar Works
  • DuoCar Datasets
  • DuoCar Business Goals
  • DuoCar Data Science Platform
  • DuoCar Cloudera EDH Cluster
  • HDFS
  • Apache Spark
  • Apache Hive
  • Apache Impala
  • Hue
  • YARN
  • DuoCar Cluster Architecture

Module 5. Apache Spark

  • Apache Spark
  • How Spark Works
  • The Spark Stack
  • Spark SQL
  • DataFrames
  • File Formats in Apache Spark
  • Text File Formats
  • Parquet File Format
  • Spark Interface Languages
  • PySpark
  • Data Science with PySpark
  • sparklyr
  • dplyr and sparklyr
  • Comparison of PySpark and sparklyr
  • How sparklyr Works with dplyr
  • sparklyr DataFrame and MLlib Functions
  • When to Use PySpark and sparklyr

Module 6. Running a Spark Application from CDSW

  • Overview
  • Starting a Spark Application
  • Reading Data into A Spark SQL DataFrame
  • Examining the Schema of a DataFrame
  • Computing the Number of Rows and Columns of a DataFrame
  • Examining Rows of a DataFrame
  • Stopping a Spark Application

Module 7. Inspecting a Spark SQL DataFrame

  • Overview
  • Inspecting a DataFrame
  • Inspecting a DataFrame Column
  • Inspecting a Primary Key Variable
  • Inspecting a Categorical Variable
  • Inspecting a Numerical Variable
  • Inspecting a Date and Time Variable

Module 8. Transforming DataFrames

  • Spark SQL DataFrames
  • Working with Columns
  • Selecting Columns
  • Dropping Columns
  • Specifying Columns
  • Adding Columns
  • Changing the Column Name
  • Changing the Column Type
  • Working with Rows
  • Ordering Rows
  • Selecting a Fixed Number of Rows
  • Selecting Distinct Rows
  • Filtering Rows
  • Sampling Rows
  • Working with Missing Values

Module 9. Transforming DataFrame Columns

  • Spark SQL Data Types
  • Working with Numerical Columns
  • Working with String Columns
  • Working with Date and Timestamp Columns
  • Working with Boolean Columns

Module 10. Complex Types (optional)

  • Complex Collection Data Types
  • Arrays
  • Maps
  • Structs

Module 11. User-Defined Functions (optional)

  • User-Defined Functions
  • Defining a Python Function
  • Registering a Python Function as a User-Defined Function
  • Applying a User-Defined Function

Module 12. Reading and Writing Data

  • Reading and Writing Data
  • Working with Delimited Text Files
  • Working with Text Files
  • Working with Parquet Files
  • Working with Hive Tables
  • Working with Object Stores
  • Working with pandas DataFrames

Module 13. Combining and Splitting DataFrames

  • Joining DataFrames
  • Cross Join
  • Inner Join
  • Left Semi Join
  • Left Anti Join
  • Left Outer Join
  • Right Outer Join
  • Full Outer Join
  • Applying Set Operations to DataFrames
  • Splitting a DataFrame

Module 14. Summarizing and Grouping DataFrames

  • Summarizing Data with Aggregate functions
  • Grouping Data
  • Pivoting Data

Module 15. Window Functions (optional)

  • Introduction to Window Functions
  • Creating a Window Specification
  • Aggregating over a Window Specification

Module 16. Exploring DataFrames

  • Possible Workflows for Big Data
  • Exploring a Single Variable
  • Exploring a Categorical Variable
  • Exploring a Continuous Variable
  • Exploring a Pair of Variables
  • Categorical-Categorical Pair
  • Categorical-Continuous Pair
  • Continuous-Continuous Pair

Module 17. Apache Spark Job Execution

  • DataFrame Operations
  • Input Splits
  • Narrow Operations
  • Wide Operations
  • Stages and Tasks
  • Shuffle

Module 18. Monitoring, Tuning, and Configuring Spark Applications

  • Monitoring Spark Applications
  • Persisting DataFrames
  • Partitioning DataFrames
  • Configuring the Spark Environment

Module 19. Machine Learning Overview

  • Machine Learning
  • Underfitting and Overfitting
  • Model Validation
  • Hyperparameters
  • Supervised and Unsupervised Learning
  • Machine Learning Algorithms
  • Machine Learning Libraries
  • Apache Spark MLlib

Module 20. Training and Evaluating Regression Models

  • Introduction to Regression Models
  • Scenario
  • Preparing the Regression Data
  • Assembling the Feature Vector
  • Creating a Train and Test Set
  • Specifying a Linear Regression Model
  • Training a Linear Regression Model
  • Examining the Model Parameters
  • Examining Various Model Performance Measures
  • Examining Various Model Diagnostics
  • Applying the Linear Regression Model to the Test Data
  • Evaluating the Linear Regression Model on the Test Data
  • Plotting the Linear Regression Model

Module 21. Training and Evaluating Classification Models

  • Introduction to Classification Models
  • Scenario
  • Preprocessing the Modeling Data
  • Generate a Label
  • Extract, Transform, And Select Features
  • Create Train and Test Sets
  • Specify A Logistic Regression Model
  • Train the Logistic Regression Model
  • Examine the Logistic Regression Model
  • Evaluate Model Performance on the Test Set

Module 22. Tuning Algorithm Hyperparameters Using Grid Search

  • Requirements for Hyperparameter Tuning
  • Specifying the Estimator
  • Specifying the Hyperparameter Grid
  • Specifying the Evaluator
  • Tuning Hyperparameters using Holdout Cross-validation
  • Tuning Hyperparameters using K-fold Cross-validation

Module 23. Train and Evaluating Clustering Models

  • Introduction to Clustering
  • Scenario
  • Preprocessing the Data
  • Extracting, Transforming, and Selecting Features
  • Specifying a Gaussian Mixture Model
  • Training a Gaussian Mixture Model
  • Examining the Gaussian Mixture Model
  • Plotting the Clusters
  • Exploring the Cluster Profiles
  • Saving and Loading the Gaussian Mixture Model

Module 24. Processing Text and Training and Evaluating Topic Models (optional)

  • Introduction to Topic Models
  • Scenario
  • Extracting and Transforming Features
  • Parsing Text Data
  • Removing Common (Stop) Words
  • Counting the Frequency of Words
  • Specifying a Topic Model
  • Training a topic model using Latent Dirichlet Allocation (LDA)
  • Assessing the Topic Model Fit
  • Examining a Topic Model
  • Applying a Topic Model

Module 25. Training and Evaluating Recommender Models (optional)

  • Introduction to Recommender Models
  • Scenario
  • Preparing Data for a Recommender Model
  • Specifying a Recommender Model
  • Training a Recommender Model using Alternating Least Squares
  • Examining a Recommender Model
  • Applying a Recommender Model
  • Evaluating a Recommender Model
  • Generating Recommendations

Module 26. Working with Machine Learning Pipelines

  • Specifying Pipeline Stages
  • Specifying a Pipeline
  • Training a Pipeline Model
  • Querying a Pipeline Model
  • Applying a Pipeline Model

Module 27. Deploying Machine Learning Pipelines

  • Saving and Loading Pipelines and Pipeline Models in Python
  • Loading Pipelines and Pipeline Models in Scala

Module 28. Overview of sparklyr (optional)

  • Connecting to Spark
  • Reading Data
  • Inspecting Data
  • Transforming Data Using dplyr Verbs
  • Using SQL Queries
  • Spark DataFrames Functions
  • Visualizing Data from Spark
  • Machine Learning with MLlib

Module 29. Introduction to Additional CDSW Features (optional)

  • Collaboration
  • Jobs
  • Experiments
  • Models
  • Applications

Module 30. Conclusion

Open calls

06 Sep 2021 - 09 Sep 2021   |  

Cloudera Data Scientist - Virtual English

28 h | 2695 € | Live Virtual Class | English
from Monday to Thursday (09:00h - 17:00h)
Calendario de sesiones

13 Dec 2021 - 16 Dec 2021   |  

Cloudera Data Scientist - Virtual English

28 h | 2695 € | Live Virtual Class | English
from Monday to Thursday (09:00h - 17:00h)
Calendario de sesiones