Cloudera Data Scientist

28 hours
1995 €
Classroom or Live Virtual Class
Classroom or Live Virtual Class



The data scientists are head of build platforms of information for provisioning a deep vision and answer questions previously unthinkable. Spark and Hadoop are already changing the way data scientists operate by allowing the data analysis to scale interactively and iteratively.

Learn how Spark and Hadoop allow data scientists to help the business to reduce costs, increase the profits, improve the products, retaining clients and identify new opportunities.

This course helps the students to understand how data scientists work, the problems that they solve and the tools and techniques that they use. Through some simulations, the participants apply data science methods to the real world challenges in different industries and, finally, they will prepare for developing data scientist functions.

PUE is an Oficial Training Partner of Cloudera, is authorized by this multinational company for give official training in Cloudera technologies.

PUE is accredited and recognized for realize consulting services and mentoring on implementing Cloudera solutions in business environment with the added value in the practical business-centred focus of knowledge that is transfer to the official courses.

Audience and prerequisites

The training course is focus on data engineers and developers with some knowledge of data science and machine learning may also find this workshop useful, as well as for data scientists who currently use Python or R to work with smaller datasets on a single machine and who need to scale up their analyses and machine learning models to large datasets on distributed clusters.

The students should have a basic understanding of Python or R and some experience exploring and analyzing data and developing statistical or machine learning models. Knowledge of Hadoop or Spark is not required.


At the end of the training, the student will get skills related to:

  • Overview of data science and machine learning at scale
  • Overview of the Hadoop ecosystem
  • Working with HDFS data and Hive tables using Hue
  • Introduction to Cloudera Data Science Workbench
  • Overview of Apache Spark 2
  • Reading and writing data
  • Inspecting data quality
  • Cleansing and transforming data
  • Summarizing and grouping data
  • Combining, splitting, and reshaping data
  • Exploring data
  • Configuring, monitoring, and troubleshooting Spark applications
  • Overview of machine learning in Spark MLlib
  • Extracting, transforming, and selecting features
  • Building and evaluating regression models
  • Building and evaluating classification models
  • Building and evaluating clustering models
  • Cross-validating models and tuning hyperparameters
  • Building machine learning pipelines
  • Deploying machine learning models



Data Science Overview

  • What data scientists do
  • What process they use
  • What tools they use

Cloudera Data Science Workbench

  • Introduction to Cloudera Data Science Workbench
  • How to Use Cloudera Data Science Workbench
  • Demonstration and Exercises: Using Cloudera Data Science Workbench

Case Study

  • Case Scenario Explanation
  • Case Data Science Platform
  • Demonstration and Exercises: Using Hue

Apache Spark

  • How Spark Works
  • The Spark Stack
  • File Formats in Spark
  • Spark Interface Languages
  • Introduction to PySpark
  • Demonstration and Exercises: Connecting to Spark using PySpark
  • Introduction to sparklyr
  • Demonstration and Exercises: Connecting to Spark using sparklyr
  • When to Use PySpark and sparklyr

Lectures, Demonstrations, and Exercises using CDSW

Apache Spark Job Execution

  • How DataFrame Operations Become Spark Jobs
  • How Spark Executes a Job

Apache Spark MLlib

  • Introduction to Apache Spark MLlib
  • Demonstrations and Exercises: Using MLlib


Open calls