The data scientists are head of build platforms of information for provisioning a deep vision and answer questions previously unthinkable. Spark and Hadoop are already changing the way data scientists operate by allowing the data analysis to scale interactively and iteratively.
Learn how Spark and Hadoop allow data scientists to help the business to reduce costs, increase the profits, improve the products, retaining clients and identify new opportunities.
This course helps the students to understand how data scientists work, the problems that they solve and the tools and techniques that they use. Through some simulations, the participants apply data science methods to the real world challenges in different industries and, finally, they will prepare for developing data scientist functions.
PUE is official Training Partner of Cloudera, is authorized by this multinational company for give official training in Cloudera technologies.
PUE is accredited and recognized for realize consulting services and mentoring on implementing Cloudera solutions in business environment with the added value in the practical business-centred focus of knowledge that is transfer to the official courses.
Audience and prerequisites
The training course is focus on data engineers and developers with some knowledge of data science and machine learning may also find this workshop useful, as well as for data scientists who currently use Python or R to work with smaller datasets on a single machine and who need to scale up their analyses and machine learning models to large datasets on distributed clusters.
The students should have a basic understanding of Python or R and some experience exploring and analyzing data and developing statistical or machine learning models. Knowledge of Hadoop or Spark is not required.
At the end of the training, the student will get skills related to:
- How to use Apache Spark 2 to run data science and machine learning workflows at scale
- How to use Spark SQL and DataFrames to work with structured data
- How to use MLlib, Spark’s machine learning library
- How to use PySpark, Spark’s Python API
- How to use sparklyr, a dplyr-compatible R interface to Spark
- How to use Cloudera Data Science Workbench (CDSW)
- How to use other Hadoop ecosystem components including HDFS, Hive, Impala, and Hue
Data Science Overview
- What data scientists do
- What process they use
- What tools they use
Cloudera Data Science Workbench
- Introduction to Cloudera Data Science Workbench
- How to Use Cloudera Data Science Workbench
- Demonstration and Exercises: Using Cloudera Data Science Workbench
- Case Scenario Explanation
- Case Data Science Platform
- Demonstration and Exercises: Using Hue
- How Apache Spark works and what capabilities it offers
- Which popular file formats Spark can use for data storage
- Which programming languages you can use to work with Spark
- How to get started using PySpark and Sparklyr
- How PySpark and Sparklyr compare
Machine Learning Overview
- What machine learning is
- Some important terms and concepts in machine learning
- Different types of machine learning algorithms
- Which libraries are used for machine learning
Apache Spark MLlib
- Which machine learning capabilities MLlib provides
- How to build, validate, and use machine learning models with MLlib
Apache Spark Job Execution
- How a Spark job is made up of a sequence of transformations followed by an action
- How Spark uses lazy execution
- How Spark splits input data into partitions
- How Spark executes narrow and wide operations
- How Spark executes a job in tasks and stages