Cloud

Success in cloud computing migration.

Infrastructure Modernization
Reliable and profitable

Reduce costs optimizing resources by choosing different storage types according your data and costs, and increase resources any time your business need it.

We reduce the Cloud migration impacts so you can get all the benefits of an infrastructure modernization:

  • Migration from Data Warehouse to Big Query
  • Migration of workloads and/or data from an on-premises environment to the Cloud
  • SQL database migration

High Performance Computing
Automation and resources optimization

The cloud manages the infrastructure as a service, reducing costs and routine task operations. At PUE, we bet for the implementation of full stack architectures with CI/CD applied to the processes automation for Infrastructure and Development, as well as container-based microservice architectures that allows resource consumption optimization and automatic scalability of services based on workload demand.

Data Analytics
Data-driven decisions

We design and create Data Lakes connecting data from different sources, which allows to process information, analyze data in real time and visualize the updated results that will provide you valuable information for decisions making.

Machine Learning & AI
Data insights

We use the data to implement ML & AI models with the Google Cloud suite of services (Vertex AI), to help companies optimize their business, get a better understanding of their customers, prevent risks and save time with disruptive and trained models.

Reduce costs optimizing resources by choosing different storage types according your data and costs, and increase resources any time your business need it.

We reduce the Cloud migration impacts so you can get all the benefits of an infrastructure modernization:

  • Migration from Data Warehouse to Big Query
  • Migration of workloads and/or data from an on-premises environment to the Cloud
  • SQL database migration

The cloud manages the infrastructure as a service, reducing costs and routine task operations. At PUE, we bet for the implementation of full stack architectures with CI/CD applied to the processes automation for Infrastructure and Development, as well as container-based microservice architectures that allows resource consumption optimization and automatic scalability of services based on workload demand.

We design and create Data Lakes connecting data from different sources, which allows to process information, analyze data in real time and visualize the updated results that will provide you valuable information for decisions making.

We use the data to implement ML & AI models with the Google Cloud suite of services (Vertex AI), to help companies optimize their business, get a better understanding of their customers, prevent risks and save time with disruptive and trained models.

Data analytics project on Google Cloud microservices infrastructure

Objetives

  • Evolve a platform from B2C to B2B.
  • Have two clusters, one productive and one pre-productive with contingency capabilities in a Public Cloud environment based on Google Cloud.
  • Develop the functionalities of the project.
  • Solution support to cover the entire Big Data lifecycle.

Google Cloud environment

The solution proposes the use of Google's project infrastructure, which primarily features the  components:

  • Compute Engine
  • Google Kubernetes Engine
  • Cloud Storage
  • Cloud SQL
  • VPC Network

Instance types

For the creation of two fully interchangeable pre- and pro-differentiated environments, two independent environments identical in infrastructure are proposed to fulfil contingency functions. Each environment consists of the following components:

  • 3 Master Nodes
  • 6 Worker nodes

The recommended distribution of services and types of instances are:

Services

Master Nodes

Worker Nodes

MapReduce
YARN
Spark
Hive

N1-highmem-8

N1-highmem-8

HBase
SolR
Impala

N1-highmem-8

N1-highmem-8
N1-highmem-16

All CDH
services

N1-highmem-8

N1-highmem-8
N1-highmem-16

Google Kubernetes Engine

For client processes, GKE microservices are used with the following features:

  • 2 clusters: pre-production and production.
  • 2 instance templates, one for each cluster.
  • Types of nodes running in the clusters: n1-standard-1.

Service architecture

Data analysis phase

Storage layer

Type

Component

Description

Main

HDFS

Distributed storage system, core for all other systems and processing

Main

HBAse

Columnar database for low-latency, scalable access to non-random data

Optional

KUDU

Intermediate performance database between HBase and HDFS

Main

Parquet

Columnar storage format on top of HBase and with possibility of compression

Main

Avro

HDFS information container that allows the compression and metadata storage of the available information schema.

Data entry phase

Input
Element

Access
Elements

Description

Syslog TCP

Syslog TCP Source in Flume and Kafka

Flume will be used for the receipt of information with a TCP listening source

Files in local/
network directories

Spool Source at Flume and Kafka

Flume will be used for pooling new information in parameterised directories

Relational databases

Sqoop

Sqoop handles both full and incremental ingestion of structured SQL sources.

Web Services

HTTP Source in Flume and JSON Handler and Kafka

A connector to web services with JSON content handling will be used

Data processing phase

Requirement

Component

Description

Batch processing

Spark

Highly effective in-memory processing of complete datasets

Real-Time Processing

Spark Streaming

Micro-batch processing within time windows

Filtering, Transformation, Bundling and Enrichment

Spark Streaming, Spark SQL HBase

Spark Streaming in conjunction with Spark SQL (facilitating process coding) for filtering options. HBase for low latency access to databases in order to exploit transformation and enrichment functions

Process Analytics and Machine Learning

MLlib, GraphX

Spark processes that use the Machine Learning libraries

Text indexing, aggregation and time series

SolR

SolR allows indexing of documents for subsequent agile search even in visual environments (integration with Hue for exploitation)

Near Real-Time Decision Making

Spark

Design of a rule engine for low latency micro-batch decision making and performance

Data Compression

Avro and Parquet

Enable reading and writing alongside native processing of compressed data

Data exploitation phase

Requirement

Module

Description

Visual Environment

HUE

Hue is the unified environment for the exploitation of data, it allows to connect visually with all the components of the cluster and to carry out the necessary processes in a visual and intuitive environment

SQL-Like Queries

Hive/Impala

Hive is the standard query engine. Impala is included as a low latency engine to improve query response times.

Third Party Integration

ODBC/ WebServices/ Sqoop

The connection can be made via JDBC/ODBC connections for reading data with Hive, or via WebServices such as HBase and dumping data into external databases via Sqoop

Share your challenges, we will help you with a solution
Let’s talk?