Sunteți pe pagina 1din 5

TRAINING STRUCTURE

DAY 1
Session 1: Starting with Hadoop
Fundamental and Core concepts of Hadoop
Infrastructure and Architecture of HDFS
HDFS command line and Web interface
Lab: Going through the Hadoop VM
Concepts of MapReduce function
Architectural overview of MapReduce
MapReduce Types and Formats Managing and Scheduling jobs Concepts of
MapReduce Version2
Lab: Running MapReduce functions
Session 2:MapReduce Program
MapReduce Flow
Examining a Sample MapReduce Program
Basic MapReduce API Concepts
Driver Code
Mapper
Reducer
Streaming API
Using Eclipse for Rapid Development
New MapReduce API
Lab: Writing a MapReduce Program
DAY 2
Session 1: Hadoop APIs in depth
Tool Runner Testing with MRUnit
Reducing Intermediate Data with Combiners
Configuration and Close Methods for Map/Reduce Setup and Teardown
Writing Practitioners for Better Load Balancing
Directly Accessing HDFS Using the Distributed Cache
Lab: Implementing Combiner
Lab: Writing a Partitioner
Session 2: Practical Development Tips and Techniques
Debugging MapReduce Code
Using LocalJobRunner Mode for Easier Debugging
Retrieving Job Information with Counters
Logging Splittable
File Formats
Determining the Optimal Number of Reducers
Map-Only MapReduce Jobs

Lab: Using Map-Only MapReduce Jobs


Common MapReduce Algorithms
Sorting and Searching
Indexing
Concepts of Machine Learning
Machine Learning with Mahout
Term Frequency
Inverse Document Frequency
Word Co-Occurrence
Lab: Creating an Inverted Index
Advanced MapReduce Programming
Custom Writables and Writable Comparables
Saving Binary Data Using Sequence Files and Avro Files
Creating Input
Formats and
Output
Formats
Session 1: 100% hands on
Pig Latin Syntax
Loading Data
Simple Data Types
Field Definitions
Data Output
Viewing the Schema
Filtering and Sorting Data
Commonly-Used Functions
Hands-On Exercise: Using Pig for ETL Processing
Session 2: 100% Hands on
Storage Formats
Complex/Nested Data Types
Grouping
Built-in Functions for Complex Data
Iterating Grouped Data
Session 3: 100% Hands On
Adding Flexibility with Parameters
Macros and Imports, UDFs, Contributed Functions
Using Other Languages to Process Data with Pig
Hands-On Exercise: Extending Pig with
Streaming and UDFs

DAY 3
Hive (hands on)
Session 1:
What Is Hive?
Hive Schema and Data Storage
Hive Use Cases, Interacting with Hive
Relational Data Analysis with Hive
Hive Databases and Tables
Basic HiveQL Syntax
Data Types, Joining Data Sets
Common Built-in Functions
Hands-On Exercise: Running Hive Queries
Session 2:
Hive Data Management
Hive Data Formats
Creating Databases and Hive-Managed Tables
Loading Data into Hive
Altering Databases and Tables
Self-Managed Tables
Simplifying Queries with Views
Storing Query Results
Hands-On Exercise: Data Management with Hive
Session 3:
Text Processing with Hive
Sentiment Analysis and N-Grams
Hands-On Exercise : Gaining Insight with Sentiment Analysis
Hive Optimization
Partitioning
Bucketing
Indexing Data
DAY 4
Scala Basics
Values, functions, classes, methods, inheritance, try-catch-finally. Expression-oriented
programming
Case classes, objects, packages, apply, update,
Functions are Objects (uniform access principle), pattern matching.
Collections
Lists, Maps, functional combinators (map, foreach, filter, zip, folds)

Why Spark?
Problems with Traditional Large-Scale Systems
Introducing Spark
Spark Basics
What is Apache Spark?
Using the Spark Shell
Resilient Distributed Datasets (RDDs)
Functional Programming with Spark
Working with RDDs, RDD Operations
Key-Value Pair RDD
Map Reduce and Pair RDD Operations
Passing Functions to Spark
DAY 5
Storm Using Java
Features of Storm
Storm components, Nimbus, Supervisor nodes
The ZooKeeper cluster
The Storm data model
Definition of a Storm topology
Operation modes
Setting Up a Storm Cluster
Setting up a distributed Storm cluster
Deploying a topology on a remote Storm cluster
Deploying the sample topology on the remote cluster
Configuring the parallelism of a topology
The worker process
The executor
Tasks
Configuring parallelism at the code level
Distributing worker processes, executors, and tasks in the sample topology
Rebalancing the parallelism of a topology
Rebalancing the parallelism of the sample topology
Stream grouping, Shuffle grouping, Fields grouping
Storm and Kafka Integration
The Kafka architecture
The producer
Replication
Consumers
Brokers
Data retention
Setting up Kafka
Setting up a single-node Kafka cluster
A sample Kafka producer
Integrating Kafka with Storm

Exploring High-level Abstraction in Storm with Trident


Introducing Trident
Understanding Trident's data model
Writing Trident functions, filters, and projections
Trident functions
Trident filters
Trident projections

*****

S-ar putea să vă placă și