INTRODUCTION

What is Hadoop?

Hadoop is an Apache open source framework written in java that allows distributed processing of large datasets across clusters of computers using simple programming models. A Hadoop frame-worked application works in an environment that provides distributed storage and computation across clusters of computers. Hadoop is designed to scale up from single server to thousands of machines, each offering local computation and storage.

OBJECTIVES
  • Understand the various parts of Hadoop condition, for instance, Hadoop 2.7, Impala, Yarn, MapReduce, Pig, Hive, HBase, Sqoop, Flume, and Apache Spark.
  • you can learn about automatic Source Code Management using GIT and Continuous Integration using Jenkins.
  • Understand MapReduce and its qualities and retain advanced MapReduce thoughts.
  • Get a working learning of Pig and its parts.
TRAINING
  • Complete Hadoop Training - Learn Hadoop from beginner to advanced level.
  • Customized Hadoop Training - Customized your syllabus as per your requirement.
  • Hadoop Project based Training - Choose any project and get training on that project based.
  • Hadoop Application Training - Get our experts assistance in your existing project.
SYLLABUS

Hadoop Syllabus

MapReduce

  • Why MapReduce
  • How MapReduce works
  • Hadoop data types
  • Difference between Hadoop 1 & Hadoop 2
  • Main class
  • Mapper & Reducer Classes
  • The Job class
  • JobContext interface
  • Partioner & Reporter Interfaces
  • The Map & Reduce phases to process data
  • Identity mapper & reducer
  • Data flow in MapReduce
  • Input Splits
  • Relation Between Input Splits and HDFS Blocks
  • Flow of Job Submission in MapReduce
  • Combiners & Partitioners
  • Job submission & Monitoring

Yarn

  • Introduction to Yarn
  • Traditional MapReduce v/s Yarn
  • Yarn Architecture
    • Resource Manager
    • Node Manager
    • Application Master
  • Application submission in YARN
  • Node Manager containers
  • Resource Manager components
  • Yarn applications
  • Scheduling in Yarn
    • Fair Scheduler
    • Capacity Scheduler
  • Fault tolerance

Hadoop Ecosystems

Pig

  • What is Apache Pig
  • Why Apache Pig
  • Pig features
  • Where should Pig be used
  • Where not to use Pig
  • The Pig Architecture
  • Pig components
  • Pig v/s MapReduce
  • Pig v/s SQL
  • Pig v/s Hive
  • Pig Installation
  • Pig Execution Modes & Mechanisms
  • Grunt Shell Commands
  • Pig Latin - Data Model
  • Pig data types
  • Pig Latin operators
  • Case Sensitivity
  • Grouping & Co Grouping in Pig Latin
  • Sorting & Filtering
  • Joins in Pig latin
  • Built-in Function
  • Writing UDFs
  • Macros in Pig

Hive

  • What is Hive
  • Features of Hive
  • The Hive Architecture
  • Components of Hive
  • Installation & configuration
  • Primitive types
  • Complex types
  • Built in functions
  • Hive UDFs
  • Views & Indexes
  • Hive Data Models
  • Hive vs Pig
  • Co-groups
  • mporting data
  • Hive DDL statements
  • Hive Query Language
  • Data types & Operators
  • Type conversions
  • Joins
  • Sorting & controlling data flow
  • local vs mapreduce mode
  • Partitions
  • Buckets

Sqoop

  • Introducing Sqoop
  • Scoop installation
  • Working of Sqoop
  • Understanding connectors
  • Importing data from MySQL to Hadoop HDFS
  • Selective imports
  • Importing data to Hive
  • Importing to Hbase
  • Exporting data to MySQL from Hadoop
  • Controlling import process

Flume

  • What is Flume
  • Applications of Flume
  • Advantages of Flume
  • Flume architecture
  • Data flow in Flume
  • Flume features
  • Flume Event
  • Flume Agent
    • Sources
    • Channels
    • Sinks
  • Log Data in Flume

HBase

  • What is HBase
  • History Of HBase
  • The NoSQL Scenario
  • HBase & HDFS
  • Physical Storage
  • HBase v/s RDBMS
  • Features of Hbase
  • HBase Data model
  • Master server
  • Region servers & Regions
  • HBase Shell
  • Create table and column family
  • The HBase Client API

Spark

  • Introduction to Apache Spark
  • Features of Spark
  • Spark built on Hadoop
  • Components of Spark
  • Resilient Distributed Datasets
  • Data Sharing using Spark RDD
  • Iterative Operations on Spark RDD
  • Interactive Operations on Spark RDD
  • Spark shell
  • RDD transformations
  • Actions
  • Programming with RDD
    • Start Shell
    • Create RDD
    • Execute Transformations
    • Caching Transformations
    • Applying Action
    • Checking output
  • GraphX overview

Scala Overview

  • Introduction to Scala
  • Spark & Scala interdependence
  • Objects & Classes
  • Class definition in Scala
  • Basic Data Types
  • Operators in Scala
  • Control structures
  • Fields in Scala
  • Functions in Scala
  • Collections in Scala
    • Mutable collection
    • Immutable collection

Zookeeper Overview

  • Zookeeper Introduction
  • Distributed Application
  • Benefits of Distributed Applications
  • Why use Zookeeper
  • Zookeeper Architecture
  • Hierarchical namespace
  • Znodes
  • Stat structure of a Znode
  • Electing a leader

Oozie & Hue Overview

  • Introduction to Apache Oozie
  • Oozie Workflow
  • Oozie Coordinators
  • Property File
  • Oozie Bundle system
  • CLI and extensions
  • Overview of Hue

MongoDB Overview

  • Introduction to MongoDB
  • MongoDB v/s RDBMS
  • Why & Where to use MongoDB
  • Databases & Collections
  • Inserting & querying documents
  • Schema Design
  • CRUD Operations

Planning Hadoop Cluster

  • Architecture of Hadoop Cluster
  • Workflow of Hadoop Cluster
  • HDFS Writes
  • Preparing for HDFS Writes
  • Pipelined HDFS Write
  • NameNode Functionality
  • Replicating Missing Replicas
  • HDFS Reads
  • Factors for Planning Hadoop Cluster
  • Single-Node and Multi-Node Cluster Configuration
  • HDFS Block replication and rack awareness
  • Topology and Components of Hadoop Cluster

Cluster Maintenance

  • Checking HDFS Status
  • Breaking the cluster
  • Copying Data Between Clusters
  • Adding and Removing Cluster Nodes
  • Rebalancing the cluster
  • Name Node Metadata Backup
  • Cluster Upgrading

Advanced Cluster Configuration Features

  • Hadoop Configuration Overview
  • Types of Configuration Files
  • Hadoop Cluster and Map Reduce Configuration Parameters with Values
  • Hadoop Environment Setup
  • Include and Exclude Configuration Files

Managing and Scheduling Jobs

  • Managing Jobs
  • The FIFO and Fair Schedule
  • How to stop and start jobs running on the cluster

Cluster Monitoring, Troubleshooting and Optimizing

  • General System conditions to Monitor
  • Name Node and Job Tracker Web Uis
  • View and Manage Hadoop's Log files
  • Ganglia Monitoring Tool
  • Common cluster issues and their resolutions

YARN

  • Introduction to YARN
  • Need for YARN
  • YARN Architecture
  • YARN Installation and Configuration

Extending Hadoop

  • Simplifying information access
  • Enabling SQL–like querying with Hive
  • Installing Pig to create MapReduce jobs
  • Imposing a tabular view on HDFS with HBase
  • Configuring Oozie to schedule workflows

Installing and Managing Hadoop Ecosystem

  • Sqoop
  • Flume
  • Hive
  • Pig
  • HBase
  • Oozie

Hadoop Analytics using R (For DataScientist)

Functions & plots In R

  • Measuring the central tendency – the model
  • Measuring spread – variance and standard deviation
  • Visualizing numeric variables – boxplots
  • Visualizing numeric variables – histograms
  • Visualizing numeric variables – qqplot
  • Understanding numeric data – uniform and normal distributions
  • Measuring the central tendency – the model
  • Exploring relationships between variables
  • Visualizing relationships – scatterplots
  • Exploring numeric variables

Read and Write Operations in R

  • Reading from CSV
  • Reading from URL
  • Reading from Excel
  • Writing to CSV & PMML

Integrating R

  • Implementing Association rule mining in R
  • Integrating R with Hadoop using RHadoop and RMR package
  • Writing MapReduce Jobs in R and executing them on Hadoop
  • Implementing Machine Learning Algorithms on larger Data Sets with Apache Mahout

Databases and Introduction to Machine Learning Concept

  • Use SQL databases to store and organize data
  • Access stored data with MySQL querying language
  • Introduction to Machine Learning
  • Supervised and Unsupervised Learning Techniques

Regression Methods and Supervised Learning Techniques

  • Creating predictive models
  • Classification Using Nearest Neighbors
  • Linear R egression
  • Multiple linear regression model
  • Logistic Regression
  • Decision Tree Classifier
  • Clustering
  • What is Random Forests?
  • Features of Random Forest
  • Out of Box Error Estimate
  • Naive Bayes Classifier

Unsupervised Machine Learning Techniques

  • Introduction of K-Means Clustering
  • K-means in Euclidean space
  • K-means as optimization
  • Understanding TF-IDF and Cosine
  • Similarity and their application to Vector Space Model

Deep learning

  • Deep Network
  • Optimization for Training Deep Models
  • Convolutional Networks
  • Understanding Support Vector Machines
  • Retrieve data using sql statements
  • Using kernels for non-linear spaces

Project

Project name: Live Project

Project description:Student will be assigned a project which they will have to execute under the careful guidance of the faculty.