Hadoop – Basic

Hadoop is an Apache open source framework written in java that allows distributed processing of large datasets across clusters of computers using simple programming models.

Course Overview

A Hadoop frame-worked application works in an environment that provides distributed storage and computation across clusters of computers.

Hadoop is designed to scale up from single server to thousands of machines, each offering local computation and storage.

Prerequisites of Entry Criteria

For Whom:

  • Software developers and architects
  • Managers
  • Testing / QA Engineers

Needs:

  • VirtualBox – Ubuntu Guest OS
  • Hadoop binaries
  • Internet Connection
Key Benefits
  • Mastering concept of Hadoop
  • HDFS
  • Mapreduce
  • Setup cluster and write a basic Mapreduce task
  • Introduction about tools Pig, Hive, Sqoop

Covering (10am to 6pm):

  • Introduction to Big Data and Hadoop (0.5 Hr)
  • HDFS Architecture (0.5 Hr)
  • Hadoop Deployment (1 Hr)
  • MapReduce (1 Hr)
  • YARN Architecture (1 Hr)
  • Hadoop toolset – Introductions (1 Hr)
  • Hands-on (2 Hr)

OUTLINE OF COURSE CONTENTS

1. INTRODUCTION TO BIG DATA & HADOOP

  • What is Big Data?
  • Data Explosion and its Sources
  • Types of Data – Structured, Semi-structured, Unstructured data
  • Characteristics of Big Data
  • Limitations of Traditional Large-Scale Systems
  • Use Cases for BigData
  • Challenges of BigData
  • Hadoop Introduction – What is Hadoop? Why Hadoop?
  • Hadoop Job Trends
  • History of Hadoop
  • Hadoop Core Components – MapReduce & HDFS

2. HDFS ARCHITECTURE

  • Introduction to Hadoop Distributed File System
  • Regular File System v/s HDFS
  • HDFS Architecture
  • Components of HDFS – NameNode, DataNode, SecondayNameNode
  • HDFS Features – Fault Tolerance, Horizontal Scaling, Data
    Replication, Rack Awareness
  • Anatomy of a file write on HDFS
  • Anatomy of a file read on HDFS

3. HADOOP DEPLOYMENT

  • Deployment Modes – Standalone Mode, Pseudo-Distributed Mode, Fully Distributed Mode
  • Pseudo-Distributed Mode Virtual Machine Setup on Windows
  • Oracle VM Virtualbox – Introduction
  • Install Virtualbox
  • Create a VM in Oracle VM Virtualbox
  • Download and Install Hadoop Packages
  • Hadoop Configuration
  • HDFS, MapReduce and YARN parameters
  • Hadoop Core Services – Daemon Process Status using JPS
  • Hadoop WebUI
  • Eclipse development environment setup

4. MAPREDUCE

  • What is MapReduce and Why it is popular
  • MapReduce Framework– Introduction, Driver, Mapper, Reducer, Combiner, Split, Shuffle & Sort
  • Example: WordCount the Hello World of MapReduce
  • Use cases of MapReduce
  • Real-time uses of MapReduce
  • Input Splits in MapReduce
  • Hands on with MapReduce Programming
  • Mapreduce Architecture
  • Responsibility of JobTracker, TaskTracker in classic MapReduce v1

5. YARN ARCHITECTURE

  • Hadoop 1.0 Limitations
  • MapReduce Limitations
  • YARN Architecture
  • Classic vs. YARN
  • Speculative Execution
  • Counters – Retrieving Job Information
  • Understanding Data Types of Keys and Values
  • Understanding Input/Output Format, Sequence Input/Output format
  • Map-Side Join, Reduce-Side Join, Distributed Join, Replicated Join, Composite Join, Cartesian Product

6. HADOOP TOOLSET - INTRODUCTIONS

  • HBase – nosql data operation
  • Pig – Scripting
  • Hive – SQL Query
  • Sqoop and Flume – Data Ingestion
  • Oozie – Workflow

7. HANDS-ON

  • Build Hadoop v2 cluster (Part 3, 4 & 5)
  • Basic Mapreduce project

Testimonials

  • My objective of attending training was met. The trainer was above far experienced

    Thariq Ahmed K P AMB INDIA PVT LTD
  • A good trainer. Presentation skill & confidence level is good. Course material is good!

    Piyush Dhani ORACLE
  • Trainer was very good with the examples that he gave to improve & understand the concepts better.

    Prabhakar Cirivn MICROSOFT
  • Trainer is confident, cool & well organized!

    M Ashok TECH MAHINDRA
Get in Touch

Send us an email and we'll get back to you, asap.

Not readable? Change text.
0