Skip Navigation Links

Course Length:
5 Days
Course Description:
This hands-on Hadoop training course is designed for experienced developers and provides a fast track to building reliable and scalable application systems using Hadoop open-source software. In this class we will introduce you to the Hadoop frameworks and tools are specifically geared towards processing of large datasets. Practical case studies will be demonstrated in class to show how Hadoop is used in real world today to solve different problems. MapReduce training is an essential component of this course.
Who Should Attend:
This course is designed for developers seeking to obtain a solid foundation of Hadoop Architecture.
Benefits of Attendance:
Upon completion of this course, students will be able to:
  • Understand Hadoop main components and Architecture
  • Be comfortable working with Hadoop Distributed File System
  • Understand MapReduce abstraction and how it works
  • Program a MapReduce job
  • Master Hadoop Input and Output data formats
  • Be comfortable working with MapReduce Counters and Joins
  • Otimize Hadoop cluster for the best performance based on specific job requirements
  • Deal with Hadoop component failures and recoveries
  • Get familiar with related Hadoop projects: HBase, Hive and Pig
  • Know best practices of using Hadoop in the enterprise
Prerequisites:
students taking this course should be competent Java programmers, since concepts taught in class will be reinforced through extensive programming exercises in the Hadoop implementation of MapReduce (which is in Java). Note that this is a course on algorithms and "thinking at scale"—not about Hadoop programming. Therefore, we expect you to "pick up" the details of the Hadoop API without explicit instruction from us. Of course, we will assist you by providing resources and a reasonable amount of guidance. In addition, students are assumed to have knowledge of basic probability and statistics (e.g., axioms of probability, Bayes' Theorem, relative frequency estimation, etc.) and also a solid understanding of basic computer architecture (e.g., microprocessor architectures, memory hierarchies, cache coherence protocols, etc.).
Course Outline:
  • Introduction to MapReduce
    1. Administrivia
    2. Overview of cloud computing
    3. Overview of MapReduce and the distributed file system
  • Hadoop: Nuts and Bolts
    1. Writing, running, debugging Hadoop programs
    2. Hadoop behind the scenes
  • MapReduce: the programming environment
    1. MapReduce algorithm design and design patterns
  • Text retrieval algorithms
    1. Introduction to information retrieval
    2. Basics of indexing and retrieval
    3. Inverted indexing in MapReduce
    4. Retrieval at scale
  • Graph algorithms
    1. Graph problems and representations
    2. Parallel breadth-first search
    3. PageRank
  • MapReduce and databases
    1. Relational databases vs. MapReduce
    2. MapReduce algorithms for processing relational data
    3. OLTP vs. OLAP (data warehousing and business intelligence)
  • Hidden Markov models
    1. Hidden Markov models
    2. Expectation maximization
  • Language models
    1. N-gram language models
    2. Parameter estimation for web-scale language models
  • Large-scale graphs
    1. Scalable identity resolution in email collections: Slides in PDF (622 KB)
    2. DNA sequence assembly: Slides in PDF (5.65 MB)
  • Dryad and DryadLINQ
    1. Dryad and DryadLINQ: Guest lecture by Michael Isard.
  • Bigtable, Hive, and Pig
    1. Bigtable
    2. Hive
    3. Pig