Big Data Masters Program

1

Introduction to Big Data

2

Big Data System Requirements

3

Monolithic vs Distributed System

4

Distributed System Architecture

5

What is Hadoop And Evolution of Hadoop

6

Core Components of Hadoop

7

HDFS Architecture:

8

What is Node And What is Cluster

9

Data Block & Block Size

10

Slave Node, Master Node, Data Node & Name Node

11

Metadata And Replication Factor

12

Heart Beat & Fault Tolerance

13

Handling Namenode Failure

14

What is SPOF

15

FSimage & Edit Logs

16

Secondary Namenode

17

Name Node Recovery

18

Check Pointing

19

Understanding Replication Factor

20

What is Rack And Rack Failure

21

Rack Awareness Mechanism

22

Block Report

23

Namenode High Availability

24

Quorum Journal Manager & Quorum Journal Node

25

Understanding Linux File System

26

List & Parameters of List Command

27

Touch, Mkdir, Rmdir & Other Linux Commands

28

HDFS Commands:

29

List Files & Directories

30

How HDFS Commands Work

31

‘ls’ Command With Various Parameters

32

Create, Remove File/Directory

33

Copy & Get Files/Folders From Local to HDFS & Vice Versa

34

Move Files/Folders From HDFS to HDFS

35

Change Replication Factor Dynamically

36

View File Metadata Information

1

Introduction to MapReduce

2

Stages in MapReduce

3

What is Key-Value

4

What is Map & What is Reduce

5

Example to Undestand Map&Reduce

6

Word Count Example in MapREduce

7

Record Reader

8

Mapper Phase

9

Reducer Phase

10

MapReduce Shuffle & Sort

11

Inside Map & Reduce Phase

12

Wordcount Example in MapReduce

13

Typical MapReduce Flow

14

Blocks in MapReduce

15

Default Number of Mappers & Reducers

16

Understanding Number of Mappers/Reducers

17

MapReduce Framework Behind the Scenes

18

Role of Hash Function in MapReduce

19

Partitioning in MapReduce

20

How to Choose Number of Reducers

21

How Hash Function Works

22

Understanding Shuffle & Sort

23

Example: Calculating Max Temperature in a Day

24

Combiner Function in MapReduce

25

Advantages of Combiners

26

When to Use or Not to Use Combiner

27

Example1: Filtering Data using MapReduce

28

Example2: Finding Distinct Values

29

Example3: Finding Top 3 Most Influential users

30

Realtime Use Case: Google Web Search

31

MapReduce Programming

32

MR Code Explanation

33

How to Write Map Reduce Code

34

Mapper Code

35

Reducer Code

36

Main Code

37

Finding the Frequency of Each Word in a File

38

Mapreduce Jars

39

MapReduce Practical Sessions

40

Word Count Program – Practical Session1

41

Jar Creation & Execution – Practical Session2:

42

How to Create a Jar

43

How to Execute the Jar

44

How to Track a Job

45

How to Track All Previous Jobs

46

MR Program Variations – Practical Session3:

47

How to Change Number of Reducers

48

Writing Custom Partitioner Logic

49

Changing Number of Reducers to Zero

50

Introducing Combiner

51

Writing Custom Combiner logic

52

Introduction to Partitioners

53

Partitioners Code Example

1

Sqoop Fundamentals

2

Sqoop Basics

3

What is sqoop

4

Sqoop Workflow

5

Key Features of Sqoop

6

Sqoop Import

7

Sqoop Export

8

Connecting to MySQL

9

Acessing MySQL Databases from Hadoop

10

Acessing MySQL Tables from Hadoop

11

Sqoop Import Practicals

12

Sqoop Export Practicals

13

Sqoop Job

14

Sqoop Incremental Load

15

Sqoop Default Import

16

Sqoop Free-From Query Import

17

Sqoop Direct import

18

Importing Data Into Hive

19

Importing Data Into HBase

20

Sqoop Validate

21

When a Sqoop Export May Fail

1

Hive Overview:

2

Transactional System and Analytical System

3

Examples of Transactional Systems

4

Examples of Analytical Systems

5

What is Hive

6

Hive Query Language (HQL)

7

Understanding Hive Table

8

Introduction to Hive Metadata

9

Why Hive over traditional databases

10

Transactional and Analytical Processing

11

What is Data Warehouse

12

Hive Architecture

13

Hive on top of Hadoop

14

How Hive Works

15

Transactional vs Analytical Processing

16

Data Warehouse Concept

17

The Hive Metastore

18

Hive vs RDBMS

19

HQL vs SQL

20

Hive Subqueries Views & Index

21

Transactional and Analytical Processing

22

What is Data Warehouse

23

Hive Architecture

24

Hive on Hadoop

25

Hive Metastore

26

Hive vs. RDBMS

27

Hive Complex Data Types

28

Hive Array, Map & Struct

29

Hive Built-in Functions

30

Hive UDF, UDAF & UDTF

31

Hive Lateral Views

32

Hive Subqueries

33

Hive Views

34

Hive Normalization vs Denormalization

1

Hive Structure Level Optimizations:

2

Hive Partitioning

3

Hive Partitioning With 2 Columns

4

Hive Bucketing

5

Hive Partitioning With Bucketing

6

Hive Query Level Optimizations:

7

Hive Join Optimizations

8

Hive Bucket Map Join Optimizations

9

Hive Window Functions

10

Hive Ranking

11

Hive Sorting

12

Hive File Format

13

Row vs Column File Formats

14

Specialized File Formats

15

Internals of ORC File Formats

16

Internals of Parquet File Formats

17

ORC vs Parquet File Formats

18

Hive Compression Techniques

19

Hive Vectorization

20

Changing the Hive Engine

21

Hive Thrift Server

1

Hbase Basics

2

Key requirements of database

3

Limitations of Hadoop

4

Google Bigtable concept for quick searching

5

Implementation of Bigtable as Hbase

6

Properties of Hbase

7

What Hbase can offer

8

Row based storage vs Columnar storage

9

Advantages of columnar storage

10

Normalization vs Denormalization

11

CRUD Operation

12

RDBMS vs Hbase

13

Hbase data model

14

4-Dimensional data model

15

CAP Theorem

16

Hbase Architecture

17

Hbase Region Server

18

Region, Memstore, Wal & Block Cache

19

Hfile

20

Zookeeper

21

Hmaster & Meta Table

22

Hbase Architecture components in details

23

Hbase Read/Write operations

24

Compaction

25

Hbase Data Update

26

Hbase Data Deletion

27

Handling Server Failures

28

Hbase Practicals

29

Handling Hbase Failure Services

30

Create & List Table

31

Insert Records in Table

32

Scan(view) & Get records from table

33

Delete a column

34

Describe a table

35

Check table exists or not

36

Drop table – Understanding how it works

37

Parameters of get command

38

Parameters of scan command

39

Hbase files structure in HDFS

40

How to disable/enable a table

41

Various filters in Hbase

42

Count Records

1

What is Cassandra

2

How Cassandra Cluster Look Like

3

Tunable read/write Consistency

4

Hbase vs Cassandra

5

Integration with Hadoop (Mini Project)

6

Hbase-Hive Integration

1

Why Scala

2

Where to Run Scala Code

3

Scala Code Using IDE

4

Scala Basics

5

Var vs val

6

Type inference

7

Data types in Scala

8

String Interpolation

9

String Comparison

10

Flow control: If else

11

Match Case

12

For Loop

13

While loop

14

Scala Functional Programming

15

How to define a function

16

Higher order function

17

Anonymous function

18

Scala Collections

19

Array

20

List

21

Tuple

22

Range

23

Set

24

Map

25

Scala Functional Programming:

26

Why Scala

27

Modes of writing Scala code

28

What is a functional programming

29

What is a function

30

What is a pure function?

31

First class function

32

Higher order function

33

Anonymous function

34

Immutability

35

Loop

36

Recursion

37

Tail recursion

38

Statement vs Expression

39

Closure

40

Scala type system

41

Scala operators

42

Anonymous function

43

Placeholder syntax

44

Partially applied functions

45

Function currying

1

What is App class in Scala

2

Default args, named args & variable args

3

Difference between nil, null, none & nothing

4

What is option in Scala

5

What is unit in Scala

6

Dealing with nulls in Scala

7

What is yield

8

What is vector

9

Scala if guards & pattern guards

10

What is “for comprehensions”

11

Difference between “==” in java and Scala

12

Difference between strict val vs lazy val

13

What are default packages in Scala

14

What is Scala apply method

15

What is a diamond problem in Scala

16

What is a trait

17

Why Scala is the top most choice for a big data

18

What is Apache Spark

1

What is Apache Spark

2

Understanding Spark cluster

3

Is Spark a replacement to Hadoop

4

Why Spark is faster than MapReduce

5

How data store in Spark

6

What is RDD

7

What is DAG

8

RDD Lineage

9

Resiliency

10

Immutability

11

Transformation & Action

12

Lazy Evaluation

13

Word count program in Spark

14

Word count program in PySpark

15

Word count problem real-time example

1

Spark Real-Time Example

2

Broadcast Variable

3

Accumulators

4

How Spark Executes Program on the Cluster

5

Spark Driver and Executors

6

Client Mode, Cluster Mode and Local Mode Analyzing Log Messages – Hands on

7

Narrow vs Wide Transformations

8

Stages in Spark

9

Difference Between reduceByKey & reduce

10

Difference Between groupByKey & reduceByKey

11

Pair RDD

12

Pair RDD vs Map

13

Understanding Default Parallelism

14

Difference Between repartition & coalesce

15

When to Increase/Decrease Partitions

16

Spark on YARN Architecture

17

YARN – Yet Another Resource Negotiator

18

Application Master

19

Containers

1

Cache vs Persist

2

Spark Storage Levels

3

Difference Between DAG & Lineage

4

How to Submit a Spark Job

5

Real-time example – Finding top movies based on ratings

6

Spark Ecosystem

7

Map vs Map Partitions

8

Introduction to Spark Structured API

9

Spark DataFrame

10

Understanding SparkSession

11

SparkSession vs SparkContext

12

Dataframe with Various Transformations

13

RDD vs DataFrame vs Datasets

14

Challenges with DataFrame

15

Spark Dataset API

16

Difference Between DataFrame and Dataset

17

Benefits of Dataset

18

Creating Dataframe/Datasets from Various File Formats

19

Read Modes & Schema

20

Ways to Define the Schema

21

Defining a Explicit Schema

1

Writing Output to Sink (spark.write)

2

Spark File Layout

3

Benefits of Repartitions

4

partitionBy & bucketBy

5

Saving file in Various file format

6

Introduction to SparkSql

7

Storing Data in Persistent Manner

8

Handling Spark Metadata

9

Low & High level Transformations

10

Refering to a Column in Dataframe/Dataset

11

Column String

12

Column Object

13

Column Expression

14

Spark UDF using Structured API

15

Adding Column in Dataframe

16

Dataframe to Dataset Using Case Class.

17

Dataset to DataFrame Conversion

18

Spark Catalog

19

Registring UDF with Driver

20

Transformations Hands on Examples

21

Aggregate Transformations

22

Simple Aggregations

23

Grouping Aggregations

24

Window Aggregations

25

Joins on DataFrame

26

Simple Join (Shuffle Sort Merge Join)

27

Broadcast Join

28

Dealing With Ambiguoes Column Names

29

Dealing With Null’s

30

Internals of Join Operations

31

When to Use Simple Join When Use Broadcast Join

32

Grouping Aggregation Real-time Example

33

Infering Data in SparkSQL

34

Quiz

35

Assignment

36

Assignment Solution

1

Level of Optimizations

2

Resource level optimizations

3

Application level optimizations

4

Cluster level optimizations

5

How to calculate no of Executors

6

Thin Executor

7

Fat Executor

8

How to calculate no of Executors

9

How to Calculate Memory allacation

10

How to Calculate No of Cores

11

Heap Memory

12

Off-Heap Memory

13

Hands on With Real-time cluster

14

Understanding Cluster Configuarations

15

Realtime Example:Moving ata to HDFS using a Edge node and work around it in a realtime cluster

16

Static Resource allocation

17

Dynamic Resource allocation

18

Understanding Memory Usage in Spark

19

Execution Memory

20

Storage Memory

21

Practical Demonstration: Cache & Persist

22

Java Serializer vs Kryo Serializer

23

Quiz

24

Assignment

25

Assignment Solution

1

Broadcast Join Practical Demonstartions

2

Broadcast Join Using RDD

3

When to Use Broadcast Join

4

Broadcast Join Using Dataframe

5

Visualizing Broadcast Join with Structured API

6

Practical Demo on Repartition vs Coalesce

7

Client Mode vs Cluster Mode When using Spark submit

8

Spark Join Optimizations

9

Spark Advance Optimizations: Sort Aggregate vs Hash Aggregate

10

Spark Catalyst Optimizer

11

Quiz

12

Assignment

13

Assignment Solution

1

What is Real-time Processing

2

The Importance of Real-time Processing

3

Batch processing vs Real-time Stream Processing

4

Spark Streaming Data

5

Spark discretized stream or DStream

6

Batch & Batch Interval

7

Do Spark is a real-time streaming engine

8

Stream Processing in Spark

9

Transformed DStream

10

Understanding Producer & Consumer

11

Practical on Real-time Processing

12

Stream Transformations

13

Stateless Transformations

14

Stateful Transformations

15

Window Operations

16

Batch Interval

17

Window Size

18

Sliding Interval

19

Practical on Stateless Transformation

20

Practical on Stateful Transformation

21

reduceByKey vs updateStateByKey

22

Working With Sliding Window

23

reduceByKeyAndWindow Transformation

24

reduceByWindow Transformation

25

countByWindow Transformation

26

Quiz

27

Assignment

28

Assignment Solution

1

What Is Structured Streaming

2

Requirement Of Structure Streaming

3

Limitations Of Spark Streaming

4

Benefits Of Spark Structure Streaming

5

Practical – Wordcount Example On Structured Streaming

6

Dynamically Setting The Shuffle Partitions

7

Data Stream Writer Output Modes

8

Datastream Output Modes – append, update & complete

9

Spark Streaming Graceful Shutdown

10

How Does Spark Streaming Code Executes Internally

11

How a Job Converted to Micro batches

12

Trigger Point For Micro Batches

13

Types of Triggers – unspecified, time interval,one time, continuous

14

Types of Data Sources – Socket Source, Rate ,Source, File Source, Kafka Source

15

Limitations of socket source

16

Practical on File Data Source

17

Types of Spark Streaming Output Data Options

18

Fault Tolerance and Exactly Once Guarantee

19

Understanding Checkpoint Location

20

Stateful vs Stateless Transformations

21

Managed Stateful Operations vs UnManaged Stateful Operations

22

Types of Aggregations – Continuous Aggregations vs Time Bound Aggregations

23

Window Tranformations

24

updateStateByKey, reduceByKeyAndWindow,reduceByWindow, countByWindow

25

Types of windows – Tumbling Time Window,Sliding Time Window

26

Streaming Joins

27

Streaming Dataframe to static dataframe

28

Streaming Dataframe With Another Streaming Dataframes

29

Quiz

30

Assignment

31

Assignment Solution

1

Introduction To Kafka

2

Kakfa Architecture

3

Kafka Key Concepts/Fundamentals

4

Overview Of Zookeeper And It’s Role In Kafka Cluster

5

Cluster, Nodes, Brokers, Topics

6

Consumer, Producers, Logs, Partitions

7

Concept Of Consumer Groups

8

Leader & Follower Partition

9

Installing One Node Kafka Cluster On Local

10

Installing Multi Broker Kafka Cluster On Local

11

Command Line Producer And Consumer

12

Replication Concept For Fault Tolerance

13

How Data Is Stored In Brokers

14

Log Segments, Message Offsets, Message Index

15

Isr List / Minimum Isr

16

Committed Vs Uncommited Messages

17

Writing A Kafka Producer In Java

18

Writing A Kafka Consumer In Java

19

Achieving Exactly Once Semantics

20

Integrating Kafka With Spark Structured Streaming.

21

Quiz

22

Assignment

23

Assignment Solution

1

AWS EMR (Elastic MapReduce)

2

What is a VM (Virtual Machine)

3

On-Premise vs Cloud Setup

4

Major Vendors of Hadoop Distribution

5

Why Cloud & Big Data on Cloud

6

Major Cloud Providers of Bigdata

7

What is EMR

8

Hdfs vs S3

9

What Is S3

10

Important Instances in AWS

11

Kinds of Nodes in Cluster

12

Transient vs Long Running Cluster

13

Running Spark Code on Emr

14

How to Track Your Job

15

Copy File From S3 to Local

16

Zeppelin Notebook

17

Types of EC2 Instances

18

How to Create a VM

19

What is a Keypair

20

Elastic IP

21

AWS Storage, Networking & CLI

22

Instance Store

23

S3 & EBS

24

Public Ip Vs Private Ip

25

Network Switches

26

Security Group

27

Aws Command Line Interface

28

Launch A Emr Cluster Using Advanced Options

1

What is Athena

2

When do we require Athena

3

What problem Athena Solve

4

How Athena Works

5

Athena Practical Demonstration

6

How to create a normal table manually on csv data residing in s3

7

How to minimize data scanning in Athena

8

How to create partition table on Parquet file

9

Infering Schema automatically using AWS Glue

10

Glue Catalog

11

Quiz

12

Assignment

13

Assignment Solution

1

One end-to-end pipeline PROJECT involving all Major components like Sqoop, Hdfs, Hive, Hbase, Spark… etc.

2

Interview Preparation Tips:

3

Resume Building

4

15+ Mock Interview Recordings

5

Mock Interview

6

Interview Questions

7

How to Handle Managerial Round Qs

Big Data Masters Program

Request more information

Introduction to Big Data

MapReduce - Distributed Computing Framework

Apache Sqoop - Data Ingestion to Hadoop

Apache Hive

Apache Hive Advance

NoSQL Databases - HBase

NO SQL Database --Cassandra Overview

Learning Scala - A Guide to Functional Programming

Apache Spark - General Purpose Cluster Computing Framework

Apache Spark Introduction

Apache Spark --ADVANCE

Apache Spark - Structured API Part-1

Apache Spark - Structured API Part-2

Apache Spark - Optimization Part-1

Apache Spark - Optimization Part-2

Apache Spark - Streaming

Apache Spark - Streaming Part-2

Apache Kafka - Distributed Event Streaming Platform

Big Data on Cloud

AWS Athena:

Final Project