Big Data with Spark Scala ProDegree

1

Introduction to Big Data

2

Big Data System Requirements

3

Monolithic vs Distributed System

4

Distributed System Architecture

5

What is Hadoop And Evolution of Hadoop

6

Core Components of Hadoop

7

HDFS Architecture:

8

What is Node And What is Cluster

9

Data Block & Block Size

10

Slave Node, Master Node, Data Node & Name Node

11

Metadata And Replication Factor

12

Heart Beat & Fault Tolerance

13

Handling Namenode Failure

14

What is SPOF

15

FSimage & Edit Logs

16

Secondary Namenode

17

Name Node Recovery

18

Check Pointing

19

Understanding Replication Factor

20

What is Rack And Rack Failure

21

Rack Awareness Mechanism

22

Block Report

23

Namenode High Availability

24

Quorum Journal Manager & Quorum Journal Node

25

Understanding Linux File System

26

List & Parameters of List Command

27

Touch, Mkdir, Rmdir & Other Linux Commands

28

HDFS Commands:

29

List Files & Directories

30

How HDFS Commands Work

31

‘ls’ Command With Various Parameters

32

Create, Remove File/Directory

33

Copy & Get Files/Folders From Local to HDFS & Vice Versa

34

Move Files/Folders From HDFS to HDFS

35

Change Replication Factor Dynamically

36

View File Metadata Information

1

Introduction to MapReduce

2

Stages in MapReduce

3

What is Key-Value

4

What is Map & What is Reduce

5

Example to Undestand Map&Reduce

6

Word Count Example in MapREduce

7

Record Reader

8

Mapper Phase

9

Reducer Phase

10

MapReduce Shuffle & Sort

11

Inside Map & Reduce Phase

12

Wordcount Example in MapReduce

13

Typical MapReduce Flow

14

Blocks in MapReduce

15

Default Number of Mappers & Reducers

16

Understanding Shuffle & Sort

17

Example: Calculating Max Temperature in a Day

18

Realtime Use Case: Google Web Search

19

MapReduce Programming

20

MR Code Explanation

21

How to Write Map Reduce Code

22

Mapper Code

23

Reducer Code

24

Main Code

25

Finding the Frequency of Each Word in a File

26

Mapreduce Jars

27

MapReduce Practical Sessions

28

Word Count Program – Practical Session1

29

Jar Creation & Execution – Practical Session2:

30

How to Create a Jar

31

How to Execute the Jar

32

How to Track a Job

33

How to Track All Previous Jobs

34

MR Program Variations – Practical Session3:

35

How to Change Number of Reducers

1

Hive Overview:

2

Transactional System and Analytical System

3

Examples of Transactional Systems

4

Examples of Analytical Systems

5

What is Hive

6

Hive Query Language (HQL)

7

Understanding Hive Table

8

Introduction to Hive Metadata

9

Why Hive over traditional databases

10

Transactional and Analytical Processing

11

What is Data Warehouse

12

Hive Architecture

13

Hive on top of Hadoop

14

How Hive Works

15

Transactional vs Analytical Processing

16

Data Warehouse Concept

17

The Hive Metastore

18

Hive vs RDBMS

19

HQL vs SQL

20

Hive Subqueries Views & Index

21

Transactional and Analytical Processing

22

What is Data Warehouse

23

Hive Architecture

24

Hive on Hadoop

25

Hive Metastore

26

Hive vs. RDBMS

27

Hive Complex Data Types

28

Hive Array, Map & Struct

29

Hive Built-in Functions

30

Hive UDF, UDAF & UDTF

31

Hive Lateral Views

32

Hive Subqueries

33

Hive Views

34

Hive Normalization vs Denormalization

1

Why Scala

2

Where to Run Scala Code

3

Scala Code Using IDE

4

Scala Basics

5

Var vs val

6

Type inference

7

Data types in Scala

8

String Interpolation

9

String Comparison

10

Flow control: If else

11

Match Case

12

For Loop

13

While loop

14

Scala Functional Programming

15

How to define a function

16

Higher order function

17

Anonymous function

18

Scala Collections

19

Array

20

List

21

Tuple

22

Range

23

Set

24

Map

25

Scala Functional Programming:

26

Why Scala

27

Modes of writing Scala code

28

What is a functional programming

29

What is a function

30

What is a pure function?

31

First class function

32

Higher order function

33

Anonymous function

34

Immutability

35

Loop

36

Recursion

37

Tail recursion

38

Statement vs Expression

39

Closure

40

Scala type system

41

Scala operators

42

Anonymous function

43

Placeholder syntax

44

Partially applied functions

45

Function currying

1

What is App class in Scala

2

Default args, named args & variable args

3

Difference between nil, null, none & nothing

4

What is option in Scala

5

What is unit in Scala

6

Dealing with nulls in Scala

7

What is yield

8

What is vector

9

Scala if guards & pattern guards

10

What is “for comprehensions”

11

Difference between “==” in java and Scala

12

Difference between strict val vs lazy val

13

What are default packages in Scala

14

What is Scala apply method

15

What is a diamond problem in Scala

16

What is a trait

17

Why Scala is the top most choice for a big data

18

What is Apache Spark

1

What is Apache Spark

2

Understanding Spark cluster

3

Is Spark a replacement to Hadoop

4

Why Spark is faster than MapReduce

5

How data store in Spark

6

What is RDD

7

What is DAG

8

RDD Lineage

9

Resiliency

10

Immutability

11

Transformation & Action

12

Lazy Evaluation

13

Word count program in Spark

14

Word count program in PySpark

15

Word count problem real-time example

1

Spark Real-Time Example

2

Broadcast Variable

3

Accumulators

4

How Spark Executes Program on the Cluster

5

Spark Driver and Executors

6

Client Mode, Cluster Mode and Local Mode Analyzing Log Messages – Hands on

7

Narrow vs Wide Transformations

8

Stages in Spark

9

Difference Between reduceByKey & reduce

10

Difference Between groupByKey & reduceByKey

11

Pair RDD

12

Pair RDD vs Map

13

Understanding Default Parallelism

14

Difference Between repartition & coalesce

15

When to Increase/Decrease Partitions

16

Spark on YARN Architecture

17

YARN – Yet Another Resource Negotiator

18

Application Master

19

Containers

1

Cache vs Persist

2

Spark Storage Levels

3

Difference Between DAG & Lineage

4

How to Submit a Spark Job

5

Real-time example – Finding top movies based on ratings

6

Spark Ecosystem

7

Map vs Map Partitions

8

Introduction to Spark Structured API

9

Spark DataFrame

10

Understanding SparkSession

11

SparkSession vs SparkContext

12

Dataframe with Various Transformations

13

RDD vs DataFrame vs Datasets

14

Challenges with DataFrame

15

Spark Dataset API

16

Difference Between DataFrame and Dataset

17

Benefits of Dataset

18

Creating Dataframe/Datasets from Various File Formats

19

Read Modes & Schema

20

Ways to Define the Schema

21

Defining a Explicit Schema

1

Writing Output to Sink (spark.write)

2

Spark File Layout

3

Benefits of Repartitions

4

partitionBy & bucketBy

5

Saving file in Various file format

6

Introduction to SparkSql

7

Storing Data in Persistent Manner

8

Handling Spark Metadata

9

Low & High level Transformations

10

Refering to a Column in Dataframe/Dataset

11

Column String

12

Column Object

13

Column Expression

14

Spark UDF using Structured API

15

Adding Column in Dataframe

16

Dataframe to Dataset Using Case Class.

17

Dataset to DataFrame Conversion

18

Spark Catalog

19

Registring UDF with Driver

20

Transformations Hands on Examples

21

Aggregate Transformations

22

Simple Aggregations

23

Grouping Aggregations

24

Window Aggregations

25

Joins on DataFrame

26

Simple Join (Shuffle Sort Merge Join)

27

Broadcast Join

28

Dealing With Ambiguoes Column Names

29

Dealing With Null’s

30

Internals of Join Operations

31

When to Use Simple Join When Use Broadcast Join

32

Grouping Aggregation Real-time Example

33

Infering Data in SparkSQL

34

Quiz

35

Assignment

36

Assignment Solution

1

Level of Optimizations

2

Resource level optimizations

3

Application level optimizations

4

Cluster level optimizations

5

How to calculate no of Executors

6

Thin Executor

7

Fat Executor

8

How to calculate no of Executors

9

How to Calculate Memory allacation

10

How to Calculate No of Cores

11

Heap Memory

12

Off-Heap Memory

13

Hands on With Real-time cluster

14

Understanding Cluster Configuarations

15

Realtime Example:Moving ata to HDFS using a Edge node and work around it in a realtime cluster

16

Static Resource allocation

17

Dynamic Resource allocation

18

Understanding Memory Usage in Spark

19

Execution Memory

20

Storage Memory

21

Practical Demonstration: Cache & Persist

22

Java Serializer vs Kryo Serializer

23

Quiz

24

Assignment

25

Assignment Solution

1

Broadcast Join Practical Demonstartions

2

Broadcast Join Using RDD

3

When to Use Broadcast Join

4

Broadcast Join Using Dataframe

5

Visualizing Broadcast Join with Structured API

6

Practical Demo on Repartition vs Coalesce

7

Client Mode vs Cluster Mode When using Spark submit

8

Spark Join Optimizations

9

Spark Advance Optimizations: Sort Aggregate vs Hash Aggregate

10

Spark Catalyst Optimizer

11

Quiz

12

Assignment

13

Assignment Solution

1

What is Real-time Processing

2

The Importance of Real-time Processing

3

Batch processing vs Real-time Stream Processing

4

Spark Streaming Data

5

Spark discretized stream or DStream

6

Batch & Batch Interval

7

Do Spark is a real-time streaming engine

8

Stream Processing in Spark

9

Transformed DStream

10

Understanding Producer & Consumer

11

Practical on Real-time Processing

12

Stream Transformations

13

Stateless Transformations

14

Stateful Transformations

15

Window Operations

16

Batch Interval

17

Window Size

18

Sliding Interval

19

Practical on Stateless Transformation

20

Practical on Stateful Transformation

21

reduceByKey vs updateStateByKey

22

Working With Sliding Window

23

reduceByKeyAndWindow Transformation

24

reduceByWindow Transformation

25

countByWindow Transformation

26

Quiz

27

Assignment

28

Assignment Solution

1

What Is Structured Streaming

2

Requirement Of Structure Streaming

3

Limitations Of Spark Streaming

4

Benefits Of Spark Structure Streaming

5

Practical – Wordcount Example On Structured Streaming

6

Dynamically Setting The Shuffle Partitions

7

Data Stream Writer Output Modes

8

Datastream Output Modes – append, update & complete

9

Spark Streaming Graceful Shutdown

10

How Does Spark Streaming Code Executes Internally

11

How a Job Converted to Micro batches

12

Trigger Point For Micro Batches

13

Types of Triggers – unspecified, time interval,one time, continuous

14

Types of Data Sources – Socket Source, Rate ,Source, File Source, Kafka Source

15

Limitations of socket source

16

Practical on File Data Source

17

Types of Spark Streaming Output Data Options

18

Fault Tolerance and Exactly Once Guarantee

19

Understanding Checkpoint Location

20

Stateful vs Stateless Transformations

21

Managed Stateful Operations vs UnManaged Stateful Operations

22

Types of Aggregations – Continuous Aggregations vs Time Bound Aggregations

23

Window Tranformations

24

updateStateByKey, reduceByKeyAndWindow,reduceByWindow, countByWindow

25

Types of windows – Tumbling Time Window,Sliding Time Window

26

Streaming Joins

27

Streaming Dataframe to static dataframe

28

Streaming Dataframe With Another Streaming Dataframes

29

Quiz

30

Assignment

31

Assignment Solution

1

Introduction To Kafka

2

Kakfa Architecture

3

Kafka Key Concepts/Fundamentals

4

Overview Of Zookeeper And It’s Role In Kafka Cluster

5

Cluster, Nodes, Brokers, Topics

6

Consumer, Producers, Logs, Partitions

7

Concept Of Consumer Groups

8

Leader & Follower Partition

9

Installing One Node Kafka Cluster On Local

10

Installing Multi Broker Kafka Cluster On Local

11

Command Line Producer And Consumer

12

Replication Concept For Fault Tolerance

13

How Data Is Stored In Brokers

14

Log Segments, Message Offsets, Message Index

15

Isr List / Minimum Isr

16

Committed Vs Uncommited Messages

17

Writing A Kafka Producer In Java

18

Writing A Kafka Consumer In Java

19

Achieving Exactly Once Semantics

20

Integrating Kafka With Spark Structured Streaming.

21

Quiz

22

Assignment

23

Assignment Solution

1

One end-to-end pipeline PROJECT involving all Major components like Sqoop, Hdfs, Hive, Hbase, Spark… etc.

2

Interview Preparation Tips:

3

Resume Building

4

15+ Mock Interview Recordings

5

Mock Interview

6

Interview Questions

7

How to Handle Managerial Round Qs

Big Data with Spark Scala ProDegree

Request more information

Introduction to Big Data

MapReduce - Distributed Computing Framework

Apache Hive

Learning Scala - A Guide to Functional Programming

Apache Spark - General Purpose Cluster Computing Framework

Apache Spark Introduction

Apache Spark --ADVANCE

Apache Spark - Structured API Part-1

Apache Spark - Structured API Part-2

Apache Spark - Optimization Part-1

Apache Spark - Optimization Part-2

Apache Spark - Streaming

Apache Spark - Streaming Part-2

Apache Kafka - Distributed Event Streaming Platform

Final Project