Add to Wishlist
Big Data with Spark Scala ProDegree

Introduction to Big Data
1
Introduction to Big Data
2
Big Data System Requirements
3
Monolithic vs Distributed System
4
Distributed System Architecture
5
What is Hadoop And Evolution of Hadoop
6
Core Components of Hadoop
7
HDFS Architecture:
8
What is Node And What is Cluster
9
Data Block & Block Size
10
Slave Node, Master Node, Data Node & Name Node
11
Metadata And Replication Factor
12
Heart Beat & Fault Tolerance
13
Handling Namenode Failure
14
What is SPOF
15
FSimage & Edit Logs
16
Secondary Namenode
17
Name Node Recovery
18
Check Pointing
19
Understanding Replication Factor
20
What is Rack And Rack Failure
21
Rack Awareness Mechanism
22
Block Report
23
Namenode High Availability
24
Quorum Journal Manager & Quorum Journal Node
25
Understanding Linux File System
26
List & Parameters of List Command
27
Touch, Mkdir, Rmdir & Other Linux Commands
28
HDFS Commands:
29
List Files & Directories
30
How HDFS Commands Work
31
‘ls’ Command With Various Parameters
32
Create, Remove File/Directory
33
Copy & Get Files/Folders From Local to HDFS & Vice Versa
34
Move Files/Folders From HDFS to HDFS
35
Change Replication Factor Dynamically
36
View File Metadata Information
MapReduce - Distributed Computing Framework
1
Introduction to MapReduce
2
Stages in MapReduce
3
What is Key-Value
4
What is Map & What is Reduce
5
Example to Undestand Map&Reduce
6
Word Count Example in MapREduce
7
Record Reader
8
Mapper Phase
9
Reducer Phase
10
MapReduce Shuffle & Sort
11
Inside Map & Reduce Phase
12
Wordcount Example in MapReduce
13
Typical MapReduce Flow
14
Blocks in MapReduce
15
Default Number of Mappers & Reducers
16
Understanding Shuffle & Sort
17
Example: Calculating Max Temperature in a Day
18
Realtime Use Case: Google Web Search
19
MapReduce Programming
20
MR Code Explanation
21
How to Write Map Reduce Code
22
Mapper Code
23
Reducer Code
24
Main Code
25
Finding the Frequency of Each Word in a File
26
Mapreduce Jars
27
MapReduce Practical Sessions
28
Word Count Program – Practical Session1
29
Jar Creation & Execution – Practical Session2:
30
How to Create a Jar
31
How to Execute the Jar
32
How to Track a Job
33
How to Track All Previous Jobs
34
MR Program Variations – Practical Session3:
35
How to Change Number of Reducers
Apache Hive
1
Hive Overview:
2
Transactional System and Analytical System
3
Examples of Transactional Systems
4
Examples of Analytical Systems
5
What is Hive
6
Hive Query Language (HQL)
7
Understanding Hive Table
8
Introduction to Hive Metadata
9
Why Hive over traditional databases
10
Transactional and Analytical Processing
11
What is Data Warehouse
12
Hive Architecture
13
Hive on top of Hadoop
14
How Hive Works
15
Transactional vs Analytical Processing
16
Data Warehouse Concept
17
The Hive Metastore
18
Hive vs RDBMS
19
HQL vs SQL
20
Hive Subqueries Views & Index
21
Transactional and Analytical Processing
22
What is Data Warehouse
23
Hive Architecture
24
Hive on Hadoop
25
Hive Metastore
26
Hive vs. RDBMS
27
Hive Complex Data Types
28
Hive Array, Map & Struct
29
Hive Built-in Functions
30
Hive UDF, UDAF & UDTF
31
Hive Lateral Views
32
Hive Subqueries
33
Hive Views
34
Hive Normalization vs Denormalization
Learning Scala - A Guide to Functional Programming
1
Why Scala
2
Where to Run Scala Code
3
Scala Code Using IDE
4
Scala Basics
5
Var vs val
6
Type inference
7
Data types in Scala
8
String Interpolation
9
String Comparison
10
Flow control: If else
11
Match Case
12
For Loop
13
While loop
14
Scala Functional Programming
15
How to define a function
16
Higher order function
17
Anonymous function
18
Scala Collections
19
Array
20
List
21
Tuple
22
Range
23
Set
24
Map
25
Scala Functional Programming:
26
Why Scala
27
Modes of writing Scala code
28
What is a functional programming
29
What is a function
30
What is a pure function?
31
First class function
32
Higher order function
33
Anonymous function
34
Immutability
35
Loop
36
Recursion
37
Tail recursion
38
Statement vs Expression
39
Closure
40
Scala type system
41
Scala operators
42
Anonymous function
43
Placeholder syntax
44
Partially applied functions
45
Function currying
Apache Spark - General Purpose Cluster Computing Framework
1
What is App class in Scala
2
Default args, named args & variable args
3
Difference between nil, null, none & nothing
4
What is option in Scala
5
What is unit in Scala
6
Dealing with nulls in Scala
7
What is yield
8
What is vector
9
Scala if guards & pattern guards
10
What is “for comprehensions”
11
Difference between “==” in java and Scala
12
Difference between strict val vs lazy val
13
What are default packages in Scala
14
What is Scala apply method
15
What is a diamond problem in Scala
16
What is a trait
17
Why Scala is the top most choice for a big data
18
What is Apache Spark
Apache Spark Introduction
1
What is Apache Spark
2
Understanding Spark cluster
3
Is Spark a replacement to Hadoop
4
Why Spark is faster than MapReduce
5
How data store in Spark
6
What is RDD
7
What is DAG
8
RDD Lineage
9
Resiliency
10
Immutability
11
Transformation & Action
12
Lazy Evaluation
13
Word count program in Spark
14
Word count program in PySpark
15
Word count problem real-time example
Apache Spark --ADVANCE
1
Spark Real-Time Example
2
Broadcast Variable
3
Accumulators
4
How Spark Executes Program on the Cluster
5
Spark Driver and Executors
6
Client Mode, Cluster Mode and Local Mode Analyzing Log Messages – Hands on
7
Narrow vs Wide Transformations
8
Stages in Spark
9
Difference Between reduceByKey & reduce
10
Difference Between groupByKey & reduceByKey
11
Pair RDD
12
Pair RDD vs Map
13
Understanding Default Parallelism
14
Difference Between repartition & coalesce
15
When to Increase/Decrease Partitions
16
Spark on YARN Architecture
17
YARN – Yet Another Resource Negotiator
18
Application Master
19
Containers
Apache Spark - Structured API Part-1
1
Cache vs Persist
2
Spark Storage Levels
3
Difference Between DAG & Lineage
4
How to Submit a Spark Job
5
Real-time example – Finding top movies based on ratings
6
Spark Ecosystem
7
Map vs Map Partitions
8
Introduction to Spark Structured API
9
Spark DataFrame
10
Understanding SparkSession
11
SparkSession vs SparkContext
12
Dataframe with Various Transformations
13
RDD vs DataFrame vs Datasets
14
Challenges with DataFrame
15
Spark Dataset API
16
Difference Between DataFrame and Dataset
17
Benefits of Dataset
18
Creating Dataframe/Datasets from Various File Formats
19
Read Modes & Schema
20
Ways to Define the Schema
21
Defining a Explicit Schema
Apache Spark - Structured API Part-2
1
Writing Output to Sink (spark.write)
2
Spark File Layout
3
Benefits of Repartitions
4
partitionBy & bucketBy
5
Saving file in Various file format
6
Introduction to SparkSql
7
Storing Data in Persistent Manner
8
Handling Spark Metadata
9
Low & High level Transformations
10
Refering to a Column in Dataframe/Dataset
11
Column String
12
Column Object
13
Column Expression
14
Spark UDF using Structured API
15
Adding Column in Dataframe
16
Dataframe to Dataset Using Case Class.
17
Dataset to DataFrame Conversion
18
Spark Catalog
19
Registring UDF with Driver
20
Transformations Hands on Examples
21
Aggregate Transformations
22
Simple Aggregations
23
Grouping Aggregations
24
Window Aggregations
25
Joins on DataFrame
26
Simple Join (Shuffle Sort Merge Join)
27
Broadcast Join
28
Dealing With Ambiguoes Column Names
29
Dealing With Null’s
30
Internals of Join Operations
31
When to Use Simple Join When Use Broadcast Join
32
Grouping Aggregation Real-time Example
33
Infering Data in SparkSQL
34
Quiz
35
Assignment
36
Assignment Solution
Apache Spark - Optimization Part-1
1
Level of Optimizations
2
Resource level optimizations
3
Application level optimizations
4
Cluster level optimizations
5
How to calculate no of Executors
6
Thin Executor
7
Fat Executor
8
How to calculate no of Executors
9
How to Calculate Memory allacation
10
How to Calculate No of Cores
11
Heap Memory
12
Off-Heap Memory
13
Hands on With Real-time cluster
14
Understanding Cluster Configuarations
15
Realtime Example:Moving ata to HDFS using a Edge node and work around it in a realtime cluster
16
Static Resource allocation
17
Dynamic Resource allocation
18
Understanding Memory Usage in Spark
19
Execution Memory
20
Storage Memory
21
Practical Demonstration: Cache & Persist
22
Java Serializer vs Kryo Serializer
23
Quiz
24
Assignment
25
Assignment Solution
Apache Spark - Optimization Part-2
1
Broadcast Join Practical Demonstartions
2
Broadcast Join Using RDD
3
When to Use Broadcast Join
4
Broadcast Join Using Dataframe
5
Visualizing Broadcast Join with Structured API
6
Practical Demo on Repartition vs Coalesce
7
Client Mode vs Cluster Mode When using Spark submit
8
Spark Join Optimizations
9
Spark Advance Optimizations: Sort Aggregate vs Hash Aggregate
10
Spark Catalyst Optimizer
11
Quiz
12
Assignment
13
Assignment Solution
Apache Spark - Streaming
1
What is Real-time Processing
2
The Importance of Real-time Processing
3
Batch processing vs Real-time Stream Processing
4
Spark Streaming Data
5
Spark discretized stream or DStream
6
Batch & Batch Interval
7
Do Spark is a real-time streaming engine
8
Stream Processing in Spark
9
Transformed DStream
10
Understanding Producer & Consumer
11
Practical on Real-time Processing
12
Stream Transformations
13
Stateless Transformations
14
Stateful Transformations
15
Window Operations
16
Batch Interval
17
Window Size
18
Sliding Interval
19
Practical on Stateless Transformation
20
Practical on Stateful Transformation
21
reduceByKey vs updateStateByKey
22
Working With Sliding Window
23
reduceByKeyAndWindow Transformation
24
reduceByWindow Transformation
25
countByWindow Transformation
26
Quiz
27
Assignment
28
Assignment Solution
Apache Spark - Streaming Part-2
1
What Is Structured Streaming
2
Requirement Of Structure Streaming
3
Limitations Of Spark Streaming
4
Benefits Of Spark Structure Streaming
5
Practical – Wordcount Example On Structured Streaming
6
Dynamically Setting The Shuffle Partitions
7
Data Stream Writer Output Modes
8
Datastream Output Modes – append, update & complete
9
Spark Streaming Graceful Shutdown
10
How Does Spark Streaming Code Executes Internally
11
How a Job Converted to Micro batches
12
Trigger Point For Micro Batches
13
Types of Triggers – unspecified, time interval,one time, continuous
14
Types of Data Sources – Socket Source, Rate ,Source, File Source, Kafka Source
15
Limitations of socket source
16
Practical on File Data Source
17
Types of Spark Streaming Output Data Options
18
Fault Tolerance and Exactly Once Guarantee
19
Understanding Checkpoint Location
20
Stateful vs Stateless Transformations
21
Managed Stateful Operations vs UnManaged Stateful Operations
22
Types of Aggregations – Continuous Aggregations vs Time Bound Aggregations
23
Window Tranformations
24
updateStateByKey, reduceByKeyAndWindow,reduceByWindow, countByWindow
25
Types of windows – Tumbling Time Window,Sliding Time Window
26
Streaming Joins
27
Streaming Dataframe to static dataframe
28
Streaming Dataframe With Another Streaming Dataframes
29
Quiz
30
Assignment
31
Assignment Solution
Apache Kafka - Distributed Event Streaming Platform
1
Introduction To Kafka
2
Kakfa Architecture
3
Kafka Key Concepts/Fundamentals
4
Overview Of Zookeeper And It’s Role In Kafka Cluster
5
Cluster, Nodes, Brokers, Topics
6
Consumer, Producers, Logs, Partitions
7
Concept Of Consumer Groups
8
Leader & Follower Partition
9
Installing One Node Kafka Cluster On Local
10
Installing Multi Broker Kafka Cluster On Local
11
Command Line Producer And Consumer
12
Replication Concept For Fault Tolerance
13
How Data Is Stored In Brokers
14
Log Segments, Message Offsets, Message Index
15
Isr List / Minimum Isr
16
Committed Vs Uncommited Messages
17
Writing A Kafka Producer In Java
18
Writing A Kafka Consumer In Java
19
Achieving Exactly Once Semantics
20
Integrating Kafka With Spark Structured Streaming.
21
Quiz
22
Assignment
23
Assignment Solution
Final Project
1
One end-to-end pipeline PROJECT involving all Major components like Sqoop, Hdfs, Hive, Hbase, Spark… etc.
2
Interview Preparation Tips:
3
Resume Building
4
15+ Mock Interview Recordings
5
Mock Interview
6
Interview Questions
7
How to Handle Managerial Round Qs