NEWTrain a custom GPT Chatbot on YouTube videosTry Now

Big Data Engineering Full Course Part 1 | 17 Hours

Updated: July 13, 2025

The Data Tech

Summary

This comprehensive Youtube video introduces beginners and intermediate learners to the world of Big Data and Data Engineering. It covers essential topics such as the misconceptions surrounding Big Data, the intricacies of Hadoop and Apache projects, and the significance of understanding various data layers. The video also delves into practical aspects such as setting up Hadoop clusters, executing MapReduce jobs, configuring Hive for data processing, and exploring the basics of Spark integration with Hadoop. Overall, it provides a well-rounded guide for individuals aspiring to build a career in the field of Big Data and Data Engineering.

TABLE OF CONTENTS

Introduction to Big Data Engineering
Introduction to Data and Hadoop
Volume Problem in Big Data
Discussion on Data Volume in Big Data
Discussion on Big Data Technologies
Open Source and Apache Software Foundation
Learning Path for Data Engineering
Data Engineering Career Preparation
Introduction to Hadoop Framework
File System Concepts
Client-Server Architecture
Types of Distributed Systems
Hadoop Architecture Evolution
Cluster Configuration and Role Assignment
Read-Write Architecture in HDFS
Explanation of Metadata Safeguarding in Hadoop Version 2
Client API's Interaction in Hadoop
Pipeline Process for Read Operations
Handling Write Request Failures
Explanation of High Availability in Hadoop Cluster
Different Types of HDFS Clusters
Overview of Hadoop Ecosystem Components
Hadoop Configuration
SSH Configuration
Formatting HDFS
File Upload in HDFS
Space and File Count Quota in HDFS
File Compaction in Hive
MapReduce Introduction
MapReduce Execution
Reducer Operations
Mapper and Reducer Configuration
Job Tracker and Name Node Interaction
Data Node and Task Tracker Operation
Reducer Task Execution
Handling Task Failures
Parallelism and Block Distribution
Input and Output Formats
YARN Architecture and Spark Works
Setting Up in Eclipse IDE
Copying JAR Files to Linux Server
Configuring Java Build Path
Running a MapReduce Job
Understanding Input Split in MapReduce
Speculative Execution in MapReduce
Introduction to Hive
Understanding Hive Metastore
Installation of Hive in MySQL
Running Hive Queries
Storing Metadata in MySQL
Creating a Meta Store in MySQL
Loading Data in Hive Tables
Running Spark Programs with Hive
Understanding Partition in Hive
Implementing Bucketing in Hive
Understanding Modulus in Bucketing
Custom Partitioning
Bucket Use Case Explained
Hive Bucket Table Creation
Storage Formats: ORC vs. Text
Hive Acid Tables
User Defined Functions in Hive
Implementing Custom Functions
Introduction to Spark
Components of Spark
Batch Processing vs Stream Processing
Spark Integration with Hadoop
Integration with Big Data Technologies
Installation and Setup of Spark
Spark API and Data Processing
Transformations and Actions in Spark
Lazy Evaluation in Spark
Spark Testing and Development
Starting Spark Service
Cluster Mode and Port Numbers
Spark Shell and Spark Context
Parallel Processing in Spark
Spark Job Execution Flow
Fault Tolerance in Spark

Introduction to Big Data Engineering

Introduction to Big Data, explanation of the course content, and the importance of data engineering for beginners and intermediate learners.

Introduction to Data and Hadoop

Explanation of Hadoop as a solution in the Big Data domain, the difference between Hadoop and Big Data, and common misconceptions about Big Data.

Volume Problem in Big Data

Clarification on the misconception that volume is the only problem in Big Data, discussion on other key problems like quality, velocity, and variety of data.

Discussion on Data Volume in Big Data

Explanation of the misunderstanding related to data volume in Big Data, comparison of data volumes in different projects, and addressing common interview questions regarding data volume.

Discussion on Big Data Technologies

Overview of the various technologies in the Big Data domain, discussion on Hadoop and Apache projects, and the importance of understanding different data layers in Big Data.

Open Source and Apache Software Foundation

Explanation of open source concept, donation of Hadoop project to Apache Software Foundation, and the significance of having projects in Apache for career opportunities.

Learning Path for Data Engineering

Customized roadmap for learning Data Engineering, including steps for learning Linux, SQL, programming languages, cloud computing, and data technology stacks.

Data Engineering Career Preparation

Guidance on preparing for a career in Data Engineering, covering topics such as project challenges, performance optimization, resume preparation, and job search strategies.

Introduction to Hadoop Framework

Explaining the components of Hadoop framework and answering frequently asked questions about HDFS (Hadoop Distributed File System) and file systems in general.

File System Concepts

Defining file system, examples of distributed file systems like S3, Google file system, and Facebook file system, and explaining the concept of blocks in file systems.

Client-Server Architecture

Describing client and server relationship in a distributed environment, differentiating between Master-Slave and Peer-to-Peer cluster types, and highlighting the importance of distributed computing.

Types of Distributed Systems

Discussing Master-Slave and Peer-to-Peer cluster infrastructures, focusing on the communication and fault tolerance mechanisms in a distributed cluster environment.

Hadoop Architecture Evolution

Exploring the architectural differences between Hadoop versions, emphasizing the role of Daemon processes, and detailing the cluster and node concepts in a Hadoop environment.

Cluster Configuration and Role Assignment

Detailing the configuration of clusters, assigning roles to nodes, selecting hardware configurations, and explaining data distribution on nodes in a Hadoop cluster.

Read-Write Architecture in HDFS

Explaining the read-write architecture in HDFS (Hadoop Distributed File System), describing the process of data storage, replication, and failover mechanisms in a distributed environment.

Explanation of Metadata Safeguarding in Hadoop Version 2

In Hadoop version 2, a fix for metadata safeguarding was introduced to prevent data loss in case of hard disk crashes.

Client API's Interaction in Hadoop

The client API in Hadoop interacts with the data nodes for read and write operations, ensuring data replication for reliability.

Pipeline Process for Read Operations

The pipeline process in Hadoop is utilized for write operations, while read operations involve reading only one copy of the data for efficiency.

Handling Write Request Failures

In case of write request failures, the pipeline process sends acknowledgments and redirects the write request to another node for successful data storage.

Explanation of High Availability in Hadoop Cluster

Hadoop cluster concepts include multiple nodes such as namenode, job tracker, resource manager, and secondary namenode to ensure high availability and fault tolerance.

Different Types of HDFS Clusters

HDFS clusters can be set up as single-node pseudo clusters for testing purposes or multi-node clusters for production environments, with each node serving a specific role for data storage and processing.

Overview of Hadoop Ecosystem Components

The Hadoop ecosystem consists of various components like Hive, Pig, Sqoop, Flume, Oozie, and HBase, each serving different data processing and management functions within the framework.

Hadoop Configuration

Explained the importance of configurations in Hadoop setup including specifying demons (data node, task tracker), setting up SSH for communication among nodes, and removing the need for entering passwords multiple times.

SSH Configuration

Detailed the process of configuring SSH for passwordless communication among nodes in a multi-node Hadoop setup, including generating SSH keys, appending public keys, and ensuring seamless communication without password prompts.

Formatting HDFS

Discussed the concept of formatting HDFS, creating Hadoop directory dirs, explaining the implications of errors in the process, and setting up configurations in the slave node.

File Upload in HDFS

Demonstrated the process of uploading files to HDFS using command line tools, including creating directories, uploading files, understanding block size, replication, and file management in HDFS.

Space and File Count Quota in HDFS

Explained the significance of space and file count quotas in HDFS, setting quotas, managing disk space allocations, and resolving errors related to quotas for efficient data management in HDFS.

File Compaction in Hive

Elaborated on the concept of file compaction in Hive, merging files to reduce file counts, optimize space usage, and enhance performance in Hadoop ecosystem components like Hive.

MapReduce Introduction

Provided an introduction to MapReduce, discussing its role in Hadoop, significance in parallel processing, its relation to Spark, and the relevance of understanding MapReduce for job interviews and working with big data technologies.

MapReduce Execution

Explained the execution process of MapReduce jobs, the map and reduce phases, input-output data handling, data locality, and the transformational aspects of mapping and reducing tasks in the MapReduce framework.

Reducer Operations

Explained the role and decision-making process of reducers in a MapReduce program. Discusses scenarios and options for the number of reducers based on the data processing requirements.

Mapper and Reducer Configuration

Describes the configuration and selection of mappers and reducers in a MapReduce job. Differentiates and explains the functionality of mapper and reducer classes in the program.

Job Tracker and Name Node Interaction

Details the interaction between the job tracker and name node in a MapReduce job. Covers the flow of information retrieval and task initiation based on block distribution.

Data Node and Task Tracker Operation

Explains the functionality of data nodes and task trackers in a MapReduce job. Describes the data transmission and task execution processes between these components.

Reducer Task Execution

Describes the execution flow and completion process of reducer tasks in a MapReduce program. Explores the data flow from local file systems to HDFS after reducer completion.

Handling Task Failures

Discusses the procedures for handling task failures in MapReduce jobs. Explains the process of task reassignment and restarting in case of failures.

Parallelism and Block Distribution

Explores the concept of parallelism and block distribution in MapReduce jobs. Describes the challenges and considerations for achieving parallelism within nodes.

Input and Output Formats

Explains the importance of input and output formats in MapReduce jobs. Describes the key-value pair requirements for mapper and reducer inputs and outputs.

YARN Architecture and Spark Works

Explanation of the YARN architecture and how Spark works within it. Details on the application master, assistant manager, resource manager, and the flow of execution in both YARN and Spark.

Setting Up in Eclipse IDE

Steps to set up a project in Eclipse IDE for Hadoop, including creating a project, managing libraries, and resolving dependencies.

Copying JAR Files to Linux Server

Instructions on copying JAR files from Eclipse to a Linux server where Hadoop is running, including creating a lib folder and pasting the jar files.

Configuring Java Build Path

Details on configuring the Java build path in Eclipse by adding jar files to the project's build path to resolve dependencies.

Running a MapReduce Job

Demonstration of running a MapReduce job on a Linux server with Hadoop, including uploading input files to HDFS, executing the job, and checking the output in HDFS.

Understanding Input Split in MapReduce

Explains the concept of input split in MapReduce and how it divides data into logical blocks for processing across nodes, ensuring efficient processing and data retrieval.

Speculative Execution in MapReduce

Description of speculative execution in MapReduce, where duplicate tasks are launched to handle slow or stuck tasks without killing the original tasks, ensuring job completion.

Introduction to Hive

Overview of Hive as a query language for Hadoop, its architecture, and how it abstracts MapReduce to allow SQL-like queries for data processing and transformation.

Understanding Hive Metastore

Explanation of Hive metastore, where table metadata is stored in an RDBMS separate from data storage, detailing the use of embedded and remote metastores in Hive installations.

Installation of Hive in MySQL

Explains the installation process of Hive in MySQL using commands like 'sudo apt iPhone get install' and configuring the MySQL connector jar for meta store storage.

Running Hive Queries

Discusses running Hive queries on map reduce or Spark rth engine instead of MySQL or Oracle to avoid any relationship between Hive and MySQL.

Storing Metadata in MySQL

Describes the process of creating a metadata store in MySQL for Hive, including downloading the MySQL connector jar, renaming it to 'Hive site dot XML' and initializing the schema in MySQL.

Creating a Meta Store in MySQL

Demonstrates creating a meta store database in MySQL and initializing the schema using commands in Hive and MySQL for storing metadata tables.

Loading Data in Hive Tables

Explains loading data into Hive tables from the local file system and HDFS, including using commands like 'load data local in path' and 'insert into' for loading data into Hive tables.

Running Spark Programs with Hive

Highlights the process of running a Spark program to load data directly into Hive tables, emphasizing the importance of using 'insert into' commands for bulk data loads.

Understanding Partition in Hive

Explores the concept of partition in Hive, comparing it to logical partitions on a laptop's hard drive and the significance of partitioning data for improved performance.

Implementing Bucketing in Hive

Introduces bucketing in Hive tables, detailing the use of buckets for optimizing performance through data distribution and explaining the concept of hash partitioning for bucket selection.

Understanding Modulus in Bucketing

Explains the concept of using modulus to determine bucket positions in HDFS based on remainder values. Illustrates how to assign records to buckets based on the remainder obtained from division.

Custom Partitioning

Discusses the importance of custom partitioning and the decision-making process behind choosing buckets. Emphasizes the significance of understanding bucket use cases for optimal data organization.

Bucket Use Case Explained

Explores the use case of bucketing in partitioning strategies. Provides insights into how bucketing aids in organizing and querying data efficiently, especially in scenarios where unique values pose challenges for partitioning.

Hive Bucket Table Creation

Walks through the process of creating bucketed tables in Hive and explains the significance of bucket counts. Demonstrates the creation of buckets and how data is distributed across them.

Storage Formats: ORC vs. Text

Contrasts the efficiency of ORC file format with text file format in Hive, highlighting the advantages of ORC in terms of compression rates and performance optimization. Illustrates the impact of storage format on data size and query execution time.

Hive Acid Tables

Introduces Hive acid tables in Hive for online analytical processing, detailing the behavior of acid tables in terms of insert, update, and delete operations. Discusses the importance of setting up and implementing acid tables for enhanced data management.

User Defined Functions in Hive

Explains the concept of user-defined functions (UDFs) in Hive, showcasing how to create and use custom functions to perform specific data transformations. Provides a step-by-step guide on designing and executing UDFs in Hive with Java programming.

Implementing Custom Functions

Demonstrates the process of implementing custom UDFs in Hive using Java programming. Shows how to write UDFs for data manipulation and concatenation operations, emphasizing the flexibility and functionality of custom functions in data processing workflows.

Introduction to Spark

Understanding the basics of Spark and its prerequisites such as Hadoop knowledge, programming languages (Java, Scala, Python), and core concepts of big data frameworks.

Components of Spark

Exploring the components of Spark including Spark SQL for structured APIs, Spark Streaming for real-time data processing, and Spark MLlib for machine learning activities.

Batch Processing vs Stream Processing

Distinguishing between batch processing and stream processing in data processing, with examples and use cases explained.

Spark Integration with Hadoop

Discussing how Spark integrates with Hadoop, the components involved, and the differences between Hadoop and Spark processing methods.

Integration with Big Data Technologies

Exploring how Spark can connect with various big data technologies and ETL tools outside the Hadoop ecosystem for seamless integration and data processing.

Installation and Setup of Spark

Guidance on installing and setting up Spark, including prerequisites like Java, understanding deployment modes (Standalone, YARN), and configuring environmental variables.

Spark API and Data Processing

Introduction to the three main Spark APIs (RDD, DataFrames, and DataSets), their characteristics, and their role in distributed and fault-tolerant data processing.

Transformations and Actions in Spark

Explaining the concepts of transformations and actions in Spark, and how they shape the coding and execution process in Spark programs.

Lazy Evaluation in Spark

Understanding the concept of lazy evaluation in Spark, ensuring actions are called to trigger Spark processing efficiently.

Spark Testing and Development

Explaining the testing and development process in Spark using shells (Scala shell, Python shell) for quick code testing and debugging.

Starting Spark Service

The video discusses starting the Spark service and explains the processes involved in initiating the Spark cluster including starting the master and worker processes, using the JPS command, and understanding the different Java processes involved in Spark.

Cluster Mode and Port Numbers

Details about running programs in IDE, building jar files, and running in cluster mode are explained along with accessing the Spark Master UI using localhost and port number 8080. The significance of RPC and web port numbers for monitoring processes is also highlighted.

Spark Shell and Spark Context

An overview of the Spark shell, its UI, and the distinction between spark and python shells is provided. The concept of Spark Context and its role in initializing and executing programs in Spark is discussed.

Parallel Processing in Spark

In-depth details on parallel processing technology, mapping, reducing, and the key concepts of grouping tasks in Spark. A comparison with Hadoop's map-reduce approach is made, emphasizing the differences in terminology and execution.

Spark Job Execution Flow

The execution flow of Spark jobs is explained, starting from submitting a job, creating drivers and executors, resource allocation, data distribution, task execution, and the role of the cluster manager in managing resources efficiently.

Fault Tolerance in Spark

Details on fault tolerance in Spark, including scenarios of executor failures, actions taken by the Spark Master in case of failures, data replication strategies, and handling failures in a Spark Standalone architecture. The importance of avoiding single points of failure is emphasized.

FAQ

Q: What is the importance of data engineering for beginners and intermediate learners?

A: Data engineering is important for beginners and intermediate learners as it involves the processes of collecting, storing, and analyzing data efficiently to derive valuable insights and make informed decisions.

Q: What are some common misconceptions about Big Data?

A: Some common misconceptions about Big Data include assuming that volume is the only problem, when in reality, problems related to data quality, velocity, and variety are equally important in the Big Data domain.

Q: What is Hadoop and how does it differ from Big Data?

A: Hadoop is a solution in the Big Data domain that provides a framework for distributed storage and processing of large data sets. While Big Data encompasses a broader concept of dealing with large and complex data, Hadoop specifically addresses the infrastructure and tools required for handling such data.

Q: What is the concept of nuclear fusion?

A: Nuclear fusion is the process by which two light atomic nuclei combine to form a single heavier one while releasing massive amounts of energy.

Q: Explain the architecture of a Hadoop cluster.

A: A Hadoop cluster typically consists of multiple nodes such as namenode, job tracker, resource manager, and secondary namenode. These nodes work together to ensure high availability and fault tolerance in data storage and processing operations.

Get your own AI Agent Today

Thousands of businesses worldwide are using Chaindesk Generative AI platform.
Don't get left behind - start building your own custom AI chatbot now!

Start For Free

Book a Demo