HDFS and MapReduce can be confusing at times. Let’s break down the entire process step-by-step with a concrete example.
The example I’m going to use is calculating the average movie rating per genre from a CSV file. First, we have to upload the CSV file into HDFS, and then we will run a MapReduce job to compute the average ratings.
We assume:
Data format:
genre,ratingGoal: average rating per genre
Environment: Hadoop (HDFS + YARN + MapReduce)
STEP 0 — Your local file (before Hadoop)
On your local machine you have a file:
genre_ratings.csv
Contents:
Drama,4.5
Comedy,3.0
Drama,5.0
Sci-Fi,4.8
Comedy,3.5
At this point:
- Hadoop is not involved
- This is a normal OS file
STEP 1 — Uploading the CSV into HDFS
You now copy the file into HDFS.
Command
hdfs dfs -mkdir -p /movies/genre_ratings
hdfs dfs -put genre_ratings.csv /movies/genre_ratings/
What happens internally (very important)
Client contacts NameNode
NameNode:
- Checks permissions
- Chooses DataNodes to store blocks
File is split into HDFS blocks
- Default block size = 128 MB
- Your file is small → 1 block
Block is:
- Written to 3 DataNodes (replication factor = 3)
Different views of the file:
- User view of HDFS:
/movies/genre_ratings/genre_ratings.csv- NameNode view of file (Name Node metadata)
ratings.csv block_1 → DN2, DN4, DN7 block_2 → DN1, DN3, DN6- DataNode view of blocks
DN2: blk_1 DN4: blk_1 DN7: blk_1 DN1: blk_2 DN3: blk_2 DN6: blk_2
Data is now distributed and fault-tolerant.
STEP 2 — Submitting the MapReduce job
You submit a MapReduce job (e.g., Hadoop Streaming).
Example command
hadoop jar hadoop-streaming.jar \
-input /movies/genre_ratings/genre_ratings.csv \
-output /movies/output_avg \
-mapper mapper.py \
-reducer reducer.py
What Hadoop does now
Job is submitted to YARN ResourceManager
ResourceManager:
Allocates containers
Launches an ApplicationMaster
ApplicationMaster:
Requests block locations from the NameNode
Determines number of map tasks
Data Locality Optimization
- NameNode knows where blocks reside
- ApplicationMaster requests containers on the same nodes
- Mapper reads data locally from the DataNode disk
Each mapper runs:
- In its own YARN container
- As a separate JVM process
- One mapper per InputSplit
STEP 3 — InputSplits and Map task creation (data locality)
InputSplit creation
- File size < 128 MB
- 1 block → 1 InputSplit → 1 mapper
If file were TBs:
- Thousands of blocks
- Thousands of mappers
For every InputSplit,
- One mapper is created.
- YARN tries to schedule mappers on DataNodes where data resides (data locality).
STEP 4 — Mapper execution (line-by-line)
What a mapper receives
Each mapper receives:
(key, value)
Example:
(0, "Drama,4.5")
(10, "Comedy,3.0")
key= byte offset (ignored)value= one line from CSV
Mapper logic (conceptual)
- Read one line
- Split by comma
- Convert rating to float
- Emit genre as key
Mapper output (logical)
Drama → (4.5,1)
Comedy → (3.0,1)
Drama → (5.0,1)
Sci-Fi → (4.8,1)
Comedy → (3.5,1)
Mapper output (physical reality)
Output is buffered in memory
When buffer fills:
- Data spilled to disk
- Sorted by key (genre)
Example spill file:
Comedy (3.0,1)
Comedy (3.5,1)
Drama (4.5,1)
Drama (5.0,1)
Sci-Fi (4.8,1)
STEP 5 — Combiner (optional but realistic)
Before shuffle:
- Combiner runs on mapper node
- Performs local aggregation
After combiner
Comedy → (6.5,2)
Drama → (9.5,2)
Sci-Fi → (4.8,1)
This drastically reduces network traffic.
STEP 6 — Shuffle and Sort phase
This is the heart of MapReduce. This is the most expensive phase in the job, because it involves data transfer over the network.
What happens
Mapper outputs are:
- Partitioned by key
- Sent over the network
All records with same genre go to same reducer (partitioner decides)
Reducer receives data already sorted
Example reducer input:
Comedy → [(6.5,2)]
Drama → [(9.5,2)]
Sci-Fi → [(4.8,1)]
STEP 7 — Reducer execution
Reducer logic
For each genre:
- Sum all partial sums
- Sum all counts
- Compute average
Reducer computation
Comedy:
sum = 6.5
count = 2
avg = 3.25
Drama:
sum = 9.5
count = 2
avg = 4.75
Sci-Fi:
sum = 4.8
count = 1
avg = 4.8
Reducer output
Comedy 3.25
Drama 4.75
Sci-Fi 4.8
STEP 8 — Writing output to HDFS
Reducer writes output to HDFS
Stored as:
/movies/output_avg/part-00000Block replicated across DataNodes
Job marked SUCCESS
STEP 9 — Viewing the result
Command
hdfs dfs -cat /movies/output_avg/part-00000
Output
Comedy 3.25
Drama 4.75
Sci-Fi 4.8
STEP 10 — Job completion & cleanup
- YARN releases containers
- Temporary files removed
- Logs stored for debugging
FULL PIPELINE
Local CSV
↓
HDFS upload (blocks + replication)
↓
InputSplit creation
↓
Mapper (parse, emit genre → rating)
↓
Combiner (local aggregation)
↓
Shuffle & Sort
↓
Reducer (sum, count, avg)
↓
Output written to HDFS