HDFS and MapReduce can be confusing at times. Let’s break down the entire process step-by-step with a concrete example.

The example I’m going to use is calculating the average movie rating per genre from a CSV file. First, we have to upload the CSV file into HDFS, and then we will run a MapReduce job to compute the average ratings.

We assume:

Data format:
genre,rating
Goal: average rating per genre
Environment: Hadoop (HDFS + YARN + MapReduce)

STEP 0 — Your local file (before Hadoop)

On your local machine you have a file:

genre_ratings.csv

Contents:

Drama,4.5
Comedy,3.0
Drama,5.0
Sci-Fi,4.8
Comedy,3.5

At this point:

Hadoop is not involved
This is a normal OS file

STEP 1 — Uploading the CSV into HDFS

You now copy the file into HDFS.

Command

hdfs dfs -mkdir -p /movies/genre_ratings
hdfs dfs -put genre_ratings.csv /movies/genre_ratings/

What happens internally (very important)

Client contacts NameNode
NameNode:
- Checks permissions
- Chooses DataNodes to store blocks
File is split into HDFS blocks
- Default block size = 128 MB
- Your file is small → 1 block
Block is:
- Written to 3 DataNodes (replication factor = 3)

Different views of the file:

User view of HDFS:

/movies/genre_ratings/genre_ratings.csv

NameNode view of file (Name Node metadata)

 ratings.csv
     block_1 → DN2, DN4, DN7
     block_2 → DN1, DN3, DN6

DataNode view of blocks

 DN2: blk_1
 DN4: blk_1
 DN7: blk_1
 DN1: blk_2
 DN3: blk_2
 DN6: blk_2

Data is now distributed and fault-tolerant.

STEP 2 — Submitting the MapReduce job

You submit a MapReduce job (e.g., Hadoop Streaming).

Example command

hadoop jar hadoop-streaming.jar \
  -input /movies/genre_ratings/genre_ratings.csv \
  -output /movies/output_avg \
  -mapper mapper.py \
  -reducer reducer.py

What Hadoop does now

Job is submitted to YARN ResourceManager

ResourceManager:
- Allocates containers
- Launches an ApplicationMaster
ApplicationMaster:
- Requests block locations from the NameNode
- Determines number of map tasks

Data Locality Optimization

NameNode knows where blocks reside
ApplicationMaster requests containers on the same nodes
Mapper reads data locally from the DataNode disk

Each mapper runs:

In its own YARN container
As a separate JVM process
One mapper per InputSplit

STEP 3 — InputSplits and Map task creation (data locality)

InputSplit creation

File size < 128 MB
1 block → 1 InputSplit → 1 mapper

If file were TBs:

Thousands of blocks
Thousands of mappers

For every InputSplit,

One mapper is created.
YARN tries to schedule mappers on DataNodes where data resides (data locality).

STEP 4 — Mapper execution (line-by-line)

What a mapper receives

Each mapper receives:

(key, value)

Example:

(0, "Drama,4.5")
(10, "Comedy,3.0")

key = byte offset (ignored)
value = one line from CSV

Mapper logic (conceptual)

Read one line
Split by comma
Convert rating to float
Emit genre as key

Mapper output (logical)

Drama  → (4.5,1)
Comedy → (3.0,1)
Drama  → (5.0,1)
Sci-Fi → (4.8,1)
Comedy → (3.5,1)

Mapper output (physical reality)

Output is buffered in memory
When buffer fills:
- Data spilled to disk
- Sorted by key (genre)

Example spill file:

Comedy  (3.0,1)
Comedy  (3.5,1)
Drama   (4.5,1)
Drama   (5.0,1)
Sci-Fi  (4.8,1)

STEP 5 — Combiner (optional but realistic)

Before shuffle:

Combiner runs on mapper node
Performs local aggregation

After combiner

Comedy → (6.5,2)
Drama  → (9.5,2)
Sci-Fi → (4.8,1)

This drastically reduces network traffic.

STEP 6 — Shuffle and Sort phase

This is the heart of MapReduce. This is the most expensive phase in the job, because it involves data transfer over the network.

What happens

Mapper outputs are:
- Partitioned by key
- Sent over the network
All records with same genre go to same reducer (partitioner decides)
Reducer receives data already sorted

Example reducer input:

Comedy → [(6.5,2)]
Drama  → [(9.5,2)]
Sci-Fi → [(4.8,1)]

STEP 7 — Reducer execution

Reducer logic

For each genre:

Sum all partial sums
Sum all counts
Compute average

Reducer computation

Comedy:
  sum = 6.5
  count = 2
  avg = 3.25

Drama:
  sum = 9.5
  count = 2
  avg = 4.75

Sci-Fi:
  sum = 4.8
  count = 1
  avg = 4.8

Reducer output

Comedy   3.25
Drama    4.75
Sci-Fi   4.8

STEP 8 — Writing output to HDFS

Reducer writes output to HDFS
Stored as:
```
/movies/output_avg/part-00000
```
Block replicated across DataNodes
Job marked SUCCESS

STEP 9 — Viewing the result

Command

hdfs dfs -cat /movies/output_avg/part-00000

Output

Comedy   3.25
Drama    4.75
Sci-Fi   4.8

STEP 10 — Job completion & cleanup

YARN releases containers
Temporary files removed
Logs stored for debugging

FULL PIPELINE

Local CSV
   ↓
HDFS upload (blocks + replication)
   ↓
InputSplit creation
   ↓
Mapper (parse, emit genre → rating)
   ↓
Combiner (local aggregation)
   ↓
Shuffle & Sort
   ↓
Reducer (sum, count, avg)
   ↓
Output written to HDFS

STEP 0 — Your local file (before Hadoop)#

STEP 1 — Uploading the CSV into HDFS#

Command#

What happens internally (very important)#

STEP 2 — Submitting the MapReduce job#

Example command#

What Hadoop does now#

Data Locality Optimization#

STEP 3 — InputSplits and Map task creation (data locality)#

InputSplit creation#

STEP 4 — Mapper execution (line-by-line)#

What a mapper receives#

Mapper logic (conceptual)#

Mapper output (logical)#

Mapper output (physical reality)#

STEP 5 — Combiner (optional but realistic)#

After combiner#

STEP 6 — Shuffle and Sort phase#

What happens#

STEP 7 — Reducer execution#

Reducer logic#

Reducer computation#

Reducer output#

STEP 8 — Writing output to HDFS#

STEP 9 — Viewing the result#

Command#

Output#

STEP 10 — Job completion & cleanup#

FULL PIPELINE#

STEP 0 — Your local file (before Hadoop)

STEP 1 — Uploading the CSV into HDFS

Command

What happens internally (very important)

STEP 2 — Submitting the MapReduce job

Example command

What Hadoop does now

Data Locality Optimization

STEP 3 — InputSplits and Map task creation (data locality)

InputSplit creation

STEP 4 — Mapper execution (line-by-line)

What a mapper receives

Mapper logic (conceptual)

Mapper output (logical)

Mapper output (physical reality)

STEP 5 — Combiner (optional but realistic)

After combiner

STEP 6 — Shuffle and Sort phase

What happens

STEP 7 — Reducer execution

Reducer logic

Reducer computation

Reducer output

STEP 8 — Writing output to HDFS

STEP 9 — Viewing the result

Command

Output

STEP 10 — Job completion & cleanup

FULL PIPELINE