Data EngineeringBig DataHadoop

Merging Small Files in Hadoop

The small files problem in Hadoop and five approaches to solve it: HDFSConcat, IdentityMapper/Reducer, FileUtil.copyMerge, Hadoop File Crush, and Hive concatenate.

26 January 2013 · 3 min read

Before even telling what to do for solving this, I would like to summarize what the problem is and why it matters. If you already know the details you can skip directly to the solution.

Problem

Each small file which is smaller than block size consumes a full block allocation. However, actual disk usage is based on the size of the file, so don’t get confused — even a small file will not consume full block storage.

  • Each block consumes some amount of RAM on the NameNode.
  • The NameNode can address a fixed number of blocks which depends on RAM available.
  • The small files in the Hadoop cluster are continuously hitting the NameNode memory limit.

So sooner or later we can have the problem where we can no longer add data to the cluster as the NameNode cannot address new blocks.

Let me summarize the problem with a simple example:

We have 1000 small files, each 100KB.

  • Each file will have one block mapped against it, so the NameNode will create 1000 blocks.
  • Actual space consumption will be only 1000 * 100 KB (assume replication = 1).
  • If we had a combined file of 1000 * 100 KB then it would have consumed just 1 block on the NameNode.
  • Information about each block is stored in NameNode RAM. So instead of just 1, we are consuming 1000 spaces in NameNode RAM.

Solution

The solution is straightforward: merge the small files into bigger ones such that they consume full blocks.

I am listing the approaches you can follow depending upon your choice:

1) HDFSConcat

If the files have the same block size and replication factor you can use an HDFS-specific API (HDFSConcat) to concatenate two files into a single file without copying blocks.

2) IdentityMapper and IdentityReducer

Write a MapReduce job with an IdentityMapper and IdentityReducer which outputs what it reads but with fewer output files. You can control this by configuring the number of reducers.

3) FileUtil.copyMerge

See FileUtil.copyMerge in the Hadoop API.

4) Hadoop File Crush

Crush small files in DFS to fewer, larger files.

Hadoop File Crush

Sample invocation:

hadoop jar filecrush-2.0-SNAPSHOT.jar crush.Crush -Ddfs.block.size=134217728 \
  --input-format=text \
  --output-format=text \
  --compress=none \
  /path/to/my/input /path/to/my/output/merge 20100221175612

Where 20100221175612 is a timestamp you can set at time of execution.

Read this document for more details. I would recommend this tool for solving this problem.

Advantages of File Crush:

  1. Automatically takes care of sub-folders
  2. Options to do in-folder or off-folder merges
  3. Lets you decide the output format — supported formats are Text and Sequence files
  4. Lets you compress data

5) Hive Concatenate Utility

If your Hive tables are using ORC format, you can use the Hive concatenate utility.

First, find partitions modified during the last 7 days:

SELECT D.NAME, T.TBL_NAME, P.PART_NAME
FROM hive.PARTITIONS P, hive.DBS D, hive.TBLS T
WHERE T.DB_ID = D.DB_ID
  AND P.TBL_ID = T.TBL_ID
  AND DATEDIFF(NOW(), FROM_UNIXTIME(P.CREATE_TIME)) <= 7;

Then run the Hive ALTER command to concatenate for all the partitions which need it:

ALTER TABLE tbl_name PARTITION (part_spec) CONCATENATE;

Closing Notes

Many people also recommend HAR and Sequence files — you might go that way also if you want to change the format of your data.

Thanks for reading. Please also share your approach of solving this problem in the comments below.

Further reading: