How HBase Minor Compaction Works
Understanding HBase minor compaction -- how files are selected for compaction using the ratio algorithm, with a worked example showing the selection logic.
Compaction is the process in which HBase combines small files (HStoreFiles) into bigger ones.
It’s of two types:
- Minor: Takes a few files which are placed together and merges them into one.
- Major: Takes all the files in a region and merges them into one.
This post covers minor compaction. If you want to read about major compaction, please read the other post: How HBase Major Compaction Works. I suggest reading minor compaction first.
Let’s see what decides the term “few” in minor compaction.
Configuration Properties
The following properties affect minor compaction:
# Minimum number of StoreFiles per Store to be selected for a compaction to occur.
# Default: 2
hbase.hstore.compaction.min=2
# Maximum number of StoreFiles to compact per minor compaction.
# Default: 10
hbase.hstore.compaction.max=10
# Any StoreFile smaller than this setting will automatically be a candidate for compaction.
hbase.hstore.compaction.min.size
# Any StoreFile larger than this setting will automatically be excluded from compaction.
hbase.hstore.compaction.max.size
# Ratio used in compaction file selection algorithm.
hbase.store.compaction.ratio
File Selection Algorithm
The file which would be used for minor compaction is decided based on the following logic:
A file is selected for compaction when
file_size <= sum(smaller_files_size) * hbase.hstore.compaction.ratio
Worked Example
Consider the following configuration:
hbase.store.compaction.ratio=1.0
hbase.hstore.compaction.min=3
hbase.hstore.compaction.max=5
hbase.hstore.compaction.min.size=10
hbase.hstore.compaction.max.size=1000
The following StoreFiles exist: 100, 50, 23, 12, and 12 bytes (oldest to newest).
With the above parameters, the files selected for minor compaction are 23, 12, and 12.
Why? Remember the logic — a file is selected when file_size <= sum(smaller_files_size) * ratio:
- 100 — No, because
sum(50, 23, 12, 12) * 1.0 = 97 - 50 — No, because
sum(23, 12, 12) * 1.0 = 47 - 23 — Yes, because
sum(12, 12) * 1.0 = 24 - 12 — Yes, because the previous file has been included, and this does not exceed the max-file limit of 5
- 12 — Yes, because the previous file had been included, and this does not exceed the max-file limit of 5
Hope this helps in understanding HBase minor compaction.
Happy Hadooping :)