Data Compression in Hadoop

When we think about Hadoop, we think about very large files which are stored in HDFS and lots of data transfer among nodes in the Hadoop cluster while storing HDFS blocks or while running map reduce tasks. If you could some how reduce the file size that would help you in reducing storage requirements as well as in reducing the data transfer across the network. That’s where data compression in Hadoop helps.

Table of contents

Data compression at various stages in Hadoop
Hadoop compression formats
Codecs in Hadoop
Compression and Input splits
Compression increases CPU processing

Data compression at various stages in Hadoop

You can compress data in Hadoop MapReduce at various stages.

Compressing input files- You can compress the input file that will reduce storage space in HDFS. If you compress the input files then the files will be decompressed automatically when the file is processed by a MapReduce job. Determining the appropriate coded will be done using the file name extension. As example if file name extension is .snappy hadoop framework will automatically use SnappyCodec to decompress the file.
Compressing the map output- You can compress the intermediate map output. Since map output is written to disk and data from several map outputs is used by a reducer so data from map outputs is transferred to the node where reduce task is running. Thus by compressing intermediate map output you can reduce both storage space and data transfer across network.
Compressing output files- You can also compress the output of a MapReduce job.

Hadoop compression formats

There are many different compression formats available in Hadoop framework. You will have to use one that suits your requirement.

Parameters that you need to look for are-

Time it takes to compress.
Space saving.
Compression format is splittable or not.

Let’s go through the list of available compression formats and see which format provides what characteristics.

Deflate– It is the compression algorithm whose implementation is zlib. Defalte compression algorithm is also used by gzip compression tool. Filename extension is .deflate.

gzip- gzip compression is based on Deflate compression algorithm. Gzip compression is not as fast as LZO or snappy but compresses better so space saving is more.
Gzip is not splittable.
Filename extension is .gz.

bzip2- Using bzip2 for compression will provide higher compression ratio but the compressing and decompressing speed is slow. Bzip2 is splittable, Bzip2Codec implements SplittableCompressionCodec interface which provides the capability to compress / de-compress a stream starting at any arbitrary position.
Filename extension is .bz2.

Refer Compressing File in bzip2 Format in Hadoop - Java Program to see how to compress using bzip2 format to get a splittable compressed file.

Snappy– The Snappy compressor from Google provides fast compression and decompression but compression ratio is less.
Snappy is not splittable.
Filename extension is .snappy.

Refer Compressing File in snappy Format in Hadoop - Java Program to see how to compress using snappy format.

LZO– LZO, just like snappy is optimized for speed so compresses and decompresses faster but compression ratio is less.
LZO is not splittable by default but you can index the lzo files as a pre-processing step to make them splittable.
Filename extension is .lzo.

Refer How to Configure And Use LZO Compression in Hadoop to see how to compress using LZO format and how to index lzo files.

LZ4– Has fast compression and decompression speed but compression ratio is less. LZ4 is not splittable.
Filename extension is .lz4.

Zstandard– Zstandard is a real-time compression algorithm, providing high compression ratios. It offers a very wide range of compression / speed trade-off.
Zstandard is not splittable.
Filename extension is .zstd.

Codecs in Hadoop

Codec, short form of compressor-decompressor is the implementation of a compression-decompression algorithm. In Hadoop framework there are different codec classes for different compression formats, you will use the codec class for the compression format you are using. The codec classes in Hadoop are as follows-

Deflate – org.apache.hadoop.io.compress.DefaultCodec or org.apache.hadoop.io.compress.DeflateCodec (DeflateCodec is an alias for DefaultCodec). This codec uses zlib compression.

Gzip – org.apache.hadoop.io.compress.GzipCodec

Bzip2 – org.apache.hadoop.io.compress.Bzip2Codec

Snappy – org.apache.hadoop.io.compress.SnappyCodec

LZO – com.hadoop.compression.lzo.LzoCodec, com.hadoop.compression.lzo.LzopCodec
LZO libraries are GPL licensed and doesn't come with Hadoop release. Hadoop codec for LZO has to be downloaded separately.

LZ4– org.apache.hadoop.io.compress.Lz4Codec

Zstandard – org.apache.hadoop.io.compress.ZstandardCodec

Compression and Input splits

As you must be knowing MapReduce job calculates the number of input splits for the job and as many map tasks are launched as the count of splits. These map tasks process the data referred by input splits in parallel.

If you compress the input file using the compression format that is not splittable, then it won't be possible to read data at an arbitrary point in the stream. So the map tasks won't be able to read split data. In this scenario MapReduce won’t create input splits and whole file will be processed by one mapper which means no advantage of parallel processing and data transfer overhead too.

Let's try to clarify it with an example. If you have a 1 GB file it will be partitioned and stored as 8 data blocks in HDFS (Block size is 128 MB). MapReduce job using this file will also create 8 input splits and launch 8 mappers to process these splits in parallel.

Now, if you compress this 1 GB file using gzip (which is not splittable) then HDFS still stores the file as 8 separate blocks. As it is not possible to read data at an arbitrary point in the compressed gzip stream, MapReduce job won’t calculate input splits and launch only one map task to process all the 8 HDFS blocks. So you lose the advantage of parallel processing and there is no data locality optimization too. Even if map task is launched on the node where data for one of the block is stored data for all the other blocks has to be transferred to the node where map task is launched.

Here note that compression used is Splittable or not is a factor for text files only. If you are using container file format like sequence file or Avro then splitting is supported even if the compressor used is not splittable like Snappy or Gzip.

Compression increases CPU processing

There is a performance cost associated with compressing and decompressing the data. Though you save on storage and I/O activity is less but compression and decompression requires extra CPU cycles.

Though in most of the cases compressing data increases the overall job performance, ensure that you weigh the pros and cons and compare the performance gains with compressed data.

That's all for this topic Data Compression in Hadoop. If you have any doubt or any suggestions to make please drop a comment. Thanks!

>>>Return to Hadoop Framework Tutorial Page

Related Topics

You may also like-

Tech Tutorials

Monday, December 25, 2023