Monday, August 22, 2016

Avro vs Parquet vs ORCFile as Hadoop storage files

While working on developing big data applications and systems in Hadoop. Every time we store data in Hadoop cluster, we think about what is the best way to store our data. There are tons of challenges when storing Petabytes of data including what is the required storage and how to faster reads your data!

In Hadoop, you can store your files in many formats. I would like to share some of these options and which to be used in certain scenarios.

How to store data files in Hadoop and what are the available options:

Apache Avro™ is a data serialization system. Avro provides a row-based data storage, while the schema is encoded on the file and it provides binary data serialization.

Use Case: Use Avro if you have a requirement to support binary data serialization for your data while maintaining a self contained schema on a row-based data files.

Read more about Avro:

Apache Parquet is a columnar storage format available to any project in the Hadoop ecosystem, regardless of the choice of data processing framework, data model or programming language.

Use Case: You want to store data on a column based files & save on storage. Parquet uses an efficient encoding & compression data representation (schemas) on your Hadoop clusters.
It works with different processing framework and programming languages. 

Read more about Apache Parquet:

3) ORCFile
Apache Orc is the smallest, fastest columnar storage for Hadoop workloads. ORC is a self-describing type-aware columnar file format designed for Hadoop workloads. It is optimized for large streaming reads, but with integrated support for finding required rows quickly. Storing data in a columnar format lets the reader read, decompress, and process only the values that are required for the current query. Because ORC files are type-aware, the writer chooses the most appropriate encoding for the type and builds an internal index as the file is written.

Use Case: Use ORC when you need to store your data on columnar storage in Hadoop in an efficient and faster way to retrieve your data. ORCFile contains its schema which makes reading values is so fast.

Read more about Apache Arc:

Hope this helps!

No comments: