How to choose the best big data file format

Wambui Mugo March 4, 2020

Managing large datasets comes with its own complexities. Storing big data costs more, data processing needs an additional network, CPU, and IO costs. Generally, the bigger the dataset the more storage and processing costs.

But first, let’s start with the basics.

What is big data?

Big data is a mix of structured, semi-structured or unstructured data collected by companies.

The fact is, there is no ideal format for each use case. Smaller and faster data structures are more beneficial to business analytics apps. When it comes to data management, modern apps and business analytics need processes that can help them transform data into the most suitable file formats.

The most popular file formats used for big data analytical purposes include Avro, JSON, Parquet, and CSV.

So, what should you consider while picking the file format to use?

Readability.
The time taken to write/data structure complexity.
Compression support.
Schema evolution support.
Splittable files.

Having taken care of the basics, let’s dive into the file formats.

JSON

JavaScript object notation (JSON) is usually the go-to file format for web communication. Why is it the de facto file format?

One, ease in readability makes it easy to transmit data in apps. Two, it supports complex and nested data structures. Three, it has much smaller documents. Finally, you can easily weave it into cloud technologies, IoT, and mobile apps.

It is commonly used in NoSQL databases like Couchbase, Cassandra, Azure Cosmos DB, Amazon DynamoDB, and MongoDB.

Benefits of using JSON

Most languages have in-build JSON support with simplified serialization and deserialization libraries.
It supports hierarchical structures simplifying data storage in a single document and complex relations presentation.
It supports object lists.
Most of the current tools have JSON build-in support.

Cons

It is not the best in analyzing and storing data as it often leads to performance degradation.
It has a row-oriented format making the compression ration worse compared to their column-oriented format.
Every record contains metadata giving it much higher data volume compared to other formats.

CSV

Comma Separated Values (CSV) files like JSON are row-based formats. They are popular for tabular data exchange across systems using plain texts. For this reason, it’s championed by spreadsheet jockeys.

Commonly used by consumer, scientific, and business applications because of the ease in sharing data.

Benefits of using CSV

It is human readable and very simple to manually edit it.
The schema is pretty much straightforward.
Almost all applications can process CSV.
You can easily parse and implement it.

Cons

It does a poor job when it comes to supporting special characters and column types.
There is no standard method to represent binary data.
Importing CSV files is challenging as there is no clear difference between quotes and NULL.
You can’t work with complex structures.

Parquet

It’s pretty much new in comparison to the two bigshots above. Launched in 2013, it stores nested data in the column-based format. Making it very efficient in performance and data storage.

The Parquet developers say it’s the best format for big data problems. It is optimized for WORM (Write Once Read Many). If you want to read some heavy workload, then Parquet is the ISH. Yes, it’s difficult to write but very simple to read.

It’s most ideal for data warehouse solutions for large datasets.

Benefits of using Parquet

The compression ratio is very impressive. Up to 75%.
Being a columnar format, it reduces the disk I/O.
It is the fastest format for reading workflows.
It is very easy to work with the files as you can backup, move, and replicate them with ease.

Cons

Other than Spark, there are no other tools that support this file format.
It doesn’t support schema evolution or data modification.

Avro

Launched by Hadoop in 2009, Avro a highly splittable, row-based format used as a serialization platform. It’s very efficient and compact as it stores data in binary format.

Almost all programming languages can process it. Java, C, C + +, Ruby, Python, and C#.

Avro schema is stored in JSON so it’s easy to read and interpret. You can easily update components on it.

Benefits of using Avro

It has incredible performance as its lightweight and delivers fast data serialization and deserialization.
Avro is highly splittable and compressible making it a good data storage file format.

Summary

The fasted format to write is CSV while JSON is the most human-readable format. Parquet is fast to read column subsets and Avro fast to read all columns at ago.

The most optimized files for Big Data are Avro and Parquet. They have remarkable compression, “splitability”, and supports complex data structures. However, their writing and readability are wanting.