Lately, we’ve had a lot of people asking about the configuration settings available when you store data in Parquet format. This is a great question and I want to go over a few of the basics about the format to answer it.
Even though Parquet is a column-oriented format, the largest sections of data are groups of row data rows. Records are organized into row groups so that the file is splittable and each split contains complete records. Here’s a simple picture of how data is stored for a simple schema with columns A, in green, and B, in blue:
If the entire file were organized by columns then the underlying HDFS blocks would contain just a column or two of each record. Reassembling those records to process them would require shuffling almost all of the data around to the right place. Here’s a picture of the same total amount of data in A and B, but organized by column only:
There’s another benefit to organizing data into row groups: memory consumption. Before Parquet can write the first data value in column B, it needs to write the last value of column A. All column-oriented formats need to buffer record data in memory until those records are written all at once. Row groups help by putting a limit on how much data will be held in memory.
You can control row group size by setting parquet.block.size, in bytes (default: 128MB). Parquet buffers data in its final encoded and compressed form, which uses less memory and means that the amount of buffered data is the same as the row group size on disk. That makes the row group size the most important setting. It controls both:
- The amount of memory consumed for each open Parquet file, and
- The layout of column data on disk
The row group setting is a trade-off between these two. It is generally better to organize data into larger contiguous column chunks to get better I/O performance, but this comes at the cost of using more memory.
That leads to next level down in the Parquet file: column chunks. Row groups are divided into column chunks, the column-oriented part of the format that is pictured above as the sections of blue and green within a row group. The benefits of Parquet come from this organization. Storing data by column lets Parquet use type-specific encodings and then compression to get more values in fewer bytes when writing, and, skip data for columns you don’t need when reading.
The total row group size is divided between the column chunks. The more columns you have, the smaller each column’s portion of the total row group. Column chunk sizes also vary widely depending on how densely Parquet can store the values, so the portion used for each column is usually skewed.
There’s no magic answer for setting the row group size, but this does all lead to a few best practices:
1. Know your memory limits
Total memory for writes is approximately the row group size times the number of open files. If this is too high, then processes die with OutOfMemoryExceptions.
On the read side, memory consumption can be less by ignoring some columns, but this will usually still require half, a third, or some other constant times your row group size.
2. Test with your data
Write a file or two using the defaults and use parquet-tools to see the size distributions for columns in your data. Then, try to pick a value that puts the majority of those columns at a few megabytes in each row group.
3. Align with HDFS blocks
Make sure some whole number of row groups make approximately one HDFS block. Each row group must be processed by a single task, so row groups larger than the HDFS block size will read a lot of data remotely. Row groups that spill over into adjacent blocks will have the same problem.
 Parquet flushes a row group when it thinks the next record will exceed that size.