Parquet row group size

Lately, we’ve had a lot of people asking about the configuration settings available when you store data in Parquet format. This is a great question and I want to go over a few of the basics about the format to answer it.

Row groups

Even though Parquet is a column-oriented format, the largest sections of data are groups of row data rows. Records are organized into row groups so that the file is splittable and each split contains complete records. Here’s a simple picture of how data is stored for a simple schema with columns A, in green, and B, in blue:

Parquet row group layout

Row groups are used to keep all the columns of each record in the same HDFS block so records can be reassembled from a single block.

If the entire file were organized by columns then the underlying HDFS blocks would contain just a column or two of each record. Reassembling those records to process them would require shuffling almost all of the data around to the right place. Here’s a picture of the same total amount of data in A and B, but organized by column only:

Column-oriented layout without row groups

Without row groups, there is no alignment with the underlying HDFS blocks. That causes remote reads to reassemble complete records.

There’s another benefit to organizing data into row groups: memory consumption. Before Parquet can write the first data value in column B, it needs to write the last value of column A. All column-oriented formats need to buffer record data in memory until those records are written all at once. Row groups help by putting a limit on how much data will be held in memory.

You can control row group size by setting parquet.block.size, in bytes (default: 128MB). Parquet buffers data in its final encoded and compressed form, which uses less memory and means that the amount of buffered data is the same as the row group size on disk[1]. That makes the row group size the most important setting. It controls both:

  1. The amount of memory consumed for each open Parquet file, and
  2. The layout of column data on disk

The row group setting is a trade-off between these two. It is generally better to organize data into larger contiguous column chunks to get better I/O performance, but this comes at the cost of using more memory.

Column Chunks

That leads to next level down in the Parquet file: column chunks. Row groups are divided into column chunks, the column-oriented part of the format that is pictured above as the sections of blue and green within a row group. The benefits of Parquet come from this organization. Storing data by column lets Parquet use type-specific encodings and then compression to get more values in fewer bytes when writing, and, skip data for columns you don’t need when reading.

Parquet row group with increasing number of columns

The row group is a fixed size. As the number of columns increases, the row group is divided into smaller chunks.

The total row group size is divided between the column chunks. The more columns you have, the smaller each column’s portion of the total row group. Column chunk sizes also vary widely depending on how densely Parquet can store the values, so the portion used for each column is usually skewed.

Recommendations

There’s no magic answer for setting the row group size, but this does all lead to a few best practices:

1. Know your memory limits

Total memory for writes is approximately the row group size times the number of open files. If this is too high, then processes die with OutOfMemoryExceptions.

On the read side, memory consumption can be less by ignoring some columns, but this will usually still require half, a third, or some other constant times your row group size.

2. Test with your data

Write a file or two using the defaults and use parquet-tools to see the size distributions for columns in your data. Then, try to pick a value that puts the majority of those columns at a few megabytes in each row group.

3. Align with HDFS blocks

Make sure some whole number of row groups make approximately one HDFS block. Each row group must be processed by a single task, so row groups larger than the HDFS block size will read a lot of data remotely. Row groups that spill over into adjacent blocks will have the same problem.


[1] Parquet flushes a row group when it thinks the next record will exceed that size.

Tweet about this on TwitterShare on FacebookShare on LinkedIn

'Parquet row group size' have 1 comment

  1. February 18, 2015 @ 8:31 pm BrettM

    First off, Thanks for all of these ingest tips! We really appreciate these insightful ingest blog postings including this one on the ever increasingly popular parquet format.

    Reply


Would you like to share your thoughts?

Your email address will not be published.