Skip to content

Various Parquet encodings

Nanoparquet defaults

Currently the defaults are decided based on the R types. This might change in the future. In general, the defaults will likely change until nanoparquet reaches version 1.0.0.

Current encoding defaults:

  • Definition levels always use RLE. (Nanoparquet does not currently write repetition levels, but they'll also use RLE, once implemented.)

  • factor columns use RLE_DICTIONARY.

  • logical columns use RLE if the average run length of the first 10,000 values is at least 15. Otherwise they use the PLAIN encoding.

  • integer, double and character columns use RLE_DICTIONARY if at least two third of their values are repeated. Otherwise they use PLAIN encoding.

  • list columns of raw vectors always use the PLAIN encoding currently.

Parquet encodings

See https://github.com/apache/parquet-format/blob/master/Encodings.md for more details on Parquet encodings.

PLAIN encoding

Supported types: all.

In general values are written back to back:

  • Integer types are little endian.

  • Floating point types follow the IEEE standard.

  • BYTE_ARRAY: for each element, there is a little endian 4-byte length and then the bytes themselves.

  • FIXED_LEN_BYTE_ARRAY: bytes are written back to back.

Nanoparquet can read and write this encoding for all primitive types.

RLE_DICTIONARY encoding

Supported types: dictionary indices in data pages.

This encoding combines run-length encoding and bit-packing. Repeated sequences of the same value can be run-length encoded, and non-repeated parts are bit packed. It is used for data pages of dictionaries. The dictionary pages themselves are PLAIN encoded.

The deprecated PLAIN_DICTIONARY name is treated the same as RLE_DICTIONARY.

Nanoparquet can read and write this encoding.

RLE encoding

Supported types: BOOLEAN. Also for definition and repetition levels.

This is the same encoding as RLE_DICTIONARY, with a slightly different header. It combines run-length encoding and bit packing. It is used for BOOLEAN columns, and also for definition and repetition levels.

Nanoparquet can read and write this encoding.

BIT_PACKED encoding (deprecated in favor of RLE)

Supported types: none. Only for definition and repetition levels, but RLE should be used instead.

This is a simple bit packing encoding for integers, that was previously used for encoding definition and repetition levels. It is not used in new Parquet files because the the RLE encoding includes it and it is better.

Nanoparquet currently cannot read or write the BIT_PACKED encoding.

DELTA_BINARY_PACKED encoding

Supported types: INT32, INT64.

This encoding efficiently encodes integer columns if the differences between consecutive elements are often the same, and/or the differences between consecutive elements are small. The extreme case of an arithmetic sequence can be encoded in O(1) space.

Nanoparquet can read this encoding, but cannot currently write it.

DELTA_LENGTH_BYTE_ARRAY encoding

Supported types: BYTE_ARRAY.

This encoding uses DELTA_BINARY_PACKED to encode the length of all byte array elements. It is especially efficient for short byte array elements, i.e. a column of short strings.

Nanoparquet can read this encoding, but cannot currently write it.

DELTA_BYTE_ARRAY encoding

Supported types: BYTE_ARRAY, FIXED_LEN_BYTE_ARRAY.

This encoding is efficient if consecutive byte array elements share the same prefix, because each element can reuse a prefix of the previous element.

Nanoparquet can read this encoding, but cannot currently write it.

BYTE_STREAM_SPLIT encoding

Supported types: FLOAT, DOUBLE, INT32, INT64, FIXED_LEN_BYTE_ARRAY.

This encoding stores the first bytes of the elements first, then the second bytes, etc. It does not reduce the size in itself, but may allow more efficient compression.

Nanoparquet can read this encoding, but cannot currently write it.

See also

write_parquet() on how to select a non-default encoding when writing Parquet files.