Parquet encodings

Various Parquet encodings

Nanoparquet defaults

Currently the defaults are decided based on the R types. This might change in the future. In general, the defaults will likely change until nanoparquet reaches version 1.0.0.

Current encoding defaults:

Definition levels always use RLE. (Nanoparquet does not currently write repetition levels, but they'll also use RLE, once implemented.)
factor columns use RLE_DICTIONARY.
logical columns use RLE if the average run length of the first 10,000 values is at least 15. Otherwise they use the PLAIN encoding.
integer, double and character columns use RLE_DICTIONARY if at least two third of their values are repeated. Otherwise they use PLAIN encoding.
list columns of raw vectors always use the PLAIN encoding currently.

See https://github.com/apache/parquet-format/blob/master/Encodings.md for more details on Parquet encodings.

`PLAIN` encoding

Supported types: all.

In general values are written back to back:

Integer types are little endian.
Floating point types follow the IEEE standard.
BYTE_ARRAY: for each element, there is a little endian 4-byte length and then the bytes themselves.
FIXED_LEN_BYTE_ARRAY: bytes are written back to back.

Nanoparquet can read and write this encoding for all primitive types.

`RLE_DICTIONARY` encoding

Supported types: dictionary indices in data pages.

This encoding combines run-length encoding and bit-packing. Repeated sequences of the same value can be run-length encoded, and non-repeated parts are bit packed. It is used for data pages of dictionaries. The dictionary pages themselves are PLAIN encoded.

The deprecated PLAIN_DICTIONARY name is treated the same as RLE_DICTIONARY.

Nanoparquet can read and write this encoding.

`RLE` encoding

Supported types: BOOLEAN. Also for definition and repetition levels.

This is the same encoding as RLE_DICTIONARY, with a slightly different header. It combines run-length encoding and bit packing. It is used for BOOLEAN columns, and also for definition and repetition levels.

Nanoparquet can read and write this encoding.

`BIT_PACKED` encoding (deprecated in favor of `RLE`)

Supported types: none. Only for definition and repetition levels, but RLE should be used instead.

This is a simple bit packing encoding for integers, that was previously used for encoding definition and repetition levels. It is not used in new Parquet files because the the RLE encoding includes it and it is better.

Nanoparquet currently cannot read or write the BIT_PACKED encoding.

`DELTA_BINARY_PACKED` encoding

Supported types: INT32, INT64.

This encoding efficiently encodes integer columns if the differences between consecutive elements are often the same, and/or the differences between consecutive elements are small. The extreme case of an arithmetic sequence can be encoded in O(1) space.

Nanoparquet can read this encoding, but cannot currently write it.

`DELTA_LENGTH_BYTE_ARRAY` encoding

Supported types: BYTE_ARRAY.

This encoding uses DELTA_BINARY_PACKED to encode the length of all byte array elements. It is especially efficient for short byte array elements, i.e. a column of short strings.

Nanoparquet can read this encoding, but cannot currently write it.

`DELTA_BYTE_ARRAY` encoding

Supported types: BYTE_ARRAY, FIXED_LEN_BYTE_ARRAY.

This encoding is efficient if consecutive byte array elements share the same prefix, because each element can reuse a prefix of the previous element.

Nanoparquet can read this encoding, but cannot currently write it.

`BYTE_STREAM_SPLIT` encoding

Supported types: FLOAT, DOUBLE, INT32, INT64, FIXED_LEN_BYTE_ARRAY.

This encoding stores the first bytes of the elements first, then the second bytes, etc. It does not reduce the size in itself, but may allow more efficient compression.

Nanoparquet can read this encoding, but cannot currently write it.

Parquet encodings

Nanoparquet defaults