Various Parquet encodings
Nanoparquet defaults
Currently the defaults are decided based on the R types. This might change in the future. In general, the defaults will likely change until nanoparquet reaches version 1.0.0.
Current encoding defaults:
Definition levels always use
RLE
. (Nanoparquet does not currently write repetition levels, but they'll also useRLE
, once implemented.)factor
columns useRLE_DICTIONARY
.logical
columns useRLE
if the average run length of the first 10,000 values is at least 15. Otherwise they use thePLAIN
encoding.integer
,double
andcharacter
columns useRLE_DICTIONARY
if at least two third of their values are repeated. Otherwise they usePLAIN
encoding.list
columns ofraw
vectors always use thePLAIN
encoding currently.
Parquet encodings
See https://github.com/apache/parquet-format/blob/master/Encodings.md for more details on Parquet encodings.
PLAIN
encoding
Supported types: all.
In general values are written back to back:
Integer types are little endian.
Floating point types follow the IEEE standard.
BYTE_ARRAY
: for each element, there is a little endian 4-byte length and then the bytes themselves.FIXED_LEN_BYTE_ARRAY
: bytes are written back to back.
Nanoparquet can read and write this encoding for all primitive types.
RLE_DICTIONARY
encoding
Supported types: dictionary indices in data pages.
This encoding combines run-length encoding and bit-packing.
Repeated sequences of the same value can be run-length encoded, and
non-repeated parts are bit packed.
It is used for data pages of dictionaries.
The dictionary pages themselves are PLAIN
encoded.
The deprecated PLAIN_DICTIONARY
name is treated the same as
RLE_DICTIONARY
.
Nanoparquet can read and write this encoding.
RLE
encoding
Supported types: BOOLEAN
. Also for definition and repetition levels.
This is the same encoding as RLE_DICTIONARY
, with a slightly different
header. It combines run-length encoding and bit packing.
It is used for BOOLEAN
columns, and also for definition and
repetition levels.
Nanoparquet can read and write this encoding.
BIT_PACKED
encoding (deprecated in favor of RLE
)
Supported types: none. Only for definition and repetition levels, but
RLE
should be used instead.
This is a simple bit packing encoding for integers, that was previously
used for encoding definition and repetition levels. It is not used in new
Parquet files because the the RLE
encoding includes it and it is better.
Nanoparquet currently cannot read or write the BIT_PACKED
encoding.
DELTA_BINARY_PACKED
encoding
Supported types: INT32
, INT64
.
This encoding efficiently encodes integer columns if the differences between consecutive elements are often the same, and/or the differences between consecutive elements are small. The extreme case of an arithmetic sequence can be encoded in O(1) space.
Nanoparquet can read this encoding, but cannot currently write it.
DELTA_LENGTH_BYTE_ARRAY
encoding
Supported types: BYTE_ARRAY
.
This encoding uses DELTA_BINARY_PACKED
to encode the length of all
byte array elements. It is especially efficient for short byte array
elements, i.e. a column of short strings.
Nanoparquet can read this encoding, but cannot currently write it.
DELTA_BYTE_ARRAY
encoding
Supported types: BYTE_ARRAY
, FIXED_LEN_BYTE_ARRAY
.
This encoding is efficient if consecutive byte array elements share the same prefix, because each element can reuse a prefix of the previous element.
Nanoparquet can read this encoding, but cannot currently write it.
BYTE_STREAM_SPLIT
encoding
Supported types: FLOAT
, DOUBLE
, INT32
, INT64
,
FIXED_LEN_BYTE_ARRAY
.
This encoding stores the first bytes of the elements first, then the second bytes, etc. It does not reduce the size in itself, but may allow more efficient compression.
Nanoparquet can read this encoding, but cannot currently write it.
See also
write_parquet()
on how to select a non-default encoding when
writing Parquet files.