How nanoparquet maps R types to Parquet types.
R's data types
When writing out a data frame, nanoparquet maps R's data types to Parquet logical types. The following table is a summary of the mapping. For the details see below.
| R type | Parquet type | Default | Notes |
| bit64::integer64 | INT64 | x | NA_integer64_ marks missing values. |
| blob::blob | BYTE_ARRAY | x | Missing values are NULL. |
| " | FIXED_LEN_BYTE_ARRAY | All entries must have the same length. Missing values are NULL. | |
| character | STRING(BYTE_ARRAY) | x | I.e. STRSXP. Converted to UTF-8. |
| " | BYTE_ARRAY | ||
| " | FIXED_LEN_BYTE_ARRAY | ||
| " | ENUM | ||
| " | UUID | ||
| Date | DATE | x | |
| difftime | INT64 | x | If not hms::hms. Arrow metadata marks it as Duration(NS). |
| factor | STRING | x | Arrow metadata marks it as a factor. |
| " | ENUM | ||
| hms::hms | TIME(true, MILLIS) | x | Sub-milliseconds precision is lost. |
| integer | INT(32, true) | x | I.e. INTSXP. |
| " | INT64 | ||
| " | INT96 | ||
| " | DECIMAL(INT32) | ||
| " | DECIMAL(INT64) | ||
| " | INT(8, *) | ||
| " | INT(16, *) | ||
| " | INT(32, signed) | ||
| list | LIST(INT32 elements) | x | List of integer vectors. NULL entries and NA elements are supported. |
| " | LIST(DOUBLE elements) | x | List of double vectors. NULL entries and NA elements are supported. |
| " | LIST(STRING elements) | x | List of character vectors. NULL entries and NA elements are supported. |
| " | BYTE_ARRAY | Must be a list of raw vectors. Missing values are NULL. | |
| " | FIXED_LEN_BYTE_ARRAY | Must be a list of raw vectors of the same length. Missing values are NULL. | |
| logical | BOOLEAN | x | I.e. LGLSXP. |
| numeric | DOUBLE | x | I.e. REALSXP. |
| " | INT96 | ||
| " | FLOAT | ||
| " | DECIMAL(INT32) | ||
| " | DECIMAL(INT64) | ||
| " | INT(*, *) | ||
| " | FLOAT16 | ||
| POSIXct | TIMESTAMP(true, MICROS) | x | Sub-microsecond precision is lost. |
The non-default mappings can be selected via the schema argument. E.g.
to write out a factor column called 'name' as ENUM, use
write_parquet(..., schema = parquet_schema(name = "ENUM"))The detailed mapping rules are listed below, in order of preference. These rules will likely change until nanoparquet reaches version 1.0.0.
bit64::integer64objects (from the bit64 package) are written asINT64. nanoparquet handles any object that inherits theinteger64class this way.NA_integer64_(i.e.INT64_MIN) marks missing values.blob::blobobjects (from the blob package) are written asBYTE_ARRAY.blob::blobis a list of raw vectors, and nanoparquet handles any object that inherits theblobclass this way, even if the blob package is not installed. Missing values (i.e.NULLlist entries) are supported.Factors (i.e. vectors that inherit the factor class) are converted to character vectors using
as.character(), then written as aSTRSXP(character vector) type. The fact that a column is a factor is stored in the Arrow metadata (see below), unless thenanoparquet.write_arrow_metadataoption is set toFALSE.Dates (i.e. the
Dateclass) is written asDATElogical type, which is anINT32type internally.hmsobjects (from the hms package) are written asTIME(true, MILLIS). logical type, which is internally theINT32Parquet type. Sub-milliseconds precision is lost.POSIXctobjects are written asTIMESTAMP(true, MICROS)logical type, which is internally theINT64Parquet type. Sub-microsecond precision is lost.difftimeobjects (that are nothmsobjects, see above), are written as anINT64Parquet type, and noting in the Arrow metadata (see below) that this column has typeDurationwithNANOSECONDSunit.Integer vectors (
INTSXP) are written asINT(32, true)logical type, which corresponds to theINT32type.Real vectors (
REALSXP) are written as theDOUBLEtype.Character vectors (
STRSXP) are written as theSTRINGlogical type, which has theBYTE_ARRAYtype. They are always converted to UTF-8 before writing.Logical vectors (
LGLSXP) are written as theBOOLEANtype.Other vectors error currently.
You can use infer_parquet_schema() on a data frame to map R data types
to Parquet data types.
To change the default R to Parquet mapping, use parquet_schema() and
the schema argument of write_parquet(). Currently supported
non-default mappings are:
integertoINT64,integertoINT96,doubletoINT96,doubletoFLOAT,charactertoBYTE_ARRAY,charactertoFIXED_LEN_BYTE_ARRAY,charactertoENUM,factortoENUM,integertoDECIAML&INT32,integertoDECIAML&INT64,doubletoDECIAML&INT32,doubletoDECIAML&INT64,integertoINT(8, *),INT(16, *),INT(32, signed),doubletoINT(*, *),charactertoUUID,doubletoFLOAT16,listofintegervectors toLISTwithINT32elements,listofdoublevectors toLISTwithDOUBLEelements,listofcharactervectors toLISTwithSTRINGelements,listofrawvectors toBYTE_ARRAY,listofrawvectors toFIXED_LEN_BYTE_ARRAY,blob::blobtoFIXED_LEN_BYTE_ARRAY.
Parquet's data types
When reading a Parquet file nanoparquet also relies on logical types and the Arrow metadata (if present, see below) in addition to the low level data types. The following table summarizes the mappings. See more details below.
| Parquet type | R type | Notes |
| Logical types | ||
| BSON | character | |
| DATE | Date | |
| DECIMAL | numeric | REALSXP, potentially losing precision. |
| ENUM | character | |
| FLOAT16 | numeric | REALSXP |
| INT(8, *) | integer | |
| INT(16, *) | integer | |
| INT(32, *) | integer | Large unsigned values may overflow! |
| INT(64, *) | numeric | REALSXP, or integer64 if read_int64_type option is set. |
| INTERVAL | list(raw) | Missing values are NULL. |
| JSON | character | |
| LIST | list | Elements are read as their corresponding R type. |
| MAP | Not supported. | |
| STRING | factor | If Arrow metadata says it is a factor. Also UTF8. |
| " | character | Otherwise. Also UTF8. |
| TIME | hms::hms | Also TIME_MILLIS and TIME_MICROS. |
| TIMESTAMP | POSIXct | Also TIMESTAMP_MILLIS and TIMESTAMP_MICROS. |
| UUID | character | In 00112233-4455-6677-8899-aabbccddeeff form. |
| UNKNOWN | Not supported. | |
| Primitive types | ||
| BOOLEAN | logical | |
| BYTE_ARRAY | factor | If Arrow metadata says it is a factor. |
| " | blob::blob | Otherwise. Missing values are NULL. |
| DOUBLE | numeric | REALSXP |
| FIXED_LEN_BYTE_ARRAY | blob::blob | Missing values are NULL. |
| FLOAT | numeric | REALSXP |
| INT32 | integer | |
| INT64 | numeric | REALSXP, or integer64 if read_int64_type option is set. |
| INT96 | POSIXct |
The exact rules are below. These rules will likely change until nanoparquet reaches version 1.0.0.
The
BOOLEANtype is read as a logical vector (LGLSXP).The
STRINGlogical type and theUTF8converted type is read as a character vector with UTF-8 encoding.The
DATElogical type and theDATEconverted type are read as aDateR object.The
TIMElogical type and theTIME_MILLISandTIME_MICROSconverted types are read as anhmsobject, see the hms package.The
TIMESTAMPlogical type and theTIMESTAMP_MILLISandTIMESTAMP_MICROSconverted types are read asPOSIXctobjects. If the logical type has theUTCflag set, then the time zone of thePOSIXctobject is set toUTC.INT32is read as an integer vector (INTSXP).INT64is read as a real vector (REALSXP) by default. If theread_int64_typeoption inparquet_options()is set to"integer64"or"bit64::integer64", it is read as abit64::integer64vector instead.NA_integer64_(i.e.INT64_MIN) marks missing values.DOUBLEandFLOATare read as real vectors (REALSXP).INT96is read as aPOSIXctread vector with thetzoneattribute set to"UTC". It was an old convention to store time stamps asINT96objects.The
DECIMALconverted type (FIXED_LEN_BYTE_ARRAYorBYTE_ARRAYtype) is read as a real vector (REALSXP), potentially losing precision.The
ENUMlogical type is read as a character vector.The
UUIDlogical type is read as a character vector that uses the00112233-4455-6677-8899-aabbccddeeffform.The
FLOAT16logical type is read as a real vector (REALSXP).BYTE_ARRAYis read as a factor object if the file was written by Arrow and the original data type of the column was a factor. (See 'The Arrow metadata below.)Otherwise
BYTE_ARRAYandFIXED_LEN_BYTE_ARRAYare read as ablob::blobobject.blob::blobis a list of raw vectors with classblob(and related vctrs classes). The blob package is not required to use this object; it is simply a list of raw vectors. Missing values are denoted byNULL.
Other logical and converted types are read as their annotated low level types:
INT(8, true),INT(16, true)andINT(32, true)are read as integer vectors because they areINT32internally in Parquet.INT(64, true)is read as a real vector (REALSXP), unless theread_int64_typeoption is set (see above).Unsigned integer types
INT(8, false),INT(16, false)andINT(32, false)are read as integer vectors (INTSXP). Large positive values may overflow into negative values, this is a known issue that we will fix.INT(64, false)is read as a real vector (REALSXP), unless theread_int64_typeoption is set (see above). Large positive values may overflow into negative values, this is a known issue that we will fix.INTERVALis a fixed length byte array, and nanoparquet reads it as a list of raw vectors. Missing values are denoted byNULL.JSONcolumns are read as character vectors (STRSXP).BSONcolumns are read as raw vectors (RAWSXP).
These types are not yet supported:
Nested
LISTtypes (lists of lists) are not supported.The
MAPlogical type is not supported.The
UNKNOWNlogical type is not supported.
You can use the read_parquet_schema() function to see how R would read
the columns of a Parquet file. Look at the r_type column.
The Arrow metadata
Apache Arrow (i.e. the arrow R package) adds additional metadata to
Parquet files when writing them in arrow::write_parquet(). Then,
when reading the file in arrow::read_parquet(), it uses this metadata
to recreate the same Arrow and R data types as before writing.
nanoparquet::write_parquet() also adds the Arrow metadata to Parquet
files, unless the nanoparquet.write_arrow_metadata option is set to
FALSE.
Similarly, nanoparquet::read_parquet() uses the Arrow metadata in the
Parquet file (if present), unless the nanoparquet.use_arrow_metadata
option is set to FALSE.
The Arrow metadata is stored in the file level key-value metadata, with
key ARROW:schema.
Currently nanoparquet uses the Arrow metadata for two things:
It uses it to detect factors. Without the Arrow metadata factors are read as string vectors.
It uses it to detect
difftimeobjects. Without the arrow metadata these are read asINT64columns, containing the time difference in nanoseconds.
See also
nanoparquet-package for options that modify the type mappings.