How nanoparquet maps R types to Parquet types.
R's data types
When writing out a data frame, nanoparquet maps R's data types to Parquet logical types. The following table is a summary of the mapping. For the details see below.
R type | Parquet type | Default | Notes |
character | STRING (BYTE_ARRAY) | x | I.e. STRSXP. Converted to UTF-8. |
" | BYTE_ARRAY | ||
" | FIXED_LEN_BYTE_ARRAY | ||
" | ENUM | ||
" | UUID | ||
Date | DATE | x | |
difftime | INT64 | x | If not hms::hms. Arrow metadata marks it as Duration(NS). |
factor | STRING | x | Arrow metadata marks it as a factor. |
" | ENUM | ||
hms::hms | TIME(true, MILLIS) | x | Sub-milliseconds precision is lost. |
integer | INT(32, true) | x | I.e. INTSXP. |
" | INT64 | ||
" | INT96 | ||
" | DECIMAL (INT32) | ||
" | DECIMAL (INT64) | ||
" | INT(8, *) | ||
" | INT(16, *) | ||
" | INT(32, signed) | ||
list | BYTE_ARRAY | Must be a list of raw vectors. Messing values are NULL . | |
" | FIXED_LEN_BYTE_ARRAY | Must be a list of raw vectors of the same length. Missing values are NULL . | |
logical | BOOLEAN | x | I.e. LGLSXP. |
numeric | DOUBLE | x | I.e. REALSXP. |
" | INT96 | ||
" | FLOAT | ||
" | DECIMAL (INT32) | ||
" | DECIMAL (INT64) | ||
" | INT(*, *) | ||
" | FLOAT16 | ||
POSIXct | TIMESTAMP(true, MICROS) | x | Sub-microsecond precision is lost. |
The non-default mappings can be selected via the schema
argument. E.g.
to write out a factor column called 'name' as ENUM
, use
write_parquet(..., schema = parquet_schema(name = "ENUM"))
The detailed mapping rules are listed below, in order of preference. These rules will likely change until nanoparquet reaches version 1.0.0.
Factors (i.e. vectors that inherit the factor class) are converted to character vectors using
as.character()
, then written as aSTRSXP
(character vector) type. The fact that a column is a factor is stored in the Arrow metadata (see below), unless thenanoparquet.write_arrow_metadata
option is set toFALSE
.Dates (i.e. the
Date
class) is written asDATE
logical type, which is anINT32
type internally.hms
objects (from the hms package) are written asTIME(true, MILLIS)
. logical type, which is internally theINT32
Parquet type. Sub-milliseconds precision is lost.POSIXct
objects are written asTIMESTAMP(true, MICROS)
logical type, which is internally theINT64
Parquet type. Sub-microsecond precision is lost.difftime
objects (that are nothms
objects, see above), are written as anINT64
Parquet type, and noting in the Arrow metadata (see below) that this column has typeDuration
withNANOSECONDS
unit.Integer vectors (
INTSXP
) are written asINT(32, true)
logical type, which corresponds to theINT32
type.Real vectors (
REALSXP
) are written as theDOUBLE
type.Character vectors (
STRSXP
) are written as theSTRING
logical type, which has theBYTE_ARRAY
type. They are always converted to UTF-8 before writing.Logical vectors (
LGLSXP
) are written as theBOOLEAN
type.Other vectors error currently.
You can use infer_parquet_schema()
on a data frame to map R data types
to Parquet data types.
To change the default R to Parquet mapping, use parquet_schema()
and
the schema
argument of write_parquet()
. Currently supported
non-default mappings are:
integer
toINT64
,integer
toINT96
,double
toINT96
,double
toFLOAT
,character
toBYTE_ARRAY
,character
toFIXED_LEN_BYTE_ARRAY
,character
toENUM
,factor
toENUM
,integer
toDECIAML
&INT32
,integer
toDECIAML
&INT64
,double
toDECIAML
&INT32
,double
toDECIAML
&INT64
,integer
toINT(8, *)
,INT(16, *)
,INT(32, signed)
,double
toINT(*, *)
,character
toUUID
,double
toFLOAT16
,list
ofraw
vectors toBYTE_ARRAY
,list
ofraw
vectors toFIXED_LEN_BYTE_ARRAY
.
Parquet's data types
When reading a Parquet file nanoparquet also relies on logical types and the Arrow metadata (if present, see below) in addition to the low level data types. The following table summarizes the mappings. See more details below.
Parquet type | R type | Notes |
Logical types | ||
BSON | character | |
DATE | Date | |
DECIMAL | numeric | REALSXP, potentially losing precision. |
ENUM | character | |
FLOAT16 | numeric | REALSXP |
INT(8, *) | integer | |
INT(16, *) | integer | |
INT(32, *) | integer | Large unsigned values may overflow! |
INT(64, *) | numeric | REALSXP |
INTERVAL | list(raw) | Missing values are NULL . |
JSON | character | |
LIST | Not supported. | |
MAP | Not supported. | |
STRING | factor | If Arrow metadata says it is a factor. Also UTF8. |
" | character | Otherwise. Also UTF8. |
TIME | hms::hms | Also TIME_MILLIS and TIME_MICROS. |
TIMESTAMP | POSIXct | Also TIMESTAMP_MILLIS and TIMESTAMP_MICROS. |
UUID | character | In 00112233-4455-6677-8899-aabbccddeeff form. |
UNKNOWN | Not supported. | |
Primitive types | ||
BOOLEAN | logical | |
BYTE_ARRAY | factor | If Arrow metadata says it is a factor. |
" | list(raw) | Otherwise. Missing values are NULL . |
DOUBLE | numeric | REALSXP |
FIXED_LEN_BYTE_ARRAY | list(raw) | Missing values are NULL . |
FLOAT | numeric | REALSXP |
INT32 | integer | |
INT64 | numeric | REALSXP |
INT96 | POSIXct |
The exact rules are below. These rules will likely change until nanoparquet reaches version 1.0.0.
The
BOOLEAN
type is read as a logical vector (LGLSXP
).The
STRING
logical type and theUTF8
converted type is read as a character vector with UTF-8 encoding.The
DATE
logical type and theDATE
converted type are read as aDate
R object.The
TIME
logical type and theTIME_MILLIS
andTIME_MICROS
converted types are read as anhms
object, see the hms package.The
TIMESTAMP
logical type and theTIMESTAMP_MILLIS
andTIMESTAMP_MICROS
converted types are read asPOSIXct
objects. If the logical type has theUTC
flag set, then the time zone of thePOSIXct
object is set toUTC
.INT32
is read as an integer vector (INTSXP
).INT64
,DOUBLE
andFLOAT
are read as real vectors (REALSXP
).INT96
is read as aPOSIXct
read vector with thetzone
attribute set to"UTC"
. It was an old convention to store time stamps asINT96
objects.The
DECIMAL
converted type (FIXED_LEN_BYTE_ARRAY
orBYTE_ARRAY
type) is read as a real vector (REALSXP
), potentially losing precision.The
ENUM
logical type is read as a character vector.The
UUID
logical type is read as a character vector that uses the00112233-4455-6677-8899-aabbccddeeff
form.The
FLOAT16
logical type is read as a real vector (REALSXP
).BYTE_ARRAY
is read as a factor object if the file was written by Arrow and the original data type of the column was a factor. (See 'The Arrow metadata below.)Otherwise
BYTE_ARRAY
is read a list of raw vectors, with missing values denoted byNULL
.
Other logical and converted types are read as their annotated low level types:
INT(8, true)
,INT(16, true)
andINT(32, true)
are read as integer vectors because they areINT32
internally in Parquet.INT(64, true)
is read as a real vector (REALSXP
).Unsigned integer types
INT(8, false)
,INT(16, false)
andINT(32, false)
are read as integer vectors (INTSXP
). Large positive values may overflow into negative values, this is a known issue that we will fix.INT(64, false)
is read as a real vector (REALSXP
). Large positive values may overflow into negative values, this is a known issue that we will fix.INTERVAL
is a fixed length byte array, and nanoparquet reads it as a list of raw vectors. Missing values are denoted byNULL
.JSON
columns are read as character vectors (STRSXP
).BSON
columns are read as raw vectors (RAWSXP
).
These types are not yet supported:
Nested types (
LIST
,MAP
) are not supported.The
UNKNOWN
logical type is not supported.
You can use the read_parquet_schema()
function to see how R would read
the columns of a Parquet file. Look at the r_type
column.
The Arrow metadata
Apache Arrow (i.e. the arrow R package) adds additional metadata to
Parquet files when writing them in arrow::write_parquet()
. Then,
when reading the file in arrow::read_parquet()
, it uses this metadata
to recreate the same Arrow and R data types as before writing.
nanoparquet::write_parquet()
also adds the Arrow metadata to Parquet
files, unless the nanoparquet.write_arrow_metadata
option is set to
FALSE
.
Similarly, nanoparquet::read_parquet()
uses the Arrow metadata in the
Parquet file (if present), unless the nanoparquet.use_arrow_metadata
option is set to FALSE.
The Arrow metadata is stored in the file level key-value metadata, with
key ARROW:schema
.
Currently nanoparquet uses the Arrow metadata for two things:
It uses it to detect factors. Without the Arrow metadata factors are read as string vectors.
It uses it to detect
difftime
objects. Without the arrow metadata these are read asINT64
columns, containing the time difference in nanoseconds.
See also
nanoparquet-package for options that modify the type mappings.