You can use this schema to specify how to write out a data frame to
a Parquet file with write_parquet()
.
Arguments
- ...
Parquet type specifications, see below. For backwards compatibility, you can supply a file name here, and then
parquet_schema
behaves asread_parquet_schema()
.
Value
Data frame with the same columns as read_parquet_schema()
:
file_name
, name
, r_type
, type
, type_length
, repetition_type
, converted_type
, logical_type
, num_children
, scale
, precision
, field_id
.
Details
A schema is a list of potentially named type specifications. A schema
is stored in a data frame. Each (potentially named) argument of
parquet_schema
may be a character scalar, or a list. Parameterized
types need to be specified as a list. Primitive Parquet types may be
specified as a string or a list.
Possible types:
Special type:
"AUTO"
: this is not a Parquet type, but it tellswrite_parquet()
to map the R type to Parquet automatically, using the default mapping rules.
Primitive Parquet types:
"BOOLEAN"
"INT32"
"INT64"
"INT96"
"FLOAT"
"DOUBLE"
"BYTE_ARRAY"
"FIXED_LEN_BYTE_ARRAY"
: fixed-length byte array. It needs atype_length
parameter, an integer between 0 and 2^31-1.
Parquet logical types:
"STRING"
"ENUM"
"UUID"
"INTEGER"
: signed or unsigned integer. It needs abit_width
and anis_signed
parameter.bit_width
must be 8, 16, 32 or 64.is_signed
must beTRUE
orFALSE
."INT"
: same as"INTEGER"
. The Parquet documentation uses"INT"
, but the actual specification uses"INTEGER"
. Both are supported in nanoparquet."DECIMAL"
: decimal number of specified scale and precision. It needs theprecision
andprimitive_type
parameters. Also supports thescale
parameter, it defaults to zero if not specified."FLOAT16"
"DATE"
"TIME"
: needs anis_adjusted_utc
(TRUE
orFALSE
) and aunit
parameter.unit
must be"MILLIS"
,"MICROS"
or"NANOS"
."TIMESTAMP"
: needs anis_adjusted_utc
(TRUE
orFALSE
) and aunit
parameter.unit
must be"MILLIS"
,"MICROS"
or"NANOS"
."JSON"
"BSON"
Logical types MAP
, LIST
and UNKNOWN
are not supported currently.
Converted types are deprecated in the Parquet specification in favor of
logical types, but parquet_schema()
accepts some converted types as a
syntactic shortcut for the corresponding logical types:
INT_8
meanlist("INT", bit_width = 8, is_signed = TRUE)
.INT_16
meanlist("INT", bit_width = 16, is_signed = TRUE)
.INT_32
meanlist("INT", bit_width = 32, is_signed = TRUE)
.INT_64
meanlist("INT", bit_width = 64, is_signed = TRUE)
.TIME_MICROS
meanslist("TIME", is_adjusted_utc = TRUE, unit = "MICROS")
.TIME_MILLIS
meanslist("TIME", is_adjusted_utc = TRUE, unit = "MILLIS")
.TIMESTAMP_MICROS
meanslist("TIMESTAMP", is_adjusted_utc = TRUE, unit = "MICROS")
.TIMESTAMP_MILLIS
meanslist("TIMESTAMP", is_adjusted_utc = TRUE, unit = "MILLIS")
.UINT_8
meanslist("INT", bit_width = 8, is_signed = FALSE)
.UINT_16
meanslist("INT", bit_width = 16, is_signed = FALSE)
.UINT_32
meanslist("INT", bit_width = 32, is_signed = FALSE)
.UINT_64
meanslist("INT", bit_width = 64, is_signed = FALSE)
.
Missing values
Each type might also have a repetition_type
parameter, with possible
values "REQUIRED"
, "OPTIONAL"
or "REPEATED"
. "REQUIRED"
columns
do not allow missing values. Missing values are allowed in "OPTIONAL"
columns. "REPEATED"
columns are currently not supported in
write_parquet()
.
Examples
parquet_schema(
c1 = "INT32",
c2 = list("INT", bit_width = 64, is_signed = TRUE),
c3 = list("STRING", repetition_type = "OPTIONAL")
)
#> # A data frame: 3 × 12
#> file_name name r_type type type_length repetition_type converted_type
#> * <chr> <chr> <chr> <chr> <int> <chr> <chr>
#> 1 NA c1 NA INT32 NA NA NA
#> 2 NA c2 NA INT64 NA NA INT_64
#> 3 NA c3 NA BYTE_… NA OPTIONAL UTF8
#> # ℹ 5 more variables: logical_type <I<list>>, num_children <int>,
#> # scale <int>, precision <int>, field_id <int>