You can use this schema to specify how to write out a data frame to
a Parquet file with write_parquet().
Arguments
- ...
Parquet type specifications, see below. For backwards compatibility, you can supply a file name here, and then
parquet_schemabehaves asread_parquet_schema().
Value
Data frame with the same columns as read_parquet_schema():
file_name, name, r_type, type, type_length, repetition_type, converted_type, logical_type, num_children, scale, precision, field_id.
Details
A schema is a list of potentially named type specifications. A schema
is stored in a data frame. Each (potentially named) argument of
parquet_schema may be a character scalar, or a list. Parameterized
types need to be specified as a list. Primitive Parquet types may be
specified as a string or a list.
Possible types:
Special type:
"AUTO": this is not a Parquet type, but it tellswrite_parquet()to map the R type to Parquet automatically, using the default mapping rules.
Primitive Parquet types:
"BOOLEAN""INT32""INT64""INT96""FLOAT""DOUBLE""BYTE_ARRAY""FIXED_LEN_BYTE_ARRAY": fixed-length byte array. It needs atype_lengthparameter, an integer between 0 and 2^31-1.
Parquet logical types:
"STRING""ENUM""UUID""INTEGER": signed or unsigned integer. It needs abit_widthand anis_signedparameter.bit_widthmust be 8, 16, 32 or 64.is_signedmust beTRUEorFALSE."INT": same as"INTEGER". The Parquet documentation uses"INT", but the actual specification uses"INTEGER". Both are supported in nanoparquet."DECIMAL": decimal number of specified scale and precision. It needs theprecisionandprimitive_typeparameters. Also supports thescaleparameter, it defaults to zero if not specified."FLOAT16""DATE""TIME": needs anis_adjusted_utc(TRUEorFALSE) and aunitparameter.unitmust be"MILLIS","MICROS"or"NANOS"."TIMESTAMP": needs anis_adjusted_utc(TRUEorFALSE) and aunitparameter.unitmust be"MILLIS","MICROS"or"NANOS"."JSON""BSON"
Logical types MAP, LIST and UNKNOWN are not supported currently.
Converted types are deprecated in the Parquet specification in favor of
logical types, but parquet_schema() accepts some converted types as a
syntactic shortcut for the corresponding logical types:
INT_8meanlist("INT", bit_width = 8, is_signed = TRUE).INT_16meanlist("INT", bit_width = 16, is_signed = TRUE).INT_32meanlist("INT", bit_width = 32, is_signed = TRUE).INT_64meanlist("INT", bit_width = 64, is_signed = TRUE).TIME_MICROSmeanslist("TIME", is_adjusted_utc = TRUE, unit = "MICROS").TIME_MILLISmeanslist("TIME", is_adjusted_utc = TRUE, unit = "MILLIS").TIMESTAMP_MICROSmeanslist("TIMESTAMP", is_adjusted_utc = TRUE, unit = "MICROS").TIMESTAMP_MILLISmeanslist("TIMESTAMP", is_adjusted_utc = TRUE, unit = "MILLIS").UINT_8meanslist("INT", bit_width = 8, is_signed = FALSE).UINT_16meanslist("INT", bit_width = 16, is_signed = FALSE).UINT_32meanslist("INT", bit_width = 32, is_signed = FALSE).UINT_64meanslist("INT", bit_width = 64, is_signed = FALSE).
Missing values
Each type might also have a repetition_type parameter, with possible
values "REQUIRED", "OPTIONAL" or "REPEATED". "REQUIRED" columns
do not allow missing values. Missing values are allowed in "OPTIONAL"
columns. "REPEATED" columns are currently not supported in
write_parquet().
Examples
parquet_schema(
c1 = "INT32",
c2 = list("INT", bit_width = 64, is_signed = TRUE),
c3 = list("STRING", repetition_type = "OPTIONAL")
)
#> # A data frame: 3 × 12
#> file_name name r_type type type_length repetition_type
#> * <chr> <chr> <chr> <chr> <int> <chr>
#> 1 NA c1 NA INT32 NA NA
#> 2 NA c2 NA INT64 NA NA
#> 3 NA c3 NA BYTE_ARRAY NA OPTIONAL
#> # ℹ 6 more variables: converted_type <chr>, logical_type <I<list>>,
#> # num_children <int>, scale <int>, precision <int>, field_id <int>