Writes the contents of an R data frame into a Parquet file.
Usage
write_parquet(
x,
file,
schema = NULL,
compression = c("snappy", "gzip", "zstd", "uncompressed"),
encoding = NULL,
metadata = NULL,
row_groups = NULL,
options = parquet_options()
)
Arguments
- x
Data frame to write.
- file
Path to the output file. If this is the string
":raw:"
, then the data frame is written to a memory buffer, and the memory buffer is returned as a raw vector.- schema
Parquet schema. Specify a schema to tweak the default nanoparquet R -> Parquet type mappings. Use
parquet_schema()
to create a schema that you can use here, orread_parquet_schema()
to use the schema of a Parquet file.- compression
Compression algorithm to use. Currently
"snappy"
(the default),"gzip"
,"zstd"
, and"uncompressed"
are supported.- encoding
Encoding to use. Possible values:
If
NULL
, the appropriate encoding is selected automatically:RLE
orPLAIN
forBOOLEAN
columns,RLE_DICTIONARY
for other columns with many repeated values, andPLAIN
otherwise.If It is a single (unnamed) character string, then it'll be used for all columns.
If it is an unnamed character vector of encoding names of the same length as the number of columns in the data frame, then those encodings will be used for each column.
If it is a named character vector, then the named must be unique and each name must match a column name, to specify the encoding of that column. The special empty name (
""
) applies to the rest of the columns. If there is no empty name, the rest of the columns will use the default encoding.
If
NA_character_
is specified for a column, the default encoding is used for the column.If a specified encoding is invalid for a certain column type, or nanoparquet does not implement it,
write_parquet()
throws an error.This version of nanoparquet supports the following encodings:
PLAIN
,GROUP_VAR_INT
,PLAIN_DICTIONARY
,RLE
,BIT_PACKED
,DELTA_BINARY_PACKED
,DELTA_LENGTH_BYTE_ARRAY
,DELTA_BYTE_ARRAY
,RLE_DICTIONARY
,BYTE_STREAM_SPLIT
.See parquet-encodings for more about encodings.
- metadata
Additional key-value metadata to add to the file. This must be a named character vector, or a data frame with columns character columns called
key
andvalue
.- row_groups
Row groups of the Parquet file. If
NULL
, andx
is a grouped data frame, then the groups are used as row groups. The rows will be reordered to match groups. IfNULL
, andx
is not a grouped data frame, then thenum_rows_per_row_group
option is used from theoptions
argument, seeparquet_options()
. Otherwise it must be an integer vector, specifying the starts of the row groups.- options
Nanoparquet options, see
parquet_options()
.
Details
write_parquet()
converts string columns to UTF-8 encoding by calling
base::enc2utf8()
. It does the same for factor levels.