The schema of the data frame must be compatible with the schema of the file.
Usage
append_parquet(
x,
file,
compression = c("snappy", "gzip", "zstd", "uncompressed"),
encoding = NULL,
row_groups = NULL,
options = parquet_options()
)Arguments
- x
Data frame to append.
- file
Path to the output file.
- compression
Compression algorithm to use for the newly written data. See
write_parquet().- encoding
Encoding to use for the newly written data. It does not have to be the same as the encoding of data in
file. Seewrite_parquet()for possible values.- row_groups
Row groups of the new, extended Parquet file.
append_parquet()can only change the last existing row group, and ifrow_groupsis specified, it has respect this. I.e. if the existing file hasnrows, and the last row group starts atk(k <= n), then the first row group inrow_groupsthat refers to the new data must start atkorn+1. (It is simpler to specifynum_rows_per_row_groupinoptions, seeparquet_options()instead ofrow_groups. Only userow_groupsif you need complete control.)- options
Nanoparquet options, for the new data, see
parquet_options(). Thekeep_row_groupsoption also affects whetherappend_parquet()overwrites existing row groups infile.
Warning
This function is not atomic! If it is interrupted, it may leave the file in a corrupt state. To work around this create a copy of the original file, append the new data to the copy, and then rename the new, extended file to the original one.
About row groups
A Parquet file may be partitioned into multiple row groups, and indeed
most large Parquet files are. append_parquet() is only able to update
the existing file along the row group boundaries. There are two
possibilities:
append_parquet()keeps all existing row groups infile, and creates new row groups for the new data. This mode can be forced by thekeep_row_groupsoption inoptions, seeparquet_options().Alternatively,
write_parquetwill overwrite the last row group in file, with its existing contents plus the (beginning of) the new data. This mode makes more sense if the last row group is small, because many small row groups are inefficient.
By default append_parquet chooses between the two modes automatically,
aiming to create row groups with at least num_rows_per_row_group
(see parquet_options()) rows. You can customize this behavior with
the keep_row_groups options and the row_groups argument.