Changelog
Source:NEWS.md
nanoparquet 0.4.2
CRAN release: 2025-02-22
write_parquet()now does not fail when writing files with a zero-length first page (#122).read_parquet()can now read Parquet files that do not contain the dictionary page offset in their metadata. Polars creates such files (#132).
nanoparquet 0.4.1
CRAN release: 2025-02-10
write_parquet()now correctly converts doubleDatecolumns to integer columns (@eitsupi, #116).read_parquet()now correctly readsFLOATcolumns from files with multiple row groups.read_parquet()now correctly reads Parquet files that have column chunks with both dictionary encoded and not dictionary encoded pages (#110).
nanoparquet 0.4.0
CRAN release: 2025-01-29
-
API changes:
parquet_schema()is now calledread_parquet_schema(). The newparquet_schema()function falls back toread_parquet_schema()if it is called with a single string argument, with a warning.parquet_info()is now calledread_parquet_info().parquet_info(still works for now, with a warning.parquet_metadata()is now calledread_parquet_metadata().parquet_metadata()still works, with a warning.parquet_column_types()is now deprecated, and issues a warning. Useread_parquet_schema()or the newinfer_parquet_schema()function instead.
-
Other improvements:
The new
parquet_schema()function creates a Parquet schema from scratch. You can use this schema as the newschemaargument ofwrite_parquet(), to specify how the columns of a data frame should be mapped to Parquet types.New
append_parquet()function to append a data frame to an existing Parquet file.New
col_selectargument forread_parquet()to read a subset of columns from a Parquet file.write_parquet()can now write multiple row groups. By default it puts at most 10 million rows into a single row group. You can choose the row groups manually with therow_groupsargument.write_parquet()now writes minimum and maximum values per row group for most types. See?parquet_options()for turning this off. It also writes out the number of non-missing values.-
Newly supported type conversions in
write_parquet()via the schema argument:-
integertoINT64, -
integertoINT96, -
doubletoINT96, -
doubletoFLOAT, -
charactertoBYTE_ARRAY, -
charactertoFIXED_LEN_BYTE_ARRAY, -
charactertoENUM, -
factortoENUM. -
integertoDECIMAL,INT32, -
integertoDECIMAL,INT64, -
doubletoDECIMAL,INT32, -
doubletoDECIMAL,INT64, -
integertoINT(8, *),INT(16, *),INT(32, signed), -
doubletoINT(*, *), -
charactertoUUID, -
doubletoFLOAT16, -
listofrawvectors toBYTE_ARRAY, -
listofrawvectors toFIXED_LEN_BYTE_ARRAY.
-
write_parquet()can now write version 2 data pages. The default is still version 1, but it might change in the future.write_parquet(file = ":raw:")now works correctly for larger data frames (#77).New
compression_leveloption to select the compression level manually. See?parquet_optionsfor details. (#91).read_parquet()can now read from an R connection (#71).read_parquet()now readsDECIMALvalues correctly fromINT32andINT64columns if theirscaleis not zero.read_parquet()now readsJSONcolumns as character vectors, as documented.read_parquet()now reads theFLOAT16logical type as a real (double) vector.The
classargument ofparquet_options()and thenanoparquet.classoption now work again (#104).
nanoparquet 0.3.0
CRAN release: 2024-06-17
-
read_parquet()type mapping changes:- The
STRINGlogical type and theUTF8converted type are still read as a character vector, butBYTE_ARRAYtypes without a converted or logical types are not any more, and are read as a list of raw vectors. Missing values are indicated asNULLvalues. - The
DECIMALconverted type is read as aREALSXPnow, even if its type isFIXED_LEN_BYTE_ARRAY. (Not just if it isBYTE_ARRAY). - The
UUIDlogical type is now read as a character vector, formatted as00112233-4455-6677-8899-aabbccddeeff. -
BYTE_ARRAYandFIXED_LEN_BYTE_ARRAYtypes without logical or converted types; or with unsupported ones:FLOAT16,INTERVAL; are now read into a list of raw vectors. Missing values are denoted byNULL.
- The
write_parquet()now automatically uses dictionary encoding for columns that have many repeated values. Only the first 10k rows are used to decide if dictionary will be used or not. Similarly, logical columns are written in RLE encoding if they contain runs of repeated values.NAvalues are ignored when selecting the encoding (#18).write_parquet()can now write a data frame to a memory buffer, returned as a raw vector, if the special":raw:"filename is used (#31).read_parquet()can now read Parquet files with V2 data pages (#37).Both
read_parquet()andwrite_parquet()now support GZIP and ZSTD compressed Parquet files.read_parquet()now supports theRLEencoding forBOOLEANcolumns and also supports theDELTA_BINARY_PACKED,DELTA_LENGTH_BYTE_ARRAY,DELTA_BYTE_ARRAYandBYTE_STREAM_SPLITencodings.The
parquet_columns()function is now calledparquet_column_types()and it can now map the column types of a data frame to Parquet types.parquet_info(),parquet_metadata()andparquet_column_types()now work if thecreated_bymetadata field is unset.New
parquet_options()function that you can use to set nanoparquet options for a singleread_parquet()orwrite_parquet()call.
nanoparquet 0.2.0
CRAN release: 2024-05-30
- First release on CRAN. It contains the Parquet reader from https://github.com/hannes/miniparquet, a Parquet writer, functions to read Parquet metadata, and many improvements.