Metadata of all pages of a Parquet file
Value
Data frame with columns:
file_name
: file name.row_group
: id of the row group the page belongs to, an integer between 0 and the number of row groups minus one.column
: id of the column. An integer between the number of leaf columns minus one. Note that only leaf columns are considered, as non-leaf columns do not have any pages.page_type
:DATA_PAGE
,INDEX_PAGE
,DICTIONARY_PAGE
orDATA_PAGE_V2
.page_header_offset
: offset of the data page (its header) in the file.uncompressed_page_size
: does not include the page header, as per Parquet spec.compressed_page_size
: without the page header.crc
: integer, checksum, if present in the file, can beNA
.num_values
: number of data values in this page, includeNULL
(NA
in R) values.encoding
: encoding of the page, current possible encodings: "PLAIN", "GROUP_VAR_INT", "PLAIN_DICTIONARY", "RLE", "BIT_PACKED", "DELTA_BINARY_PACKED", "DELTA_LENGTH_BYTE_ARRAY", "DELTA_BYTE_ARRAY", "RLE_DICTIONARY", "BYTE_STREAM_SPLIT".definition_level_encoding
: encoding of the definition levels, seeencoding
for possible values. This can be missing in V2 data pages, where they are always RLE encoded.repetition_level_encoding
: encoding of the repetition levels, seeencoding
for possible values. This can be missing in V2 data pages, where they are always RLE encoded.data_offset
: offset of the actual data in the file.page_header_length
: size of the page header, in bytes.
Details
Reading all the page headers might be slow for large files, especially if the file has many small pages.
See also
read_parquet_page()
to read a page.
Examples
file_name <- system.file("extdata/userdata1.parquet", package = "nanoparquet")
nanoparquet:::parquet_pages(file_name)
#> # A data frame: 19 × 14
#> file_name row_group column page_type page_header_offset
#> <chr> <int> <int> <chr> <dbl>
#> 1 /home/runner/work/_temp/… 0 0 DATA_PAGE 4
#> 2 /home/runner/work/_temp/… 0 1 DATA_PAGE 6741
#> 3 /home/runner/work/_temp/… 0 2 DICTIONA… 10766
#> 4 /home/runner/work/_temp/… 0 2 DATA_PAGE 12259
#> 5 /home/runner/work/_temp/… 0 3 DICTIONA… 13334
#> 6 /home/runner/work/_temp/… 0 3 DATA_PAGE 15211
#> 7 /home/runner/work/_temp/… 0 4 DATA_PAGE 16239
#> 8 /home/runner/work/_temp/… 0 5 DICTIONA… 31726
#> 9 /home/runner/work/_temp/… 0 5 DATA_PAGE 31759
#> 10 /home/runner/work/_temp/… 0 6 DATA_PAGE 32031
#> 11 /home/runner/work/_temp/… 0 7 DATA_PAGE 42952
#> 12 /home/runner/work/_temp/… 0 8 DICTIONA… 53749
#> 13 /home/runner/work/_temp/… 0 8 DATA_PAGE 55009
#> 14 /home/runner/work/_temp/… 0 9 DATA_PAGE 55925
#> 15 /home/runner/work/_temp/… 0 10 DATA_PAGE 59312
#> 16 /home/runner/work/_temp/… 0 11 DICTIONA… 65063
#> 17 /home/runner/work/_temp/… 0 11 DATA_PAGE 67026
#> 18 /home/runner/work/_temp/… 0 12 DICTIONA… 68019
#> 19 /home/runner/work/_temp/… 0 12 DATA_PAGE 71089
#> # ℹ 9 more variables: uncompressed_page_size <int>,
#> # compressed_page_size <int>, crc <int>, num_values <int>,
#> # encoding <chr>, definition_level_encoding <chr>,
#> # repetition_level_encoding <chr>, data_offset <dbl>,
#> # page_header_length <int>