Transformations #

pq includes commands for restructuring Parquet files: selecting columns, slicing rows, merging files, and splitting into partitions.

Setup #

Create a product catalog:

products.json

[
  {"id": 1, "name": "Widget A", "category": "tools", "price": 9.99, "stock": 150},
  {"id": 2, "name": "Widget B", "category": "tools", "price": 14.99, "stock": 80},
  {"id": 3, "name": "Gadget X", "category": "electronics", "price": 49.99, "stock": 30},
  {"id": 4, "name": "Gadget Y", "category": "electronics", "price": 79.99, "stock": 15},
  {"id": 5, "name": "Gizmo", "category": "electronics", "price": 199.99, "stock": 5},
  {"id": 6, "name": "Thingamajig", "category": "misc", "price": 4.99, "stock": 500},
  {"id": 7, "name": "Doohickey", "category": "tools", "price": 24.99, "stock": 60},
  {"id": 8, "name": "Whatchamacallit", "category": "misc", "price": 2.99, "stock": 1000}
]
$ pq import products.json -o products.parquet
Converted 8 rows to products.parquet

Select: project columns #

pq select creates a new file with only the columns you need:

$ pq select products.parquet -c name,price -o name_price.parquet
Wrote 8 rows to name_price.parquet

The output file has a reduced schema:

$ pq schema name_price.parquet
Schema (2 columns):
├── name: string (nullable)
└── price: float64 (nullable)
$ pq head name_price.parquet
╭─────────────────┬────────╮
│ name            ┆ price  │
╞═════════════════╪════════╡
│ Widget A        ┆ 9.99   │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ Widget B        ┆ 14.99  │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ Gadget X        ┆ 49.99  │
│ ...             ┆ ...    │
╰─────────────────┴────────╯

Slice: extract a row range #

pq slice extracts a contiguous range of rows:

$ pq slice products.parquet --offset 2 --limit 3 -o middle.parquet
Wrote 3 rows to middle.parquet

$ pq cat middle.parquet
╭─────────────┬────┬──────────┬────────┬───────╮
│ category    ┆ id ┆ name     ┆ price  ┆ stock │
╞═════════════╪════╪══════════╪════════╪═══════╡
│ electronics ┆ 3  ┆ Gadget X ┆ 49.99  ┆ 30    │
├╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ electronics ┆ 4  ┆ Gadget Y ┆ 79.99  ┆ 15    │
├╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ electronics ┆ 5  ┆ Gizmo    ┆ 199.99 ┆ 5     │
╰─────────────┴────┴──────────┴────────┴───────╯

Merge: combine files #

Create a second file with new products:

products_new.json

[
  {"id": 9, "name": "Sprocket", "category": "tools", "price": 34.99, "stock": 40},
  {"id": 10, "name": "Doodad", "category": "misc", "price": 7.99, "stock": 200}
]
$ pq import products_new.json -o products_new.parquet
Converted 2 rows to products_new.parquet

pq merge concatenates files with matching schemas:

$ pq merge products.parquet products_new.parquet -o all_products.parquet
Merged 2 files, wrote 10 rows to all_products.parquet

$ pq count all_products.parquet
10

Schema modes

By default, merge uses strict mode: all files must have identical schemas. Use --schema-mode to handle schema differences:

Mode Behavior
strict All schemas must match exactly (default)
union Keep all columns, fill missing with nulls
intersect Keep only columns present in all files

Split by row count #

pq split --rows N breaks a file into chunks of N rows each:

$ pq split products.parquet --rows 3 -o split_rows/
Wrote 3 rows to split_rows/products_0000.parquet
Wrote 3 rows to split_rows/products_0001.parquet
Wrote 2 rows to split_rows/products_0002.parquet
Split 8 rows into 3 files in split_rows/

Split by partition column #

--partition-by creates Hive-style directory partitions, useful for tools like Spark, Trino, or DuckDB:

$ pq split products.parquet --partition-by category -o split_category/

Each partition directory contains only the matching rows:

$ pq cat split_category/category=electronics/products.parquet
╭─────────────┬────┬──────────┬────────┬───────╮
│ category    ┆ id ┆ name     ┆ price  ┆ stock │
╞═════════════╪════╪══════════╪════════╪═══════╡
│ electronics ┆ 3  ┆ Gadget X ┆ 49.99  ┆ 30    │
├╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ electronics ┆ 4  ┆ Gadget Y ┆ 79.99  ┆ 15    │
├╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ electronics ┆ 5  ┆ Gizmo    ┆ 199.99 ┆ 5     │
╰─────────────┴────┴──────────┴────────┴───────╯

$ pq cat split_category/category=tools/products.parquet
╭──────────┬────┬───────────┬───────┬───────╮
│ category ┆ id ┆ name      ┆ price ┆ stock │
╞══════════╪════╪═══════════╪═══════╪═══════╡
│ tools    ┆ 1  ┆ Widget A  ┆ 9.99  ┆ 150   │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ tools    ┆ 2  ┆ Widget B  ┆ 14.99 ┆ 80    │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ tools    ┆ 7  ┆ Doohickey ┆ 24.99 ┆ 60    │
╰──────────┴────┴───────────┴───────┴───────╯