Getting Started #

pq is a Swiss Army Knife for Parquet files. This tutorial walks through the basics: importing data, inspecting files, and querying with SQL.

Create a sample dataset #

Start with a small JSON file containing user records:

users.json

[
  {"name": "Alice", "age": 30, "city": "New York", "score": 92.5, "active": true},
  {"name": "Bob", "age": 25, "city": "Los Angeles", "score": 88.0, "active": true},
  {"name": "Charlie", "age": 35, "city": "Chicago", "score": 76.3, "active": false},
  {"name": "Diana", "age": 28, "city": "New York", "score": 95.1, "active": true},
  {"name": "Eve", "age": 32, "city": "Los Angeles", "score": 81.7, "active": false}
]

Import into Parquet #

Convert the JSON file into Parquet format:

$ pq import users.json -o users.parquet
Converted 5 rows to users.parquet

Inspect file metadata #

Use info to see a summary of the file:

$ pq info users.parquet
File:         users.parquet
Size:         1.8 KB
Rows:         5
Row Groups:   1
Columns:      5
Format:       v1.0
Created by:   parquet-rs version 53.4.1
Compression:  ZSTD(ZstdLevel(1))
Metadata:
  ARROW:schema:
    active: boolean (nullable)
    age: int64 (nullable)
    city: string (nullable)
    name: string (nullable)
    score: float64 (nullable)

View the schema #

The schema command shows column names and types:

$ pq schema users.parquet
Schema (5 columns):
├── active: boolean (nullable)
├── age: int64 (nullable)
├── city: string (nullable)
├── name: string (nullable)
└── score: float64 (nullable)

Preview data with head and tail #

Show the first 3 rows:

$ pq head users.parquet -n 3
╭────────┬─────┬─────────────┬─────────┬───────╮
│ active ┆ age ┆ city        ┆ name    ┆ score │
╞════════╪═════╪═════════════╪═════════╪═══════╡
│ true   ┆ 30  ┆ New York    ┆ Alice   ┆ 92.5  │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ true   ┆ 25  ┆ Los Angeles ┆ Bob     ┆ 88.0  │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ false  ┆ 35  ┆ Chicago     ┆ Charlie ┆ 76.3  │
╰────────┴─────┴─────────────┴─────────┴───────╯

Show the last 2 rows:

$ pq tail users.parquet -n 2
╭────────┬─────┬─────────────┬───────┬───────╮
│ active ┆ age ┆ city        ┆ name  ┆ score │
╞════════╪═════╪═════════════╪═══════╪═══════╡
│ true   ┆ 28  ┆ New York    ┆ Diana ┆ 95.1  │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ false  ┆ 32  ┆ Los Angeles ┆ Eve   ┆ 81.7  │
╰────────┴─────┴─────────────┴───────┴───────╯

Count rows #

$ pq count users.parquet
5

Dump rows with cat #

Use cat with --limit to control how many rows are shown:

$ pq cat users.parquet --limit 3
╭────────┬─────┬─────────────┬─────────┬───────╮
│ active ┆ age ┆ city        ┆ name    ┆ score │
╞════════╪═════╪═════════════╪═════════╪═══════╡
│ true   ┆ 30  ┆ New York    ┆ Alice   ┆ 92.5  │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ true   ┆ 25  ┆ Los Angeles ┆ Bob     ┆ 88.0  │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ false  ┆ 35  ┆ Chicago     ┆ Charlie ┆ 76.3  │
╰────────┴─────┴─────────────┴─────────┴───────╯

Random sample #

Use --seed for reproducible results:

$ pq sample users.parquet -n 2 --seed 42
╭────────┬─────┬─────────────┬─────────┬───────╮
│ active ┆ age ┆ city        ┆ name    ┆ score │
╞════════╪═════╪═════════════╪═════════╪═══════╡
│ true   ┆ 25  ┆ Los Angeles ┆ Bob     ┆ 88.0  │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ false  ┆ 35  ┆ Chicago     ┆ Charlie ┆ 76.3  │
╰────────┴─────┴─────────────┴─────────┴───────╯

Query with SQL #

Use the sql command to run SQL queries. Reference Parquet files in the FROM clause with ./ paths:

$ pq sql "SELECT name, age FROM './users.parquet' WHERE age > 28 ORDER BY age"
╭─────────┬─────╮
│ name    ┆ age │
╞═════════╪═════╡
│ Alice   ┆ 30  │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┤
│ Eve     ┆ 32  │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┤
│ Charlie ┆ 35  │
╰─────────┴─────╯

Export back to JSON #

Export the Parquet file back to JSON:

$ pq export users.parquet -o users_out.json
Exported 5 rows to users_out.json

Verify the exported content:

$ cat users_out.json
[
  {
    "active": true,
    "age": 30,
    "city": "New York",
    "name": "Alice",
    "score": 92.5
  },
  ...
]