SQL Queries #

pq sql executes SQL queries on Parquet files using Apache DataFusion. Files are referenced in the FROM clause with single-quoted paths.

Setup: create test data #

A users table:

users.json

[
  {"name": "Alice", "age": 30, "city": "New York", "score": 92.5, "active": true},
  {"name": "Bob", "age": 25, "city": "Los Angeles", "score": 88.0, "active": true},
  {"name": "Charlie", "age": 35, "city": "Chicago", "score": 76.3, "active": false},
  {"name": "Diana", "age": 28, "city": "New York", "score": 95.1, "active": true},
  {"name": "Eve", "age": 32, "city": "Los Angeles", "score": 81.7, "active": false}
]

An orders table:

orders.json

[
  {"order_id": 1, "customer": "Alice", "product": "Widget", "quantity": 3, "price": 9.99},
  {"order_id": 2, "customer": "Bob", "product": "Gadget", "quantity": 1, "price": 24.99},
  {"order_id": 3, "customer": "Alice", "product": "Gadget", "quantity": 2, "price": 24.99},
  {"order_id": 4, "customer": "Charlie", "product": "Widget", "quantity": 5, "price": 9.99},
  {"order_id": 5, "customer": "Diana", "product": "Doohickey", "quantity": 1, "price": 49.99},
  {"order_id": 6, "customer": "Bob", "product": "Widget", "quantity": 2, "price": 9.99},
  {"order_id": 7, "customer": "Eve", "product": "Gadget", "quantity": 1, "price": 24.99},
  {"order_id": 8, "customer": "Alice", "product": "Doohickey", "quantity": 1, "price": 49.99}
]

Import both files:

$ pq import users.json -o users.parquet
Converted 5 rows to users.parquet
$ pq import orders.json -o orders.parquet
Converted 8 rows to orders.parquet

SELECT with WHERE and ORDER BY #

Filter and sort rows:

$ pq sql "SELECT customer, product, quantity
           FROM './orders.parquet'
           WHERE quantity > 1
           ORDER BY quantity DESC"
╭──────────┬─────────┬──────────╮
│ customer ┆ product ┆ quantity │
╞══════════╪═════════╪══════════╡
│ Charlie  ┆ Widget  ┆ 5        │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ Alice    ┆ Widget  ┆ 3        │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ Alice    ┆ Gadget  ┆ 2        │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ Bob      ┆ Widget  ┆ 2        │
╰──────────┴─────────┴──────────╯

LIMIT #

Return only the top results:

$ pq sql "SELECT product, SUM(quantity) as total_qty
           FROM './orders.parquet'
           GROUP BY product
           ORDER BY total_qty DESC
           LIMIT 2"
╭─────────┬───────────╮
│ product ┆ total_qty │
╞═════════╪═══════════╡
│ Widget  ┆ 10        │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ Gadget  ┆ 4         │
╰─────────┴───────────╯

GROUP BY with aggregates #

Summarize orders per customer:

$ pq sql "SELECT customer,
                 COUNT(*) as num_orders,
                 SUM(quantity * price) as total_spent
           FROM './orders.parquet'
           GROUP BY customer
           ORDER BY total_spent DESC"
╭──────────┬────────────┬─────────────╮
│ customer ┆ num_orders ┆ total_spent │
╞══════════╪════════════╪═════════════╡
│ Alice    ┆ 3          ┆ 129.94      │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ Diana    ┆ 1          ┆ 49.99       │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ Charlie  ┆ 1          ┆ 49.95       │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ Bob      ┆ 2          ┆ 44.97       │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ Eve      ┆ 1          ┆ 24.99       │
╰──────────┴────────────┴─────────────╯

Aggregate functions #

Compute min, max, average, and sum across all orders:

$ pq sql "SELECT MIN(price) as min_price,
                 MAX(price) as max_price,
                 ROUND(AVG(price), 2) as avg_price,
                 SUM(quantity) as total_items
           FROM './orders.parquet'"
╭───────────┬───────────┬───────────┬─────────────╮
│ min_price ┆ max_price ┆ avg_price ┆ total_items │
╞═══════════╪═══════════╪═══════════╪═════════════╡
│ 9.99      ┆ 49.99     ┆ 25.62     ┆ 16          │
╰───────────┴───────────┴───────────┴─────────────╯

Saving results to files #

Use -o to write query results to a file. The format is auto-detected from the extension.

Save to Parquet:

$ pq sql "SELECT name, age FROM './users.parquet' WHERE active = true ORDER BY age" \
    -o active_users.parquet
Wrote 3 rows to active_users.parquet

$ pq cat active_users.parquet
╭───────┬─────╮
│ name  ┆ age │
╞═══════╪═════╡
│ Bob   ┆ 25  │
├╌╌╌╌╌╌╌┼╌╌╌╌╌┤
│ Diana ┆ 28  │
├╌╌╌╌╌╌╌┼╌╌╌╌╌┤
│ Alice ┆ 30  │
╰───────┴─────╯

Save to JSON:

$ pq sql "SELECT name, score FROM './users.parquet' ORDER BY score DESC LIMIT 3" \
    -o top_scores.json
Wrote 3 rows to top_scores.json

$ cat top_scores.json
[
  {"name": "Diana", "score": 95.1},
  {"name": "Alice", "score": 92.5},
  {"name": "Bob", "score": 88.0}
]

JOIN two files #

Join users with their orders to see order counts by city:

$ pq sql "SELECT u.name, u.city, COUNT(o.order_id) as num_orders
           FROM './users.parquet' u
           JOIN './orders.parquet' o ON u.name = o.customer
           GROUP BY u.name, u.city
           ORDER BY num_orders DESC"
╭─────────┬─────────────┬────────────╮
│ name    ┆ city        ┆ num_orders │
╞═════════╪═════════════╪════════════╡
│ Alice   ┆ New York    ┆ 3          │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ Bob     ┆ Los Angeles ┆ 2          │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ Diana   ┆ New York    ┆ 1          │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ Eve     ┆ Los Angeles ┆ 1          │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ Charlie ┆ Chicago     ┆ 1          │
╰─────────┴─────────────┴────────────╯

Querying nested fields #

Parquet files often contain structs, lists, and maps. DataFusion supports bracket notation to reach into nested columns.

Create a file with nested data:

events.json

[
  {"event": "click", "user_id": 402, "city": "Seattle", "payload": {"action": "buy", "metadata": {"source": "ad", "campaign": "summer"}}},
  {"event": "view", "user_id": 117, "city": "Portland", "payload": {"action": "browse", "metadata": {"source": "organic", "campaign": null}}},
  {"event": "click", "user_id": 892, "city": "Denver", "payload": {"action": "buy", "metadata": {"source": "ad", "campaign": "winter"}}},
  {"event": "view", "user_id": 402, "city": "Seattle", "payload": {"action": "browse", "metadata": {"source": "referral", "campaign": null}}},
  {"event": "click", "user_id": 117, "city": "Portland", "payload": {"action": "signup", "metadata": {"source": "ad", "campaign": "summer"}}}
]

$ pq import events.json -o events.parquet
Converted 5 rows to events.parquet

Access struct fields with bracket notation:

$ pq sql "SELECT event, payload['action'] as action, city
           FROM './events.parquet'
           WHERE payload['action'] = 'buy'"
╭───────┬────────┬─────────╮
│ event ┆ action ┆ city    │
╞═══════╪════════╪═════════╡
│ click ┆ buy    ┆ Seattle │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ click ┆ buy    ┆ Denver  │
╰───────┴────────┴─────────╯

Reach into deeply nested structs by chaining brackets:

$ pq sql "SELECT payload['action'] as action,
                 payload['metadata']['source'] as source,
                 COUNT(*) as n
           FROM './events.parquet'
           GROUP BY payload['action'], payload['metadata']['source']
           ORDER BY n DESC"
╭────────┬──────────┬───╮
│ action ┆ source   ┆ n │
╞════════╪══════════╪═══╡
│ buy    ┆ ad       ┆ 2 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌┤
│ browse ┆ organic  ┆ 1 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌┤
│ browse ┆ referral ┆ 1 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌┤
│ signup ┆ ad       ┆ 1 │
╰────────┴──────────┴───╯

Filter on nested values and join with flat data:

$ pq sql "SELECT e.city, COUNT(*) as ad_clicks
           FROM './events.parquet' e
           WHERE e.event = 'click'
             AND e.payload['metadata']['source'] = 'ad'
           GROUP BY e.city
           ORDER BY ad_clicks DESC"
╭──────────┬────────────╮
│ city     ┆ ad_clicks  │
╞══════════╪════════════╡
│ Portland ┆ 1          │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ Denver   ┆ 1          │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ Seattle  ┆ 1          │
╰──────────┴────────────╯

SQL reference #

pq uses Apache DataFusion for SQL execution. See the DataFusion SQL reference for the full list of supported functions and syntax.