Sampling¶
ASQL provides intuitive syntax for random sampling, essential for exploratory data analysis and working with large datasets.
Fixed Sample Size¶
Get a random sample of N rows:
from orders
sample 100
This returns 100 random rows from the table.
Generated SQL:
SELECT * FROM orders ORDER BY RANDOM() LIMIT 100
Percentage Sampling¶
Get a random percentage of rows:
from orders
sample 10%
This returns approximately 10% of the rows, randomly selected.
Generated SQL (using SQL standard TABLESAMPLE):
SELECT * FROM orders TABLESAMPLE BERNOULLI(10)
Dialect Variations¶
| Dialect | SQL Output |
|---|---|
| PostgreSQL/Redshift | TABLESAMPLE BERNOULLI(10) or TABLESAMPLE SYSTEM(10) |
| BigQuery | TABLESAMPLE SYSTEM(10 PERCENT) |
| Snowflake | SAMPLE (10) |
| DuckDB | USING SAMPLE 10% |
Stratified Sampling¶
Get N random rows per group value (stratified sampling):
from orders
sample 100 per category
This returns 100 random rows for each distinct value of category.
Generated SQL:
SELECT *
FROM orders
QUALIFY ROW_NUMBER() OVER (PARTITION BY category ORDER BY RANDOM()) <= 100
Use Cases¶
- Balanced training datasets — Ensure ML models see equal examples from each class
- Representative samples — Get proportional data across categories
- Quality checks — Sample from each segment for review
Combining with Other Operations¶
Sampling can be combined with filters and other operations:
from orders
where status = "completed"
sample 1000
order by created_at
Note: The order by applies to the final sampled result, not to the sampling process itself.
Real-World Examples¶
Quick Data Exploration¶
-- Get a quick look at recent orders
from orders
sample 50
order by -created_at
Stratified Sample for Analysis¶
-- Sample orders from each region
from orders
sample 100 per region
select region, customer_id, amount, created_at
Sample for Model Training¶
-- Balanced sample across categories
from products
sample 500 per category
select *
Sample with Filters¶
-- Sample active users with purchases
from users
where is_active
where total_purchases > 0
sample 1000
Performance Considerations¶
| Method | Performance | Best For |
|---|---|---|
sample N |
Requires full scan | Small tables, any dialect |
sample N% |
Fast (block sampling) | Large tables with TABLESAMPLE support |
sample N per col |
Window function overhead | Stratified sampling |
Tips¶
- For very large tables, prefer
sample N%which uses block-level sampling - For reproducibility, consider adding a seed (dialect-specific)
- For exact counts,
sample Nis guaranteed;sample N%is approximate
Quick Reference¶
| Syntax | Description | Example |
|---|---|---|
sample N |
Random N rows | sample 100 |
sample N% |
Random N percent | sample 10% |
sample N per col |
N rows per group | sample 100 per category |
Next Steps¶
- Pipeline Basics — Combining sample with other operations
- Window Functions — How stratified sampling works
- Examples — More sampling patterns