fauxdata — fake data, done right

  ███████╗ █████╗ ██╗   ██╗██╗  ██╗██████╗  █████╗ ████████╗ █████╗
  ██╔════╝██╔══██╗██║   ██║╚██╗██╔╝██╔══██╗██╔══██╗╚══██╔══╝██╔══██╗
  █████╗  ███████║██║   ██║ ╚███╔╝ ██║  ██║███████║   ██║   ███████║
  ██╔══╝  ██╔══██║██║   ██║ ██╔██╗ ██║  ██║██╔══██║   ██║   ██╔══██║
  ██║     ██║  ██║╚██████╔╝██╔╝ ██╗██████╔╝██║  ██║   ██║   ██║  ██║
  ╚═╝     ╚═╝  ╚═╝ ╚═════╝ ╚═╝  ╚═╝╚═════╝ ╚═╝  ╚═╝   ╚═╝   ╚═╝  ╚═╝

fake data. done right.

because your real data is a disaster anyway

your real data
is lying to you

You've seen it. A column called email containing "test@test.com", "N/A", and a phone number. A date field with 1900-01-01. An age column that thinks someone is 847 years old. A country column that contains 73 different spellings of "Italy".

Real data is not "raw" — it's rotten. And the worst part? You can't share it. It's sensitive, it's GDPR-locked, it's "confidential", it lives in a database only accessible from the office VPN on Tuesdays.

☠ real data

null values where there shouldn't be any
emails like "aaa@bbb" or just "no"
ages of 0, 999, or -3
duplicate IDs that are "definitely unique"
dates from 1900 and 2099 mixed together
GDPR-protected — can't share with the team
requires a VPN, two approvals and a sacrifice
breaks your pipeline in a new way every Monday

✓ fauxdata

zero nulls — unless you asked for them
emails that look like actual emails
ages between 18 and 90, as specified
IDs that are actually unique
dates strictly within your range
shareable, reproducible, seedable
runs in milliseconds, from any machine
validated before it even reaches your pipeline

fauxdata validate real_data.csv schemas/people.yml Loaded 4821 rows, 9 columns Running validation... ✗ FAIL col_vals_not_null(email) 247 failures ✗ FAIL col_vals_not_null(id) 83 failures ✗ FAIL col_vals_between(age, 18, 90) 512 failures — found ages: 0, 847, -3, 9999 ✗ FAIL col_vals_regex(email) 1204 failures — found: "N/A", "test", "no@" ✗ FAIL rows_distinct(id) 318 duplicate IDs — "definitely unique" 🙃 ⚠ 5/5 rules failed. This data is fiction. Bad fiction.

fauxdata generate schemas/people.yml --rows 1000 --validate Generated 1000 rows (seed=42, locale=IT) Running validation... ✓ PASS col_vals_not_null(email) 1000/1000 ✓ PASS col_vals_not_null(id) 1000/1000 ✓ PASS col_vals_between(age, 18, 90) 1000/1000 ✓ PASS col_vals_regex(email) 1000/1000 ✓ PASS rows_distinct(id) 1000/1000 — truly unique ✓ 5/5 rules passed. This fake data is more real than your real data.

define once.
generate forever.

Write a YAML schema. Run one command. Get a perfect, validated, reproducible dataset — in any format, any size, any locale.

schema-first

One YAML file defines everything: column types, ranges, presets, and validation rules. The schema is both the blueprint and the contract.

locale-aware

Set locale: IT and get Italian names, cities, email domains, IBANs, and phone formats — all coherent within each row. Works for 100+ countries.

validated by design

The same schema that defines generation also drives validation. Run --validate and know your data is correct before it touches your pipeline.

reproducible

Set seed: 42 and generate the exact same dataset every time. Share the schema, share the seed, share the data.

pipeline-friendly

Use --out - to pipe data directly to stdout. No files, no noise — just clean data flowing through your tools.

multi-format

CSV, Parquet, JSON, JSONL — the dataset, not the tool, decides the format. Switch with one flag.

pointblank 0.22+ polars python 3.11+ 100+ locales CI-friendly

YAML so readable
your PM could write it

A schema is a plain YAML file. It describes the structure of your dataset, the constraints for each column, and the validation rules to apply. One file. Everything in it.

# schemas/people.yml
name: people
rows: 1000
seed: 42         # reproducible — same seed, same data
locale: IT      # Italian names, cities, emails, IBANs

output:
  format: csv    # csv | parquet | json | jsonl
  path: people.csv

columns:
  id:
    type: int
    unique: true
    min: 1  max: 99999

  name:
    type: string
    preset: name     # → "Giulia Ferretti", "Marco Rossi"

  email:
    type: string
    preset: email    # → "g.ferretti@virgilio.it"

  age:
    type: int
    min: 18  max: 90

  status:
    type: string
    values: [active, inactive, pending]

  signup_date:
    type: date
    min: "2020-01-01"
    max: "2024-12-31"

validation:
  - rule: col_vals_not_null
    columns: [id, name, email]
  - rule: col_vals_between
    column: age  min: 18  max: 90
  - rule: rows_distinct
    columns: [id]

Available presets: name · email · phone_number · city · country_code_2 · company · job · address · postcode · ipv4 · uuid4 · iban · url · user_name · sentence · word and more.

four commands.
one tool.

command	what it does
`fauxdata generate SCHEMA`	Generate a dataset from a YAML schema. Options: `--rows`, `--format`, `--seed`, `--out -` (stdout), `--validate`
`fauxdata validate DATASET SCHEMA`	Validate an existing file against a schema. Exits with code 1 on failure — CI-ready.
`fauxdata preview DATASET`	Show the first N rows and column statistics (type, nulls, unique, min/max).
`fauxdata init [--name]`	Interactive wizard to create a new schema template.

fauxdata generate schemas/orders.yml --rows 10000 --out - --format jsonl \ | python3 transform.py \ | duckdb -c "SELECT status, COUNT(*) FROM '/dev/stdin' GROUP BY ALL"
┌───────────┬──────────┐ │ status │ count(*) │ │ varchar │ int64 │ ├───────────┼──────────┤ │ delivered │ 3124 │ │ shipped │ 2891 │ │ pending │ 2003 │ │ cancelled │ 1982 │ └───────────┴──────────┘

up and running
in 30 seconds.

uv tool install fauxdata-cli

fauxdata --help

uv installs fauxdata as an isolated tool available from any directory — no virtualenv activation needed.

pip install fauxdata-cli