fake data. done right.
because your real data is a disaster anyway
your real data
is lying to you
You've seen it. A column called email containing "test@test.com", "N/A", and a phone number. A date field with 1900-01-01. An age column that thinks someone is 847 years old. A country column that contains 73 different spellings of "Italy".
Real data is not "raw" — it's rotten. And the worst part? You can't share it. It's sensitive, it's GDPR-locked, it's "confidential", it lives in a database only accessible from the office VPN on Tuesdays.
☠ real data
- null values where there shouldn't be any
- emails like "aaa@bbb" or just "no"
- ages of 0, 999, or -3
- duplicate IDs that are "definitely unique"
- dates from 1900 and 2099 mixed together
- GDPR-protected — can't share with the team
- requires a VPN, two approvals and a sacrifice
- breaks your pipeline in a new way every Monday
✓ fauxdata
- zero nulls — unless you asked for them
- emails that look like actual emails
- ages between 18 and 90, as specified
- IDs that are actually unique
- dates strictly within your range
- shareable, reproducible, seedable
- runs in milliseconds, from any machine
- validated before it even reaches your pipeline
define once.
generate forever.
Write a YAML schema. Run one command. Get a perfect, validated, reproducible dataset — in any format, any size, any locale.
schema-first
One YAML file defines everything: column types, ranges, presets, and validation rules. The schema is both the blueprint and the contract.
locale-aware
Set locale: IT and get Italian names, cities, email domains, IBANs, and phone formats — all coherent within each row. Works for 100+ countries.
validated by design
The same schema that defines generation also drives validation. Run --validate and know your data is correct before it touches your pipeline.
reproducible
Set seed: 42 and generate the exact same dataset every time. Share the schema, share the seed, share the data.
pipeline-friendly
Use --out - to pipe data directly to stdout. No files, no noise — just clean data flowing through your tools.
multi-format
CSV, Parquet, JSON, JSONL — the dataset, not the tool, decides the format. Switch with one flag.
YAML so readable
your PM could write it
A schema is a plain YAML file. It describes the structure of your dataset, the constraints for each column, and the validation rules to apply. One file. Everything in it.
Available presets: name · email · phone_number · city · country_code_2 · company · job · address · postcode · ipv4 · uuid4 · iban · url · user_name · sentence · word and more.
four commands.
one tool.
| command | what it does |
|---|---|
fauxdata generate SCHEMA |
Generate a dataset from a YAML schema. Options: --rows, --format, --seed, --out - (stdout), --validate |
fauxdata validate DATASET SCHEMA |
Validate an existing file against a schema. Exits with code 1 on failure — CI-ready. |
fauxdata preview DATASET |
Show the first N rows and column statistics (type, nulls, unique, min/max). |
fauxdata init [--name] |
Interactive wizard to create a new schema template. |
up and running
in 30 seconds.
uv tool install fauxdata-cli
fauxdata --help
uv installs fauxdata as an isolated tool available from any directory — no virtualenv activation needed.
pip install fauxdata-cli
fauxdata --help