Tometo Tomato
Fuzzy join for messy CSV data
Your data has typos. Your reference file doesn’t. Let’s fix that.
Tometo Tomato is a command-line tool that connects two CSV files by similarity, not just exact matches. It handles typos, abbreviations, accents, and formatting differences — so you can enrich your messy data using an authoritative reference table.
Built on DuckDB and rapidfuzz.
How It Works
The workflow has two steps:
Your file Reference file
(dirty data) (ground truth)
│ │
└──── tometo_tomato ──┘
│
Mapping table
(dirty → clean + score)
│
┌── exact join (duckdb) ──┐
│ │
Your file Corrected file
(original) (enriched)
Step 1. Run tometo_tomato to fuzzy-match your dirty column against a reference file. The output is a mapping table that links each dirty value to the best match.
Step 2. Use a standard exact join (e.g. with DuckDB) to bring the corrections back into your original file.
Quick Example
| municipality | region |
|---|---|
| Castro | Pugla |
| Castro | Lombardia |
| Calliano | Trentno-Alto Adige |
| Calliano | Piemnte |
| municipality_name | region | istat_code |
|---|---|---|
| Castro | Puglia | 075019 |
| Castro | Lombardia | 016065 |
| Calliano | Trentino-Alto Adige | 022032 |
| Calliano | Piemonte | 005013 |
tometo_tomato work.csv reference.csv \
-j "municipality,municipality_name" \
-j "region,region" \
-a istat_code -s -t 70 \
-o mapping.csv| municipality | region | ref_municipality_name | ref_region | istat_code | avg_score |
|---|---|---|---|---|---|
| Castro | Pugla | Castro | Puglia | 075019 | 95.5 |
| Castro | Lombardia | Castro | Lombardia | 016065 | 100.0 |
| Calliano | Trentno-Alto Adige | Calliano | Trentino-Alto Adige | 022032 | 98.6 |
| Calliano | Piemnte | Calliano | Piemonte | 005013 | 96.7 |
Matching on two column pairs (-j repeated) ensures “Castro, Pugla” maps to the correct Castro in Puglia, not the one in Lombardia.