Use Case Guide: Cleaning Dirty Data with a Reference Table
The Problem
You have a CSV file where one or more columns contain messy values: typos, missing characters, inconsistent capitalization. A typical example is a column of city or municipality names entered by hand.
You also have an authoritative reference file — an official list with the correct names and additional attributes (such as codes or identifiers) that you need to bring into your working file.
The challenge: a regular join fails because the values do not match exactly. tometo_tomato solves this with a fuzzy join — it finds the best approximate match for each row, even when the text is not identical.
The Workflow
The process has two steps:
Your file Reference file
(dirty data) (ground truth)
│ │
└──── tometo_tomato ─┘
│
Mapping table
(dirty → clean)
│
┌── exact join (duckdb) ──┐
│ │
Your file Corrected file
(original) (enriched)
Step 1. Run tometo_tomato to produce a mapping table that links each dirty value to its best match in the reference file.
Step 2. Use a standard exact join (for example with DuckDB) to bring the corrections back into your original file.
Example A: Single Join Column
The data
Your working file (work_simple.csv):
| municipality | amount |
|---|---|
| Roma | 1500 |
| Rma | 2300 |
| Milno | 800 |
| Milano | 1200 |
| Npoli | 950 |
Reference file (reference_simple.csv):
| municipality_name | istat_code |
|---|---|
| Roma | 058091 |
| Milano | 015146 |
| Napoli | 063049 |
Step 1: Generate the mapping
tometo_tomato work_simple.csv reference_simple.csv \
-j "municipality,municipality_name" \
-a istat_code \
-s -t 70 \
-o mapping.csvFlags explained:
-j "municipality,municipality_name"— the pair of columns to compare (input column, reference column)-a istat_code— an extra column from the reference file to include in the output-s— show the similarity score-t 70— minimum similarity threshold (0-100)-o mapping.csv— output file path
Result (mapping.csv):
| municipality | ref_municipality_name | istat_code | avg_score |
|---|---|---|---|
| Roma | Roma | 058091 | 100.0 |
| Rma | Roma | 058091 | 85.7 |
| Milano | Milano | 015146 | 100.0 |
| Milno | Milano | 015146 | 90.9 |
| Npoli | Napoli | 063049 | 90.9 |
Each row in the mapping connects a dirty value to the best matching reference value. The avg_score column shows how confident the match is (100 = perfect).
Step 2: Join back to your original file
duckdb -c "
COPY (
SELECT
w.*,
m.ref_municipality_name AS municipality_correct,
m.istat_code,
ROUND(m.avg_score, 1) AS score
FROM read_csv_auto('work_simple.csv') w
JOIN read_csv_auto('mapping.csv') m
ON w.municipality = m.municipality
ORDER BY w.id
) TO 'work_corrected.csv' (HEADER);
"The join condition is w.municipality = m.municipality — an exact join on the dirty column. This works because the mapping table contains exactly the original (dirty) values from your file.
Result (work_corrected.csv):
| municipality | amount | municipality_correct | istat_code | score |
|---|---|---|---|---|
| Roma | 1500 | Roma | 058091 | 100.0 |
| Rma | 2300 | Roma | 058091 | 85.7 |
| Milno | 800 | Milano | 015146 | 90.9 |
| Milano | 1200 | Milano | 015146 | 100.0 |
| Npoli | 950 | Napoli | 063049 | 90.9 |
Every row in your original file now has the correct municipality name and its ISTAT code.
Example B: Multi-Column Join (Disambiguation)
Why a single column is not always enough
Some municipality names exist in multiple regions. For example, in Italy:
| Municipality | Region | ISTAT Code |
|---|---|---|
| Castro | Puglia | 075019 |
| Castro | Lombardia | 016065 |
| Samone | Trentino-Alto Adige | 022164 |
| Samone | Piemonte | 001244 |
If you match only on the municipality name, “Castro” in Puglia could be incorrectly linked to “Castro” in Lombardia. You need a second column — the region — to disambiguate.
The data
Your working file (work_file.csv):
id,municipality,region,amount
1,San Valentino in Abruzzo Citeriore,Abrzzo,1500
2,San Valentino Torio,Campnia,2300
3,Castro,Pugla,800
4,Castro,Lombardia,1200
5,Samone,Trntino-Alto Adige,950
6,Samone,Piemonte,3100
7,Calliano,Trentno-Alto Adige,400
8,Calliano,Piemnte,2750
9,Livo,Trentno-Alto Adige,1800
10,Livo,Lombardia,600
Notice the typos in the region column: “Abrzzo”, “Campnia”, “Pugla”, “Trntino-Alto Adige”, “Piemnte”.
Reference file (reference.csv):
municipality_name,region,istat_code
San Valentino in Abruzzo Citeriore,Abruzzo,068037
San Valentino Torio,Campania,065131
Castro,Puglia,075019
Castro,Lombardia,016065
Samone,Trentino-Alto Adige,022164
Samone,Piemonte,001244
Calliano,Trentino-Alto Adige,022032
Calliano,Piemonte,005013
Livo,Trentino-Alto Adige,022107
Livo,Lombardia,013130
Step 1: Generate the mapping with two join pairs
tometo_tomato work_file.csv reference.csv \
-j "municipality,municipality_name" \
-j "region,region" \
-a istat_code \
-s -t 70 \
-o mapping.csvThe key difference: the -j flag is repeated for each column pair. The tool computes a similarity score for each pair and averages them. This means both the municipality name and the region must be similar for a match to succeed.
Result (mapping.csv):
| municipality | region | ref_municipality_name | ref_region | istat_code | avg_score |
|---|---|---|---|---|---|
| Castro | Pugla | Castro | Puglia | 075019 | 95.5 |
| Castro | Lombardia | Castro | Lombardia | 016065 | 100.0 |
| Samone | Trntino-Alto Adige | Samone | Trentino-Alto Adige | 022164 | 98.6 |
| Samone | Piemonte | Samone | Piemonte | 001244 | 100.0 |
| Calliano | Trentno-Alto Adige | Calliano | Trentino-Alto Adige | 022032 | 98.6 |
| Calliano | Piemnte | Calliano | Piemonte | 005013 | 96.7 |
“Castro, Pugla” correctly maps to “Castro, Puglia” (not Lombardia) because the averaged score on both columns picks the right region.
Step 2: Join back using both columns
duckdb -c "
COPY (
SELECT
w.*,
m.ref_municipality_name AS municipality_correct,
m.ref_region AS region_correct,
m.istat_code,
ROUND(m.avg_score, 1) AS score
FROM read_csv_auto('work_file.csv') w
JOIN read_csv_auto('mapping.csv') m
ON w.municipality = m.municipality
AND w.region = m.region
ORDER BY w.id::INT
) TO 'work_corrected.csv' (HEADER);
"The join now uses both dirty columns: ON w.municipality = m.municipality AND w.region = m.region. This ensures each row maps to exactly one entry in the mapping table.
Result (work_corrected.csv):
| id | municipality | region | amount | municipality_correct | region_correct | istat_code | score |
|---|---|---|---|---|---|---|---|
| 1 | San Valentino in Abruzzo Citeriore | Abrzzo | 1500 | San Valentino in Abruzzo Citeriore | Abruzzo | 068037 | 96.2 |
| 2 | San Valentino Torio | Campnia | 2300 | San Valentino Torio | Campania | 065131 | 96.7 |
| 3 | Castro | Pugla | 800 | Castro | Puglia | 075019 | 95.5 |
| 4 | Castro | Lombardia | 1200 | Castro | Lombardia | 016065 | 100.0 |
| 5 | Samone | Trntino-Alto Adige | 950 | Samone | Trentino-Alto Adige | 022164 | 98.6 |
| 6 | Samone | Piemonte | 3100 | Samone | Piemonte | 001244 | 100.0 |
| 7 | Calliano | Trentno-Alto Adige | 400 | Calliano | Trentino-Alto Adige | 022032 | 98.6 |
| 8 | Calliano | Piemnte | 2750 | Calliano | Piemonte | 005013 | 96.7 |
| 9 | Livo | Trentno-Alto Adige | 1800 | Livo | Trentino-Alto Adige | 022107 | 98.6 |
| 10 | Livo | Lombardia | 600 | Livo | Lombardia | 013130 | 100.0 |
All 10 rows are correctly matched, with the right ISTAT code for each municipality-region pair.
Tips and Best Practices
Reviewing low-confidence matches
Use the score column to spot matches that might be wrong. A quick check:
duckdb -c "
SELECT * FROM read_csv_auto('mapping.csv')
WHERE avg_score < 95
ORDER BY avg_score;
"Rows with lower scores deserve manual review.
Handling ambiguous matches
When two reference rows score equally well for the same input, tometo_tomato flags the match as ambiguous and excludes it from the clean output to avoid inserting incorrect data. Use --output-ambiguous to inspect these cases:
tometo_tomato work_file.csv reference.csv \
-j "municipality,municipality_name" \
-j "region,region" \
-a istat_code \
-s -t 70 \
-o mapping.csv \
-u ambiguous.csvUnmatched rows
If an input row has no reference match above the threshold, it still appears in the mapping file — but with empty reference columns. This is a LEFT JOIN: no input row is ever lost. You can filter these out in Step 2 or handle them separately.
Choosing the right threshold
- 90-100: strict, only very close matches (good for clean data with minor typos)
- 80-90: moderate, handles abbreviations and missing characters
- 70-80: permissive, catches more distant matches but needs manual review
Performance on large datasets
For large files, use --block-prefix N to avoid comparing every pair of rows:
tometo_tomato work_file.csv reference.csv \
-j "municipality,municipality_name" \
-j "region,region" \
-a istat_code \
--block-prefix 3 \
-s -t 70 \
-o mapping.csvThis only compares rows where the first 3 characters of the join columns match, dramatically reducing computation time.