Use Case Guide: Cleaning Dirty Data with a Reference Table

The Problem

You have a CSV file where one or more columns contain messy values: typos, missing characters, inconsistent capitalization. A typical example is a column of city or municipality names entered by hand.

You also have an authoritative reference file — an official list with the correct names and additional attributes (such as codes or identifiers) that you need to bring into your working file.

The challenge: a regular join fails because the values do not match exactly. tometo_tomato solves this with a fuzzy join — it finds the best approximate match for each row, even when the text is not identical.

The Workflow

The process has two steps:

 Your file          Reference file
 (dirty data)       (ground truth)
     │                    │
     └──── tometo_tomato ─┘
                │
          Mapping table
          (dirty → clean)
                │
     ┌── exact join (duckdb) ──┐
     │                         │
 Your file               Corrected file
 (original)              (enriched)

Step 1. Run tometo_tomato to produce a mapping table that links each dirty value to its best match in the reference file.

Step 2. Use a standard exact join (for example with DuckDB) to bring the corrections back into your original file.

Example A: Single Join Column

The data

Your working file (work_simple.csv):

municipality amount
Roma 1500
Rma 2300
Milno 800
Milano 1200
Npoli 950

Reference file (reference_simple.csv):

municipality_name istat_code
Roma 058091
Milano 015146
Napoli 063049

Step 1: Generate the mapping

tometo_tomato work_simple.csv reference_simple.csv \
  -j "municipality,municipality_name" \
  -a istat_code \
  -s -t 70 \
  -o mapping.csv

Flags explained:

  • -j "municipality,municipality_name" — the pair of columns to compare (input column, reference column)
  • -a istat_code — an extra column from the reference file to include in the output
  • -s — show the similarity score
  • -t 70 — minimum similarity threshold (0-100)
  • -o mapping.csv — output file path

Result (mapping.csv):

municipality ref_municipality_name istat_code avg_score
Roma Roma 058091 100.0
Rma Roma 058091 85.7
Milano Milano 015146 100.0
Milno Milano 015146 90.9
Npoli Napoli 063049 90.9

Each row in the mapping connects a dirty value to the best matching reference value. The avg_score column shows how confident the match is (100 = perfect).

Step 2: Join back to your original file

duckdb -c "
COPY (
  SELECT
    w.*,
    m.ref_municipality_name AS municipality_correct,
    m.istat_code,
    ROUND(m.avg_score, 1) AS score
  FROM read_csv_auto('work_simple.csv') w
  JOIN read_csv_auto('mapping.csv') m
    ON w.municipality = m.municipality
  ORDER BY w.id
) TO 'work_corrected.csv' (HEADER);
"

The join condition is w.municipality = m.municipality — an exact join on the dirty column. This works because the mapping table contains exactly the original (dirty) values from your file.

Result (work_corrected.csv):

municipality amount municipality_correct istat_code score
Roma 1500 Roma 058091 100.0
Rma 2300 Roma 058091 85.7
Milno 800 Milano 015146 90.9
Milano 1200 Milano 015146 100.0
Npoli 950 Napoli 063049 90.9

Every row in your original file now has the correct municipality name and its ISTAT code.

Example B: Multi-Column Join (Disambiguation)

Why a single column is not always enough

Some municipality names exist in multiple regions. For example, in Italy:

Municipality Region ISTAT Code
Castro Puglia 075019
Castro Lombardia 016065
Samone Trentino-Alto Adige 022164
Samone Piemonte 001244

If you match only on the municipality name, “Castro” in Puglia could be incorrectly linked to “Castro” in Lombardia. You need a second column — the region — to disambiguate.

The data

Your working file (work_file.csv):

id,municipality,region,amount
1,San Valentino in Abruzzo Citeriore,Abrzzo,1500
2,San Valentino Torio,Campnia,2300
3,Castro,Pugla,800
4,Castro,Lombardia,1200
5,Samone,Trntino-Alto Adige,950
6,Samone,Piemonte,3100
7,Calliano,Trentno-Alto Adige,400
8,Calliano,Piemnte,2750
9,Livo,Trentno-Alto Adige,1800
10,Livo,Lombardia,600

Notice the typos in the region column: “Abrzzo”, “Campnia”, “Pugla”, “Trntino-Alto Adige”, “Piemnte”.

Reference file (reference.csv):

municipality_name,region,istat_code
San Valentino in Abruzzo Citeriore,Abruzzo,068037
San Valentino Torio,Campania,065131
Castro,Puglia,075019
Castro,Lombardia,016065
Samone,Trentino-Alto Adige,022164
Samone,Piemonte,001244
Calliano,Trentino-Alto Adige,022032
Calliano,Piemonte,005013
Livo,Trentino-Alto Adige,022107
Livo,Lombardia,013130

Step 1: Generate the mapping with two join pairs

tometo_tomato work_file.csv reference.csv \
  -j "municipality,municipality_name" \
  -j "region,region" \
  -a istat_code \
  -s -t 70 \
  -o mapping.csv

The key difference: the -j flag is repeated for each column pair. The tool computes a similarity score for each pair and averages them. This means both the municipality name and the region must be similar for a match to succeed.

Result (mapping.csv):

municipality region ref_municipality_name ref_region istat_code avg_score
Castro Pugla Castro Puglia 075019 95.5
Castro Lombardia Castro Lombardia 016065 100.0
Samone Trntino-Alto Adige Samone Trentino-Alto Adige 022164 98.6
Samone Piemonte Samone Piemonte 001244 100.0
Calliano Trentno-Alto Adige Calliano Trentino-Alto Adige 022032 98.6
Calliano Piemnte Calliano Piemonte 005013 96.7

“Castro, Pugla” correctly maps to “Castro, Puglia” (not Lombardia) because the averaged score on both columns picks the right region.

Step 2: Join back using both columns

duckdb -c "
COPY (
  SELECT
    w.*,
    m.ref_municipality_name AS municipality_correct,
    m.ref_region AS region_correct,
    m.istat_code,
    ROUND(m.avg_score, 1) AS score
  FROM read_csv_auto('work_file.csv') w
  JOIN read_csv_auto('mapping.csv') m
    ON w.municipality = m.municipality
    AND w.region = m.region
  ORDER BY w.id::INT
) TO 'work_corrected.csv' (HEADER);
"

The join now uses both dirty columns: ON w.municipality = m.municipality AND w.region = m.region. This ensures each row maps to exactly one entry in the mapping table.

Result (work_corrected.csv):

id municipality region amount municipality_correct region_correct istat_code score
1 San Valentino in Abruzzo Citeriore Abrzzo 1500 San Valentino in Abruzzo Citeriore Abruzzo 068037 96.2
2 San Valentino Torio Campnia 2300 San Valentino Torio Campania 065131 96.7
3 Castro Pugla 800 Castro Puglia 075019 95.5
4 Castro Lombardia 1200 Castro Lombardia 016065 100.0
5 Samone Trntino-Alto Adige 950 Samone Trentino-Alto Adige 022164 98.6
6 Samone Piemonte 3100 Samone Piemonte 001244 100.0
7 Calliano Trentno-Alto Adige 400 Calliano Trentino-Alto Adige 022032 98.6
8 Calliano Piemnte 2750 Calliano Piemonte 005013 96.7
9 Livo Trentno-Alto Adige 1800 Livo Trentino-Alto Adige 022107 98.6
10 Livo Lombardia 600 Livo Lombardia 013130 100.0

All 10 rows are correctly matched, with the right ISTAT code for each municipality-region pair.

Tips and Best Practices

Reviewing low-confidence matches

Use the score column to spot matches that might be wrong. A quick check:

duckdb -c "
SELECT * FROM read_csv_auto('mapping.csv')
WHERE avg_score < 95
ORDER BY avg_score;
"

Rows with lower scores deserve manual review.

Handling ambiguous matches

When two reference rows score equally well for the same input, tometo_tomato flags the match as ambiguous and excludes it from the clean output to avoid inserting incorrect data. Use --output-ambiguous to inspect these cases:

tometo_tomato work_file.csv reference.csv \
  -j "municipality,municipality_name" \
  -j "region,region" \
  -a istat_code \
  -s -t 70 \
  -o mapping.csv \
  -u ambiguous.csv

Unmatched rows

If an input row has no reference match above the threshold, it still appears in the mapping file — but with empty reference columns. This is a LEFT JOIN: no input row is ever lost. You can filter these out in Step 2 or handle them separately.

Choosing the right threshold

  • 90-100: strict, only very close matches (good for clean data with minor typos)
  • 80-90: moderate, handles abbreviations and missing characters
  • 70-80: permissive, catches more distant matches but needs manual review

Performance on large datasets

For large files, use --block-prefix N to avoid comparing every pair of rows:

tometo_tomato work_file.csv reference.csv \
  -j "municipality,municipality_name" \
  -j "region,region" \
  -a istat_code \
  --block-prefix 3 \
  -s -t 70 \
  -o mapping.csv

This only compares rows where the first 3 characters of the join columns match, dramatically reducing computation time.