Tometo Tomato

Fuzzy join for messy CSV data

Your data has typos. Your reference file doesn’t. Let’s fix that.

Tometo Tomato is a command-line tool that connects two CSV files by similarity, not just exact matches. It handles typos, abbreviations, accents, and formatting differences — so you can enrich your messy data using an authoritative reference table.

Built on DuckDB and rapidfuzz.

How It Works

The workflow has two steps:

 Your file           Reference file
 (dirty data)        (ground truth)
      │                     │
      └──── tometo_tomato ──┘
                 │
           Mapping table
           (dirty → clean + score)
                 │
      ┌── exact join (duckdb) ──┐
      │                         │
  Your file               Corrected file
  (original)              (enriched)

Step 1. Run tometo_tomato to fuzzy-match your dirty column against a reference file. The output is a mapping table that links each dirty value to the best match.

Step 2. Use a standard exact join (e.g. with DuckDB) to bring the corrections back into your original file.

Quick Example

Your working file
municipality region
Castro Pugla
Castro Lombardia
Calliano Trentno-Alto Adige
Calliano Piemnte
Reference file
municipality_name region istat_code
Castro Puglia 075019
Castro Lombardia 016065
Calliano Trentino-Alto Adige 022032
Calliano Piemonte 005013
tometo_tomato work.csv reference.csv \
  -j "municipality,municipality_name" \
  -j "region,region" \
  -a istat_code -s -t 70 \
  -o mapping.csv
Mapping result — each dirty value linked to its best match
municipality region ref_municipality_name ref_region istat_code avg_score
Castro Pugla Castro Puglia 075019 95.5
Castro Lombardia Castro Lombardia 016065 100.0
Calliano Trentno-Alto Adige Calliano Trentino-Alto Adige 022032 98.6
Calliano Piemnte Calliano Piemonte 005013 96.7

Matching on two column pairs (-j repeated) ensures “Castro, Pugla” maps to the correct Castro in Puglia, not the one in Lombardia.


Why fuzzy matching? Install Use Case Guide CLI Reference