Tometo Tomato

Fuzzy join for messy CSV data

Your data has typos. Your reference file doesn’t. Let’s fix that.

Tometo Tomato is a command-line tool that connects two CSV files by similarity, not just exact matches. It handles typos, abbreviations, accents, and formatting differences — so you can enrich your messy data using an authoritative reference table.

Built on DuckDB and rapidfuzz.

How It Works

The workflow has two steps:

 Your file           Reference file
 (dirty data)        (ground truth)
      │                     │
      └──── tometo_tomato ──┘
                 │
           Mapping table
           (dirty → clean + score)
                 │
      ┌── exact join (duckdb) ──┐
      │                         │
  Your file               Corrected file
  (original)              (enriched)

Step 1. Run tometo_tomato to fuzzy-match your dirty column against a reference file. The output is a mapping table that links each dirty value to the best match.

Step 2. Use a standard exact join (e.g. with DuckDB) to bring the corrections back into your original file.

Quick Example

Your working file
municipality	region
Castro	Pugla
Castro	Lombardia
Calliano	Trentno-Alto Adige
Calliano	Piemnte

Reference file
municipality_name	region	istat_code
Castro	Puglia	075019
Castro	Lombardia	016065
Calliano	Trentino-Alto Adige	022032
Calliano	Piemonte	005013

tometo_tomato work.csv reference.csv \
  -j "municipality,municipality_name" \
  -j "region,region" \
  -a istat_code -s -t 70 \
  -o mapping.csv

Mapping result — each dirty value linked to its best match
municipality	region	ref_municipality_name	ref_region	istat_code	avg_score
Castro	Pugla	Castro	Puglia	075019	95.5
Castro	Lombardia	Castro	Lombardia	016065	100.0
Calliano	Trentno-Alto Adige	Calliano	Trentino-Alto Adige	022032	98.6
Calliano	Piemnte	Calliano	Piemonte	005013	96.7

Matching on two column pairs (-j repeated) ensures “Castro, Pugla” maps to the correct Castro in Puglia, not the one in Lombardia.

Why fuzzy matching? Install Use Case Guide CLI Reference