CLI Reference

Synopsis

tometo_tomato INPUT_FILE REFERENCE_FILE [OPTIONS]

INPUT_FILE is the CSV with messy data. REFERENCE_FILE is the CSV with correct, authoritative data.

Options

Core

Core options
Flag Short Description
--join-pair COL1,COL2 -j Column pair to compare (input column, reference column). Repeatable for multi-column matching.
--add-field FIELD -a Extra column from the reference file to include in output. Repeatable.
--output-clean FILE -o Path for the clean matches output file. Default: clean_matches.csv
--output-ambiguous FILE -u Path for the ambiguous matches file. Only created if ambiguous records exist.
--threshold N -t Minimum similarity score (0-100). Default: 85
--show-score -s Include the avg_score column in the output.
--force -f Overwrite existing output files without prompting.

Matching Algorithm

Matching options
Flag Short Description
--scorer ALGO Fuzzy matching algorithm: ratio (default) or token_set_ratio.
--infer-pairs -i Automatically infer column pairs from similar header names.
--infer-threshold N -I Similarity threshold (0-1) for header name inference. Default: 0.7

Normalization

By default, tometo_tomato normalizes join columns before comparison: converts to lowercase, trims whitespace, and collapses multiple spaces. These flags control that behavior.

Normalization options
Flag Short Description
--latinize Strip accents and special characters before matching (e.g. e = e). Original characters are preserved in output.
--keep-alphanumeric -k Remove punctuation and special characters, keeping only letters, numbers, and spaces.
--raw-case Disable case normalization (case-sensitive matching).
--raw-whitespace Disable whitespace normalization (no trimming or space collapsing).

Performance

Performance options
Flag Short Description
--block-prefix N Only compare records sharing the same first N characters in each join column. Dramatically reduces computation on large datasets.

Output Control

Output control options
Flag Short Description
--verbose -v Increase verbosity. Use -vv for debug output.
--quiet -q Suppress all output except errors.
--version Show version and exit.

Examples

Basic single-column match

tometo_tomato work.csv reference.csv \
  -j "city,city_name" \
  -a city_code \
  -s -t 85 \
  -o mapping.csv

Multi-column match (disambiguation)

tometo_tomato work.csv reference.csv \
  -j "municipality,municipality_name" \
  -j "region,region" \
  -a istat_code \
  -s -t 70 \
  -o mapping.csv

With normalization

tometo_tomato work.csv reference.csv \
  -j "name,ref_name" \
  --latinize \
  --keep-alphanumeric \
  -a code \
  -s -t 80 \
  -o mapping.csv

With blocking for large datasets

tometo_tomato work.csv reference.csv \
  -j "city,city" -j "region,region" \
  -a city_code \
  --block-prefix 3 \
  -s -t 85 \
  -o mapping.csv

Token set ratio (word-order independent)

Useful when names have different word counts (e.g. “Reggio Calabria” vs. “Reggio di Calabria”):

tometo_tomato work.csv reference.csv \
  -j "city,city_name" \
  --scorer token_set_ratio \
  -s -t 90 \
  -o mapping.csv

Automated pipeline (no prompts)

tometo_tomato work.csv reference.csv \
  -j "name,ref_name" \
  -a code \
  -t 85 \
  -o mapping.csv \
  --force --quiet

Capture ambiguous matches

tometo_tomato work.csv reference.csv \
  -j "city,city_name" \
  -a city_code \
  -s -t 80 \
  -o mapping.csv \
  -u ambiguous.csv

How Scores Work

The avg_score column represents the average fuzzy similarity across all join pairs, on a 0-100 scale:

  • 100 = perfect match (after normalization)
  • 90-99 = minor differences (a missing character, a typo)
  • 80-89 = moderate differences (abbreviations, missing words)
  • < 80 = distant matches (manual review recommended)

When using multiple -j pairs, the score is the average of individual pair scores. For example, with -j "city,city" -j "region,region":

  • city score: 100 (exact match)
  • region score: 90.9 (“Pugla” vs “Puglia”)
  • avg_score: (100 + 90.9) / 2 = 95.5

How Ambiguity Works

A match is ambiguous when two or more reference rows achieve the same maximum avg_score for a given input row (and that score meets the threshold).

  • Ambiguous rows are excluded from --output-clean to prevent inserting incorrect data
  • Use --output-ambiguous to inspect the tied candidates
  • If --output-ambiguous is not set, the tool prints a warning with the count of ambiguous records