Why Fuzzy Matching?

The Problem

Anyone who works with data regularly encounters columns that should follow a standard encoding — an official list of values — but instead contain typos, extra whitespace, special characters instead of accented letters, inconsistent capitalization, and so on.

Here are some Italian city names, written incorrectly:

Examples of incorrect city names
city type of error
Cefalu’ should use ù instead of u'
Reggio Calabria the official name is Reggio di Calabria
RODENGO-SAIANO should be Rodengo Saiano, without - and not all uppercase


If you wanted to associate each city with the code1 that ISTAT — the Italian National Institute of Statistics — assigns to each municipality, a simple exact join would fail because the names don’t match. Associating a code to each municipality is very important: it enables you to link data from different sources that would otherwise be incomparable, and to automatically generate maps, since mapping software often uses these codes to identify municipalities.

Pay Attention

Data educators usually shout: CODES, NOT LABELS!!

The codes for Italian administrative units are public and freely available under CC-BY-4.0 on various sections of the ISTAT website. One source is SITUAS, which has a page with the “List of codes and names of territorial units”, downloadable in CSV and JSON format.

Official ISTAT municipality data
city region istat_city_code
Cefalù Sicilia 082027
Reggio di Calabria Calabria 080063
Rodengo Saiano Lombardia 017163


If you try to join the incorrect data with this official data, you get no results.


Tometo Tomato was created to solve this problem: it lets you easily perform joins that are not based on exact matches, but on similarity, and then enrich, modify, and correct your messy input table by comparing it against a reference table.


Sample Files

These are our two sample tables, available as input.csv and ref_sample.csv so you can download them and try the examples yourself.

Using SQL

The Exact JOIN

The first test is to run a simple SQL join query to see what happens with our sample data.

The raw data
city region
Cefalu’ Sicilia
Reggio Calabria CALABRIA
RODENGO-SAIANO Lombardia
The reference data
city region city_code
Cefalù Sicilia 082027
Reggio di Calabria Calabria 080063
Rodengo Saiano Lombardia 017164


The query below uses a LEFT JOIN so that all rows from the left table appear in the output, even if they have no match in the right table.

SELECT
  i.*,
  r.city_code
FROM read_csv_auto('input.csv') AS i
LEFT JOIN read_csv_auto('ref_sample.csv') AS r
  ON i.city = r.city AND i.region = r.region
Note

Note: read_csv_auto is a DuckDB function that reads CSV files directly without importing them into a table first. This makes quick experiments easy — no temporary tables needed.

The result is disappointing: none of the rows find a match, so city_code is always NULL.

Result of an exact join between messy and official data
city region istat_city_code
Cefalu’ Sicilia NULL
Reggio Calabria CALABRIA NULL
RODENGO-SAIANO Lombardia NULL

The Fuzzy JOIN

A “fuzzy” join finds matches even when values are not exactly equal, but merely similar. For example, we could consider city names that differ by just a few characters, using the Levenshtein distance (the minimum number of single-character edits needed to transform one string into another).

We could rewrite the previous query using DuckDB’s levenshtein function, matching cities that differ by at most 2 characters:

SELECT
  i.city AS input_city,
  i.region AS input_region,
  r.city AS ref_city,
  r.region AS ref_region,
  r.city_code,
  levenshtein(i.city, r.city) AS levenshtein_distance
FROM read_csv_auto('input.csv') AS i
JOIN read_csv_auto('ref_sample.csv') AS r
  ON i.region = r.region
  AND levenshtein(i.city, r.city) <= 2;
Result with Levenshtein distance ≤ 2
input_city input_region ref_city ref_region city_code levenshtein_distance
Cefalu’ Sicilia Cefalù Sicilia 082027 2

The only match is Cefalu', because the Levenshtein distance between Cefalu' and Cefalù is 2: replacing u' with ù and removing '.

If we increase the threshold to 10, nothing changes — because Rodengo Saiano is written in uppercase with a hyphen, and the distance between RODENGO-SAIANO and Rodengo Saiano is 12:

SELECT levenshtein('RODENGO-SAIANO', 'Rodengo Saiano') AS distance;

12

Setting the threshold to 15 gives only 2 matches, because the exact JOIN on region fails between CALABRIA and Calabria. So we need to use a distance function for the region column too:

SELECT
  i.city AS input_city,
  i.region AS input_region,
  r.city AS ref_city,
  r.region AS ref_region,
  r.city_code,
  levenshtein(i.city, r.city) AS levenshtein_distance
FROM read_csv_auto('input.csv') AS i
JOIN read_csv_auto('ref_sample.csv') AS r
  ON levenshtein(i.city, r.city) <= 15
  AND levenshtein(LOWER(i.region), LOWER(r.region)) < 10

But the output is not what we expect — not 3 rows (one per municipality), but 8 rows:

Result of a fuzzy join with permissive thresholds
input_city input_region ref_city ref_region city_code levenshtein_distance
Cefalu’ Sicilia Cefalù Sicilia 082027 2
Cefalu’ Sicilia Reggio di Calabria Calabria 080063 15
Cefalu’ Sicilia Rodengo Saiano Lombardia 017163 12
Reggio Calabria CALABRIA Cefalù Sicilia 082027 12
Reggio Calabria CALABRIA Reggio di Calabria Calabria 080063 3
Reggio Calabria CALABRIA Rodengo Saiano Lombardia 017163 10
RODENGO-SAIANO Lombardia Cefalù Sicilia 082027 14
RODENGO-SAIANO Lombardia Rodengo Saiano Lombardia 017163 12

With exact joins, the goal is to find a single clear match for each row. In the world of fuzzy matching, the rules change. By using permissive distance thresholds, we are no longer asking the database “find the one correct match”, but rather:

For each input row, find all reference rows that satisfy these loose similarity criteria.

If an input row is “somewhat similar” to multiple reference rows, the database creates an output row for each. This multiplication effect is known as a Cartesian product of matches.

After generating all possible matches, we need to filter them to keep only the best one per input record. The process has three phases:

  1. Find all possible matches and calculate distances
  2. Rank each match by total distance
  3. Select only the top-ranked match (the most similar)
-- Phase 1: find all candidates and calculate distances
WITH all_matches AS (
    SELECT
        i.city AS input_city,
        i.region AS input_region,
        r.city AS ref_city,
        r.region AS ref_region,
        r.city_code,
        levenshtein(i.city, r.city) AS city_distance,
        levenshtein(i.region, r.region) AS region_distance
    FROM read_csv_auto('input.csv') AS i
    JOIN read_csv_auto('ref_sample.csv') AS r
      ON levenshtein(i.city, r.city) <= 15
     AND levenshtein(i.region, r.region) < 10
),
-- Phase 2: rank the candidates
ranked_matches AS (
    SELECT
        *,
        ROW_NUMBER() OVER (
            PARTITION BY input_city, input_region
            ORDER BY (city_distance + region_distance) ASC
        ) AS rn
    FROM all_matches
)
-- Phase 3: keep only the best candidate (rn = 1)
SELECT
    input_city,
    input_region,
    ref_city,
    ref_region,
    city_code,
    city_distance,
    region_distance
FROM ranked_matches
WHERE rn = 1;
Result of a fuzzy join, selecting the best match
input_city input_region ref_city ref_region city_code city_distance region_distance
Cefalu’ Sicilia Cefalù Sicilia 082027 2 0
RODENGO-SAIANO Lombardia Rodengo Saiano Lombardia 017163 12 0
Reggio Calabria CALABRIA Reggio di Calabria Calabria 080063 3 7

Reducing Distances with Normalization

Here is a summary using the Italian municipality of Forza d'Agro, written incorrectly as Forza D'Agro:

  • uppercase D instead of lowercase
  • two extra spaces (one between Forza and D'Agro, one trailing)
  • no accent on the o
  • a typographic apostrophe \u2019 instead of a straight one '
Note

Note: spaces are shown as · for visibility.

Reducing Levenshtein distance through normalization steps
description left right distance function applied
Start Forza··D\u2019Agro· Forza·d'Agr\u00f2 7
Lowercase forza··d\u2019agro· forza·d'agr\u00f2 6 LOWER('value')
Trim + collapse spaces forza·d\u2019agro forza·d'agr\u00f2 5 regexp_replace(trim(LOWER('value')), '\s+', ' ')
Strip accents forza·d\u2019agro forza·d'agro 3 strip_accents(...)
Keep only alphanumeric forza·dagro forza·dagro 0 regexp_replace(..., '[^a-zA-Z0-9 ]', '', 'g')


So even a really messy value, full of errors, can be normalized to reduce the distance to zero — meaning Forza D\u2019Agro becomes equivalent to Forza d'Agr\u00f2 and we can retrieve its identifier code.

What Comes Next

In this walkthrough we covered the basics of measuring string distance and normalizing text for more effective comparison: case, whitespace, accents, and special characters.

However, this is just the beginning. The fuzzy matching world is much broader. For larger and messier datasets, more sophisticated methods exist:

  • Phonetic algorithms: Instead of looking at how words are spelled, these encode words by how they sound. Methods like Metaphone or Soundex excel at recognizing that “Smith” and “Smythe” are probably the same thing.
  • Statistical models (n-grams): These break strings into small fragments (pairs or triplets of characters) and compare their frequency, which is very effective at finding similarities even when word order differs.

Data cleaning tools like OpenRefine integrate dozens of these advanced algorithms, allowing you to cluster and correct similar data with great effectiveness.

Tometo Tomato wraps all of this complexity into a single command. See the Use Case Guide for the complete step-by-step workflow.