Why Fuzzy Matching?

The Problem

Anyone who works with data regularly encounters columns that should follow a standard encoding — an official list of values — but instead contain typos, extra whitespace, special characters instead of accented letters, inconsistent capitalization, and so on.

Here are some Italian city names, written incorrectly:

Examples of incorrect city names
city	type of error
Cefalu’	should use `ù` instead of `u'`
Reggio Calabria	the official name is `Reggio di Calabria`
RODENGO-SAIANO	should be `Rodengo Saiano`, without `-` and not all uppercase

If you wanted to associate each city with the code¹ that ISTAT — the Italian National Institute of Statistics — assigns to each municipality, a simple exact join would fail because the names don’t match. Associating a code to each municipality is very important: it enables you to link data from different sources that would otherwise be incomparable, and to automatically generate maps, since mapping software often uses these codes to identify municipalities.

Pay Attention

Data educators usually shout: CODES, NOT LABELS!!

The codes for Italian administrative units are public and freely available under CC-BY-4.0 on various sections of the ISTAT website. One source is SITUAS, which has a page with the “List of codes and names of territorial units”, downloadable in CSV and JSON format.

Official ISTAT municipality data
city	region	istat_city_code
Cefalù	Sicilia	082027
Reggio di Calabria	Calabria	080063
Rodengo Saiano	Lombardia	017163

If you try to join the incorrect data with this official data, you get no results.

Tometo Tomato was created to solve this problem: it lets you easily perform joins that are not based on exact matches, but on similarity, and then enrich, modify, and correct your messy input table by comparing it against a reference table.

Sample Files

These are our two sample tables, available as input.csv and ref_sample.csv so you can download them and try the examples yourself.

Using SQL

The Exact JOIN

The first test is to run a simple SQL join query to see what happens with our sample data.

The raw data
city	region
Cefalu’	Sicilia
Reggio Calabria	CALABRIA
RODENGO-SAIANO	Lombardia

The reference data
city	region	city_code
Cefalù	Sicilia	082027
Reggio di Calabria	Calabria	080063
Rodengo Saiano	Lombardia	017164

The query below uses a LEFT JOIN so that all rows from the left table appear in the output, even if they have no match in the right table.

SELECT
  i.*,
  r.city_code
FROM read_csv_auto('input.csv') AS i
LEFT JOIN read_csv_auto('ref_sample.csv') AS r
  ON i.city = r.city AND i.region = r.region

Note

Note: read_csv_auto is a DuckDB function that reads CSV files directly without importing them into a table first. This makes quick experiments easy — no temporary tables needed.

The result is disappointing: none of the rows find a match, so city_code is always NULL.

Result of an exact join between messy and official data
city	region	istat_city_code
Cefalu’	Sicilia	NULL
Reggio Calabria	CALABRIA	NULL
RODENGO-SAIANO	Lombardia	NULL

The Fuzzy JOIN

A “fuzzy” join finds matches even when values are not exactly equal, but merely similar. For example, we could consider city names that differ by just a few characters, using the Levenshtein distance (the minimum number of single-character edits needed to transform one string into another).

We could rewrite the previous query using DuckDB’s levenshtein function, matching cities that differ by at most 2 characters:

SELECT
  i.city AS input_city,
  i.region AS input_region,
  r.city AS ref_city,
  r.region AS ref_region,
  r.city_code,
  levenshtein(i.city, r.city) AS levenshtein_distance
FROM read_csv_auto('input.csv') AS i
JOIN read_csv_auto('ref_sample.csv') AS r
  ON i.region = r.region
  AND levenshtein(i.city, r.city) <= 2;

Result with Levenshtein distance ≤ 2
input_city	input_region	ref_city	ref_region	city_code	levenshtein_distance
Cefalu’	Sicilia	Cefalù	Sicilia	082027	2

The only match is Cefalu', because the Levenshtein distance between Cefalu' and Cefalù is 2: replacing u' with ù and removing '.

If we increase the threshold to 10, nothing changes — because Rodengo Saiano is written in uppercase with a hyphen, and the distance between RODENGO-SAIANO and Rodengo Saiano is 12:

SELECT levenshtein('RODENGO-SAIANO', 'Rodengo Saiano') AS distance;

12

Setting the threshold to 15 gives only 2 matches, because the exact JOIN on region fails between CALABRIA and Calabria. So we need to use a distance function for the region column too:

SELECT
  i.city AS input_city,
  i.region AS input_region,
  r.city AS ref_city,
  r.region AS ref_region,
  r.city_code,
  levenshtein(i.city, r.city) AS levenshtein_distance
FROM read_csv_auto('input.csv') AS i
JOIN read_csv_auto('ref_sample.csv') AS r
  ON levenshtein(i.city, r.city) <= 15
  AND levenshtein(LOWER(i.region), LOWER(r.region)) < 10

But the output is not what we expect — not 3 rows (one per municipality), but 8 rows:

Result of a fuzzy join with permissive thresholds
input_city	input_region	ref_city	ref_region	city_code	levenshtein_distance
Cefalu’	Sicilia	Cefalù	Sicilia	082027	2
Cefalu’	Sicilia	Reggio di Calabria	Calabria	080063	15
Cefalu’	Sicilia	Rodengo Saiano	Lombardia	017163	12
Reggio Calabria	CALABRIA	Cefalù	Sicilia	082027	12
Reggio Calabria	CALABRIA	Reggio di Calabria	Calabria	080063	3
Reggio Calabria	CALABRIA	Rodengo Saiano	Lombardia	017163	10
RODENGO-SAIANO	Lombardia	Cefalù	Sicilia	082027	14
RODENGO-SAIANO	Lombardia	Rodengo Saiano	Lombardia	017163	12

With exact joins, the goal is to find a single clear match for each row. In the world of fuzzy matching, the rules change. By using permissive distance thresholds, we are no longer asking the database “find the one correct match”, but rather:

For each input row, find all reference rows that satisfy these loose similarity criteria.

If an input row is “somewhat similar” to multiple reference rows, the database creates an output row for each. This multiplication effect is known as a Cartesian product of matches.

After generating all possible matches, we need to filter them to keep only the best one per input record. The process has three phases:

Find all possible matches and calculate distances
Rank each match by total distance
Select only the top-ranked match (the most similar)

-- Phase 1: find all candidates and calculate distances
WITH all_matches AS (
    SELECT
        i.city AS input_city,
        i.region AS input_region,
        r.city AS ref_city,
        r.region AS ref_region,
        r.city_code,
        levenshtein(i.city, r.city) AS city_distance,
        levenshtein(i.region, r.region) AS region_distance
    FROM read_csv_auto('input.csv') AS i
    JOIN read_csv_auto('ref_sample.csv') AS r
      ON levenshtein(i.city, r.city) <= 15
     AND levenshtein(i.region, r.region) < 10
),
-- Phase 2: rank the candidates
ranked_matches AS (
    SELECT
        *,
        ROW_NUMBER() OVER (
            PARTITION BY input_city, input_region
            ORDER BY (city_distance + region_distance) ASC
        ) AS rn
    FROM all_matches
)
-- Phase 3: keep only the best candidate (rn = 1)
SELECT
    input_city,
    input_region,
    ref_city,
    ref_region,
    city_code,
    city_distance,
    region_distance
FROM ranked_matches
WHERE rn = 1;

Result of a fuzzy join, selecting the best match
input_city	input_region	ref_city	ref_region	city_code	city_distance	region_distance
Cefalu’	Sicilia	Cefalù	Sicilia	082027	2	0
RODENGO-SAIANO	Lombardia	Rodengo Saiano	Lombardia	017163	12	0
Reggio Calabria	CALABRIA	Reggio di Calabria	Calabria	080063	3	7

Reducing Distances with Normalization

Here is a summary using the Italian municipality of Forza d'Agro, written incorrectly as Forza D'Agro:

uppercase D instead of lowercase
two extra spaces (one between Forza and D'Agro, one trailing)
no accent on the o
a typographic apostrophe \u2019 instead of a straight one '

Note

Note: spaces are shown as · for visibility.

Reducing Levenshtein distance through normalization steps
description	left	right	distance	function applied
Start	`Forza··D\u2019Agro·`	`Forza·d'Agr\u00f2`	7
Lowercase	`forza··d\u2019agro·`	`forza·d'agr\u00f2`	6	`LOWER('value')`
Trim + collapse spaces	`forza·d\u2019agro`	`forza·d'agr\u00f2`	5	`regexp_replace(trim(LOWER('value')), '\s+', ' ')`
Strip accents	`forza·d\u2019agro`	`forza·d'agro`	3	`strip_accents(...)`
Keep only alphanumeric	`forza·dagro`	`forza·dagro`	0	`regexp_replace(..., '[^a-zA-Z0-9 ]', '', 'g')`

So even a really messy value, full of errors, can be normalized to reduce the distance to zero — meaning Forza D\u2019Agro becomes equivalent to Forza d'Agr\u00f2 and we can retrieve its identifier code.

What Comes Next

In this walkthrough we covered the basics of measuring string distance and normalizing text for more effective comparison: case, whitespace, accents, and special characters.

However, this is just the beginning. The fuzzy matching world is much broader. For larger and messier datasets, more sophisticated methods exist:

Phonetic algorithms: Instead of looking at how words are spelled, these encode words by how they sound. Methods like Metaphone or Soundex excel at recognizing that “Smith” and “Smythe” are probably the same thing.
Statistical models (n-grams): These break strings into small fragments (pairs or triplets of characters) and compare their frequency, which is very effective at finding similarities even when word order differs.

Data cleaning tools like OpenRefine integrate dozens of these advanced algorithms, allowing you to cluster and correct similar data with great effectiveness.

Tometo Tomato wraps all of this complexity into a single command. See the Use Case Guide for the complete step-by-step workflow.

Footnotes

Codes of Italian municipalities ↩︎