Why Fuzzy Matching?
The Problem
Anyone who works with data regularly encounters columns that should follow a standard encoding — an official list of values — but instead contain typos, extra whitespace, special characters instead of accented letters, inconsistent capitalization, and so on.
Here are some Italian city names, written incorrectly:
| city | type of error |
|---|---|
| Cefalu’ | should use ù instead of u' |
| Reggio Calabria | the official name is Reggio di Calabria |
| RODENGO-SAIANO | should be Rodengo Saiano, without - and not all uppercase |
If you wanted to associate each city with the code1 that ISTAT — the Italian National Institute of Statistics — assigns to each municipality, a simple exact join would fail because the names don’t match. Associating a code to each municipality is very important: it enables you to link data from different sources that would otherwise be incomparable, and to automatically generate maps, since mapping software often uses these codes to identify municipalities.
Data educators usually shout: CODES, NOT LABELS!!
The codes for Italian administrative units are public and freely available under CC-BY-4.0 on various sections of the ISTAT website. One source is SITUAS, which has a page with the “List of codes and names of territorial units”, downloadable in CSV and JSON format.
| city | region | istat_city_code |
|---|---|---|
| Cefalù | Sicilia | 082027 |
| Reggio di Calabria | Calabria | 080063 |
| Rodengo Saiano | Lombardia | 017163 |
If you try to join the incorrect data with this official data, you get no results.
Tometo Tomato was created to solve this problem: it lets you easily perform joins that are not based on exact matches, but on similarity, and then enrich, modify, and correct your messy input table by comparing it against a reference table.
Sample Files
These are our two sample tables, available as input.csv and ref_sample.csv so you can download them and try the examples yourself.
Using SQL
The Exact JOIN
The first test is to run a simple SQL join query to see what happens with our sample data.
| city | region |
|---|---|
| Cefalu’ | Sicilia |
| Reggio Calabria | CALABRIA |
| RODENGO-SAIANO | Lombardia |
| city | region | city_code |
|---|---|---|
| Cefalù | Sicilia | 082027 |
| Reggio di Calabria | Calabria | 080063 |
| Rodengo Saiano | Lombardia | 017164 |
The query below uses a LEFT JOIN so that all rows from the left table appear in the output, even if they have no match in the right table.
SELECT
i.*,
r.city_code
FROM read_csv_auto('input.csv') AS i
LEFT JOIN read_csv_auto('ref_sample.csv') AS r
ON i.city = r.city AND i.region = r.regionNote: read_csv_auto is a DuckDB function that reads CSV files directly without importing them into a table first. This makes quick experiments easy — no temporary tables needed.
The result is disappointing: none of the rows find a match, so city_code is always NULL.
| city | region | istat_city_code |
|---|---|---|
| Cefalu’ | Sicilia | NULL |
| Reggio Calabria | CALABRIA | NULL |
| RODENGO-SAIANO | Lombardia | NULL |
The Fuzzy JOIN
A “fuzzy” join finds matches even when values are not exactly equal, but merely similar. For example, we could consider city names that differ by just a few characters, using the Levenshtein distance (the minimum number of single-character edits needed to transform one string into another).
We could rewrite the previous query using DuckDB’s levenshtein function, matching cities that differ by at most 2 characters:
SELECT
i.city AS input_city,
i.region AS input_region,
r.city AS ref_city,
r.region AS ref_region,
r.city_code,
levenshtein(i.city, r.city) AS levenshtein_distance
FROM read_csv_auto('input.csv') AS i
JOIN read_csv_auto('ref_sample.csv') AS r
ON i.region = r.region
AND levenshtein(i.city, r.city) <= 2;| input_city | input_region | ref_city | ref_region | city_code | levenshtein_distance |
|---|---|---|---|---|---|
| Cefalu’ | Sicilia | Cefalù | Sicilia | 082027 | 2 |
The only match is Cefalu', because the Levenshtein distance between Cefalu' and Cefalù is 2: replacing u' with ù and removing '.
If we increase the threshold to 10, nothing changes — because Rodengo Saiano is written in uppercase with a hyphen, and the distance between RODENGO-SAIANO and Rodengo Saiano is 12:
SELECT levenshtein('RODENGO-SAIANO', 'Rodengo Saiano') AS distance;
12Setting the threshold to 15 gives only 2 matches, because the exact JOIN on region fails between CALABRIA and Calabria. So we need to use a distance function for the region column too:
SELECT
i.city AS input_city,
i.region AS input_region,
r.city AS ref_city,
r.region AS ref_region,
r.city_code,
levenshtein(i.city, r.city) AS levenshtein_distance
FROM read_csv_auto('input.csv') AS i
JOIN read_csv_auto('ref_sample.csv') AS r
ON levenshtein(i.city, r.city) <= 15
AND levenshtein(LOWER(i.region), LOWER(r.region)) < 10But the output is not what we expect — not 3 rows (one per municipality), but 8 rows:
| input_city | input_region | ref_city | ref_region | city_code | levenshtein_distance |
|---|---|---|---|---|---|
| Cefalu’ | Sicilia | Cefalù | Sicilia | 082027 | 2 |
| Cefalu’ | Sicilia | Reggio di Calabria | Calabria | 080063 | 15 |
| Cefalu’ | Sicilia | Rodengo Saiano | Lombardia | 017163 | 12 |
| Reggio Calabria | CALABRIA | Cefalù | Sicilia | 082027 | 12 |
| Reggio Calabria | CALABRIA | Reggio di Calabria | Calabria | 080063 | 3 |
| Reggio Calabria | CALABRIA | Rodengo Saiano | Lombardia | 017163 | 10 |
| RODENGO-SAIANO | Lombardia | Cefalù | Sicilia | 082027 | 14 |
| RODENGO-SAIANO | Lombardia | Rodengo Saiano | Lombardia | 017163 | 12 |
With exact joins, the goal is to find a single clear match for each row. In the world of fuzzy matching, the rules change. By using permissive distance thresholds, we are no longer asking the database “find the one correct match”, but rather:
For each input row, find all reference rows that satisfy these loose similarity criteria.
If an input row is “somewhat similar” to multiple reference rows, the database creates an output row for each. This multiplication effect is known as a Cartesian product of matches.
After generating all possible matches, we need to filter them to keep only the best one per input record. The process has three phases:
- Find all possible matches and calculate distances
- Rank each match by total distance
- Select only the top-ranked match (the most similar)
-- Phase 1: find all candidates and calculate distances
WITH all_matches AS (
SELECT
i.city AS input_city,
i.region AS input_region,
r.city AS ref_city,
r.region AS ref_region,
r.city_code,
levenshtein(i.city, r.city) AS city_distance,
levenshtein(i.region, r.region) AS region_distance
FROM read_csv_auto('input.csv') AS i
JOIN read_csv_auto('ref_sample.csv') AS r
ON levenshtein(i.city, r.city) <= 15
AND levenshtein(i.region, r.region) < 10
),
-- Phase 2: rank the candidates
ranked_matches AS (
SELECT
*,
ROW_NUMBER() OVER (
PARTITION BY input_city, input_region
ORDER BY (city_distance + region_distance) ASC
) AS rn
FROM all_matches
)
-- Phase 3: keep only the best candidate (rn = 1)
SELECT
input_city,
input_region,
ref_city,
ref_region,
city_code,
city_distance,
region_distance
FROM ranked_matches
WHERE rn = 1;| input_city | input_region | ref_city | ref_region | city_code | city_distance | region_distance |
|---|---|---|---|---|---|---|
| Cefalu’ | Sicilia | Cefalù | Sicilia | 082027 | 2 | 0 |
| RODENGO-SAIANO | Lombardia | Rodengo Saiano | Lombardia | 017163 | 12 | 0 |
| Reggio Calabria | CALABRIA | Reggio di Calabria | Calabria | 080063 | 3 | 7 |
Reducing Distances with Normalization
Here is a summary using the Italian municipality of Forza d'Agro, written incorrectly as Forza D'Agro:
- uppercase
Dinstead of lowercase - two extra spaces (one between
ForzaandD'Agro, one trailing) - no accent on the
o - a typographic apostrophe
\u2019instead of a straight one'
Note: spaces are shown as · for visibility.
| description | left | right | distance | function applied |
|---|---|---|---|---|
| Start | Forza··D\u2019Agro· |
Forza·d'Agr\u00f2 |
7 | |
| Lowercase | forza··d\u2019agro· |
forza·d'agr\u00f2 |
6 | LOWER('value') |
| Trim + collapse spaces | forza·d\u2019agro |
forza·d'agr\u00f2 |
5 | regexp_replace(trim(LOWER('value')), '\s+', ' ') |
| Strip accents | forza·d\u2019agro |
forza·d'agro |
3 | strip_accents(...) |
| Keep only alphanumeric | forza·dagro |
forza·dagro |
0 | regexp_replace(..., '[^a-zA-Z0-9 ]', '', 'g') |
So even a really messy value, full of errors, can be normalized to reduce the distance to zero — meaning Forza D\u2019Agro becomes equivalent to Forza d'Agr\u00f2 and we can retrieve its identifier code.
What Comes Next
In this walkthrough we covered the basics of measuring string distance and normalizing text for more effective comparison: case, whitespace, accents, and special characters.
However, this is just the beginning. The fuzzy matching world is much broader. For larger and messier datasets, more sophisticated methods exist:
- Phonetic algorithms: Instead of looking at how words are spelled, these encode words by how they sound. Methods like Metaphone or Soundex excel at recognizing that “Smith” and “Smythe” are probably the same thing.
- Statistical models (n-grams): These break strings into small fragments (pairs or triplets of characters) and compare their frequency, which is very effective at finding similarities even when word order differs.
Data cleaning tools like OpenRefine integrate dozens of these advanced algorithms, allowing you to cluster and correct similar data with great effectiveness.
Tometo Tomato wraps all of this complexity into a single command. See the Use Case Guide for the complete step-by-step workflow.