Table Joins are a way to join 2 separate layers based on a common attribute value. QGIS has a Join Attributes By Field Value algorithm that allows you to table joins. A limitation of this algorithm is that the field values must match exactly. If the values differ slightly – the join will fail. There are many times where you are trying to join 2 layers from different sources and they contain values which are similar but may not match exactly. Fortunately QGIS now has built-in fuzzy string matching functions that can be used – along with Aggregate function – to do table join based on fuzzy matches.
I will demonstrate this here using a point layer of city names from Geonames and data table of census data from Socioeconomic Data and Applications Center (SEDAC). You can download a geopackage file fuzzy_join.gpkg containing the layers shown below.
The demo project contains 2 layers. cities and census. We want to join the census layer with cities
We can do a join based on the name contained in the asciiname column in cities layer and NAME column from census layer.
If we try the regular table join by trying to match the names in these columns, there are many records that cannot be matched. This is because the names are not a 100% match.
We can add find a fuzzy match using the built-in Levenshtein distance function. Let’s add a new virtual field to the cities layer.
Using an aggregate function, we can find all similar names (low levenshtein distance). Here we are using array_agg aggegate, so the result is a list of matching names.
aggregate( layer:= 'census', aggregate:='array_agg', expression:=NAME, filter:=levenshtein(NAME, attribute(@parent, 'asciiname')) <= 2, order_by:=levenshtein(NAME, attribute(@parent, 'asciiname')) )
You may have noticed the order_by expression. This is a very useful new feature added in QGIS 3.8 and it is what enables us to pick the name which is the closest match by sorting by levensthein distance. Once we have the sorted array of names, we can call array_first function to pick the first one (closest match). The final expression is below
array_first(aggregate( layer:= 'census', aggregate:='array_agg', expression:=NAME, filter:=levenshtein(NAME, attribute(@parent, 'asciiname')) <= 2, order_by:=levenshtein(NAME, attribute(@parent, 'asciiname')) ))
Now we can attempt a table join using the newly added fuzzymatch virtual field. Since fuzzymatch is a virtual field that will be evaluated for each feature at runtime – this process make take a long time. For this example, the join took approximately 10 minutes on my laptop.
This performs much better and results in more matches than an exact match. The failed matches are down from 30 to 6.
In the table below, you can see the names that were matched using the fuzzy matching algorithm. There are slight differences between them, but they likely refer to the same place.
In this example, the filter expression considers all the features from the target layer. This may not be ideal in cases where there is a possibility of similar names in the target. For example, you may find a similar city name from a district that matches a city from another district. To mitigate such problem, you may use additional rules in filter expression that restrict the selection to features from the same district as the @parent feature. If the target layer is a spatial layer, you can also filter by proximity.
All cities with a population > 15000 or capitals, cities15000.zip downloaded from GeoNames
India Village-Level Geospatial Socio-Economic Data Set, v1 (1991, 2001) Meiyappan, P., P. S. Roy, A. Soliman, T. Li, P. Mondal, S. Wang, and A. K. Jain. 2018. India Village-Level Geospatial Socio-Economic Data Set: 1991, 2001. Palisades, NY: NASA Socioeconomic Data and Applications Center (SEDAC).https://doi.org/10.7927/H4CN71ZJ. Accessed 26-09-2019