K-Means Clustering is a popular algorithm for automatically grouping points into natural clusters. QGIS comes with a Processing Toolbox algorithm ‘K-means clustering’ that can take a vector layer and group features into N clusters. A problem with this algorithm is that you do not have control over how many points end up in each cluster. Many applications require you to segment your data layer into equal sized clusters or clusters having a minimum number of points. Some examples where you may need this

- When planning for FTTH (Fiber-to-the-Home) network one may want to divide a neighborhood into clusters of at least 250 houses for placement of a node.
- Dividing a sales territory/ customers equally among sales teams with customers in the same region are assigned to the same team.

There is a variation of the K-means algorithm called Constrained K-Means Clustering that uses graph theory to find optimal clusters with a user supplied minimum number of points belonging to given clusters. Stanislaw Adaszewski has a nice Python implementation of this algorithm that I have adapted to be used as a Processing Toolbox algorithm in QGIS.

Warning!

I have heard feedback from users that this algorithm doesn’t work on all types of point distributions and may get stuck while finding an optimal solution. I am looking into ways to improve the code and will appreciate if you had feedback.

To test this algorithm, I downloaded the Address Points layer from Washington DCOpen Data Catalog and selected a subset of 1000 random points.

Let’s see how we can create 10 spatial clusters of these addresses. The default QGIS K-Means algorithm clusters the points as follows with number of points per cluster varying between 47 and 169.

Once you add the new Processing Script (instructions below), you will get a new tool named **Constrained K-Means Clustering** – where you can specify a minimum number of points per cluster.

Running this algorithm on the input data with minimum 100 points per cluster, results in the output below – with each cluster having exactly 100 points assigned to it. (Since we asked for 10 clusters from a total of 1000 points)

Want to try this script on your own data?

- Download the constrainted_kmeans.py script
- In QGIS, open Settings → User Profiles → Open Active Profile Folder
- Copy the constrained_kmeans.py script to processing → scripts folder.
- Restart QGIS and launch the script from Processing Toolbox → Scripts → Constrained K-Means Clustering

This script works out-of-the-box on Windows and Mac with official QGIS packages. Linux QGIS users may have to install the dependencies via pip if you don’t already have it.

pip3 install networkx numpy

Below is the full code for the script for reference

```
import networkx as nx
import numpy as np
from itertools import groupby
from PyQt5.QtCore import QCoreApplication, QVariant
from qgis.core import (QgsProcessing, QgsProcessingAlgorithm,
QgsProcessingParameterFeatureSource, QgsProcessingParameterNumber,
QgsProcessingParameterFeatureSink,QgsFields, QgsField, QgsWkbTypes,
QgsFeatureSink, QgsProcessingUtils)
class ConstrainedKMeansAlgorithm(QgsProcessingAlgorithm):
"""Calculates the 2D distance based k-means cluster number for each input feature"""
INPUT = 'INPUT'
CLUSTERS = 'CLUSTERS'
MINPOINTS = 'MINPOINTS'
OUTPUT = 'OUTPUT'
def initAlgorithm(self, config=None):
self.addParameter(
QgsProcessingParameterFeatureSource(
'INPUT',
self.tr('Input Layer'),
types=[QgsProcessing.TypeVectorAnyGeometry]
)
)
self.addParameter(
QgsProcessingParameterNumber(
'CLUSTERS',
self.tr('Number of Clusters'),
QgsProcessingParameterNumber.Integer,
5, False, 1
)
)
self.addParameter(
QgsProcessingParameterNumber(
'MINPOINTS',
self.tr('Miminum Number of Points per Cluster'),
QgsProcessingParameterNumber.Integer,
1, False, 1
)
)
self.addParameter(
QgsProcessingParameterFeatureSink(
self.OUTPUT,
'Clusters',
QgsProcessing.TypeVectorAnyGeometry
)
)
def processAlgorithm(self, parameters, context, feedback):
source= self.parameterAsSource(parameters, self.INPUT, context)
k = self.parameterAsInt(parameters, self.CLUSTERS, context)
minpoints = self.parameterAsInt(parameters, self.MINPOINTS, context)
outputFields = source.fields()
newFields = QgsFields()
newFields.append(QgsField('CLUSTER_ID', QVariant.Int))
newFields.append(QgsField('CLUSTER_SIZE', QVariant.Int))
outputFields = QgsProcessingUtils.combineFields(outputFields, newFields)
sink, dest_id = self.parameterAsSink(
parameters,
self.OUTPUT,
context,
outputFields,
source.wkbType(),
source.sourceCrs()
)
feedback.pushInfo(self.tr( "Collecting input points"))
features = [f for f in source.getFeatures()]
data = []
for f in features:
geometry = f.geometry()
if geometry.wkbType() == QgsWkbTypes.Point:
point = geometry
else:
point = geometry.centroid()
data.append([point.asPoint().x(), point.asPoint().y()])
feedback.pushInfo(self.tr( "Input ready"))
feedback.pushInfo(self.tr( "Computing clusters"))
demand = [minpoints] * k
C, M, f = constrained_kmeans(data, demand)
feedback.pushInfo(self.tr( "Clusters ready"))
# M is the cluster assignment for the data points
# Compute cluster sizes
sorted_M = np.sort(M)
sizes = [len(list(group)) for key, group in groupby(sorted_M)]
for index, out_f in enumerate(features):
attributes = out_f.attributes()
cluster_id = M[index].item()
attributes.append(cluster_id + 1)
attributes.append(sizes[cluster_id])
out_f.setAttributes(attributes)
sink.addFeature(out_f, QgsFeatureSink.FastInsert)
return {self.OUTPUT: sink}
def name(self):
return 'constrained_kmeans'
def displayName(self):
return self.tr('Constrained K-Means Clustering')
def shortHelpString(self):
return self.tr('Constrained K-Means Clustering algorithm PyQGIS implementation')
def group(self):
return self.tr(self.groupId())
def groupId(self):
return ''
def tr(self, string):
return QCoreApplication.translate('Processing', string)
def createInstance(self):
return ConstrainedKMeansAlgorithm()
# Code adapted from https://adared.ch/constrained-k-means-implementation-in-python/
def constrained_kmeans(data, demand, maxiter=None, fixedprec=1e9):
data = np.array(data)
min_ = np.min(data, axis = 0)
max_ = np.max(data, axis = 0)
C = min_ + np.random.random((len(demand), data.shape[1])) * (max_ - min_)
M = np.array([-1] * len(data), dtype=np.int)
itercnt = 0
while True:
itercnt += 1
# memberships
g = nx.DiGraph()
g.add_nodes_from(range(0, data.shape[0]), demand=-1) # points
for i in range(0, len(C)):
g.add_node(len(data) + i, demand=demand[i])
# Calculating cost...
cost = np.array([np.linalg.norm(np.tile(data.T, len(C)).T - np.tile(C, len(data)).reshape(len(C) * len(data), C.shape[1]), axis=1)])
# Preparing data_to_C_edges...
data_to_C_edges = np.concatenate((np.tile([range(0, data.shape[0])],
len(C)).T, np.tile(np.array([range(data.shape[0], data.shape[0] +
C.shape[0])]).T, len(data)).reshape(len(C) * len(data), 1), cost.T *
fixedprec), axis=1).astype(np.uint64) # Adding to graph
g.add_weighted_edges_from(data_to_C_edges)
a = len(data) + len(C)
g.add_node(a, demand=len(data)-np.sum(demand))
C_to_a_edges = np.concatenate((np.array([range(len(data), len(data) + len(C))]).T, np.tile([[a]], len(C)).T), axis=1)
g.add_edges_from(C_to_a_edges)
# Calculating min cost flow...
f = nx.min_cost_flow(g)
# assign
M_new = np.ones(len(data), dtype=np.int) * -1
for i in range(len(data)):
p = sorted(f[i].items(), key=lambda x: x[1])[-1][0]
M_new[i] = p - len(data)
# stop condition
if np.all(M_new == M):
# Stop
return (C, M, f)
M = M_new
# compute new centers
for i in range(len(C)):
C[i, :] = np.mean(data[M==i, :], axis=0)
if maxiter is not None and itercnt >= maxiter:
# Max iterations reached
return (C, M, f)
```

Used this script with + 25 000 points and for 40 clusters. After five hours I gave up, it seems extremely slow for larger data set

It is indeed slow for large datasets but I have tested it with 15k points and it finished in 10 minutes. Would you be able to share your dataset with me at ujaval@spatialthoughts.com. Would like to take a look and see if there are ways to speed up.

Hello,

I tried to apply this script to various datasets, ranging from a few points to larger amounts.

Every time, it looks like the script loops endlessly, even for very simple cases (50 points, 2 clusters..), I can’t even stop its execution manually in QGIS.

Is this something that happened for you ?

Ya. I got some feedback recently that this script doesn’t work for all datasets and gets stuck in finding a solution for some distributions. The dataset linked in the post and many other datasets I tested still work, but some datasets 0 like your fail. I have looked into it but haven’t yet figured out why.

A wild guess : maybe that’s due to the formatting of the dataset.

What kind of dataset/formatting are you using ?

I tried with shapefile and geopackage, doesn’t seem to work out, but as the code needs properly formatted data to process, maybe that’s where it gets messy with the variety of formats that can be used for geospatial data.

The format doesn’t seem to be the issue.

I did manage to make it work by limiting the number of processed features.

Then, I did numerous runs, each time changing the parameters and limit number of processed features.

From what I can tell, sometimes a given configuration (for instance : 150 features max, 75 points min per cluster, 2 clusters) will work, and sometimes not.

I can also tell that points are always taken in the same order, so the problem is not coming from, for instance, faulty geometries among the dataset.

Thus, it may come from a part of the code where a random function is called.

I will get into your code to try and debug, but as you wrote it yourself, may my feedback be of help to find where’s the issue.

Thanks. The issue is in the code for the ‘Constrained K-Means’ implementation taken from this project. https://adared.ch/constrained-k-means-implementation-in-python/

[…] K-Means Clustering with Equal Sized Clusters in QGIS […]

At first we also had problems with no-end-runs but eventually tried it with WGS84 projection (before ETRS89) and then it worked. Maybe that’s the case for others as well.

Hey ujaval,

thanks for you great work in the (q)gis sector. We already had contact because of my order_by overlay_intersects problem.

I also found a solution for this problem just now:

first, i’ve put the “def constrained_kmeans” part into the “def processAlgorithm” part so i could use the “feedback.pushInfo” command to find the sequence the code hangs. This ofc wasn’t necessary to solve the problem, but i found that it hangs at

“# Calculating min cost flow…

f = nx.min_cost_flow(g)”

my solution was to reduce “fixedprec=1e9” (one billion, 1^9) to “fixedprec=1e6” (one million, 1^6).

Now the code finishes everytime. I have to admit that i had not time to fully understand why, but the result fits my needs.

Very cool. Thanks for letting me know. I have some test cases where it failed so let me test. Will update the code with your suggestion.

Could you test? After some more testing i get different results with the same settings and origin data. That is somehow crazy.

kmeans uses random starting values and it is possible the solution will be not identical. Read more on how KMeans works. As you changed the code it may not be allowing the algorithm to find the optimal clusters.

Allright, thank you. The Clustersize fits well, but the clusters are always shifted, implying a random factor.