K-Means Clustering with Equal Sized Clusters in QGIS

K-Means Clustering is a popular algorithm for automatically grouping points into natural clusters. QGIS comes with a Processing Toolbox algorithm ‘K-means clustering’ that can take a vector layer and group features into N clusters. A problem with this algorithm is that you do not have control over how many points end up in each cluster. Many applications require you to segment your data layer into equal sized clusters or clusters having a minimum number of points. Some examples where you may need this

  • When planning for FTTH (Fiber-to-the-Home) network one may want to divide a neighborhood into clusters of at least 250 houses for placement of a node.
  • Dividing a sales territory/ customers equally among sales teams with customers in the same region are assigned to the same team.

There is a variation of the K-means algorithm called Constrained K-Means Clustering that uses graph theory to find optimal clusters with a user supplied minimum number of points belonging to given clusters. Stanislaw Adaszewski has a nice Python implementation of this algorithm that I have adapted to be used as a Processing Toolbox algorithm in QGIS.


I have heard feedback from users that this algorithm doesn’t work on all types of point distributions and may get stuck while finding an optimal solution. I am looking into ways to improve the code and will appreciate if you had feedback.

To test this algorithm, I downloaded the Address Points layer from Washington DCOpen Data Catalog and selected a subset of 1000 random points.

Let’s see how we can create 10 spatial clusters of these addresses. The default QGIS K-Means algorithm clusters the points as follows with number of points per cluster varying between 47 and 169.

Default K-Means Algorithm with varying cluster sizes

Once you add the new Processing Script (instructions below), you will get a new tool named Constrained K-Means Clustering – where you can specify a minimum number of points per cluster.

Running this algorithm on the input data with minimum 100 points per cluster, results in the output below – with each cluster having exactly 100 points assigned to it. (Since we asked for 10 clusters from a total of 1000 points)

Constrained K-Means algorithm with equal cluster sizes

Want to try this script on your own data?

  • Download the constrainted_kmeans.py script
  • In QGIS, open Settings → User Profiles → Open Active Profile Folder
  • Copy the constrained_kmeans.py script to processing → scripts folder.
  • Restart QGIS and launch the script from Processing Toolbox → Scripts → Constrained K-Means Clustering

This script works out-of-the-box on Windows and Mac with official QGIS packages. Linux QGIS users may have to install the dependencies via pip if you don’t already have it.

pip3 install networkx numpy

Below is the full code for the script for reference

import networkx as nx
import numpy as np
from itertools import groupby
from PyQt5.QtCore import QCoreApplication, QVariant

from qgis.core import (QgsProcessing, QgsProcessingAlgorithm, 
    QgsProcessingParameterFeatureSource, QgsProcessingParameterNumber,
    QgsProcessingParameterFeatureSink,QgsFields, QgsField, QgsWkbTypes,
    QgsFeatureSink, QgsProcessingUtils)

class ConstrainedKMeansAlgorithm(QgsProcessingAlgorithm):
    """Calculates the 2D distance based k-means cluster number for each input feature"""

    def initAlgorithm(self, config=None):
                self.tr('Input Layer'),
                self.tr('Number of Clusters'),
                5, False, 1
                self.tr('Miminum Number of Points per Cluster'),
                1, False, 1

    def processAlgorithm(self, parameters, context, feedback):
        source= self.parameterAsSource(parameters, self.INPUT, context)
        k = self.parameterAsInt(parameters, self.CLUSTERS, context)
        minpoints = self.parameterAsInt(parameters, self.MINPOINTS, context)
        outputFields = source.fields()
        newFields = QgsFields()
        newFields.append(QgsField('CLUSTER_ID', QVariant.Int))
        newFields.append(QgsField('CLUSTER_SIZE', QVariant.Int))

        outputFields = QgsProcessingUtils.combineFields(outputFields, newFields)
        sink, dest_id = self.parameterAsSink(
        feedback.pushInfo(self.tr( "Collecting input points"))
        features = [f for f in source.getFeatures()]
        data = []
        for f in features:
            geometry = f.geometry()
            if geometry.wkbType() == QgsWkbTypes.Point:
                point = geometry
                point = geometry.centroid()
            data.append([point.asPoint().x(), point.asPoint().y()])
        feedback.pushInfo(self.tr( "Input ready"))
        feedback.pushInfo(self.tr( "Computing clusters"))

        demand = [minpoints] * k
        C, M, f = constrained_kmeans(data, demand)
        feedback.pushInfo(self.tr( "Clusters ready"))

        # M is the cluster assignment for the data points
        # Compute cluster sizes
        sorted_M = np.sort(M)
        sizes = [len(list(group)) for key, group in groupby(sorted_M)]
        for index, out_f in enumerate(features):
            attributes = out_f.attributes()
            cluster_id = M[index].item()
            attributes.append(cluster_id + 1)

            sink.addFeature(out_f, QgsFeatureSink.FastInsert)
        return {self.OUTPUT: sink} 

    def name(self):
        return 'constrained_kmeans'

    def displayName(self):
        return self.tr('Constrained K-Means Clustering')
    def shortHelpString(self):
        return self.tr('Constrained K-Means Clustering algorithm PyQGIS implementation')

    def group(self):
        return self.tr(self.groupId())

    def groupId(self):
        return ''

    def tr(self, string):
        return QCoreApplication.translate('Processing', string)

    def createInstance(self):
        return ConstrainedKMeansAlgorithm()

# Code adapted from https://adared.ch/constrained-k-means-implementation-in-python/
def constrained_kmeans(data, demand, maxiter=None, fixedprec=1e9):
  data = np.array(data)
  min_ = np.min(data, axis = 0)
  max_ = np.max(data, axis = 0)
  C = min_ + np.random.random((len(demand), data.shape[1])) * (max_ - min_)
  M = np.array([-1] * len(data), dtype=np.int)
  itercnt = 0
  while True:
    itercnt += 1
    # memberships
    g = nx.DiGraph()
    g.add_nodes_from(range(0, data.shape[0]), demand=-1) # points
    for i in range(0, len(C)):
      g.add_node(len(data) + i, demand=demand[i])
    # Calculating cost...
    cost = np.array([np.linalg.norm(np.tile(data.T, len(C)).T - np.tile(C, len(data)).reshape(len(C) * len(data), C.shape[1]), axis=1)])
    # Preparing data_to_C_edges...
    data_to_C_edges = np.concatenate((np.tile([range(0, data.shape[0])],
    len(C)).T, np.tile(np.array([range(data.shape[0], data.shape[0] +
    C.shape[0])]).T, len(data)).reshape(len(C) * len(data), 1), cost.T *
    fixedprec), axis=1).astype(np.uint64) # Adding to graph

    a = len(data) + len(C)
    g.add_node(a, demand=len(data)-np.sum(demand))
    C_to_a_edges = np.concatenate((np.array([range(len(data), len(data) + len(C))]).T, np.tile([[a]], len(C)).T), axis=1)
    # Calculating min cost flow...
    f = nx.min_cost_flow(g)
    # assign
    M_new = np.ones(len(data), dtype=np.int) * -1
    for i in range(len(data)):
      p = sorted(f[i].items(), key=lambda x: x[1])[-1][0]
      M_new[i] = p - len(data)
    # stop condition
    if np.all(M_new == M):
      # Stop
      return (C, M, f)
    M = M_new
    # compute new centers
    for i in range(len(C)):
      C[i, :] = np.mean(data[M==i, :], axis=0)
    if maxiter is not None and itercnt >= maxiter:
      # Max iterations reached
      return (C, M, f)


Leave a Comment

  1. Used this script with + 25 000 points and for 40 clusters. After five hours I gave up, it seems extremely slow for larger data set

  2. Hello,

    I tried to apply this script to various datasets, ranging from a few points to larger amounts.

    Every time, it looks like the script loops endlessly, even for very simple cases (50 points, 2 clusters..), I can’t even stop its execution manually in QGIS.

    Is this something that happened for you ?

    • Ya. I got some feedback recently that this script doesn’t work for all datasets and gets stuck in finding a solution for some distributions. The dataset linked in the post and many other datasets I tested still work, but some datasets 0 like your fail. I have looked into it but haven’t yet figured out why.

      • A wild guess : maybe that’s due to the formatting of the dataset.
        What kind of dataset/formatting are you using ?

        I tried with shapefile and geopackage, doesn’t seem to work out, but as the code needs properly formatted data to process, maybe that’s where it gets messy with the variety of formats that can be used for geospatial data.

  3. The format doesn’t seem to be the issue.

    I did manage to make it work by limiting the number of processed features.
    Then, I did numerous runs, each time changing the parameters and limit number of processed features.

    From what I can tell, sometimes a given configuration (for instance : 150 features max, 75 points min per cluster, 2 clusters) will work, and sometimes not.

    I can also tell that points are always taken in the same order, so the problem is not coming from, for instance, faulty geometries among the dataset.

    Thus, it may come from a part of the code where a random function is called.

    I will get into your code to try and debug, but as you wrote it yourself, may my feedback be of help to find where’s the issue.

  4. At first we also had problems with no-end-runs but eventually tried it with WGS84 projection (before ETRS89) and then it worked. Maybe that’s the case for others as well.

  5. Hey ujaval,

    thanks for you great work in the (q)gis sector. We already had contact because of my order_by overlay_intersects problem.

    I also found a solution for this problem just now:
    first, i’ve put the “def constrained_kmeans” part into the “def processAlgorithm” part so i could use the “feedback.pushInfo” command to find the sequence the code hangs. This ofc wasn’t necessary to solve the problem, but i found that it hangs at
    “# Calculating min cost flow…
    f = nx.min_cost_flow(g)”

    my solution was to reduce “fixedprec=1e9” (one billion, 1^9) to “fixedprec=1e6” (one million, 1^6).
    Now the code finishes everytime. I have to admit that i had not time to fully understand why, but the result fits my needs.

    • Very cool. Thanks for letting me know. I have some test cases where it failed so let me test. Will update the code with your suggestion.

      • Could you test? After some more testing i get different results with the same settings and origin data. That is somehow crazy.

      • kmeans uses random starting values and it is possible the solution will be not identical. Read more on how KMeans works. As you changed the code it may not be allowing the algorithm to find the optimal clusters.

      • Allright, thank you. The Clustersize fits well, but the clusters are always shifted, implying a random factor.

  6. I have a hierarchical data where I want to create 2 clusters per Union Council. I want to join wards within a Union Council to create 2 clusters within each UC. Is this possible

  7. It was working fine and then I updated to qgis 3.32.3 and now I get:

    Traceback (most recent call last):
    File “C:\Users\godbyu\AppData\Roaming\QGIS\QGIS3\profiles\default\processing\scripts\constrainted_kmeans.py”, line 92, in processAlgorithm
    C, M, f = constrained_kmeans(data, demand)
    File “C:\Users\godbyu\AppData\Roaming\QGIS\QGIS3\profiles\default\processing\scripts\constrainted_kmeans.py”, line 140, in constrained_kmeans
    M = np.array([-1] * len(data), dtype=np.int)
    File “C:\PROGRA~1\QGIS33~1.3\apps\Python39\lib\site-packages\numpy\__init__.py”, line 284, in __getattr__
    raise AttributeError(“module {!r} has no attribute ”
    AttributeError: module ‘numpy’ has no attribute ‘int’

    Any ideas?

      • Hey, Ujaval – thanks for all your effort in this. I am using this to make clusters out of 1276 points (farm plots centroids) to a cluster set of around 8-10 plots (I set min at 8, cluster count at 105).

        The runtime error is same as George Godby’s above. Does this solution mean that I open the python script and change all .int datatypes into .int_ ones? I am on QGIS 3.34.2 / Win11.

      • So, I ran the script again after changing references to np.int to np.int_ in lines 140 and 172. The script now runs but throws the error below:

        Collecting input points
        Input ready
        Computing clusters
        Traceback (most recent call last):
        File “C:\Users\atree\AppData\Roaming\QGIS\QGIS3\profiles\default\processing\scripts\constrainted_kmeans.py”, line 92, in processAlgorithm
        C, M, f = constrained_kmeans(data, demand)
        File “C:\Users\atree\AppData\Roaming\QGIS\QGIS3\profiles\default\processing\scripts\constrainted_kmeans.py”, line 169, in constrained_kmeans
        f = nx.min_cost_flow(g)
        File “C:\PROGRA~1\QGIS\apps\Python39\lib\site-packages\networkx\algorithms\flow\mincost.py”, line 184, in min_cost_flow
        return nx.network_simplex(G, demand=demand, capacity=capacity, weight=weight)[1]
        File “”, line 2, in network_simplex
        File “C:\PROGRA~1\QGIS\apps\Python39\lib\site-packages\networkx\utils\decorators.py”, line 78, in _not_implemented_for
        return not_implement_for_func(*args, **kwargs)
        File “C:\PROGRA~1\QGIS\apps\Python39\lib\site-packages\networkx\algorithms\flow\networksimplex.py”, line 546, in network_simplex
        raise nx.NetworkXUnfeasible(“no flow satisfies all node demands”)
        networkx.exception.NetworkXUnfeasible: no flow satisfies all node demands

        Execution failed after 1.16 seconds

        Loading resulting layers
        Algorithm ‘Constrained K-Means Clustering’ finished

        Any ideas?

Leave a Reply