K-Means Clustering with Equal Sized Clusters in QGIS

K-Means Clustering is a popular algorithm for automatically grouping points into natural clusters. QGIS comes with a Processing Toolbox algorithm ‘K-means clustering’ that can take a vector layer and group features into N clusters. A problem with this algorithm is that you do not have control over how many points end up in each cluster. Many applications require you to segment your data layer into equal sized clusters or clusters having a minimum number of points. Some examples where you may need this

  • When planning for FTTH (Fiber-to-the-Home) network one may want to divide a neighborhood into clusters of at least 250 houses for placement of a node.
  • Dividing a sales territory/ customers equally among sales teams with customers in the same region are assigned to the same team.

There is a variation of the K-means algorithm called Constrained K-Means Clustering that uses graph theory to find optimal clusters with a user supplied minimum number of points belonging to given clusters. Stanislaw Adaszewski has a nice Python implementation of this algorithm that I have adapted to be used as a Processing Toolbox algorithm in QGIS.

Warning!

I have heard feedback from users that this algorithm doesn’t work on all types of point distributions and may get stuck while finding an optimal solution. I am looking into ways to improve the code and will appreciate if you had feedback.

To test this algorithm, I downloaded the Address Points layer from Washington DCOpen Data Catalog and selected a subset of 1000 random points.

Let’s see how we can create 10 spatial clusters of these addresses. The default QGIS K-Means algorithm clusters the points as follows with number of points per cluster varying between 47 and 169.

Default K-Means Algorithm with varying cluster sizes

Once you add the new Processing Script (instructions below), you will get a new tool named Constrained K-Means Clustering – where you can specify a minimum number of points per cluster.

Running this algorithm on the input data with minimum 100 points per cluster, results in the output below – with each cluster having exactly 100 points assigned to it. (Since we asked for 10 clusters from a total of 1000 points)

Constrained K-Means algorithm with equal cluster sizes

Want to try this script on your own data?

  • Download the constrainted_kmeans.py script
  • In QGIS, open Settings → User Profiles → Open Active Profile Folder
  • Copy the constrained_kmeans.py script to processing → scripts folder.
  • Restart QGIS and launch the script from Processing Toolbox → Scripts → Constrained K-Means Clustering

This script works out-of-the-box on Windows and Mac with official QGIS packages. Linux QGIS users may have to install the dependencies via pip if you don’t already have it.

pip3 install networkx numpy

Below is the full code for the script for reference

import networkx as nx
import numpy as np
from itertools import groupby
from PyQt5.QtCore import QCoreApplication, QVariant

from qgis.core import (QgsProcessing, QgsProcessingAlgorithm, 
    QgsProcessingParameterFeatureSource, QgsProcessingParameterNumber,
    QgsProcessingParameterFeatureSink,QgsFields, QgsField, QgsWkbTypes,
    QgsFeatureSink, QgsProcessingUtils)


class ConstrainedKMeansAlgorithm(QgsProcessingAlgorithm):
    """Calculates the 2D distance based k-means cluster number for each input feature"""
    INPUT = 'INPUT'
    CLUSTERS = 'CLUSTERS'
    MINPOINTS = 'MINPOINTS'
    OUTPUT = 'OUTPUT'

    
    def initAlgorithm(self, config=None):
        self.addParameter(
            QgsProcessingParameterFeatureSource(
                'INPUT',
                self.tr('Input Layer'),
                types=[QgsProcessing.TypeVectorAnyGeometry]
            )
        )
        
        self.addParameter(
            QgsProcessingParameterNumber(
                'CLUSTERS',
                self.tr('Number of Clusters'),
                QgsProcessingParameterNumber.Integer,
                5, False, 1
            )
        )
        
        self.addParameter(
            QgsProcessingParameterNumber(
                'MINPOINTS',
                self.tr('Miminum Number of Points per Cluster'),
                QgsProcessingParameterNumber.Integer,
                1, False, 1
            )
        )
        
        
        self.addParameter(
            QgsProcessingParameterFeatureSink(
                self.OUTPUT,
                'Clusters',
                QgsProcessing.TypeVectorAnyGeometry
            )
        )

    def processAlgorithm(self, parameters, context, feedback):
        source= self.parameterAsSource(parameters, self.INPUT, context)
        k = self.parameterAsInt(parameters, self.CLUSTERS, context)
        minpoints = self.parameterAsInt(parameters, self.MINPOINTS, context)
        
        outputFields = source.fields()
        newFields = QgsFields()
        newFields.append(QgsField('CLUSTER_ID', QVariant.Int))
        newFields.append(QgsField('CLUSTER_SIZE', QVariant.Int))

        outputFields = QgsProcessingUtils.combineFields(outputFields, newFields)
        sink, dest_id = self.parameterAsSink(
            parameters,
            self.OUTPUT,
            context,
            outputFields,
            source.wkbType(),
            source.sourceCrs()
            )
        feedback.pushInfo(self.tr( "Collecting input points"))
        
        features = [f for f in source.getFeatures()]
        data = []
        for f in features:
            geometry = f.geometry()
            if geometry.wkbType() == QgsWkbTypes.Point:
                point = geometry
            else:
                point = geometry.centroid()
            
            data.append([point.asPoint().x(), point.asPoint().y()])
        
        feedback.pushInfo(self.tr( "Input ready"))
        feedback.pushInfo(self.tr( "Computing clusters"))

        demand = [minpoints] * k
        C, M, f = constrained_kmeans(data, demand)
        feedback.pushInfo(self.tr( "Clusters ready"))

        # M is the cluster assignment for the data points
        # Compute cluster sizes
        sorted_M = np.sort(M)
        sizes = [len(list(group)) for key, group in groupby(sorted_M)]
    
        
        for index, out_f in enumerate(features):
            attributes = out_f.attributes()
            cluster_id = M[index].item()
            attributes.append(cluster_id + 1)
            attributes.append(sizes[cluster_id])

            out_f.setAttributes(attributes)
            sink.addFeature(out_f, QgsFeatureSink.FastInsert)
        return {self.OUTPUT: sink} 

    def name(self):
        return 'constrained_kmeans'

    def displayName(self):
        return self.tr('Constrained K-Means Clustering')
        
    def shortHelpString(self):
        return self.tr('Constrained K-Means Clustering algorithm PyQGIS implementation')

    def group(self):
        return self.tr(self.groupId())

    def groupId(self):
        return ''

    def tr(self, string):
        return QCoreApplication.translate('Processing', string)

    def createInstance(self):
        return ConstrainedKMeansAlgorithm()

# Code adapted from https://adared.ch/constrained-k-means-implementation-in-python/
def constrained_kmeans(data, demand, maxiter=None, fixedprec=1e9):
  data = np.array(data)
  
  min_ = np.min(data, axis = 0)
  max_ = np.max(data, axis = 0)
  
  C = min_ + np.random.random((len(demand), data.shape[1])) * (max_ - min_)
  M = np.array([-1] * len(data), dtype=np.int)
  
  itercnt = 0
  while True:
    itercnt += 1
    
    # memberships
    g = nx.DiGraph()
    g.add_nodes_from(range(0, data.shape[0]), demand=-1) # points
    for i in range(0, len(C)):
      g.add_node(len(data) + i, demand=demand[i])
    
    # Calculating cost...
    cost = np.array([np.linalg.norm(np.tile(data.T, len(C)).T - np.tile(C, len(data)).reshape(len(C) * len(data), C.shape[1]), axis=1)])
    # Preparing data_to_C_edges...
    data_to_C_edges = np.concatenate((np.tile([range(0, data.shape[0])],
    len(C)).T, np.tile(np.array([range(data.shape[0], data.shape[0] +
    C.shape[0])]).T, len(data)).reshape(len(C) * len(data), 1), cost.T *
    fixedprec), axis=1).astype(np.uint64) # Adding to graph
    g.add_weighted_edges_from(data_to_C_edges)
    

    a = len(data) + len(C)
    g.add_node(a, demand=len(data)-np.sum(demand))
    C_to_a_edges = np.concatenate((np.array([range(len(data), len(data) + len(C))]).T, np.tile([[a]], len(C)).T), axis=1)
    g.add_edges_from(C_to_a_edges)
    
    
    # Calculating min cost flow...
    f = nx.min_cost_flow(g)
    
    # assign
    M_new = np.ones(len(data), dtype=np.int) * -1
    for i in range(len(data)):
      p = sorted(f[i].items(), key=lambda x: x[1])[-1][0]
      M_new[i] = p - len(data)
      
    # stop condition
    if np.all(M_new == M):
      # Stop
      return (C, M, f)
      
    M = M_new
      
    # compute new centers
    for i in range(len(C)):
      C[i, :] = np.mean(data[M==i, :], axis=0)
      
    if maxiter is not None and itercnt >= maxiter:
      # Max iterations reached
      return (C, M, f)

7 Comments

Leave a Comment

  1. Used this script with + 25 000 points and for 40 clusters. After five hours I gave up, it seems extremely slow for larger data set

  2. Hello,

    I tried to apply this script to various datasets, ranging from a few points to larger amounts.

    Every time, it looks like the script loops endlessly, even for very simple cases (50 points, 2 clusters..), I can’t even stop its execution manually in QGIS.

    Is this something that happened for you ?

    • Ya. I got some feedback recently that this script doesn’t work for all datasets and gets stuck in finding a solution for some distributions. The dataset linked in the post and many other datasets I tested still work, but some datasets 0 like your fail. I have looked into it but haven’t yet figured out why.

      • A wild guess : maybe that’s due to the formatting of the dataset.
        What kind of dataset/formatting are you using ?

        I tried with shapefile and geopackage, doesn’t seem to work out, but as the code needs properly formatted data to process, maybe that’s where it gets messy with the variety of formats that can be used for geospatial data.

  3. The format doesn’t seem to be the issue.

    I did manage to make it work by limiting the number of processed features.
    Then, I did numerous runs, each time changing the parameters and limit number of processed features.

    From what I can tell, sometimes a given configuration (for instance : 150 features max, 75 points min per cluster, 2 clusters) will work, and sometimes not.

    I can also tell that points are always taken in the same order, so the problem is not coming from, for instance, faulty geometries among the dataset.

    Thus, it may come from a part of the code where a random function is called.

    I will get into your code to try and debug, but as you wrote it yourself, may my feedback be of help to find where’s the issue.

Leave a Reply to soli004 Cancel reply