K-Means Clustering with Equal Sized Clusters in QGIS

K-Means Clustering is a popular algorithm for automatically grouping points into natural clusters. QGIS comes with a Processing Toolbox algorithm ‘K-means clustering’ that can take a vector layer and group features into N clusters. A problem with this algorithm is that you do not have control over how many points end up in each cluster. Many applications require you to segment your data layer into equal sized clusters or clusters having a minimum number of points. Some examples where you may need this

  • When planning for FTTH (Fiber-to-the-Home) network one may want to divide a neighborhood into clusters of at least 250 houses for placement of a node.
  • Dividing a sales territory/ customers equally among sales teams with customers in the same region are assigned to the same team.

There is a variation of the K-means algorithm called Constrained K-Means Clustering that uses graph theory to find optimal clusters with a user supplied minimum number of points belonging to given clusters. Stanislaw Adaszewski has a nice Python implementation of this algorithm that I have adapted to be used as a Processing Toolbox algorithm in QGIS.

To test this algorithm, I downloaded the Address Points layer from Washington DCOpen Data Catalog and selected a subset of 1000 random points.

Let’s see how we can create 10 spatial clusters of these addresses. The default QGIS K-Means algorithm clusters the points as follows with number of points per cluster varying between 47 and 169.

Default K-Means Algorithm with varying cluster sizes

Once you add the new Processing Script (instructions below), you will get a new tool named Constrained K-Means Clustering – where you can specify a minimum number of points per cluster.

Running this algorithm on the input data with minimum 100 points per cluster, results in the output below – with each cluster having exactly 100 points assigned to it. (Since we asked for 10 clusters from a total of 1000 points)

Constrained K-Means algorithm with equal cluster sizes

Want to try this script on your own data?

  • Download the constrainted_kmeans.py script
  • In QGIS, open Settings → User Profiles → Open Active Profile Folder
  • Copy the constrained_kmeans.py script to processing → scripts folder.
  • Restart QGIS and launch the script from Processing Toolbox → Scripts → Constrained K-Means Clustering

This script works out-of-the-box on Windows and Mac with official QGIS packages. Linux QGIS users may have to install the dependencies via pip if you don’t already have it.

pip3 install networkx numpy

Below is the full code for the script for reference

import networkx as nx
import numpy as np
from itertools import groupby
from PyQt5.QtCore import QCoreApplication, QVariant

from qgis.core import (QgsProcessing, QgsProcessingAlgorithm, 
    QgsProcessingParameterFeatureSource, QgsProcessingParameterNumber,
    QgsProcessingParameterFeatureSink,QgsFields, QgsField, QgsWkbTypes,
    QgsFeatureSink, QgsProcessingUtils)

class ConstrainedKMeansAlgorithm(QgsProcessingAlgorithm):
    """Calculates the 2D distance based k-means cluster number for each input feature"""

    def initAlgorithm(self, config=None):
                self.tr('Input Layer'),
                self.tr('Number of Clusters'),
                5, False, 1
                self.tr('Miminum Number of Points per Cluster'),
                1, False, 1

    def processAlgorithm(self, parameters, context, feedback):
        source= self.parameterAsSource(parameters, self.INPUT, context)
        k = self.parameterAsInt(parameters, self.CLUSTERS, context)
        minpoints = self.parameterAsInt(parameters, self.MINPOINTS, context)
        outputFields = source.fields()
        newFields = QgsFields()
        newFields.append(QgsField('CLUSTER_ID', QVariant.Int))
        newFields.append(QgsField('CLUSTER_SIZE', QVariant.Int))

        outputFields = QgsProcessingUtils.combineFields(outputFields, newFields)
        sink, dest_id = self.parameterAsSink(
        feedback.pushInfo(self.tr( "Collecting input points"))
        features = [f for f in source.getFeatures()]
        data = []
        for f in features:
            geometry = f.geometry()
            if geometry.wkbType() == QgsWkbTypes.Point:
                point = geometry
                point = geometry.centroid()
            data.append([point.asPoint().x(), point.asPoint().y()])
        feedback.pushInfo(self.tr( "Input ready"))
        feedback.pushInfo(self.tr( "Computing clusters"))

        demand = [minpoints] * k
        C, M, f = constrained_kmeans(data, demand)
        feedback.pushInfo(self.tr( "Clusters ready"))

        # M is the cluster assignment for the data points
        # Compute cluster sizes
        sorted_M = np.sort(M)
        sizes = [len(list(group)) for key, group in groupby(sorted_M)]
        for index, out_f in enumerate(features):
            attributes = out_f.attributes()
            cluster_id = M[index].item()
            attributes.append(cluster_id + 1)

            sink.addFeature(out_f, QgsFeatureSink.FastInsert)
        return {self.OUTPUT: sink} 

    def name(self):
        return 'constrained_kmeans'

    def displayName(self):
        return self.tr('Constrained K-Means Clustering')
    def shortHelpString(self):
        return self.tr('Constrained K-Means Clustering algorithm PyQGIS implementation')

    def group(self):
        return self.tr(self.groupId())

    def groupId(self):
        return ''

    def tr(self, string):
        return QCoreApplication.translate('Processing', string)

    def createInstance(self):
        return ConstrainedKMeansAlgorithm()

# Code adapted from https://adared.ch/constrained-k-means-implementation-in-python/
def constrained_kmeans(data, demand, maxiter=None, fixedprec=1e9):
  data = np.array(data)
  min_ = np.min(data, axis = 0)
  max_ = np.max(data, axis = 0)
  C = min_ + np.random.random((len(demand), data.shape[1])) * (max_ - min_)
  M = np.array([-1] * len(data), dtype=np.int)
  itercnt = 0
  while True:
    itercnt += 1
    # memberships
    g = nx.DiGraph()
    g.add_nodes_from(range(0, data.shape[0]), demand=-1) # points
    for i in range(0, len(C)):
      g.add_node(len(data) + i, demand=demand[i])
    # Calculating cost...
    cost = np.array([np.linalg.norm(np.tile(data.T, len(C)).T - np.tile(C, len(data)).reshape(len(C) * len(data), C.shape[1]), axis=1)])
    # Preparing data_to_C_edges...
    data_to_C_edges = np.concatenate((np.tile([range(0, data.shape[0])],
    len(C)).T, np.tile(np.array([range(data.shape[0], data.shape[0] +
    C.shape[0])]).T, len(data)).reshape(len(C) * len(data), 1), cost.T *
    fixedprec), axis=1).astype(np.uint64) # Adding to graph

    a = len(data) + len(C)
    g.add_node(a, demand=len(data)-np.sum(demand))
    C_to_a_edges = np.concatenate((np.array([range(len(data), len(data) + len(C))]).T, np.tile([[a]], len(C)).T), axis=1)
    # Calculating min cost flow...
    f = nx.min_cost_flow(g)
    # assign
    M_new = np.ones(len(data), dtype=np.int) * -1
    for i in range(len(data)):
      p = sorted(f[i].items(), key=lambda x: x[1])[-1][0]
      M_new[i] = p - len(data)
    # stop condition
    if np.all(M_new == M):
      # Stop
      return (C, M, f)
    M = M_new
    # compute new centers
    for i in range(len(C)):
      C[i, :] = np.mean(data[M==i, :], axis=0)
    if maxiter is not None and itercnt >= maxiter:
      # Max iterations reached
      return (C, M, f)

Leave a Reply