Spatial Homogeneity Testing of Raingauge Data with Advanced QGIS Expressions

Rainfall is arguably the most frequently measured hydro-meteorological variable. It is a required input for many hydrological applications like runoff computations, flood forecasting as well as engineering design of structures. However, rainfall data in its raw form contain many gaps and inconsistent values. Therefore it is important to do rigorous validation of rain-gauge observation before incorporating them into analysis.

World Bank’s National Hydrology Project (NHP) prescribes a set of primary and secondary validation methods in the Manual of Rainfall Data Validation.
Of particular interest to me are the spatial methods aimed to identify suspect values by comparison with neighboring stations. This spatial homogeneity test requires complex spatial and statistical data processing that can be quite challenging. I got an opportunity to work on a project that required automating the entire process of identifying and testing suspect stations. I ended up implementing it in QGIS using just Expressions and Processing Modeler. The whole solution required no custom code and was easily usable by an analyst in the QGIS environment. In this post, I will explain the details of the test and show you how you can use similar techniques for your own analysis.

This workflow was presented as a live session on QGIS Open Day. You can watch the recording to understand the concepts and implementation.

Overview Presentation

Video Walkthrough

Spatial Correlation of Rainfall Data

Rainfall exhibits some degree of spatial consistency. The spatial correlation of rainfall measurement depends on the following factors

Duration (less correlation at shorter duration)
Distance (less correlation with increasing distance)
Type of precipitation (less correlation for convectional rainfall)
Terrain (less correlation for hilly terrain)

Nearest Neighbor Test

A reasonable assumption is that rainfall data exhibit adequate spatial correlation within a short distance when aggregated over a longer duration. Given this, we can apply a validation test to check if the measured value at station is similar to neighboring station values. This is the basis for the Nearest Neighbor Test. The test is described as follows

An estimate of the interpolated rainfall value at a station is obtained on the basis of weighted average of rainfall observed at the surrounding stations. Whenever the difference between observed and estimated values exceed the expected limiting value, such values are considered as suspect values and they are then flagged for further investigation for possible causes of their discrepancies.
Manual on Rainfall Data Validation, NHP, India

Implementing the Nearest Neighbor Test in QGIS

We will now see how to implement this test in QGIS. The implementation is carried out in the following steps:

Identify neighbor stations
Compute distances to neighbors
Identify suspect stations based on statistical tests
Build a Processing Model to automate the entire process

Get The Dataset

For this post, we will use open precipitation data from Global Historical Climatology Network (GHCN). The example in this post uses Precipitation data for the state of Florida for June 2020.

Identify Neighbor Stations

The test prescribes the following conditions must be met when selecting neighbor stations for the test.

The distance between the test and the neighbouring station must be less than a specified maximum correlation distance, say R_max (kms)
A maximum of 8 neighbouring stations can be considered for interpolation
To reduce the spatial bias in selection, it is appropriate to consider a maximum of only two stations within each quadrant

We have a point layer named ghcn_stations that has the location of rain-gauge stations in Florida. We will use the aggregate() function to find the neighboring stations within a distance of 10km (R_max). I describe the technique for finding neighbors in detail in my another post Find Neighbor Polygons using Summary Aggregate Function in QGIS.

If you just wanted n nearest neighbors for each point, you can use the 'Distance Matrix' algorithm from the Processing Toolbox. Here we use expressions because the condition for neighbor selection is more complex.

The following aggregate expression selects 8 nearest neighbors for each station in the layer

array_to_string(array_slice(array_remove_all(aggregate(
layer:= 'ghcn_stations',
aggregate:='array_agg',
expression:="NAME",
filter:=distance($geometry, geometry(@parent)) < 10000,
order_by:=distance($geometry, geometry(@parent))
), "NAME"), 0, 7))

The above expression works and selects neighboring stations that meet 2 out of 3 criteria. They are all within 10km and we have a maximum of 8 stations. But we are still not guaranteed of having a maximum of 2 stations per quadrant. We can add an additional filter to find stations that are within certain angles matching each quadrant. We can add 4 new fields with 2 stations for each qudrant using the expressions below.

Q1

array_to_string(array_slice(array_remove_all(aggregate(
layer:= 'ghcn_stations',
aggregate:='array_agg',
expression:="NAME",
concatenator:=',',
filter:=distance($geometry, geometry(@parent)) < 10000 and
angle_at_vertex(make_line(geometry(@parent), $geometry), 0) >=0 and
angle_at_vertex(make_line(geometry(@parent), $geometry), 0) < 90,
order_by:=distance($geometry, geometry(@parent))), "NAME"), 0, 1))

Q2

array_to_string(array_slice(array_remove_all(aggregate(
layer:= 'ghcn_stations',
aggregate:='array_agg',
expression:="NAME",
concatenator:=',',
filter:=distance($geometry, geometry(@parent)) < 10000 and
angle_at_vertex(make_line(geometry(@parent), $geometry), 0) >=270 and
angle_at_vertex(make_line(geometry(@parent), $geometry), 0) < 360,
order_by:=distance($geometry, geometry(@parent))), "NAME"), 0, 1))

Q3

array_to_string(array_slice(array_remove_all(aggregate(
layer:= 'ghcn_stations',
aggregate:='array_agg',
expression:="NAME",
concatenator:=',',
filter:=distance($geometry, geometry(@parent)) < 10000 and
angle_at_vertex(make_line(geometry(@parent), $geometry), 0) >=180 and
angle_at_vertex(make_line(geometry(@parent), $geometry), 0) < 270,
order_by:=distance($geometry, geometry(@parent))), "NAME"), 0, 1))

Q4

array_to_string(array_slice(array_remove_all(aggregate(
layer:= 'ghcn_stations',
aggregate:='array_agg',
expression:="NAME",
concatenator:=',',
filter:=distance($geometry, geometry(@parent)) < 10000 and
angle_at_vertex(make_line(geometry(@parent), $geometry), 0) >=90 and
angle_at_vertex(make_line(geometry(@parent), $geometry), 0) < 180,
order_by:=distance($geometry, geometry(@parent))), "NAME"), 0, 1))

Now we have 4 fields, each containing upto 2 stations. We can combine them and get our final list of neighbors. We use the array functions along with the coalesce() function that ensures an empty array is created in case any of the quadrants has empty (NULL) values. We will save the result of this operation to a layer named stations_with_neighbors.

neighbors

array_to_string(array_remove_all(array_cat(
coalesce(string_to_array(Q1), array()),
coalesce(string_to_array(Q2), array()),
coalesce(string_to_array(Q3), array()),
coalesce(string_to_array(Q4), array())), "NAME"))

Compute distances to neighbors

We now have a list of neighbors for each station in the layer stations_with_neighbors. To apply the spatial homogeneity test, we should know the distances to each station. We can now use the array_foreach() function to add a new field for every station with the distance to each of its neighbors. I have outlined a similar technique for measure border lengths in detail in my post Calculating Shared Border Lengths Between Polygons.

We have the list of neighbors in the field named "neighbors". We can compute the distance in kilometers using the expression below and save it to a field "distances".. Save the result of this operation to a layer station_with_distances.

distances
 array_to_string(array_foreach(string_to_array("neighbors"),
 round(distance($geometry,
 geometry(get_feature('stations_with_neighbors', 'NAME', @element)))/1000,
 2)))

At this point we have all the data needed to run the statistical analysis for the test.

You may save the table as a CSV and use a statistical software (Excel, R, ..) of your choice to do the spatial homogeneity test. But to automate the entire process, we will do this in QGIS itself.

Identify Suspect Stations

To do the various statistical tests, we will use QGIS expression engine. The expressions get quite complex and we will need to build these expressions dynamically. We will use a function called eval() – which allows you to build those expressions using the expression engine and then evaluate them.

The statistical tests require number of neighbor stations N for each station. We can compute that easily using the following expression.

N

array_length(string_to_array( "neighbors" ))

We will now compute the estimated precipitation P_est using observed values at neighbors and computing the inverse distance weighted mean. Enter the following expression. They key here is the eval() function – which takes a string and executes it. We create a string separated by the + character and pass it to eval() which executes it, giving us the sum.

In newer versions of QGIS, there is a built-in array function array_sum() that will greatly simplify the expressions below.

P_est

eval(array_to_string(array_foreach(string_to_array("neighbors"),
(1/(distance($geometry,
geometry(get_feature('stations_with_distances', 'NAME', @element)))/1000)^2)
*attributes(get_feature('stations_with_distances', 'NAME', @element))['PRCP']
),
'+'))
/
eval(array_to_string(array_foreach(string_to_array("neighbors"),
(1/(distance($geometry,
geometry(get_feature('stations_with_distances', 'NAME', @element)))/1000)^2)),
'+'))

Similarly, we can compute mean and standard deviation of neighbor precipitation.

mean

eval(array_to_string(array_foreach(
 string_to_array( "neighbors" ),
 attributes(get_feature('stations_with_distances', 'NAME', @element))['PRCP']),
 '+')) / "N"

stddev

sqrt(eval(array_to_string(array_foreach(
 string_to_array( "neighbors" ),
 (attributes(get_feature('stations_with_distances', 'NAME', @element))['PRCP'] -
 "mean" )^2),
 '+')) / "N" )

Final variable we need is the absolute difference between observed and interpolated values.

absdiff
 
abs("PRCP" - "P_est")

Now we can run the test and flag suspect stations

suspect

CASE WHEN
  ("absdiff" <= 75 and "absdiff" <= 2* "stddev")  or "N" <= 2 or "absdiff" is NULL
THEN 'No' 
 ELSE 'Yes'
 END

Build a Processing Model

The biggest advantage of implementing the test using QGIS expressions is that we can now automate the full process using the Processing Modeler. This means we can run the test on many different inputs, or on a regular basis without having to run all the intermediate steps.

Here we are using the ‘Field Calculator’ algorithm with the expressions defined earlier. One important change is that instead of hard-coding the field names which contains inputs needed for the analysis, we can let the user specify them as model inputs. For example, we can have a model input named 'Neighbors Field'. This value can be accessed in the model using the variable @NeighborsField. The variable will contain the name of the field, so we can use the function attribute(@NeighborsField) to fetch the actual attribute value. This expression is equivalent to "NAME" field we have used before, but it is more flexible since it can run on input data that may have fields named differently.

Once the model is built, you can save it inside the project and access it from the ‘Project Models’ section of the Processing Toolbox.

With our model, we can run the test easily on any input precipitation data without ever entering any expresssion manually.

If you want to run the model on your system, download the data.gpkg file containing the sample data as well as the processing model. The completed model is saved inside a QGIS project and is available under Project Models folder in the Processing Toolbox once the project is opened.

If you are interested in learning QGIS Automation techniques, please check out our Advanced QGIS course which is freely available for self-study.

5 Comments

Thayer Y says:

November 29, 2020 at 3:31 am

Where do you get your compiled QGIS? I have never seen a log window for Field Calculator in Mac and Windows versions. That would be very handy to have when debugging expressions.

- ujaval says:
  
  November 29, 2020 at 7:35 am
  
  This is standard QGIS but the field calculator is the algorithm from Processing Toolbox →Field Calculator which is my preferred way of using expressions.
  
Neil Falconer says:

November 29, 2020 at 11:46 pm

Thank you for sharing these techniques, it is greatly appreciated!!. I would however like to ask whether the 2 stations per quadrant strictly requires that the axes be based on 0, 90, 180, 270? Essentially it could be possible to rotate the quandrants and one could then possible catch more quadrants with 2 stations per quadrant

- ujaval says:
  
  November 30, 2020 at 11:28 am
  
  Possibly. But you would want to use the same axes for all stations. The idea is that you should select stations from all directions because there maybe bias due to terrain or other factors in a particular direction. This manual goes into much more depth on various methods. http://nhp.mowr.gov.in/Conference-2/Manual_Rainfall_Data_Validation.pdf
  
Zheng Zhipeng says:

December 7, 2020 at 9:20 am

Thanks for the sharing.I have a question:When I use the algorithm from Processing Toolbox →Field Calculator I can get the right result(104 points have no neighbors),but when I click Open Filed Caculator from the Toolbars using the same expressions as you provide,I can not get the whole result(235 points have no neighbors),I know it missing some result.Is this a Bug?I have tested the data with different QGIS version(3.16.1,3.14.16,3.10.12) and get the same result.

Spatial Thoughts

Blog

Spatial Homogeneity Testing of Raingauge Data with Advanced QGIS Expressions

Overview Presentation

Video Walkthrough

Spatial Correlation of Rainfall Data

Nearest Neighbor Test

Implementing the Nearest Neighbor Test in QGIS

Get The Dataset

Identify Neighbor Stations

Compute distances to neighbors

Identify Suspect Stations

Build a Processing Model

5 Comments

Leave a ReplyCancel reply

Overview Presentation

Video Walkthrough

Spatial Correlation of Rainfall Data

Nearest Neighbor Test

Implementing the Nearest Neighbor Test in QGIS

Get The Dataset

Identify Neighbor Stations

Compute distances to neighbors

Identify Suspect Stations

Build a Processing Model

Leave a ReplyCancel reply

Discover more from Spatial Thoughts