Rainfall is arguably the most frequently measured hydro-meteorological variable. It is a required input for many hydrological applications like runoff computations, flood forecasting as well as engineering design of structures. However, rainfall data in its raw form contain many gaps and inconsistent values. Therefore it is important to do rigorous validation of rain-gauge observation before incorporating them into analysis.

World Bank’s National Hydrology Project (NHP) prescribes a set of primary and secondary validation methods in the Manual of Rainfall Data Validation.

Of particular interest to me are the spatial methods aimed to identify suspect values by comparison with neighboring stations. This spatial homogeneity test requires complex spatial and statistical data processing that can be quite challenging. I got an opportunity to work on a project that required automating the entire process of identifying and testing suspect stations. I ended up implementing it in QGIS using just Expressions and Processing Modeler. The whole solution required no custom code and was easily usable by an analyst in the QGIS environment. In this post, I will explain the details of the test and show you how you can use similar techniques for your own analysis.

This workflow was presented as a live session on QGIS Open Day. You can watch the recording to understand the concepts and implementation.

### Overview Presentation

### Video Walkthrough

## Spatial Correlation of Rainfall Data

Rainfall exhibits some degree of spatial consistency. The spatial correlation of rainfall measurement depends on the following factors

- Duration (
*less correlation at shorter duration*) - Distance (
*less correlation with increasing distance*) - Type of precipitation (
*less correlation for convectional rainfall*) - Terrain (
*less correlation for hilly terrain*)

## Nearest Neighbor Test

**A reasonable assumption is that rainfall data exhibit adequate spatial correlation within a short distance when aggregated over a longer duration.** Given this, we can apply a validation test to check if the measured value at station is similar to neighboring station values. This is the basis for the *Nearest Neighbor Test*. The test is described as follows

Manual on Rainfall Data Validation, NHP, India

An estimate of the interpolated rainfall value at a station is obtained on the basis of weighted average of rainfall observed at the surrounding stations. Whenever the difference between observed and estimated values exceed the expected limiting value, such values are considered as suspect values and they are then flagged for further investigation for possible causes of their discrepancies.

### Implementing the Nearest Neighbor Test in QGIS

We will now see how to implement this test in QGIS. The implementation is carried out in the following steps:

- Identify neighbor stations
- Compute distances to neighbors
- Identify suspect stations based on statistical tests
- Build a Processing Model to automate the entire process

### Get The Dataset

For this post, we will use open precipitation data from Global Historical Climatology Network (GHCN). The example in this post uses Precipitation data for the state of Florida for June 2020.

You can download the data.gpkg file containing the sample data as well as the processing model.

### Identify Neighbor Stations

The test prescribes the following conditions must be met when selecting neighbor stations for the test.

- The distance between the test and the neighbouring station must be less than a specified maximum correlation distance, say R
_{max}(kms) - A maximum of 8 neighbouring stations can be considered for interpolation
- To reduce the spatial bias in selection, it is appropriate to consider a maximum of only two stations within each quadrant

We have a point layer named `ghcn_stations`

that has the location of rain-gauge stations in Florida. We will use the `aggregate()`

function to find the neighboring stations within a distance of 10km (R_{max}). I describe the technique for finding neighbors in detail in my another post Find Neighbor Polygons using Summary Aggregate Function in QGIS.

If you just wanted n nearest neighbors for each point, you can use the

`'Distance Matrix'`

algorithm from the Processing Toolbox. Here we use expressions because the condition for neighbor selection is more complex.

The following aggregate expression selects 8 nearest neighbors for each station in the layer

```
array_to_string(array_slice(array_remove_all(aggregate(
layer:= 'ghcn_stations',
aggregate:='array_agg',
expression:="NAME",
filter:=distance($geometry, geometry(@parent)) < 10000,
order_by:=distance($geometry, geometry(@parent))
), "NAME"), 0, 7))
```

The above expression works and selects neighboring stations that meet 2 out of 3 criteria. They are all within 10km and we have a maximum of 8 stations. But we are still not guaranteed of having a maximum of 2 stations per quadrant. We can add an additional filter to find stations that are within certain angles matching each quadrant. We can add 4 new fields with 2 stations for each qudrant using the expressions below.

**Q1**
array_to_string(array_slice(array_remove_all(aggregate(
layer:= 'ghcn_stations',
aggregate:='array_agg',
expression:="NAME",
concatenator:=',',
filter:=distance($geometry, geometry(@parent)) < 10000 and
angle_at_vertex(make_line(geometry(@parent), $geometry), 0) >=0 and
angle_at_vertex(make_line(geometry(@parent), $geometry), 0) < 90,
order_by:=distance($geometry, geometry(@parent))), "NAME"), 0, 1))
**Q2**
array_to_string(array_slice(array_remove_all(aggregate(
layer:= 'ghcn_stations',
aggregate:='array_agg',
expression:="NAME",
concatenator:=',',
filter:=distance($geometry, geometry(@parent)) < 10000 and
angle_at_vertex(make_line(geometry(@parent), $geometry), 0) >=270 and
angle_at_vertex(make_line(geometry(@parent), $geometry), 0) < 360,
order_by:=distance($geometry, geometry(@parent))), "NAME"), 0, 1))
**Q3**
array_to_string(array_slice(array_remove_all(aggregate(
layer:= 'ghcn_stations',
aggregate:='array_agg',
expression:="NAME",
concatenator:=',',
filter:=distance($geometry, geometry(@parent)) < 10000 and
angle_at_vertex(make_line(geometry(@parent), $geometry), 0) >=180 and
angle_at_vertex(make_line(geometry(@parent), $geometry), 0) < 270,
order_by:=distance($geometry, geometry(@parent))), "NAME"), 0, 1))
**Q4**
array_to_string(array_slice(array_remove_all(aggregate(
layer:= 'ghcn_stations',
aggregate:='array_agg',
expression:="NAME",
concatenator:=',',
filter:=distance($geometry, geometry(@parent)) < 10000 and
angle_at_vertex(make_line(geometry(@parent), $geometry), 0) >=90 and
angle_at_vertex(make_line(geometry(@parent), $geometry), 0) < 180,
order_by:=distance($geometry, geometry(@parent))), "NAME"), 0, 1))

Now we have 4 fields, each containing upto 2 stations. We can combine them and get our final list of neighbors. We use the array functions along with the `coalesce() `

function that ensures an empty array is created in case any of the quadrants has empty (NULL) values. We will save the result of this operation to a layer named `stations_with_neighbors`

.

**neighbors**
array_to_string(array_remove_all(array_cat(
coalesce(string_to_array(Q1), array()),
coalesce(string_to_array(Q2), array()),
coalesce(string_to_array(Q3), array()),
coalesce(string_to_array(Q4), array())), "NAME"))

### Compute distances to neighbors

We now have a list of neighbors for each station in the layer `stations_with_neighbors`

. To apply the spatial homogeneity test, we should know the distances to each station. We can now use the `array_foreach()`

function to add a new field for every station with the distance to each of its neighbors. I have outlined a similar technique for measure border lengths in detail in my post Calculating Shared Border Lengths Between Polygons.

We have the list of neighbors in the field named `"neighbors"`

. We can compute the distance in kilometers using the expression below and save it to a field `"distances"`

.. Save the result of this operation to a layer `station_with_distances`

.

**distances**
array_to_string(array_foreach(string_to_array("neighbors"),
round(distance($geometry,
geometry(get_feature('stations_with_neighbors', 'NAME', @element)))/1000,
2)))

At this point we have all the data needed to run the statistical analysis for the test.

You may save the table as a CSV and use a statistical software (Excel, R, ..) of your choice to do the spatial homogeneity test. But to automate the entire process, we will do this in QGIS itself.

### Identify Suspect Stations

To do the various statistical tests, we will use QGIS expression engine. The expressions get quite complex and we will need to build these expressions dynamically. We will use a function called `eval()`

– which allows you to build those expressions using the expression engine and then evaluate them.

The statistical tests require number of neighbor stations **N** for each station. We can compute that easily using the following expression.

**N**
array_length(string_to_array( "neighbors" ))

We will now compute the estimated precipitation **P _{est}** using observed values at neighbors and computing the inverse distance weighted mean. Enter the following expression. They key here is the eval() function – which takes a string and executes it. We create a string separated by the + character and pass it to eval() which executes it, giving us the sum.

**P**_{est}
eval(array_to_string(array_foreach(string_to_array("neighbors"),
(1/(distance($geometry,
geometry(get_feature('stations_with_distances', 'NAME', @element)))/1000)^2)
*attributes(get_feature('stations_with_distances', 'NAME', @element))['PRCP']
),
'+'))
/
eval(array_to_string(array_foreach(string_to_array("neighbors"),
(1/(distance($geometry,
geometry(get_feature('stations_with_distances', 'NAME', @element)))/1000)^2)),
'+'))

Similarly, we can compute `mean`

and `standard deviation`

of neighbor precipitation.

**mean**
eval(array_to_string(array_foreach(
string_to_array( "neighbors" ),
attributes(get_feature('stations_with_distances', 'NAME', @element))['PRCP']),
'+')) / "N"

**stddev**
sqrt(eval(array_to_string(array_foreach(
string_to_array( "neighbors" ),
(attributes(get_feature('stations_with_distances', 'NAME', @element))['PRCP'] -
"mean" )^2),
'+')) / "N" )

Final variable we need is the `absolute difference`

between observed and interpolated values.

**absdiff**
abs("PRCP" - "P_est")

Now we can run the test and flag suspect stations

**suspect**
CASE WHEN
("absdiff" <= 75 and "absdiff" <= 2* "stddev") or "N" <= 2 or "absdiff" is NULL
THEN 'No'
ELSE 'Yes'
END

### Build a Processing Model

The biggest advantage of implementing the test using QGIS expressions is that we can now automate the full process using the Processing Modeler. This means we can run the test on many different inputs, or on a regular basis without having to run all the intermediate steps.

Here we are using the ‘Field Calculator’ algorithm with the expressions defined earlier. One important change is that instead of hard-coding the field names which contains inputs needed for the analysis, we can let the user specify them as model inputs. For example, we can have a model input named `'Neighbors Field'.`

This value can be accessed in the model using the variable `@NeighborsField`

. The variable will contain the name of the field, so we can use the function `attribute(@NeighborsField)`

to fetch the actual attribute value. This expression is equivalent to `"NAME"`

field we have used before, but it is more flexible since it can run on input data that may have fields named differently.

Once the model is built, you can save it inside the project and access it from the ‘Project Models’ section of the Processing Toolbox.

With our model, we can run the test easily on any input precipitation data without ever entering any expresssion manually.

Where do you get your compiled QGIS? I have never seen a log window for Field Calculator in Mac and Windows versions. That would be very handy to have when debugging expressions.

This is standard QGIS but the field calculator is the algorithm from Processing Toolbox →Field Calculator which is my preferred way of using expressions.

Thank you for sharing these techniques, it is greatly appreciated!!. I would however like to ask whether the 2 stations per quadrant strictly requires that the axes be based on 0, 90, 180, 270? Essentially it could be possible to rotate the quandrants and one could then possible catch more quadrants with 2 stations per quadrant

Possibly. But you would want to use the same axes for all stations. The idea is that you should select stations from all directions because there maybe bias due to terrain or other factors in a particular direction. This manual goes into much more depth on various methods. http://nhp.mowr.gov.in/Conference-2/Manual_Rainfall_Data_Validation.pdf

Thanks for the sharing.I have a question:When I use the algorithm from Processing Toolbox →Field Calculator I can get the right result(104 points have no neighbors),but when I click Open Filed Caculator from the Toolbars using the same expressions as you provide,I can not get the whole result(235 points have no neighbors),I know it missing some result.Is this a Bug?I have tested the data with different QGIS version(3.16.1,3.14.16,3.10.12) and get the same result.