## Tuesday, 10 August 2010

### Explaining clustering to non- mathematicians and non-geographers 4

In the previous posts on this topic I have given details of my dataset of 28 variables of police force expenditure, discussed how this can be mapped in 28 dimensional space to provide unique locations for each force, and then introduced the concept of dividing the forces into hierarchical clusters based on the locations of the forces. The process I illustrated was in the previous post was the divisive or top down approach that starts off with all the forces in one cluster and then divides into two, then three, then four clusters, etc. until there are 43 clusters containing one force each. This is how most non-mathematicians approach clustering I think.

An alternative approach and one which is simpler mathematically is the bottom up or agglomeritive approach. This starts with 43 separate clusters and ends up with one through the merging of clusters.

The important thing to grasp in this post is that the criteria for clustering is based on a measurement of distances between clusters and this distance can be measured in various different ways. The example I am illustrating is calculated using SPSS version 17 software. I have selected the euclidean distance measure (squared - recommended by SPSS - gives longer distances more weight), which is basically the shortest distance between two points and measuring to the centre (or centroid) of the cluster. The process uses the agglomeritive approach.

The process starts with 43 clusters each with the membership of one force. The centroid of each cluster is therefore the location of that force in the 28 dimensional space that the computer calculates by plotting the values of the 28 variables relating to each force. The computer is then asked to find the closest two centroids (euclidean distance wise) which happens to be Bedfordshire and South Yorkshire (these are the most similar forces as far as expenditure patterns are concerned). It then merges the two clusters together (now there are 42 clusters) and calculates the centroid of that new cluster. It then looks for the closest two clusters again. This time Derbyshire and Kent clusters are merged and the centroid calculated of that new cluster (now 41clusters). The closest two clusters are again found. This time the Bedfordshire, South Yorkshire cluster is merged with Durham (40 clusters).

1 Avon & Somerset, 2 Bedfordshire, 3 Cambridgeshire, 4 Cheshire, 5 City of London, 6 Cleveland,  7 Cumbria, 8 Derbyshire, 9 Devon & Cornwall, 10 Dorset, 11 Durham, 12 Dyfed-Powys, 13 Essex, 14 Gloucestershire, 15 Greater Manchester, 16 Gwent, 17 Hampshire, 18 Hertfordshire, 19 Humberside, 20 Kent, 21 Lancashire, 22 Leicestershire,               23 Lincolnshire,  24 Merseyside, 25 Metropolitan Police, 26 Norfolk, 27 North Wales, 28 North Yorkshire,                             29 Northamptonshire, 30 Northumbria, 31 Nottinghamshire, 32 South Wales, 33 South Yorkshire, 34 Staffordshire,               35 Suffolk, 36 Surrey, 37 Sussex, 38 Thames Valley, 39 Warwickshire, 40 West Mercia, 41 West Midlands, 42 West Yorkshire, 43 Wiltshire

This Agglomeration Table is produced by SPSS. It takes little bit of understanding. I listed the forces in alphabetical order so the numbers relating to the clusters relate to the forces as shown (but remember by cluster 2 by stage 3 has two forces in it 2 & 33, you are given help in this in the last 3 columns). The stages refer to the stage in the process, so stage 1 is when 42 clusters are formed from 43. The Coefficient column gives an indication of how good the fit of the clustering is. For instance it appears that 33 & 32 clusters (stages 10 & 11) are a better fit than 34 clusters (stage 9).

I am interested in the Metropolitan Police - 25 so it the other end of the table that is of interest.

Cluster 3 (Cambridgeshire) is merged with cluster 37 (Sussex) at stage 33 (11 clusters). At stage 34 the Metropolitan Police (25) is merged with that 3 cluster. This means that the Metropolitan Police is one of the last forces to be put in a cluster with other forces but it is by no means the most dissimilar force as far as expenditure is concerned. It probably comes about 7th in the list behind Cumbria, City of London, Avon and Somerset, Warwickshire, North Wales and Norfolk. Interestingly 6 clusters is a better fit than 7 clusters.

The maps of clusters 2 to 7 are displayed in the previous post. Even though the actual process is agglomeritive it is easier for non-mathematicians to visualise it as if it is divisive.