This post is a continuation of the analysis presented in the last three posts.
In this post I am going to outline how I have clustered the Council Wards in London based on the "violence against person" incidents in 2009. I am clustering the demand profile that these incidents present to the police. The idea is to find similar Wards in London based on the variables I use.
The choice of variables is critical. I have hopefully shown that Wards can be grouped by number of relevant incidents in the year and whether these occur between midnight and 4am on Saturday and Sunday mornings or whether they do not occur at these times. I am therefore only using two variables at this stage.
This means they can be plotted in a two dimensional graph based on the incidents that occur at those two separate times. This is shown above.
I then load the simple three column spreadsheet into SPSS17 and perform two different types of clustering calculations. I have outlined in some detail how these calculation work in six posts starting here
. I used two different methods - K means and Ward's Hierarchical (do not be confused Ward is the person who devised the method and nothing to do with Council Wards, its just a burden I have to bear when writing about my clustering analysis).
So this is what the map of London looks like when produced by ARCMap and the accompanying graphs in MSExcel.
I have specified six clusters and I have sized the graphs so the x and y axis have approximately the same scale. I have tried to keep the same colours in the maps and the graphs. If you look at the graphs closely you will see that the two different methods split the clusters up in similar but slightly different ways. What becomes obvious is that because the y axis has many more incidents in most cases that the clustering does not really take any notice of x axis. Therefore the maps are just clusters based on incidents that happen outside the Saturday and Sunday midnight to 4am. This is not good enough. I have decided that both variables are equally important as each other so I will have to make adjustments to ensure this is reflected in the clustering.
Mathematically this quite simple (I hope I have got this correct having made that bold statement). First I calculate the proportion or percentage of the x axis value is to the y axis value in each Ward. That is (x/y)*100. These percentages are then all added together and divided by the number of Wards to give an average percentage, which is in this case is 10.70% (to two decimal places). So to make the x axis variable have the same scale as the y axis; 100%/10.70% gives a figure of 9.34 (to two decimal places). This 9.34 is then used to multiply the x variable incidents for each Ward.
The resulting three column table is loaded into SPSS17 etc. and the following maps and graphs are produced.
The clustering now takes both variables equally into account but the two methods split the clusters differently especially the brown and orange. More in subsequent posts.