Clustering wireless access points using the k-means method

Data visualization and analysis is currently widely used in the telecommunications industry. In particular, the analysis is highly dependent on the use of geospatial data. Perhaps this is due to the fact that telecommunication networks themselves are geographically dispersed. Accordingly, the analysis of such dispersions can be of tremendous value.

Data


To illustrate the k-means clustering algorithm, we will use the geographic database for free public WiFi in New York. The dataset is available at NYC Open Data. In particular, the k-means clustering algorithm is used to form WiFi usage clusters based on latitude and longitude data.

Latitude and longitude data are extracted from the data set itself using the programming language R:

#1. Prepare data
newyork<-read.csv("NYC_Free_Public_WiFi_03292017.csv")
attach(newyork)
newyorkdf<-data.frame(newyork$LAT,newyork$LON)

Here is a piece of data:



We determine the number of clusters


Next, we determine the number of clusters using the code below, which shows the result in a graph.

#2. Determine number of clusters
wss <- (nrow(newyorkdf)-1)*sum(apply(newyorkdf,2,var))
for (i in 2:20) wss[i] <- sum(kmeans(newyorkdf,
                                     centers=i)$withinss)
plot(1:20, wss, type="b", xlab="Number of Clusters",
     ylab="Within groups sum of squares")



The graph shows how the curve aligns at around 11. Therefore, this is the number of clusters that will be used in the k-means model.

K-means analysis


The analysis of K-means is carried out:

#3. K-Means Cluster Analysis
set.seed(20)
fit <- kmeans(newyorkdf, 11) # 11 cluster solution
# get cluster means
aggregate(newyorkdf,by=list(fit$cluster),FUN=mean)
# append cluster assignment
newyorkdf <- data.frame(newyorkdf, fit$cluster)
newyorkdf
newyorkdf$fit.cluster <- as.factor(newyorkdf$fit.cluster)
library(ggplot2)
ggplot(newyorkdf, aes(x=newyork.LON, y=newyork.LAT, color = newyorkdf$fit.cluster)) + geom_point()

The newyorkdf dataset contains information about latitude, longitude and cluster label:
> newyorkdf
newyork.LAT newyork.LON fit.cluster
1 40.75573 -73.94458 1
2 40.75533 -73.94413 1
3 40.75575 -73.94517 1
4 40.75575 -73.94517 1
5 40.75575 -73.94517 1 5 40.75575 -73.94517 1 5 40.75575 -73.94517
6 40.75575 -73.94517 1
...
80 40.84832 -73.82075 11

Here is a visual illustration:



This illustration is useful, but visualization will be even more valuable if you overlay it on a map of New York itself.

# devtools::install_github("zachcp/nycmaps")
library(nycmaps)
map(database="nyc")
#this should also work with ggplot and ggalt
nyc <- map_data("nyc")
gg  <- ggplot()
gg  <- gg + 
  geom_map(
    data=nyc, 
    map=nyc,
    aes(x=long, y=lat, map_id=region))
gg +
  geom_point(data = newyorkdf, aes(x = newyork.LON, y = newyork.LAT),
             colour = newyorkdf$fit.cluster, alpha = .5) + ggtitle("New York Public WiFi")



This type of clustering gives an excellent idea of ​​the structure of a WiFi network in a city. This indicates that the geographic region marked by cluster 1 shows a lot of WiFi traffic. On the other hand, fewer connections in cluster 6 may indicate low WiFi traffic.

K-Means clustering alone does not tell us why traffic for a particular cluster is high or low. For example, when cluster 6 has a high population density, but low internet speeds result in fewer connections.

However, this clustering algorithm provides an excellent starting point for further analysis and facilitates the collection of additional information. For example, using this map as an example, you can build hypotheses regarding individual geographical clusters. The original article is here.

Also popular now: