
Clustering wireless access points using the k-means method
- Tutorial
- Recovery mode
Data visualization and analysis is currently widely used in the telecommunications industry. In particular, the analysis is highly dependent on the use of geospatial data. Perhaps this is due to the fact that telecommunication networks themselves are geographically dispersed. Accordingly, the analysis of such dispersions can be of tremendous value.
To illustrate the k-means clustering algorithm, we will use the geographic database for free public WiFi in New York. The dataset is available at NYC Open Data. In particular, the k-means clustering algorithm is used to form WiFi usage clusters based on latitude and longitude data.
Latitude and longitude data are extracted from the data set itself using the programming language R:
Here is a piece of data:

Next, we determine the number of clusters using the code below, which shows the result in a graph.

The graph shows how the curve aligns at around 11. Therefore, this is the number of clusters that will be used in the k-means model.
The analysis of K-means is carried out:
The newyorkdf dataset contains information about latitude, longitude and cluster label:
> newyorkdf
newyork.LAT newyork.LON fit.cluster
1 40.75573 -73.94458 1
2 40.75533 -73.94413 1
3 40.75575 -73.94517 1
4 40.75575 -73.94517 1
5 40.75575 -73.94517 1 5 40.75575 -73.94517 1 5 40.75575 -73.94517
6 40.75575 -73.94517 1
...
80 40.84832 -73.82075 11
Here is a visual illustration:

This illustration is useful, but visualization will be even more valuable if you overlay it on a map of New York itself.

This type of clustering gives an excellent idea of the structure of a WiFi network in a city. This indicates that the geographic region marked by cluster 1 shows a lot of WiFi traffic. On the other hand, fewer connections in cluster 6 may indicate low WiFi traffic.
K-Means clustering alone does not tell us why traffic for a particular cluster is high or low. For example, when cluster 6 has a high population density, but low internet speeds result in fewer connections.
However, this clustering algorithm provides an excellent starting point for further analysis and facilitates the collection of additional information. For example, using this map as an example, you can build hypotheses regarding individual geographical clusters. The original article is here.
Data
To illustrate the k-means clustering algorithm, we will use the geographic database for free public WiFi in New York. The dataset is available at NYC Open Data. In particular, the k-means clustering algorithm is used to form WiFi usage clusters based on latitude and longitude data.
Latitude and longitude data are extracted from the data set itself using the programming language R:
#1. Prepare data
newyork<-read.csv("NYC_Free_Public_WiFi_03292017.csv")
attach(newyork)
newyorkdf<-data.frame(newyork$LAT,newyork$LON)
Here is a piece of data:

We determine the number of clusters
Next, we determine the number of clusters using the code below, which shows the result in a graph.
#2. Determine number of clusters
wss <- (nrow(newyorkdf)-1)*sum(apply(newyorkdf,2,var))
for (i in 2:20) wss[i] <- sum(kmeans(newyorkdf,
centers=i)$withinss)
plot(1:20, wss, type="b", xlab="Number of Clusters",
ylab="Within groups sum of squares")

The graph shows how the curve aligns at around 11. Therefore, this is the number of clusters that will be used in the k-means model.
K-means analysis
The analysis of K-means is carried out:
#3. K-Means Cluster Analysis
set.seed(20)
fit <- kmeans(newyorkdf, 11) # 11 cluster solution
# get cluster means
aggregate(newyorkdf,by=list(fit$cluster),FUN=mean)
# append cluster assignment
newyorkdf <- data.frame(newyorkdf, fit$cluster)
newyorkdf
newyorkdf$fit.cluster <- as.factor(newyorkdf$fit.cluster)
library(ggplot2)
ggplot(newyorkdf, aes(x=newyork.LON, y=newyork.LAT, color = newyorkdf$fit.cluster)) + geom_point()
The newyorkdf dataset contains information about latitude, longitude and cluster label:
> newyorkdf
newyork.LAT newyork.LON fit.cluster
1 40.75573 -73.94458 1
2 40.75533 -73.94413 1
3 40.75575 -73.94517 1
4 40.75575 -73.94517 1
5 40.75575 -73.94517 1 5 40.75575 -73.94517 1 5 40.75575 -73.94517
6 40.75575 -73.94517 1
...
80 40.84832 -73.82075 11
Here is a visual illustration:

This illustration is useful, but visualization will be even more valuable if you overlay it on a map of New York itself.
# devtools::install_github("zachcp/nycmaps")
library(nycmaps)
map(database="nyc")
#this should also work with ggplot and ggalt
nyc <- map_data("nyc")
gg <- ggplot()
gg <- gg +
geom_map(
data=nyc,
map=nyc,
aes(x=long, y=lat, map_id=region))
gg +
geom_point(data = newyorkdf, aes(x = newyork.LON, y = newyork.LAT),
colour = newyorkdf$fit.cluster, alpha = .5) + ggtitle("New York Public WiFi")

This type of clustering gives an excellent idea of the structure of a WiFi network in a city. This indicates that the geographic region marked by cluster 1 shows a lot of WiFi traffic. On the other hand, fewer connections in cluster 6 may indicate low WiFi traffic.
K-Means clustering alone does not tell us why traffic for a particular cluster is high or low. For example, when cluster 6 has a high population density, but low internet speeds result in fewer connections.
However, this clustering algorithm provides an excellent starting point for further analysis and facilitates the collection of additional information. For example, using this map as an example, you can build hypotheses regarding individual geographical clusters. The original article is here.