# How to choose the best place to open a branch and visualize the results on the maps

Choosing a place for a new branch is a responsible decision. Mistakes can be expensive, especially in capital-intensive industries. Most often, such decisions are made by management experts: based on knowledge of the city, industry, previous experience.

In this article I will talk about how analytics can help in making such decisions. How to collect information about the population, real estate prices and make interactive visualizations. Does the number of clients depend on the distance to the branch, the year the house was built, and the value of the property.

To assess the population of the house, we used the data of housing and communal services reform . On this portal you can get information on each house: year of construction, living area, number of residential premises. The population estimate of each house was based on the number of apartments and the total living area: an average of about 3 people per apartment with slight differences for some houses and municipal districts.

Above is a heat map with population density in St. Petersburg. Our card for internal use also contains a separate layer with the density of customers. It’s more convenient to look for white spots - places with low coverage.

Due to the specifics of the business, we had addresses for almost all customers in our database. It was only necessary to find the geographical coordinates for each address: geocoding or geocoding. To get the coordinates, I used the geocoder package for python. The following problems occurred during geocoding:

As a result, we got the exact coordinates of the house for 93% of customers. Now you can build such a map:

Such a map turned out to be a convenient tool for testing hypotheses. For example, the business had a hypothesis that in some types of houses (Soviet mass buildings: ships, 504 series, Khrushchevs, etc.) there would be no our clients. It turned out that this is not entirely true. Yes, the proportion of customers from the population in such houses is low. But they need to be taken into account since there are a lot of such houses in the city and as a result they provide up to 20% of the client flow.

You can rearrange the population and customer data from the previous section by municipal district and map it. If you add info windows and customized coloring, it turns out very informative. There is already an excellent article on the hub, where the steps show how to build such cards.

Determining property prices has proven to be a daunting task. At the first stage, we managed to get all the ads for the sale of real estate from the beginning of 2018 - this is about 700 thousand records.

For each house, the cost per square meter was calculated as the median of the ads. For 20% of houses without ads, we estimated the cost of sq. m. using the model. The main factor is the price per square meter. m 15 nearest houses. At the same time, houses with similar characteristics received more weight: year of construction, number of residents, type of project. The average error of the model on the test set was 9.5%, which is quite acceptable for our study. Especially when you consider that even in one house the cost of square meters. m. can vary greatly: floor, repair, area and other factors.

The graph for 4 departments shows the dependence of the proportion of customers in the house on the distance to the department. In some branches there are strong leaps, which suggests the influence of other factors (age of the house, price of real estate).

Interesting is the relationship between the year the house was built and the proportion of customers.

For further modeling, the age of the house was divided into 5 meaningful categories:

Price correlates with customer share. But the relationship is weaker than between the proportion of customers and the age of the house. Perhaps the reason is that the age of the house correlates with the age of the residents. And the age of a person greatly affects the frequency of requests for medical services.

Subsequently, this analysis developed into a full-fledged model, where coordinates are supplied at the input, and the number of visits from new customers is obtained at the output. The article turned out to be voluminous, so I will talk about the model briefly.

For ease of interpretation of the results, linear regression was chosen as the model. The target variable is the proportion of customers in the house, factors: the logarithm of the distance to the nearest office, the cost of housing, the year the house was built. All three factors turned out to be significant and entered the model.

Substituting new coordinates into such a model (that is, changing the distance factor to the nearest branch), at the output we get a new number of clients for the entire network. If we subtract from this number the number of customers that was before, we will get a net effect.

Such a statement of the problem is convenient in that new locations are selected taking into account the location of the current branches. That is, it is not necessary to additionally take into account the “cannibalization” factor between different departments.

The search for optimal points for the entire city was carried out by a simple enumeration of coordinates every 500 m. To calculate the effect of opening several branches, points were set sequentially.

We managed to replace the wall map, on which we manually painted the borders of the districts and read something on convenient interactive maps. Rid employees from manually correcting and matching thousands of addresses with municipal districts. Enrich the data and go from the level of the municipal district to each house.

It turned out to identify several very promising and non-obvious locations for placement. Build a model that automatically and impartially compares different points.

Interesting results were obtained when the business lines were divided into “geo-dependent” and “geo-independent”. The former should be part of new branches, the latter can be developed within the framework of current locations.

In this article I will talk about how analytics can help in making such decisions. How to collect information about the population, real estate prices and make interactive visualizations. Does the number of clients depend on the distance to the branch, the year the house was built, and the value of the property.

#### City population accurate to home

**Code to create a map**

```
#Тепловая карта для СПб
import pandas as pd
from folium.plugins import HeatMap
import folium
#Загрузка данных
df = pd.read_csv('people_spb.csv')
filial = pd.read_csv('filial.csv')
competitor = pd.read_csv('competitors.csv')
#Создание карты
hmap = folium.Map(location=[59.95, 30.15], zoom_start=11)
#Слой для населения Спб
people = folium.FeatureGroup(name = 'Население СПб')
hm = HeatMap( list(zip(df.lat.values, df.lng.values, df['People'])),
min_opacity = .1,
max_val = df['People'].max(),
radius = 15,
blur = 25,
max_zoom = 1
)
people.add_child(hm)
#Маркеры с адресами филиалов
filial_markers = folium.FeatureGroup(name = 'Адреса филиалов')
for index, row in filial.iterrows():
folium.Marker(
location = [row['lat'], row['lng']],
popup = row['Name'],
icon = folium.Icon(color='blue', icon='cloud')
).add_to(filial_markers)
#Маркеры с адресами конкурентов
competitor_markers = folium.FeatureGroup(name = 'Адреса конкурентов')
for index, row in competitor.iterrows():
folium.Marker(
location = [row['lat'], row['lng']],
popup = row['Name'],
icon = folium.Icon(color='red')
).add_to(competitor_markers)
#Добавляем слои на карту
hmap.add_child(people)
hmap.add_child(filial_markers)
hmap.add_child(competitor_markers)
#Добавляем контроль слоев
folium.LayerControl(collapsed=False).add_to(hmap)
#Сохраняем полученную карту в html файл
hmap.save('people_spb.html')
```

To assess the population of the house, we used the data of housing and communal services reform . On this portal you can get information on each house: year of construction, living area, number of residential premises. The population estimate of each house was based on the number of apartments and the total living area: an average of about 3 people per apartment with slight differences for some houses and municipal districts.

Above is a heat map with population density in St. Petersburg. Our card for internal use also contains a separate layer with the density of customers. It’s more convenient to look for white spots - places with low coverage.

#### Customer Addresses

Due to the specifics of the business, we had addresses for almost all customers in our database. It was only necessary to find the geographical coordinates for each address: geocoding or geocoding. To get the coordinates, I used the geocoder package for python. The following problems occurred during geocoding:

- Some addresses are incorrect, for example, the case or letter is confused. In this situation, geocoding can “put” the client in a kindergarten or office building. For such cases, I had to write a process that changed coordinates to the nearest residential building within 200 m.
- Points with an abnormally high number of customers: city center, middle of a large street, middle of the district. Such coordinates were obtained with an incorrectly filled address and could distort the overall picture, therefore, before modeling, they were deleted

As a result, we got the exact coordinates of the house for 93% of customers. Now you can build such a map:

*Random data is plotted on a map for part of St. Petersburg.***Code to create a map**

```
import pandas as pd
import folium
from folium.plugins import MarkerCluster
#Загружаем данные
df = pd.read_csv('data.csv')
cmap = folium.Map(location=[59.95525, 30.2923], zoom_start=13)
mс = MarkerCluster()
for i, row in df.iterrows():
mc.add_child(folium.Marker(location=[row.lat,row.lng]))
cmap.add_child(mc)
cmap.save(folder+"marker_map.html")
```

Such a map turned out to be a convenient tool for testing hypotheses. For example, the business had a hypothesis that in some types of houses (Soviet mass buildings: ships, 504 series, Khrushchevs, etc.) there would be no our clients. It turned out that this is not entirely true. Yes, the proportion of customers from the population in such houses is low. But they need to be taken into account since there are a lot of such houses in the city and as a result they provide up to 20% of the client flow.

#### Borders of municipal districts

You can rearrange the population and customer data from the previous section by municipal district and map it. If you add info windows and customized coloring, it turns out very informative. There is already an excellent article on the hub, where the steps show how to build such cards.

#### Property Value

Determining property prices has proven to be a daunting task. At the first stage, we managed to get all the ads for the sale of real estate from the beginning of 2018 - this is about 700 thousand records.

For each house, the cost per square meter was calculated as the median of the ads. For 20% of houses without ads, we estimated the cost of sq. m. using the model. The main factor is the price per square meter. m 15 nearest houses. At the same time, houses with similar characteristics received more weight: year of construction, number of residents, type of project. The average error of the model on the test set was 9.5%, which is quite acceptable for our study. Especially when you consider that even in one house the cost of square meters. m. can vary greatly: floor, repair, area and other factors.

#### Distance from home to branch

The graph for 4 departments shows the dependence of the proportion of customers in the house on the distance to the department. In some branches there are strong leaps, which suggests the influence of other factors (age of the house, price of real estate).

#### Age at home

Interesting is the relationship between the year the house was built and the proportion of customers.

For further modeling, the age of the house was divided into 5 meaningful categories:

Period | Description |
---|---|

1700-1960 | Old Foundation and Stalin |

1960-1990 | The period of mass Soviet development |

1990-2000 | Point buildings in old quarters, many brick houses |

2000-2010 | The period of economic recovery. A lot of housing is being built in good locations. |

2010-2018 | mass development in less well located and remote areas |

#### Price per sq. m

Price correlates with customer share. But the relationship is weaker than between the proportion of customers and the age of the house. Perhaps the reason is that the age of the house correlates with the age of the residents. And the age of a person greatly affects the frequency of requests for medical services.

#### Model description

Subsequently, this analysis developed into a full-fledged model, where coordinates are supplied at the input, and the number of visits from new customers is obtained at the output. The article turned out to be voluminous, so I will talk about the model briefly.

For ease of interpretation of the results, linear regression was chosen as the model. The target variable is the proportion of customers in the house, factors: the logarithm of the distance to the nearest office, the cost of housing, the year the house was built. All three factors turned out to be significant and entered the model.

Substituting new coordinates into such a model (that is, changing the distance factor to the nearest branch), at the output we get a new number of clients for the entire network. If we subtract from this number the number of customers that was before, we will get a net effect.

Such a statement of the problem is convenient in that new locations are selected taking into account the location of the current branches. That is, it is not necessary to additionally take into account the “cannibalization” factor between different departments.

The search for optimal points for the entire city was carried out by a simple enumeration of coordinates every 500 m. To calculate the effect of opening several branches, points were set sequentially.

#### results

We managed to replace the wall map, on which we manually painted the borders of the districts and read something on convenient interactive maps. Rid employees from manually correcting and matching thousands of addresses with municipal districts. Enrich the data and go from the level of the municipal district to each house.

It turned out to identify several very promising and non-obvious locations for placement. Build a model that automatically and impartially compares different points.

Interesting results were obtained when the business lines were divided into “geo-dependent” and “geo-independent”. The former should be part of new branches, the latter can be developed within the framework of current locations.

*(not presented in the article)*.