How can I predict the number of emergency calls in different parts of the city using geodata analysis?

    Try to solve the problem from the Geohack . 112 online hackathon . Given: the territory of Moscow and the Moscow region was divided into squares of sizes from 500 to 500 meters. The average number of emergency calls per day (numbers 112, 101, 102, 103, 104, 010, 020, 030, 040) is presented as the initial data. The region in question was divided into western and eastern parts. Participants are invited to study in the western part to predict the number of emergency calls for all eastern squares.


    In the zones.csv tablelists all the squares with their coordinates. All squares are located in Moscow, or at a short distance from Moscow. The squares located in the western part of the sample are intended for training the model - for these squares the average number of emergency calls per square per day is
    known : calls_daily : on all days
    calls_workday : on weekdays
    calls_weekend : on weekends
    calls_wd {D} : on weekday D (0 - Monday, 6 - Sunday)

    On the squares from the eastern part of the sample, it is necessary to forecast the number of calls for all days of the week. Assessment of the quality of the prediction will be carried out on a subset of squares, which do not include squares, from where calls come extremely rarely. A subset of the target squares has is_target = 1 in the table. For test squares, the values ​​of calls_ * and is_target are hidden.

    Three types of squares are marked on the map:
    Green - from the training part, non-target
    Red - from the training part, target
    Blue - test, you need to build a forecast on them.


    As a solution, you must provide a CSV table with predictions for all test squares, for each square - by all days of the week.


    Quality is judged only by a subset of the target squares. Participants do not know which of the target squares, however, the principle of choosing the target squares in the training and test parts is identical. During the competition, quality is estimated at 30% of the test target squares (randomly selected), at the end of the competition the results are summed up for the remaining 70% of the squares.

    Prediction Quality Metric - Kendall's Tau-b rank correlation coefficient, is considered as the proportion of pairs of objects with incorrectly ordered predictions adjusted for pairs with the same value of the target variable. The metric evaluates the order in which predictions relate to each other, rather than their exact values. Different days of the week are considered independent sampling elements, i.e. the correlation coefficient is predicted for all test pairs (zone_id, day of the week).

    We read the table of squares using Pandas
    import pandas
    df_zones = pandas.read_csv('data/zones.csv', index_col='zone_id')

    Retrieving Features from OpenStreetMap

    The prediction of the number of calls can be made using machine learning methods. To do this, each square needs to build a vector with a characteristic description. Having trained the model in the marked-up western part of Moscow, it can be used to predict the target variable in the eastern part.

    According to the conditions of the competition, data for solving the problem can be taken from any open sources. When it comes to describing a small area on a map, the first thing that comes to mind is OpenStreetMap, an open, non-profit electronic map created by the community.

    The OSM map is a set of elements of three types:
    Node: points on the
    Way map : roads, areas, are defined by a set of points
    Relation: connections between elements, for example, combining a road from several parts.

    Elements can have a set of tags - key-value pairs. Here is an example of a store that is indicated on the map as an element of type Node:


    Actual data uploading of OpenStreetMap can be taken from the website . Since we are interested in Moscow and its immediate surroundings, we upload the RU-MOS.osm.pbf file - part of OpenStreetMap from the corresponding region in the osm.pbf binary format . There is a simple osmread library for reading such files from Python.

    Before we start working, we consider from OSM all the elements of type Node from the region we need that have tags (the rest are eliminated and will not be used in the future).

    The organizers of the competition prepared a base line, which is available here . All the code below is contained in this baseline.

    Loading objects from OpenStreetMap
    import pickle
    import osmread
    from tqdm import tqdm_notebook
    LAT_MIN, LAT_MAX = 55.309397, 56.13526
    LON_MIN, LON_MAX = 36.770379, 38.19270
    osm_file = osmread.parse_file('osm/RU-MOS.osm.pbf')
    tagged_nodes = [
        for entry in tqdm_notebook(osm_file, total=18976998)
        if isinstance(entry, osmread.Node)
        if len(entry.tags) > 0
        if (LAT_MIN < < LAT_MAX) and (LON_MIN < entry.lon < LON_MAX)
    # Сохраним список с выбранными объектами в отдельный файл
    with open('osm/tagged_nodes.pickle', 'wb') as fout:
        pickle.dump(tagged_nodes, fout, protocol=pickle.HIGHEST_PROTOCOL)
    # Файл с сохраненными объектами можно будет быстро загрузить
    with open('osm/tagged_nodes.pickle', 'rb') as fin:
        tagged_nodes = pickle.load(fin)

    When working in Python, geodata can be quickly visualized on an interactive map using the folium library , which under the hood Leaflet.js is a standard solution for displaying OpenStreetMap.

    Visualization example using folium
    import folium
    fmap = folium.Map([55.753722, 37.620657])
    # нанесем ж/д станции
    for node in tagged_nodes:
        if node.tags.get('railway') == 'station':
            folium.CircleMarker([, node.lon], radius=3).add_to(fmap)
    # выделим квадраты с наибольшим числом вызовов
    calls_thresh = df_zones.calls_daily.quantile(.99)
    for _, row in df_zones.query('calls_daily > @calls_thresh').iterrows():
            bounds=((row.lat_bl, row.lon_bl), (row.lat_tr, row.lon_tr)),
    # карту можно сохранить и посмотреть в браузере'map_demo.html')

    We aggregate the obtained set of points by squares and construct simple signs:

    1. The distance from the center of the square to the Kremlin
    2. The number of points in the radius R from the center of the square (for different values ​​of R)
    a. all points tagged with
    b. train stations
    c. shops
    d. public transport stops
    3. Maximum and average distance from the center of the square to the nearest points of the types listed above

    For a quick search for points in proximity to the center of the square, we will use the kd tree data structure implemented in the SciKit-Learn package in the NearestNeighbors class.

    Building a table with a characteristic description of the squares
    import collections 
    import math
    import numpy as np
    from sklearn.neighbors import NearestNeighbors
    kremlin_lat, kremlin_lon = 55.753722, 37.620657
    def dist_calc(lat1, lon1, lat2, lon2):
        R = 6373.0
        lat1 = math.radians(lat1)
        lon1 = math.radians(lon1)
        lat2 = math.radians(lat2)
        lon2 = math.radians(lon2)
        dlon = lon2 - lon1
        dlat = lat2 - lat1
        a = math.sin(dlat / 2)**2 + math.cos(lat1) * math.cos(lat2) * \
              math.sin(dlon / 2)**2
        c = 2 * math.atan2(math.sqrt(a), math.sqrt(1 - a))
        return R * c
    df_features = collections.OrderedDict([])
    df_features['distance_to_kremlin'] = df_zones.apply(
        lambda row: dist_calc(row.lat_c, row.lon_c, kremlin_lat, kremlin_lon), axis=1)
    # набор фильтров точек, по которым будет считаться статистика
        ('tagged', lambda 
         node: len(node.tags) > 0),
        ('railway', lambda node: node.tags.get('railway') == 'station'),
        ('shop', lambda node: 'shop' in node.tags),
        ('public_transport', lambda node: 'public_transport' in node.tags),
    # центры квадратов в виде матрицы
    X_zone_centers = df_zones[['lat_c', 'lon_c']].as_matrix()
    for prefix, point_filter in POINT_FEATURE_FILTERS:
        # берем подмножество точек в соответствии с фильтром
        coords = np.array([
            [, node.lon]
            for node in tagged_nodes
            if point_filter(node)
        # строим структуру данных для быстрого поиска точек
        neighbors = NearestNeighbors().fit(coords)
        # признак вида "количество точек в радиусе R от центра квадрата"
        for radius in [0.001, 0.003, 0.005, 0.007, 0.01]:
            dists, inds = neighbors.radius_neighbors(X=X_zone_centers, radius=radius)
            df_features['{}_points_in_{}'.format(prefix, radius)] = \
                np.array([len(x) for x in inds])
        # признак вида "расстояние до ближайших K точек"
        for n_neighbors in [3, 5, 10]:
            dists, inds = neighbors.kneighbors(X=X_zone_centers, n_neighbors=n_neighbors)
            df_features['{}_mean_dist_k_{}'.format(prefix, n_neighbors)] = \
            df_features['{}_max_dist_k_{}'.format(prefix, n_neighbors)] = \
            df_features['{}_std_dist_k_{}'.format(prefix, n_neighbors)] = \
        # признак вида "расстояние до ближайшей точки"
        df_features['{}_min'.format(prefix)] = dists.min(axis=1)
    # Создаем таблицу и сохраняем ее в features.csv
    df_features = pandas.DataFrame(df_features, index=df_zones.index)

    As a result, for each square from the training and test samples, we have a characteristic description that can be used for prediction. A slight modification of the proposed code can take into account objects of other types in the signs. Participants can add their own characteristics, thereby giving the model more information about urban development.

    Predicting the number of challenges

    Experienced participants in data analysis competitions know that to get a high place it is important not only to look at the public leaderboard, but also to do your own validation on the training sample. Let's make a simple partition of the training part of the sample into subsamples for training and validation in the ratio of 70/30.

    Take only the target squares and train the random forest model (RandomForestRegressor) to predict the average number of calls per day.

    Highlighting a validation subsample and training RandomForest
    from sklearn.model_selection import train_test_split
    from sklearn.ensemble import RandomForestRegressor
    df_zones_train = df_zones.query('is_test == 0 & is_target == 1')
    idx_train, idx_valid = train_test_split(df_zones_train.index, test_size=0.3)
    X_train = df_features.loc[idx_train, :]
    y_train = df_zones.loc[idx_train, 'calls_daily']
    model = RandomForestRegressor(n_estimators=100, n_jobs=4), y_train)

    We will evaluate the quality on the validation subsample, for all days of the week we will build the same forecast. The quality metric is the non-parametric correlation coefficient of Kendall tau-b, it is implemented in the SciPy package as a function scipy.stats.kendalltau.

    It turns out Validation score: 0.656881482683
    This is not bad, because the value of the metric 0 means no correlation, and 1 means a complete monotonous correspondence of real values ​​and predicted.

    Validation forecast and quality assessment
    from scipy.stats import kendalltau
    X_valid = df_features.loc[idx_valid, :]
    y_valid = df_zones.loc[idx_valid, 'calls_daily']
    y_pred = model.predict(X_valid)
    target_columns = ['calls_wd{}'.format(d) for d in range(7)]
    df_valid_target = df_zones.loc[idx_valid, target_columns]
    df_valid_predictions = pandas.DataFrame(collections.OrderedDict([
        (column_name, y_pred)
        for column_name in target_columns
    ]), index=idx_valid)
    df_comparison = pandas.DataFrame({
        'target': df_valid_target.unstack(),
        'prediction': df_valid_predictions.unstack(),
    valid_score = kendalltau(df_comparison['target'], df_comparison['prediction']).correlation
    print('Validation score:', valid_score)

    Before entering the ranks of participants who will receive an invitation to the Data Fest, one small step remains: to build a table with predictions on all test squares and send them to the system.

    Building predictions on a test
    idx_test = df_zones.query('is_test == 1').index
    X_test = df_features.loc[idx_test, :]
    y_pred = model.predict(X_test)
    df_test_predictions = pandas.DataFrame(collections.OrderedDict([
        (column_name, y_pred)
        for column_name in target_columns
    ]), index=idx_test)

    Why analyze your location data?

    The main sources of data, not only spatial (geographical), but also temporal, for any telecom company are the complexes of equipment for signal reception and transmission located on the entire territory of the service countries — Base Stations (BS). Spatial analysis of data is technically more difficult, but can bring significant benefits and features that add a significant contribution to the effectiveness of machine learning models.

    At MTS, the use of geographic data helps expand the knowledge base. They can be used to improve the quality of the service provided when analyzing subscribers' complaints about the quality of communication, improving the coverage area of ​​the signal, planning network development and solving other problems where spatio-temporal communication is an important factor.

    Increasing, especially in large cities, population density, rapid development of infrastructure and road network, satellite images and vector maps, the number of buildings and POIs (points of interest) from open maps of the area associated with the provision of services - all these spatial data can be obtained from external sources. Such data can be used as an additional source of information, and also allow you to get an objective picture, regardless of network coverage.

    Do you like the challenge? Then we invite you to take part in the Geohack . 112 geographic data analysis online hackathon . Register and upload your solutions to the siteuntil April 24th. The authors of the top three results will receive cash prizes. Separate nominations are provided for participants who have presented the best public solution to the problem and the best visualization of data journalism. The total prize fund of MTS GeoHack is 500,000 rubles . We hope to see interesting new approaches to generating spatial features, visualizing geodata and using new open sources of information.

    The award ceremony will be held April 28 at the DataFest conference .

    Also popular now: