Analyzing NHL Player Careers with Survival Regression and Python

    Hello, Habr! Today, we will consider one of the approaches to assessing temporary risk, which is based on the survival curve and regression of the same name, and apply it to the analysis of the career length of NHL players.

    When will this patient relapse? When will our customer leave? Answers to such questions can be found using survival analysis, which can be used in all areas where the time period from the "birth" to the "death" of the object is studied, or similar events: the period from the receipt of equipment to its failure, from the beginning of use company services and before giving up, etc. Most often, these models are used in medicine, where it is necessary to assess the risk of death in a patient, which is the reason for the name of the model, but they are also applicable in the manufacturing, banking and insurance sectors.

    image


    Obviously, among the objects under consideration there will always be those for which “death” (failure, the occurrence of an insured event) has not yet occurred, and not to take them into account is a common mistake among analysts, since in this case they do not include the most relevant information in the sample and can artificially lower the time interval. To solve this problem, a survival analysis was developed that estimates the potential "life expectancy" of observations with the "death" that has not yet occurred.

    NHL Player Career Analysis


    Data and Survival Curve


    We talked about the application of this approach in medicine and economics, but now we will consider a less trivial example - the duration of the career of NHL players. We all know that hockey is a very dynamic and unpredictable game where you can play at the highest level as Gordy Howe (26 seasons) for many years, or you can end your career after a couple of seasons due to an unsuccessful collision. Therefore, we will find out: how many seasons can a hockey player expect to spend in the NHL and what factors affect the duration of a career?

    Let's look at the data already cleared and ready for analysis:

    df.head(3)

    NamePositionPointsBalanceCareer_startCareer_lengthObserved
    Olli JokinenF419-581997181
    Kevyn adamsF72-141997eleven1
    Matt pettingerF99-442000101

    Data was collected on 688 NHL hockey players who have played since 1979 and played at least 20 matches, their position on the field, the number of points scored, the team’s benefit (±), the beginning of a career in the NHL and its duration. The observed column indicates whether the player completed his career, that is, players with a value of 0 are still playing in the league.

    Let's look at the distribution of players by the number of seasons played:

    The distribution is similar to the lognormal with a median of 11 seasons. These statistics take into account only the number of seasons played by current players played before 2017, so the median score is clearly underestimated.

    To get a more accurate value, we will evaluate the survival function using the lifelines library :

    from lifelines import KaplanMeierFitter
    kmf = KaplanMeierFitter()
    kmf.fit(df.career_length, event_observed = df.observed)
    

    Out:

    The library syntax is similar to scikit-learn and its fit / predict: initiating KaplanMeierFitter we then train the model on the data. As an argument, the fit method takes the career_length time interval and the observed vector .

    Finally, when the model is trained, we can build a survival function for NHL players:

    kmf.survival_function_.plot()


    Each point in the graph is the probability that a player will play more than t seasons. As you can see, the threshold in 6 seasons is overcome by 80% of the players, while a very small number of players manage to play more than 17 seasons in the NHL. Now more strictly:

    print(kmf.median_)
    print(kmf.survival_function_.KM_estimate[20])
    print(kmf.survival_function_.KM_estimate[5])

    Out:13.0
    0.0685611305647
    0.888063315225

    Thus, 50% of NHL players will play over 13 seasons, which is 2 seasons more than the initial estimate. The chances of hockey players to play more than 5 and 20 seasons are 88.8% and 6.9%, respectively. Great incentive for young players!

    We can also derive a survival function, along with confidence intervals for probability:

    kmf.plot()


    Similarly, we construct the threat function using the Nelson-Aalen procedure:

    from lifelines import NelsonAalenFitter
    naf = NelsonAalenFitter()
    naf.fit(df.career_length,event_observed=df.observed)
    naf.plot()



    As you can see on the graph, during the first 10 years the risk for an NHL player to end his career at the end of the season is extremely small, but after season 10, this risk sharply increases. In other words, after spending 10 seasons in the NHL, the player will increasingly think about ending his career.

    Comparison of Forwards and Defenders


    Now compare the duration of a career defender and striker of the NHL. Initially, we assume that the attackers, on average, play much longer than the defenders, since they are most often more popular among fans, so they often get multi-year contracts. Also, with age, players become slower, which is the hardest hit for defenders, who are finding it increasingly difficult to deal with brisk forwards.

    ax = plt.subplot(111)
    kmf.fit(df.career_length[df.Position == 'D'], event_observed=df.observed[df.Position == 'D'], 
            label="Defencemen")
    kmf.plot(ax=ax, ci_force_lines=True)
    kmf.fit(df.career_length[df.Position == 'F'], event_observed=df.observed[df.Position == 'F'], 
            label="Forwards")
    kmf.plot(ax=ax, ci_force_lines=True)
    plt.ylim(0,1);


    It seems that the length of the career of the defenders is slightly shorter than that of the attackers. To make sure the conclusion is correct, we will conduct a statistical test based on the chi-square criterion:

    from lifelines.statistics import logrank_test
    dem = (df["Position"] == "F")
    L = df.career_length
    O = df.observed
    results = logrank_test(L[dem], L[~dem], O[dem], O[~dem], alpha=.90 )
    results.print_summary()

    Results
       t 0: -1
       test: logrank
       alpha: 0.9
       null distribution: chi squared
       df: 1
       __ p-value ___|__ test statistic __|____ test result ____|__ is significant __
             0.05006 |              3.840 |      Reject Null    |        True  

    So, at the 10% significance level we reject the null hypothesis: the defenders really have a greater chance of ending their careers earlier than the attackers, and, as can be seen from the graph, the difference increases towards the end of the career, which also confirms the thesis put forward earlier: the old defender has less time game and is becoming less and less in demand.

    Survival regression


    Often, we also have other factors affecting the duration of “life” that we would like to take into account. For this, a survival regression model was developed, which, like the classical linear regression, has a dependent variable and a set of factors.

    Consider one of the popular approaches to assessing the regression parameters - the additive Aalen model, which as the dependent variable chose not the time intervals themselves, but the values ​​of the threat function calculated on their basis$ \ lambda (t) $:

    $ \ lambda (t) = b_ {0} (t) + b_ {1} (t) x_ {1} + ... + b_ {n} (t) x_ {n} $


    Let's move on to the implementation of the model. As factors of the model, we take the number of points, player position, career start date, and overall team utility (±):

    from lifelines import AalenAdditiveFitter
    import patsy # используем библиотеку patsy, чтобы создать design матрицу из факторов 
    # -1 добавляет столбец из единиц к матрице, чтобы обучить константу модели
    X = patsy.dmatrix('Position + Points + career_start + Balance -1', df, return_type='dataframe')
    aaf = AalenAdditiveFitter(coef_penalizer=1.0, fit_intercept=True) # добавляем penalizer, который делает регрессию более стабильной, в случае мультиколлинеарности или малой выборки)
    

    Now that everything is ready, we will train survival regression:

    X['L'] = df['career_length'] 
    X['O'] = df['observed'] # добавляем career_length и observed к матрице факторов
    aaf.fit(X, 'L', event_col='O')

    Let's try to estimate how many more seasons 2 players will play in the NHL: Derek Stepan, a striker who started his career in 2010 and scored 310 points with a utility score of +109 and compare it with the less successful NHL player - Klas Dalbek, who got into the NHL in 2014, scored 11 points with a utility score of -12.

    ix1 = (df['Position'] == 'F') & (df['Points'] == 360) & (df['career_start'] == 2010) & (df['Balance'] == +109)
    ix2 = (df['Position'] == 'D') & (df['Points'] == 11) & (df['career_start'] == 2014) & (df['Balance'] == -12)
    stepan = X.ix[ix1]
    dahlbeck = X.ix[ix2]
    ax = plt.subplot(2,1,1)
    aaf.predict_cumulative_hazard(oshie).plot(ax=ax)
    aaf.predict_cumulative_hazard(jones).plot(ax=ax)
    plt.legend(['Stepan', 'Dahlbeck'])
    plt.title('Функция угрозы')
    ax = plt.subplot(2,1,2)
    aaf.predict_survival_function(oshie).plot(ax=ax);
    aaf.predict_survival_function(jones).plot(ax=ax);
    plt.legend(['Stepan', 'Dahlbeck']);



    As expected, a more successful player has a higher career expectancy. So, Stepan has a probability of playing more than 11 seasons - 80%, while Dalbek has only 55%. The threat curve also shows that starting from season 13, the risk of ending a career in the next season increases sharply, with Dalbek growing faster than Stepan.

    Cross validation


    To more strictly evaluate the quality of regression, we use the cross-validation procedure built into the lifelines library . At the same time, when working with censored data, we cannot use the standard error and similar criteria as a quality metric; therefore, the libraries use the concordance index or the consent index, which is a generalization of the AUC metric. We will carry out a 5-step cross-validation:

    from lifelines.utils import k_fold_cross_validation
    score = k_fold_cross_validation(aaf, X, 'L', event_col='O', k=5)
    print (np.mean(score))
    print (np.std(score))

    Out:0.764216775131
    0.0269169670161
    

    The average accuracy for all iterations is 76.4% with a deviation of 2.7%, which indicates a rather good quality of the algorithm.

    References


    1. Documentation Library lifelines
    2. J. Klein, M. Moeschberger. Survival Analysis. Techniques for censored and truncated data is a book with many examples and datasets that can be accessed through the KMsurv package in R.

    Also popular now: