ML model to predict 20 or more goals.

Discussion in 'By The Numbers' started by GVM, Dec 27, 2018.

  1. GVM

    GVM Registered User

    Joined:
    Apr 9, 2017
    Messages:
    10
    Likes Received:
    28
    Trophy Points:
    11
    I have minimal domain knowledge of advanced stats. I should perhaps read up.

    With this in mind, I wondered if it was possible for me to predict players who have scored 20 + goals at even strength in 2017-2018 based solely on advanced stats culled from www.hockey-reference.com.

    Search:
    For single season, in 2017-18, in Even Strength situations, playing skater, sorted by descending Games Played.

    Link:

    Player Advanced Stats Finder | Hockey-Reference.com

    Simple binary classification model:
    Based on data from hockey-reference.com, has a player scored 20 + goals?

    I removed all of the derived value columns: CF%, CF% rel, C/60, Crel/60, FF%, FF% rel, PDO.
    I also removed all of the face-off stats: FOW, FOL, FO%

    Data looks like this:
    GPCFCAFFFAoiSH%oiSV%oZS%dZS%TOI/GmHITBLKTKGVG
    8215721636124212859.491.442.257.818.96110033810
    6145858232644411.890.745.554.59.5552127140
    82113595778471610.19158.941.112.6372430211
    82110511377738405.590.755.944.113.2323528490
    781406164110241237790.544.955.119.310112515570
    82137713021032100410.890.558.141.915.91322121641
    G column stands for goals: 0 means less than 20, 1 means 20 or more goals scored.

    Class distribution:
    Number of True cases: 53 (5.96%)
    Number of False cases: 837 (94.04%)
    Data split:
    74.94% in training set
    25.06% in test set

    Verifying predicted value split:
    Original True : 53 (5.96%) Original False : 837 (94.04%)
    Training True : 37 (5.55%) Training False : 630 (94.45%)
    Test True : 16 (7.17%) Test False : 207 (92.83%)

    Tried a bunch of classification algorithms, and they all pretty much outputted similar results.

    Output from the logistic regression algorithm on the training set with predictions made on the test set:

    Accuracy: 0.9596

    Confusion matrix:
    n =223predicted < 20predicted => 20
    actual < 202070
    actual => 2097
    Meaning that out of 223 rows, we had a total of 9 bad predictions.
    Model, without any tweaking, is really good at predicting less than 20, but needs work on predicting 20 or more.

    If I remove the # of games played from the features, I believe that my => 20 predictions would improve.
    Some players have 20 or more goals with less than 70 games played.

    Suspect I would've achieved similar results had I used only derived value columns.

    Anyway, just a possible scenario on how a NHL team, or a fantasy league pooler could make use of advanced stats.

    Cheers
     
    Last edited: Dec 30, 2018
  2. Mike Farkas

    Mike Farkas Grace Personified

    Joined:
    Jun 28, 2006
    Messages:
    10,051
    Likes Received:
    3,014
    Trophy Points:
    186
    Location:
    PA
    Home Page:
    I don't understand this...but I want to.
     
    GVM likes this.
  3. Mud the ACAS

    Mud the ACAS St. Louis Blues: 2019 Stanley Cup Champions

    Joined:
    Dec 15, 2002
    Messages:
    24,516
    Likes Received:
    3,220
    Trophy Points:
    265
    Gender:
    Male
    Occupation:
    Actuary
    Location:
    Celebrating a Stanley Cup title
    You're not really "predicting" anything, though. What you're really doing is saying, "these are the attributes that appear to distinguish 20-goal scorers from non-20-goal scorers" but it doesn't predict who will be a 20-goal scorer. Or, you're simply validating that players with 20+ goals in a season tend to have a certain set of characteristics with respect to advanced statistics. And even with that, is there any "a-ha" moment in the analysis that says something that no one really expected a priori?

    To make this a prediction you'd want to take information from prior years and say, "given past information, this is what it tells us about the future." I mean, I think what you've done could be interesting, but it's not predictive.
     
    Hockey Outsider likes this.
  4. Doctor No

    Doctor No Registered User

    Joined:
    Oct 26, 2005
    Messages:
    8,098
    Likes Received:
    1,636
    Trophy Points:
    149
    Home Page:
    Agreed - although it's interesting and it's a great start.

    What I would do is modify your data set above; currently you have (data from year N) and (20 goals in year N?). What would make your model (perhaps) predictive would be to take (data from year N) and (20 goals in year N+1)?
     
    Hockey Outsider and GVM like this.
  5. GVM

    GVM Registered User

    Joined:
    Apr 9, 2017
    Messages:
    10
    Likes Received:
    28
    Trophy Points:
    11
    This is a simple machine learning 101 binary classification (true or false) model applied to advanced stats.
    Given certain metrics, can the model predict 1 or 0.

    At a very high level...

    Phase 2 of my process would look to make better predictions for true .
    For this I could try:
    Removing players with 0 goals from the data.
    Removing or adding columns from the data.
    Normalizing the data.
    Mess around with weight of features (columns)
    Having better domain knowledge of advanced stats :)

    Once satisfied with my true prediction rate, I could repeat the process with other seasons.
    I could calculate the historical mean for metrics of a 20 goals scorer for instance.

    If you want a 20 goal guy on your team, can you develop a player to this historical mean?

    We could then move on to a regression (quantitative) model.
    Prorate the metrics given a number of games played, compare it to the historical mean, and predict 20 + or not.

    We often hear or reed about models these days, but no explanation is ever given. This a sample of what a model could look like. In a nutshell, this is how weather is foretasted, cars drive themselves, odds are given by Vegas.

    Cheers
     
  6. morehockeystats

    morehockeystats Unusual hockey stats

    Joined:
    Dec 13, 2016
    Messages:
    391
    Likes Received:
    87
    Trophy Points:
    46
    Occupation:
    sysadmin
    Location:
    San Jose
    Home Page:
    Train your model on data up to 16/17 included.
    Make it predict the 17/18 goal scorers.
    Post the results of the prediction here.
     
    Hockey Outsider likes this.
  7. GVM

    GVM Registered User

    Joined:
    Apr 9, 2017
    Messages:
    10
    Likes Received:
    28
    Trophy Points:
    11
    Lowering the target threshold would increase it's ratio, making it less of a rare event.
    I would expect a better rate for positive prediction. I'll also remove the 0 goals from the data.
    Will post my findings later today. Leaving shortly for a family ski day!
     
  8. GVM

    GVM Registered User

    Joined:
    Apr 9, 2017
    Messages:
    10
    Likes Received:
    28
    Trophy Points:
    11
    Day late...
    I removed players with 0 goals from the dataset and lowered the target threshold to 17 or more goals.
    Same algorithm, no adding or removing of features.

    We now have:
    Original True : 85 (11.81%) Original False : 635 (88.19%)
    Training True : 65 (12.04%) Training False : 475 (87.96%)
    Test True : 20 (11.11%) Test False : 160 (88.89%)

    Accuracy of test: 93%
    Confusion matrix:
    n =180predicted < 17predicted => 17
    actual < 171537
    actual => 17614

    Total of 13 incorrect predictions.

    When a positive value is predicted, how often did it predict 17 or more goals: 70%

    As I initially suspected, an increase in the ratio of true in the test set would improve the prediction %

    I also noticed a nearly 1 to 1 correlation between the corsi and fenwick stats. I could drop one or the other from the model and get similar results.
     
  9. supsens

    supsens Registered User

    Joined:
    Oct 6, 2013
    Messages:
    2,738
    Likes Received:
    402
    Trophy Points:
    94
    Is there any real difference here or is this like using shots then shooting% and 'predicting' the exact amount of goals every player got? It looks like a lot of work but without knowing your model it seems._ pointless?
     
  10. glucker

    glucker Registered User

    Joined:
    Aug 22, 2008
    Messages:
    7,791
    Likes Received:
    1,280
    Trophy Points:
    139
    Location:
    London, ON
    I think you misunderstood what he was saying... he meant using the data from the seasons up to 2016/2017, and using it to "predict" 2017/2018 results.
     
    morehockeystats likes this.
  11. GVM

    GVM Registered User

    Joined:
    Apr 9, 2017
    Messages:
    10
    Likes Received:
    28
    Trophy Points:
    11
    Ah man.. I did misread. Beyond the scope of the original post.. However, I will work on this and post my findings.
     

Share This Page

  1. This site uses cookies to help personalise content, tailor your experience and to keep you logged in if you register.
    By continuing to use this site, you are consenting to our use of cookies.
    Dismiss Notice
monitoring_string = "358c248ada348a047a4b9bb27a146148"