ML model to predict 20 or more goals.

GVM

Registered User
Apr 9, 2017
10
28
I have minimal domain knowledge of advanced stats. I should perhaps read up.

With this in mind, I wondered if it was possible for me to predict players who have scored 20 + goals at even strength in 2017-2018 based solely on advanced stats culled from www.hockey-reference.com.

Search:
For single season, in 2017-18, in Even Strength situations, playing skater, sorted by descending Games Played.

Link:

Player Advanced Stats Finder | Hockey-Reference.com

Simple binary classification model:
Based on data from hockey-reference.com, has a player scored 20 + goals?

I removed all of the derived value columns: CF%, CF% rel, C/60, Crel/60, FF%, FF% rel, PDO.
I also removed all of the face-off stats: FOW, FOL, FO%

Data looks like this:
GPCFCAFFFAoiSH%oiSV%oZS%dZS%TOI/GmHITBLKTKGVG
8215721636124212859.491.442.257.818.96110033810
6145858232644411.890.745.554.59.5552127140
82113595778471610.19158.941.112.6372430211
82110511377738405.590.755.944.113.2323528490
781406164110241237790.544.955.119.310112515570
82137713021032100410.890.558.141.915.91322121641
[TBODY] [/TBODY]
G column stands for goals: 0 means less than 20, 1 means 20 or more goals scored.

Class distribution:
Number of True cases: 53 (5.96%)
Number of False cases: 837 (94.04%)
Data split:
74.94% in training set
25.06% in test set

Verifying predicted value split:
Original True : 53 (5.96%) Original False : 837 (94.04%)
Training True : 37 (5.55%) Training False : 630 (94.45%)
Test True : 16 (7.17%) Test False : 207 (92.83%)

Tried a bunch of classification algorithms, and they all pretty much outputted similar results.

Output from the logistic regression algorithm on the training set with predictions made on the test set:

Accuracy: 0.9596

Confusion matrix:
n =223predicted < 20predicted => 20
actual < 202070
actual => 2097
[TBODY] [/TBODY]
Meaning that out of 223 rows, we had a total of 9 bad predictions.
Model, without any tweaking, is really good at predicting less than 20, but needs work on predicting 20 or more.

If I remove the # of games played from the features, I believe that my => 20 predictions would improve.
Some players have 20 or more goals with less than 70 games played.

Suspect I would've achieved similar results had I used only derived value columns.

Anyway, just a possible scenario on how a NHL team, or a fantasy league pooler could make use of advanced stats.

Cheers
 
Last edited:

Ted Hoffman

The other Rick Zombo
Dec 15, 2002
29,220
8,625
You're not really "predicting" anything, though. What you're really doing is saying, "these are the attributes that appear to distinguish 20-goal scorers from non-20-goal scorers" but it doesn't predict who will be a 20-goal scorer. Or, you're simply validating that players with 20+ goals in a season tend to have a certain set of characteristics with respect to advanced statistics. And even with that, is there any "a-ha" moment in the analysis that says something that no one really expected a priori?

To make this a prediction you'd want to take information from prior years and say, "given past information, this is what it tells us about the future." I mean, I think what you've done could be interesting, but it's not predictive.
 
  • Like
Reactions: Hockey Outsider

Doctor No

Registered User
Oct 26, 2005
9,250
3,971
hockeygoalies.org
Agreed - although it's interesting and it's a great start.

What I would do is modify your data set above; currently you have (data from year N) and (20 goals in year N?). What would make your model (perhaps) predictive would be to take (data from year N) and (20 goals in year N+1)?
 

GVM

Registered User
Apr 9, 2017
10
28
This is a simple machine learning 101 binary classification (true or false) model applied to advanced stats.
Given certain metrics, can the model predict 1 or 0.

At a very high level...

Phase 2 of my process would look to make better predictions for true .
For this I could try:
Removing players with 0 goals from the data.
Removing or adding columns from the data.
Normalizing the data.
Mess around with weight of features (columns)
Having better domain knowledge of advanced stats :)

Once satisfied with my true prediction rate, I could repeat the process with other seasons.
I could calculate the historical mean for metrics of a 20 goals scorer for instance.

If you want a 20 goal guy on your team, can you develop a player to this historical mean?

We could then move on to a regression (quantitative) model.
Prorate the metrics given a number of games played, compare it to the historical mean, and predict 20 + or not.

We often hear or reed about models these days, but no explanation is ever given. This a sample of what a model could look like. In a nutshell, this is how weather is foretasted, cars drive themselves, odds are given by Vegas.

Cheers
 

GVM

Registered User
Apr 9, 2017
10
28
Train your model on data up to 16/17 included.
Make it predict the 17/18 goal scorers.
Post the results of the prediction here.

Lowering the target threshold would increase it's ratio, making it less of a rare event.
I would expect a better rate for positive prediction. I'll also remove the 0 goals from the data.
Will post my findings later today. Leaving shortly for a family ski day!
 

GVM

Registered User
Apr 9, 2017
10
28
Day late...
I removed players with 0 goals from the dataset and lowered the target threshold to 17 or more goals.
Same algorithm, no adding or removing of features.

We now have:
Original True : 85 (11.81%) Original False : 635 (88.19%)
Training True : 65 (12.04%) Training False : 475 (87.96%)
Test True : 20 (11.11%) Test False : 160 (88.89%)

Accuracy of test: 93%
Confusion matrix:
n =180predicted < 17predicted => 17
actual < 171537
actual => 17614
[TBODY] [/TBODY]

Total of 13 incorrect predictions.

When a positive value is predicted, how often did it predict 17 or more goals: 70%

As I initially suspected, an increase in the ratio of true in the test set would improve the prediction %

I also noticed a nearly 1 to 1 correlation between the corsi and fenwick stats. I could drop one or the other from the model and get similar results.
 

supsens

Registered User
Oct 6, 2013
6,577
2,000
Is there any real difference here or is this like using shots then shooting% and 'predicting' the exact amount of goals every player got? It looks like a lot of work but without knowing your model it seems._ pointless?
 

glucker

Registered User
Aug 22, 2008
7,883
1,421
London, ON
Lowering the target threshold would increase it's ratio, making it less of a rare event.
I would expect a better rate for positive prediction. I'll also remove the 0 goals from the data.
Will post my findings later today. Leaving shortly for a family ski day!
I think you misunderstood what he was saying... he meant using the data from the seasons up to 2016/2017, and using it to "predict" 2017/2018 results.
 
  • Like
Reactions: morehockeystats

GVM

Registered User
Apr 9, 2017
10
28
I think you misunderstood what he was saying... he meant using the data from the seasons up to 2016/2017, and using it to "predict" 2017/2018 results.
Ah man.. I did misread. Beyond the scope of the original post.. However, I will work on this and post my findings.
 

Ad

Upcoming events

Ad

Ad