ML model to predict 20 or more goals.

GVM · Dec 27, 2018

I have minimal domain knowledge of advanced stats. I should perhaps read up.

With this in mind, I wondered if it was possible for me to predict players who have scored 20 + goals at even strength in 2017-2018 based solely on advanced stats culled from www.hockey-reference.com.

Search:
For single season, in 2017-18, in Even Strength situations, playing skater, sorted by descending Games Played.

Link:
Player Advanced Stats Finder | Hockey-Reference.com

Simple binary classification model:
Based on data from hockey-reference.com, has a player scored 20 + goals?

I removed all of the derived value columns: CF%, CF% rel, C/60, Crel/60, FF%, FF% rel, PDO.
I also removed all of the face-off stats: FOW, FOL, FO%

Data looks like this:

GP	CF	CA	FF	FA	oiSH%	oiSV%	oZS%	dZS%	TOI/Gm	HIT	BLK	TK	GV	G
82	1572	1636	1242	1285	9.4	91.4	42.2	57.8	18.9	61	100	33	81	0
61	458	582	326	444	11.8	90.7	45.5	54.5	9.5	55	21	27	14	0
82	1135	957	784	716	10.1	91	58.9	41.1	12.6	37	24	30	21	1
82	1105	1137	773	840	5.5	90.7	55.9	44.1	13.2	32	35	28	49	0
78	1406	1641	1024	1237	7	90.5	44.9	55.1	19.3	101	125	15	57	0
82	1377	1302	1032	1004	10.8	90.5	58.1	41.9	15.9	132	21	21	64	1

[TBODY] [/TBODY]

G column stands for goals: 0 means less than 20, 1 means 20 or more goals scored.

Class distribution:
Number of True cases: 53 (5.96%)
Number of False cases: 837 (94.04%)
Data split:
74.94% in training set
25.06% in test set

Verifying predicted value split:
Original True : 53 (5.96%) Original False : 837 (94.04%)
Training True : 37 (5.55%) Training False : 630 (94.45%)
Test True : 16 (7.17%) Test False : 207 (92.83%)

Tried a bunch of classification algorithms, and they all pretty much outputted similar results.

Output from the logistic regression algorithm on the training set with predictions made on the test set:

Accuracy: 0.9596

Confusion matrix:

n =223	predicted < 20	predicted => 20
actual < 20	207	0
actual => 20	9	7

[TBODY] [/TBODY]

Meaning that out of 223 rows, we had a total of 9 bad predictions.
Model, without any tweaking, is really good at predicting less than 20, but needs work on predicting 20 or more.

If I remove the # of games played from the features, I believe that my => 20 predictions would improve.
Some players have 20 or more goals with less than 70 games played.

Suspect I would've achieved similar results had I used only derived value columns.

Anyway, just a possible scenario on how a NHL team, or a fantasy league pooler could make use of advanced stats.

Cheers

Michael Farkas · Dec 27, 2018

I don't understand this...but I want to.

Ted Hoffman · Dec 27, 2018

You're not really "predicting" anything, though. What you're really doing is saying, "these are the attributes that appear to distinguish 20-goal scorers from non-20-goal scorers" but it doesn't predict who will be a 20-goal scorer. Or, you're simply validating that players with 20+ goals in a season tend to have a certain set of characteristics with respect to advanced statistics. And even with that, is there any "a-ha" moment in the analysis that says something that no one really expected a priori?

To make this a prediction you'd want to take information from prior years and say, "given past information, this is what it tells us about the future." I mean, I think what you've done could be interesting, but it's not predictive.

Doctor No · Dec 27, 2018

Agreed - although it's interesting and it's a great start.

What I would do is modify your data set above; currently you have (data from year N) and (20 goals in year N?). What would make your model (perhaps) predictive would be to take (data from year N) and (20 goals in year N+1)?

GVM · Dec 28, 2018

This is a simple machine learning 101 binary classification (true or false) model applied to advanced stats.
Given certain metrics, can the model predict 1 or 0.

At a very high level...

Phase 2 of my process would look to make better predictions for true .
For this I could try:
Removing players with 0 goals from the data.
Removing or adding columns from the data.
Normalizing the data.
Mess around with weight of features (columns)
Having better domain knowledge of advanced stats

Once satisfied with my true prediction rate, I could repeat the process with other seasons.
I could calculate the historical mean for metrics of a 20 goals scorer for instance.

If you want a 20 goal guy on your team, can you develop a player to this historical mean?

We could then move on to a regression (quantitative) model.
Prorate the metrics given a number of games played, compare it to the historical mean, and predict 20 + or not.

We often hear or reed about models these days, but no explanation is ever given. This a sample of what a model could look like. In a nutshell, this is how weather is foretasted, cars drive themselves, odds are given by Vegas.

Cheers

morehockeystats · Dec 29, 2018

Train your model on data up to 16/17 included.
Make it predict the 17/18 goal scorers.
Post the results of the prediction here.

GVM · Dec 29, 2018

morehockeystats said:
Train your model on data up to 16/17 included.
Make it predict the 17/18 goal scorers.
Post the results of the prediction here.

Lowering the target threshold would increase it's ratio, making it less of a rare event.
I would expect a better rate for positive prediction. I'll also remove the 0 goals from the data.
Will post my findings later today. Leaving shortly for a family ski day!

GVM · Dec 30, 2018

Day late...
I removed players with 0 goals from the dataset and lowered the target threshold to 17 or more goals.
Same algorithm, no adding or removing of features.

We now have:
Original True : 85 (11.81%) Original False : 635 (88.19%)
Training True : 65 (12.04%) Training False : 475 (87.96%)
Test True : 20 (11.11%) Test False : 160 (88.89%)

Accuracy of test: 93%
Confusion matrix:

n =180	predicted < 17	predicted => 17
actual < 17	153	7
actual => 17	6	14

[TBODY] [/TBODY]

Total of 13 incorrect predictions.

When a positive value is predicted, how often did it predict 17 or more goals: 70%

As I initially suspected, an increase in the ratio of true in the test set would improve the prediction %

I also noticed a nearly 1 to 1 correlation between the corsi and fenwick stats. I could drop one or the other from the model and get similar results.

supsens · Jan 1, 2019

Is there any real difference here or is this like using shots then shooting% and 'predicting' the exact amount of goals every player got? It looks like a lot of work but without knowing your model it seems._ pointless?

glucker · Jan 2, 2019

GVM said:
Lowering the target threshold would increase it's ratio, making it less of a rare event.
I would expect a better rate for positive prediction. I'll also remove the 0 goals from the data.
Will post my findings later today. Leaving shortly for a family ski day!

I think you misunderstood what he was saying... he meant using the data from the seasons up to 2016/2017, and using it to "predict" 2017/2018 results.

GVM · Jan 2, 2019

glucker said:
I think you misunderstood what he was saying... he meant using the data from the seasons up to 2016/2017, and using it to "predict" 2017/2018 results.

Ah man.. I did misread. Beyond the scope of the original post.. However, I will work on this and post my findings.

Search

Search

ML model to predict 20 or more goals.

GVM

Registered User

Michael Farkas

Celebrate 68

Ted Hoffman

The other Rick Zombo

Doctor No

Registered User

GVM

Registered User

morehockeystats

Unusual hockey stats

GVM

Registered User

GVM

Registered User

supsens

Registered User

glucker

Registered User

GVM

Registered User

Ad

Latest posts

Upcoming events

Ad

Ad