Introducing a new stat: Location Adjusted Expected Goals Percentage

Wesleyy · Aug 28, 2013

http://hockeymetrics.net/introducing-a-new-stat-location-adjusted-expected-goals-percentage/

Basically an improvement of corsi, it takes shot location into account on top of shot quantity. Took me over a month of work to finish, would love to hear your feedback.

do0glas · Aug 28, 2013

unfortunately i cant view the heat maps at work.

just to help me understand better: you gave a percentage weight to spots on the ice given how often goals are scored from their league wide?

Wesleyy · Aug 28, 2013

do0glas said:
unfortunately i cant view the heat maps at work.

just to help me understand better: you gave a percentage weight to spots on the ice given how often goals are scored from their league wide?

right, the heatmap is just a visual representation of the percentages. It's pretty much what you would expect, the percentage lowers as it becomes further from the slot.

do0glas · Aug 28, 2013

Wesleyy said:
right, the heatmap is just a visual representation of the percentages. It's pretty much what you would expect, the percentage lowers as it becomes further from the slot.

okay,

im trying to wrap my head around it all.

is the LAEGAP an average percentage, so basically you took all of the heatmap data and gave every shot an average? or does each individual player get a percentage based on where they take the majority of their shots?

either way, i think you did a great job on this. im just trying to better understand it :dunce:

Bear of Bad News · Aug 28, 2013

Wesleyy, thanks for putting this together. I intend to comment more later, but a quick read-through suggests that this is a great step forward.

Wesleyy · Aug 28, 2013

do0glas said:
okay,

im trying to wrap my head around it all.

is the LAEGAP an average percentage, so basically you took all of the heatmap data and gave every shot an average? or does each individual player get a percentage based on where they take the majority of their shots?

either way, i think you did a great job on this. im just trying to better understand it

Essentially, the heatmap is a visual representation of the league average shot percentage at each position on the ice. EGF is the estimated "+" in +/- for each player if his shots at their respective locations have a league average chance of going in. EGA is the "-". EG% is (EGF/(EGF+EGA)). LAEGAP is a estimation of the lowest possible "true" EG% in a calculated interval with a 95% confidence. The more you play, the smaller the interval, and the higher your LAEGAP will be, assuming you are perfectly consistent to your past performance throughout your ice time.

Ohashi_Jouzu* · Aug 28, 2013

Really like where this is going, what it strives to do, and the apparent reliability. Good job, bro. Looking forward to everything this can turn into.

Cunneen · Aug 28, 2013

Unfortunately, I don't think the shot location data is even close to accurate enough for us to create metrics based off shot location. The data that the NHL tracks is just horrendous. Truly horrendous.

http://www.habseyesontheprize.com/2013/2/20/4005122/how-reliable-is-the-nhl-com-shot-tracker

Bear of Bad News · Aug 29, 2013

Cunneen said:
Unfortunately, I don't think the shot location data is even close to accurate enough for us to create metrics based off shot location. The data that the NHL tracks is just horrendous. Truly horrendous.

http://www.habseyesontheprize.com/2013/2/20/4005122/how-reliable-is-the-nhl-com-shot-tracker

Reading that blog, I don't see the errors as being as large as the blogger claims.

Plus, the foundation of the blog's claim is that the other measurement is perfectly accurate (which it certainly can't be). The fact that the two measurements differ isn't 100% the fault of the NHL's tally.

Beyond that, let's suppose that the measurement is as inaccurate as the blogger claims - it's still pretty good. What's a better analytical tool; one that's slightly off (but still gets the general gist that shots closer to the net are better scoring chances) or one that treats all shots on goal as equally likely to produce a goal (as CORSI does)?

Wesleyy · Aug 29, 2013

Taco MacArthur said:
Reading that blog, I don't see the errors as being as large as the blogger claims.

Plus, the foundation of the blog's claim is that the other measurement is perfectly accurate (which it certainly can't be). The fact that the two measurements differ isn't 100% the fault of the NHL's tally.

Beyond that, let's suppose that the measurement is as inaccurate as the blogger claims - it's still pretty good. What's a better analytical tool; one that's slightly off (but still gets the general gist that shots closer to the net are better scoring chances) or one that treats all shots on goal as equally likely to produce a goal (as CORSI does)?

On top of that, I've put a lot of effort into correcting the data as best as I could. I think my method of correcting the recording bias in the NHL is the most accurate out of all the other ones I've read. Obviously it's not perfect, but over a large sample size it should provide a much better estimation than the raw data.

Cuneen, if you read the methodology post (I don't know if you read it yet) which I go into a lot more detail of how I attempt to solve the recording bias problem, you might change your mind regarding the data accuracy.

matnor · Aug 29, 2013

First off, let me you just say that I think this is very interesting work and I think you have done a very good job! I have a couple of comments and suggestions which possibly could be used to improve your work further.

1) I read the methodology paper and I am a bit unsure how you correct for arena bias. It seems to me that you just remove the average error for each arena but that seems a bit odd as shots taken close to the net are unlikely to have the same error as those taken far from the net. Wouldn't this also mean that you record some shots as taken behind the net when they were actually taken right in front. Maybe I'm missing something here, I haven't really thought it through.

2) The weighting function you use seems perfectly fine but is arbitrarily chosen. If you are really interested, you can use a data-driven method to select the weighting function. My suggestion would be to use kernel regression with a cross-validation method to select the bandwidth. It might be that there isn't enough data to get a small enough bandwidth, but it could be worth trying. I can recommend using the np-package in R for this. I should say, this is a really technical comment that is by no means necessary, and I don't really know what your background is, but if you are interested in learning about nonparametric estimation techniques it might be fun to test

3) I'm not really sure I think the way you use the lower bound of the confidence interval to take care of the small sample issue is the best way. Just spitballing an idea here: what if you instead used a Bayesian method setting the prior shooting percentage to be 0? I'm not very familiar with Bayesian statistics but it seems to me that it could take care of the problem. Otherwise, I know that people often only want to show a single estimate but I think it's better to show the confidence interval to indicate just how uncertain the statistic is.

4) It would be really nice to see a scatterplot comparing your method with regular CORSI to get an impression of how important shot location is.

Anyway, these comments aren't that important and it seems that what you have done works perfectly fine, just throwing out some ideas.

Ohashi_Jouzu* · Aug 29, 2013

Taco MacArthur said:
Reading that blog, I don't see the errors as being as large as the blogger claims.

Plus, the foundation of the blog's claim is that the other measurement is perfectly accurate (which it certainly can't be). The fact that the two measurements differ isn't 100% the fault of the NHL's tally.

Beyond that, let's suppose that the measurement is as inaccurate as the blogger claims - it's still pretty good. What's a better analytical tool; one that's slightly off (but still gets the general gist that shots closer to the net are better scoring chances) or one that treats all shots on goal as equally likely to produce a goal (as CORSI does)?

Totally agree that it looks like a step in the right direction. Matnor also makes some interesting suggestions for fine-tuning.

Kershaw · Aug 29, 2013

Damn looks like a lot of work was put into it, I will continue to follow this. Great work and it is pretty challenging. And I agree that this is a step in the right direction.

blue425 · Aug 29, 2013

Gave it a look and my brain melted after a few minutes. Damn fine work though.

I'll try again..

do0glas · Aug 29, 2013

Wesleyy said:
Essentially, the heatmap is a visual representation of the league average shot percentage at each position on the ice. EGF is the estimated "+" in +/- for each player if his shots at their respective locations have a league average chance of going in. EGA is the "-". EG% is (EGF/(EGF+EGA)). LAEGAP is a estimation of the lowest possible "true" EG% in a calculated interval with a 95% confidence. The more you play, the smaller the interval, and the higher your LAEGAP will be, assuming you are perfectly consistent to your past performance throughout your ice time.

okay, i like that explanation better. maybe it was tougher since i couldnt view the heat map.

So Corsi is a posession metric, rather than a shot metric. it just happens to use shots as the basis to determine possession. Would you say this is more of a scoring metric? who are the guys that can sustain a consistently higher than average shot quality?

ive always though that a passes completed percentage in the offensive zone is a more reliable possession metric, but no one tracks that data like they do for soccer.

LAEGP seems like it would be great to have along side Corsi (IE: player x seems to really boost his teams posession on ice, but does he improve the actual goals percentage in a tangible way?) so it really complements Corsi rather than improves upon it, imo. so for someone like Tyler Kennedy, who seems to just take shots from anywhere...does his volume shooting really have a tangible effect on the ice, or is it just keeping the puck in the zone hoping for rebounds?

great stuff

Jyrki · Aug 29, 2013

Very interesting work. Thank you so much!

BTW, have you aggregated data from before 2011-12? I like how the results could be replicated between the last two seasons, but it would be even better to see how the metric stacks up over a larger selection of seasons. I've been flirting with some new statistics, and I've noticed the correlation from season to season can vary quite a bit.

EDIT:

To jump in the above conversation, I think the debate ends up being more philosophical to determine whether if shots taken, adjusted for location makes for a better metric than simple shot-taking. For instance, a team can spend a lot of time in the offensive zone, with the ultimate goal of producing lots of point shots; or attempt to produce relatively few shots from the slot. When we consider that sort of scenario, you can say a location-adjusted metric would be better since it accounts for the fact the latter team is trying to generate higher percentage plays than the former team.

Badger Mayhew* · Aug 29, 2013

The link says you're 17? I was not expecting such quality work to be done by somebody your age. Very impressive.

Ohashi_Jouzu* · Aug 29, 2013

do0glas said:
ive always though that a passes completed percentage in the offensive zone is a more reliable possession metric, but no one tracks that data like they do for soccer.

I've always kinda liked the idea of this, as well. Like many hockey playing Canadians, my summer sport/passion was soccer, and I've also wondered what kind of trends we'd see if average number of touches from possession to scoring play, or consecutive during possession in general, was tracked. It would be interesting if teams with the lowest average touches to create a scoring play from possession happened to be viewed as the more "potent" offenses, or if teams with the most touches per possession were seen as good "possession" teams.

Jyrki · Aug 29, 2013

Ohashi_Jouzu said:
I've always kinda liked the idea of this, as well. Like many hockey playing Canadians, my summer sport/passion was soccer, and I've also wondered what kind of trends we'd see if average number of touches from possession to scoring play, or consecutive during possession in general, was tracked. It would be interesting if teams with the lowest average touches to create a scoring play from possession happened to be viewed as the more "potent" offenses, or if teams with the most touches per possession were seen as good "possession" teams.

Problem with tracking passes is that hockey is much faster paced than soccer, and it's arguable if many common plays that end up with the puck going to another player can actually be called a "pass" (e.g. is a dump-in a pass? a deflection? a banked shot? a loose puck? a clearing attempt?)

Wesleyy · Aug 30, 2013

matnor said:
First off, let me you just say that I think this is very interesting work and I think you have done a very good job! I have a couple of comments and suggestions which possibly could be used to improve your work further.

1) I read the methodology paper and I am a bit unsure how you correct for arena bias. It seems to me that you just remove the average error for each arena but that seems a bit odd as shots taken close to the net are unlikely to have the same error as those taken far from the net. Wouldn't this also mean that you record some shots as taken behind the net when they were actually taken right in front. Maybe I'm missing something here, I haven't really thought it through.

2) The weighting function you use seems perfectly fine but is arbitrarily chosen. If you are really interested, you can use a data-driven method to select the weighting function. My suggestion would be to use kernel regression with a cross-validation method to select the bandwidth. It might be that there isn't enough data to get a small enough bandwidth, but it could be worth trying. I can recommend using the np-package in R for this. I should say, this is a really technical comment that is by no means necessary, and I don't really know what your background is, but if you are interested in learning about nonparametric estimation techniques it might be fun to test

3) I'm not really sure I think the way you use the lower bound of the confidence interval to take care of the small sample issue is the best way. Just spitballing an idea here: what if you instead used a Bayesian method setting the prior shooting percentage to be 0? I'm not very familiar with Bayesian statistics but it seems to me that it could take care of the problem. Otherwise, I know that people often only want to show a single estimate but I think it's better to show the confidence interval to indicate just how uncertain the statistic is.

4) It would be really nice to see a scatterplot comparing your method with regular CORSI to get an impression of how important shot location is.

Anyway, these comments aren't that important and it seems that what you have done works perfectly fine, just throwing out some ideas.

1) I agree it's not perfect and some points do end up on the other side of the goal/blue line, and obviously each shot at their respective arenas do not all vary by the same distance, but I considered all the other options and decided that this would be closest to their actual locations. I've also attempted to ease this error by regressing the points. Since we can only really measure trends in recorder bias, I think the current method is good enough. A better solution could be using visual anchors like faceoff circle/dot, goal line, blue line, instead of pos/neg x/y points to correct the recording bias, basing on the assumption that the recorders plot shot locations using those visual anchors, but I think even then, I would have to regress the points to a certain extent, and the difference between that method and my current method will be marginal.

2) The 5 feet radius is partly arbitrary. I decided upon 5 ft for 2 reasons. One, because it was the approximate distance from a player's stick blade to his skate, so a recorder could technically have a 5 feet margin of error either side depending on what handiness the player is. Two, because it was the largest distance bias a arena had (NYI with -4.3 and 3.3 ft on the positive end). The 75% exponential weighting was definitely arbitrary though. I'm not familiar with non-parametric regression, my understanding is that it selects weights based on the amount of data points available? Since I am not familiar with it, I can't say for sure, but since, like I mentioned before, we can really only measure trends in recorder bias, an improved regression method will most likely only have a minute effect on the data but seems to add a whole lot more in terms of complexity to the stat.

3) I agree with posting the interval, I think just including the lower bound confused some people. I probably will update the tables to include the probability and their intervals when I have the time. As for the Bayesian interval, I think it's parallel to confidence intervals and using one over an other would be essentially a lateral move. As for setting the prior to 0, I have no idea what you mean by that as, in my understanding, credible intervals relies completely on the prior to make an accurate prediction so setting them all to zero would make it useless? Maybe I'm misunderstanding something from your post.

4) What would a Corsi vs LAEGAP plot prove? How is that a measurement of the importance of shot location? It is 100% certain that shooting at a historically high percentage location will have a higher chance of going in versus shooting at a historically low percentage location. Again, maybe I'm misunderstanding something.

Very interesting work. Thank you so much!

BTW, have you aggregated data from before 2011-12? I like how the results could be replicated between the last two seasons, but it would be even better to see how the metric stacks up over a larger selection of seasons. I've been flirting with some new statistics, and I've noticed the correlation from season to season can vary quite a bit.

EDIT:

To jump in the above conversation, I think the debate ends up being more philosophical to determine whether if shots taken, adjusted for location makes for a better metric than simple shot-taking. For instance, a team can spend a lot of time in the offensive zone, with the ultimate goal of producing lots of point shots; or attempt to produce relatively few shots from the slot. When we consider that sort of scenario, you can say a location-adjusted metric would be better since it accounts for the fact the latter team is trying to generate higher percentage plays than the former team.

I haven't compiled the data for seasons prior to 2011 yet, I am working on an other article that needs the past seasons data so when I finish that I will update and post more correlation plots for the older seasons.

It's actually not philosophical at all of whether adjusting for location make for a better metric. It is certain (assuming the location data is not so inaccurate that it resembles pure randomness, which it isn't). I think what you mean is that one team might try to go for more point shots than high percentage shots, and ends up scoring more goals because they were able to have so many shots. This is actually the core idea of LAEGAP/EG%. For example, if team A have 20 shots at the blue line, where the average shot percentage is .10, and team B had 4 shots at a .25 location, team A will have a better EGF (2 vs 1) than team B. Team A is expected to score 2 goal, and team B is expected to score 1 of a goal.

So Corsi is a posession metric, rather than a shot metric. it just happens to use shots as the basis to determine possession. Would you say this is more of a scoring metric?

Every statistic that any one came up tries to predict one thing and one thing only, wins. Possession and Corsi is rated so highly because it has a strong correlation with winning, which means scoring more than your opponent. If you think about it logically, there are only 3 ways LAEGAP/EG% won't translate into wins:

1. the data is inaccurate.
2. players on the team have less than average shooting skills.
3. your goalie sucks

Corsi has all three exceptions, plus two:

1. your team shoots in low scoring areas (doesn't create good chances)
2. your team allows shots from high scoring areas (allows good chances against)

who are the guys that can sustain a consistently higher than average shot quality?

That's something I am going to look at in the future. There's so much content and info to extract from this data and so many articles to write. I am pretty excited :laugh:

The link says you're 17? I was not expecting such quality work to be done by somebody your age. Very impressive.

Ha thanks, I actually just turned 17 this month.

Jyrki · Aug 30, 2013

I haven't compiled the data for seasons prior to 2011 yet, I am working on an other article that needs the past seasons data so when I finish that I will update and post more correlation plots for the older seasons.

Good to know!

Just wondering, how do you compile the shot distance data? I recall last month there being a thread on just that, and I don't think anyone came up with a satisfactory answer.

Wesleyy said:
It's actually not philosophical at all of whether adjusting for location make for a better metric. It is certain (assuming the location data is not so inaccurate that it resembles pure randomness, which it isn't). I think what you mean is that one team might try to go for more point shots than high percentage shots, and ends up scoring more goals because they were able to have so many shots. This is actually the core idea of LAEGAP/EG%. For example, if team A have 20 shots at the blue line, where the average shot percentage is .10, and team B had 4 shots at a .25 location, team A will have a better EGF (2 vs 1) than team B. Team A is expected to score 2 goal, and team B is expected to score 1 of a goal.

My apologies if I wasn't clear enough; I'm not contesting LAEGP might be a better win predictor than Corsi differential, I was just arguing their nature as possession predictors. Say a team has a neutral LAEGP - it roughly tells us the team is as efficient in its offensive possessions as it is inefficient when it is on the defensive - or vice-versa. It doesn't say much when it comes to how much a team has either owned the puck or let it be controlled by opponents, however. Corsi has its own shortcomings as a possession statistic, but we can all agree that taking a shot begets owning the puck, and letting a shot again begets not having the puck.

Not sure if we're all concerned with that, though. :laugh:

Ohashi_Jouzu* · Aug 30, 2013

VinnyC said:
Problem with tracking passes is that hockey is much faster paced than soccer, and it's arguable if many common plays that end up with the puck going to another player can actually be called a "pass" (e.g. is a dump-in a pass? a deflection? a banked shot? a loose puck? a clearing attempt?)

Simply the number of touches would be fine, and it wouldn't have to be 100% controlled along the way, either, as long as the opponent doesn't gain possession. Just something different from a possession time statistic separated into zones of the ice (although that might be nice, too), or similar, like you see in soccer.

I dunno, I haven't really thought about it that much. Just a random thought that occasionally comes up.

The Legend · Aug 30, 2013

Wesleyy said:
http://hockeymetrics.net/introducing-a-new-stat-location-adjusted-expected-goals-percentage/

Basically an improvement of corsi, it takes shot location into account on top of shot quantity. Took me over a month of work to finish, would love to hear your feedback.

It's a smart step; but will need to eventually be adjusted for opposition. I assume that certain teams have defensive styles that will make shots from different locations "higher quality". This works in a balanced schedule with balanced lines.

Blue Blooded · Aug 30, 2013

I find this really interesting, great work!

One question though; it seems like you have based the metric on shots, wouldn't it have been better to base it on Corsi/Fenwick instead?

1. You'd get a larger sample size for each player.

2. Just because a shot from an area is more likely to go in it doesn't mean that a shot attempt is more likely to do the same. There might be a higher frequency of blocks, or harder to hit the net from that area.

Point #2 is likely pretty (or completely) insignificant. But there is a reason Corsi and Fenwick are preferred over shots, shouldn't you have used one of/both of them instead?

Devilsfan992 · Aug 30, 2013

Great work! :handclap:

Still surprised your only 17 years old. This is Senior Year of College/Post-Grad work. I wish Pete Deboer could read this journal and realize he should be consistently starting Mark Fayne.

Introducing a new stat: Location Adjusted Expected Goals Percentage

Registered User

Registered User

Registered User

Registered User

Your Third or Fourth Favorite HFBoards Admin

Registered User

Registered User

Registered User

Your Third or Fourth Favorite HFBoards Admin

Registered User

Registered User

Registered User

Kershaw

Guest

Registered User

Registered User

Benning has been purged! VANmen!

Badger Mayhew*

Guest

Registered User

Benning has been purged! VANmen!

Registered User

Benning has been purged! VANmen!

Registered User

GW

Most people rejected his message

Registered User

Ad

Ad

Ad