First off, let me you just say that I think this is very interesting work and I think you have done a very good job! I have a couple of comments and suggestions which possibly could be used to improve your work further.
1) I read the methodology paper and I am a bit unsure how you correct for arena bias. It seems to me that you just remove the average error for each arena but that seems a bit odd as shots taken close to the net are unlikely to have the same error as those taken far from the net. Wouldn't this also mean that you record some shots as taken behind the net when they were actually taken right in front. Maybe I'm missing something here, I haven't really thought it through.
2) The weighting function you use seems perfectly fine but is arbitrarily chosen. If you are really interested, you can use a data-driven method to select the weighting function. My suggestion would be to use kernel regression with a cross-validation method to select the bandwidth. It might be that there isn't enough data to get a small enough bandwidth, but it could be worth trying. I can recommend using the np-package in R for this. I should say, this is a really technical comment that is by no means necessary, and I don't really know what your background is, but if you are interested in learning about nonparametric estimation techniques it might be fun to test
3) I'm not really sure I think the way you use the lower bound of the confidence interval to take care of the small sample issue is the best way. Just spitballing an idea here: what if you instead used a Bayesian method setting the prior shooting percentage to be 0? I'm not very familiar with Bayesian statistics but it seems to me that it could take care of the problem. Otherwise, I know that people often only want to show a single estimate but I think it's better to show the confidence interval to indicate just how uncertain the statistic is.
4) It would be really nice to see a scatterplot comparing your method with regular CORSI to get an impression of how important shot location is.
Anyway, these comments aren't that important and it seems that what you have done works perfectly fine, just throwing out some ideas.
1) I agree it's not perfect and some points do end up on the other side of the goal/blue line, and obviously each shot at their respective arenas do not all vary by the same distance, but I considered all the other options and decided that this would be closest to their actual locations. I've also attempted to ease this error by regressing the points. Since we can only really measure trends in recorder bias, I think the current method is good enough. A better solution could be using visual anchors like faceoff circle/dot, goal line, blue line, instead of pos/neg x/y points to correct the recording bias, basing on the assumption that the recorders plot shot locations using those visual anchors, but I think even then, I would have to regress the points to a certain extent, and the difference between that method and my current method will be marginal.
2) The 5 feet radius is partly arbitrary. I decided upon 5 ft for 2 reasons. One, because it was the approximate distance from a player's stick blade to his skate, so a recorder could technically have a 5 feet margin of error either side depending on what handiness the player is. Two, because it was the largest distance bias a arena had (NYI with -4.3 and 3.3 ft on the positive end). The 75% exponential weighting was definitely arbitrary though. I'm not familiar with non-parametric regression, my understanding is that it selects weights based on the amount of data points available? Since I am not familiar with it, I can't say for sure, but since, like I mentioned before, we can really only measure trends in recorder bias, an improved regression method will most likely only have a minute effect on the data but seems to add a whole lot more in terms of complexity to the stat.
3) I agree with posting the interval, I think just including the lower bound confused some people. I probably will update the tables to include the probability and their intervals when I have the time. As for the Bayesian interval, I think it's parallel to confidence intervals and using one over an other would be essentially a lateral move. As for setting the prior to 0, I have no idea what you mean by that as, in my understanding, credible intervals relies completely on the prior to make an accurate prediction so setting them all to zero would make it useless? Maybe I'm misunderstanding something from your post.
4) What would a Corsi vs LAEGAP plot prove? How is that a measurement of the importance of shot location? It is 100% certain that shooting at a historically high percentage location will have a higher chance of going in versus shooting at a historically low percentage location. Again, maybe I'm misunderstanding something.
Very interesting work. Thank you so much!
BTW, have you aggregated data from before 2011-12? I like how the results could be replicated between the last two seasons, but it would be even better to see how the metric stacks up over a larger selection of seasons. I've been flirting with some new statistics, and I've noticed the correlation from season to season can vary quite a bit.
EDIT:
To jump in the above conversation, I think the debate ends up being more philosophical to determine whether if shots taken, adjusted for location makes for a better metric than simple shot-taking. For instance, a team can spend a lot of time in the offensive zone, with the ultimate goal of producing lots of point shots; or attempt to produce relatively few shots from the slot. When we consider that sort of scenario, you can say a location-adjusted metric would be better since it accounts for the fact the latter team is trying to generate higher percentage plays than the former team.
I haven't compiled the data for seasons prior to 2011 yet, I am working on an other article that needs the past seasons data so when I finish that I will update and post more correlation plots for the older seasons.
It's actually not philosophical at all of whether adjusting for location make for a better metric. It is certain (assuming the location data is not so inaccurate that it resembles pure randomness, which it isn't). I think what you mean is that one team might try to go for more point shots than high percentage shots, and ends up scoring more goals because they were able to have so many shots. This is actually the core idea of LAEGAP/EG%. For example, if team A have 20 shots at the blue line, where the average shot percentage is .10, and team B had 4 shots at a .25 location, team A
will have a better EGF (2 vs 1) than team B. Team A is expected to score 2 goal, and team B is expected to score 1 of a goal.
So Corsi is a posession metric, rather than a shot metric. it just happens to use shots as the basis to determine possession. Would you say this is more of a scoring metric?
Every statistic that any one came up tries to predict one thing and one thing only, wins. Possession and Corsi is rated so highly because it has a strong correlation with winning, which means scoring more than your opponent. If you think about it logically, there are only 3 ways LAEGAP/EG% won't translate into wins:
1. the data is inaccurate.
2. players on the team have less than average shooting skills.
3. your goalie sucks
Corsi has all three exceptions, plus two:
1. your team shoots in low scoring areas (doesn't create good chances)
2. your team allows shots from high scoring areas (allows good chances against)
who are the guys that can sustain a consistently higher than average shot quality?
That's something I am going to look at in the future. There's so much content and info to extract from this data and so many articles to write. I am pretty excited
The link says you're 17? I was not expecting such quality work to be done by somebody your age. Very impressive.
Ha thanks, I actually just turned 17 this month.