I've always wanted to see basically this exact stat done but I'm not savvy enough to do it myself. Especially the heat map, that's awesome.
One thing that struck me as odd was that even the top players had a LAEGAP of under 50%. As the sample sizes increase those should go up right? I think I'd be more interested in seeing the data without having it adjusted down due to degrees of confidence - maybe with a filter of I certain minimum number of events.
Great work though. Keep it up.
The reason the numbers are low is because he was using the lowest possible number in the interval.
For example, Dan Boyle's number may have been 53%, plus or minus 3.7% meaning it could lie anywhere in the range of 49.3% to 56.7%, but Wesleyy is simply reporting the lowest possible value.
Based on the data in the article the +/- 3.7% is dramatically below the margin of error seen since it appears that Boyle's LAEGAP from the numbers is actually 37.9/(37.9+23.4) = 37.9/61.3 = 61.8% which, to give him the 49.3% shown as the lower bound would make the margin of error 12.5%.
Now, I don't have the full data so I may be incorrect, but this is just to explain why all the numbers seem low.
As he said:
To allow for easy sorting, we will take the lowest value possible in the interval. By taking the lowest possible value we will undervalue a player, particularly low event ones, much more often than we will overvalue them, which I think is the better of the two.
Regarding the author and your publication of this: I first want to say great job. As others have said, it's extraordinary to do something like this at 17. I'm not sure what you want to do as a career, or where you want to go to university, but if you're interested in something with a math background, make sure to tell a recruiter about this.
Regarding the actual metric: I find it very interesting and it's certainly something worth looking into further, especially when combined with other metrics. Before something like this I could compare (over larger sample sizes) CORSI to GF% to try and get an idea if a player was helped or hindered by variation in shooting percentage. Something like on ice shooting percentage can help with this as well.
LAEGAP goes a step further and can help me answer a question like "Is player X really suffering from (causing?) poor shooting while on the ice, or does his presence merely increase the likelihood of bad shots being taken?" Scott Gomez is notorious for producing good corsi numbers but terrible on ice shooting percentage numbers. This metric would help us to see why there's such a discrepancy. Have he and his teammates simply shot poorly while he's on the ice, or do they simply take more low percentage shots?
As others have noted, it would be interesting to expand this to fenwick or corsi events (looking at corsi shooting percentage of fenwick shooting percentage) although current data collection does not allow this. The beauty of developing a framework like this is that it would be fairly simple to plug in the data if/when it becomes available.
I'm glad to see that you reviewed repeatability. That's one of the two key things I look at for a stat in regards to its predictive power. 1.) Can players repeat it? and; 2.) Does it correlate with winning? Question 1 basically asks if it has any predictive power. If you can't repeat a performance of a given statistic then it doesn't do much to tell me about your future results. Question 2 basically asks if it's useful (or harmful) to be good at a stat. Just because I can repeat a statistical performance doesn't mean it helps my team.
Now, by very nature of being something based on expected goals, one would expect that LAEGAP would have a positive correlation with winning, but I'm curious if you looked at this at all. Did teams with strong LAEGAP, or large numbers of good LAEGAP players, tend to fare better than those with weak LAEGAP or low number of good players?
And a quick technical note: On your chart in the LAEGAP vs corsi section you switched the games played and LAEGAP columns.