It is not just shot quality where the publicly available data is inadequate. I brought up
this clip that I recorded two weeks ago while re-watching a game. Everyone agrees that it shows 2 high danger chances, or at least scoring chances, but the public data has it listed as nothing...not even a corsi or shot attempt. I tracked a couple games closely to evaluate how accurate the public data is and my conclusion was it is terrible.
If I recall - early in the season the NHL was having huge issues with their data. I don't know if these guys like Evolving Wild or whatever just skim NHL data (or from another source), or if they collect their own.
I'll just say that generally - hockey is going to be way harder than baseball, and much harder than Basketball to model. Basketball you have a ton of scoring events, and not a ton of "chaos" on chances. Ball generally goes in or it doesn't - blocks are rare and there are a lot of instances where a perfect defense is irrelevant to scoring. Hockey is just *very* chaotic by nature. The link I posted above that discussed how inefficient "home plate" chances are (with a shot % of 7.1% for what most companies would consider "high danger"), they noted that one of the most common goals (by rate) in the NHL this year were accidental deflections of shots from the point (unscreened).
The next two top scoring sequences in the NHL are actually two different types of broken plays — what we’d categorize as a mid-percentage broken play and then a high-percentage broken play. You can imagine how difficult these things are to put into context when you’re training people to watch the game the way you do.
The simple way of saying it is that a mid-percentage broken play would be a shot that comes, delivered to the net from the point, in the air. The player in front is waving at it, trying to deflect it and it inadvertently hits someone’s elbow or shin pad and ends up in the net. That’s a mid-percentage broken play because the intent was a mid-percentage shot, an in-the-air deflection, no screen.
A high-percentage broken play is a slot-line pass that’s intended for the receiving player. It doesn’t go through and it goes off their skate or their stick. Those went in 434 times last year. It’s neat because when you look at how the puck ends up in the net at the end of the season, the slot line is directly impacted in two of the top three sequences. If you can move the puck from one side of the ice to the other and force the other team to defend, you’re going to get more broken play goals
This guy's theory is that the most dangerous chances come from when the puck goes from one side of the ice to the other. This seems obvious to us, but the question is *how is the available data capturing what we all know to be true*. If you're categorizing a Stamkos/Ovi/Pastrnak one-timer as a low or mid danger shot, something is wrong with your numbers.
Edit: To the extent this comes down to "eye test" versus "analytics", I don't know where a smart person can reasonably fall. It's harder to collect data in hockey than the other sports, so someone is going to have to make some informed judgment calls on the input data for the analytics. Someone's going to need to look and see if that shot was high danger or not. And the tricky thing is making sure that data is *good*.
Your statistical analysis can't be good if the input data isn't good. And right now I don't think the input data is particularly good. We can post heatmaps of where shots come from and make some judgment calls based on that, and it's probably a fairly decent proxy at a very macro level (as in - you can come up with pretty good general conclusions based on that data), but I think when you start trying to apply that big data to small situations (i.e. apply them directly to players), gaping holes form.