TomasHertlsRooster
Don’t say eye test when you mean points
Since the start of the 2007-2008 season, there have been 180 NHL playoff series. 15 per season over 12 seasons.
The team with the higher place in the regular season standings has won 99 of those series. (55%)
The team with the higher regular season team GAR has won 108 of those series. (60%)
In other words, if you had guessed the winner of every single playoff series based on nothing but regular season standings rankings, you would've guessed 9 more series winners correctly than you would have if you had just flipped a coin for every series, and managed to guess exactly half of them. If you had guessed the winner of every single playoff series based on nothing but regular season team GAR, you would've guessed 9 more series winners correctly than you would have if you had picked them based on nothing but regular season standings rankings, and 18 more series winners than you would have if you had picked them based on coin flips and got exactly 50% right.
To put things more simply: the gap between using GAR and standings rankings is the same as the gap between using standings rankings and flipping a coin. If we were to say Team GAR>>Standings Rankings>>Coin Flips, the number of greater than symbols that we use would be the exact same size.
GAR is made to correlate very closely with a team's performance, and it does; just under 3 quarters of a team's standings points since 2007-2008 can be explained by their team GAR. Here is what this looks like:
So, why does GAR do a better job of predicting playoff series winners than standings points? Because although it's very similar GAR adjusts for context. I won't go into too much detail on how GAR does this, but to put it quite simply, it uses a regression to determine how certain contextual factors such as opposition, rest, score, venue, etc. all play into a team's results, and then subtracts the impact of these contextual factors from the raw results.
In theory, let's say we took two identical teams full of replacement level players, and replaced the Vegas Golden Knights with one of them, and the New Jersey Devils with the other team. The team that replaced the Vegas Golden Knights would do much better in the standings. Why? Because in the scenario where they replace Vegas, they are replacing the strongest team in the weakest division. In the scenario where they replace New Jersey, they are replacing the weakest team in the strongest division. The team that replaced New Jersey would have 5, maybe 10 more points. But, would that make them a better team? No! The teams are f***ing identical! This is why context is very important, and this is why a metric like GAR is better at predicting the winner of a playoff series than a metric like standings points. Both teams would have a team GAR of 0, but very different standings points.
I probably didn't need to explain what context means or why it was important; just about all hockey fans have an opinion on context and believe it should be adjusted for. By page 5 of every thread that compares two players with similar accomplishments and accolades, the discussion shifts to comparing each player's linemates and context, rather than their individual performance. Find an Auston Matthews Vs. Leon Draisaitl thread from last summer, and by page 5, the discussion will have shifted to fans downplaying their own players in a comparison between Rieder/Lucic/McDavid and Hyman/Nylander/Kapanen. Taylor Hall won the Hart Trophy in 2017-2018 not because he had the most points, but because he had the largest gap in points between himself and his next closest teammate, and his weak team made the playoffs. Back in the 1940's, fans probably argued that Toe Blake wouldn't have scored as many points if he weren't lining up next to Elmer Lach and Maurice Richard on Montreal's top line.
This is all by way of saying that the concept of adjusting for context is not new. We can pretty much all agree that some sort of context should be used when assessing players and teams. But the concept of using regressions to objectively adjust for context is very new, and draws a ton of ire from more traditionalist hockey fans who prefer more arbitrary, subjective methods of doing so. In fact, I never see people say "but context!" as often as they do when metrics like GAR and RAPM are brought up - even though these metrics actually do adjust for context, unlike other metrics such as points which aren't held to such a high standard. I don't know if this is because people don't understand that metrics like GAR/RAPM adjust for context, or because they just want an easy excuse to discredit whatever assessment came alongside the GAR/RAPM, but it's rather misguided to slam these metrics for not including context. In reality, these metrics require the least additional context to be provided, because they already adjust for a ton of it.
It's worth noting that I just looked at was at the team level. At the individual player level, things get a bit more dicey, and I think even the creators of GAR, along with the biggest proponents, would tell you that it does not perfectly distribute credit among individual players. For example, the top-6 players in PPO GAR (power play offense) are McDavid, Chiasson, Draisaitl, Bergeron, Pastrnak, and DeBrusk. Is it likely that Chiasson and DeBrusk, whose PP scoring rates are 65th and 164th, respectively, are two of the top-6 most positively impactful PP players in the league? No, probably not. It's a lot more likely that the model isn't giving enough credit for that PP impact to guys like McDavid, Draisaitl, and Pastrnak - who respective PP scoring rates rank 1st through 3rd. At the team level, Edmonton and Boston rank 1st and 2nd in PP GAR, so it's likely that they have some of the league's best PP players - the credit just isn't perfectly distributed here by GAR.
However, that doesn't mean that GAR should be discredited entirely for individual players. The issues with distribution of credit are more prominent on the power play than they are at even strength for a few reasons, and I went out of my way to pull out the most wonky GAR results that I don't agree with. By understanding how GAR is calculated, it's easier to make subjective assessments to get an idea of where GAR results may differ from a player's actual impact. But, since GAR gives us a good picture of team quality, the teams full of players with strong GAR will be good teams, and vice versa, so the players whose GAR are not at all indicative of their level of performance are more likely to be a few outliers who have antagonistic outlier teammates on the other side of the spectrum. It's not like this year's Detroit Red Wings are going to be full of players with good GAR, or last year's Tampa Bay Lightning will be full of players with bad GAR, but the contextual adjustments made by GAR will show that a few players on those teams (Larkin, Mantha, Koekkok, Schenn) do fit that bill. If a player's GAR really doesn't match their impact, it should be pretty easy to take a look under the hood and get an idea of whether or not they've got a teammate whose GAR is an outlier in the opposite direction and figure out why that is.
Most importantly, because the contextual adjustments behind GAR make it so much more effective than raw standings at the team level, it's reasonable to say that those contextual adjustments also make it more effective than the metrics we use at the individual player level, even if there should still be more uncertainty around GAR at the player level.
To be clear, there should still be some uncertainty when approaching GAR at the team level as well; these metrics aren't perfect. I've got my own issues with GAR that I could go into better detail on below. But, by adjusting for context, they are better than the metrics that don't adjust for context, and that is proven by how they predict playoff series victories. They clearly do a better job of assessing team quality than raw standings points. And since GAR at the team level is comprised of the GAR of all of their players, GAR can tell us which teams are full of good players and which aren't; even if they will occasionally fail to perfectly distribute credit among players on that team.
If you want to come into this thread and discredit the merit of stats, in favor of the eye test, go ahead. But in that case, you have to discredit all stats, including the traditional ones like standings points that are clearly inferior to GAR. I don't think anybody is actually prepared to do that.
The team with the higher place in the regular season standings has won 99 of those series. (55%)
The team with the higher regular season team GAR has won 108 of those series. (60%)
In other words, if you had guessed the winner of every single playoff series based on nothing but regular season standings rankings, you would've guessed 9 more series winners correctly than you would have if you had just flipped a coin for every series, and managed to guess exactly half of them. If you had guessed the winner of every single playoff series based on nothing but regular season team GAR, you would've guessed 9 more series winners correctly than you would have if you had picked them based on nothing but regular season standings rankings, and 18 more series winners than you would have if you had picked them based on coin flips and got exactly 50% right.
To put things more simply: the gap between using GAR and standings rankings is the same as the gap between using standings rankings and flipping a coin. If we were to say Team GAR>>Standings Rankings>>Coin Flips, the number of greater than symbols that we use would be the exact same size.
GAR is made to correlate very closely with a team's performance, and it does; just under 3 quarters of a team's standings points since 2007-2008 can be explained by their team GAR. Here is what this looks like:
So, why does GAR do a better job of predicting playoff series winners than standings points? Because although it's very similar GAR adjusts for context. I won't go into too much detail on how GAR does this, but to put it quite simply, it uses a regression to determine how certain contextual factors such as opposition, rest, score, venue, etc. all play into a team's results, and then subtracts the impact of these contextual factors from the raw results.
In theory, let's say we took two identical teams full of replacement level players, and replaced the Vegas Golden Knights with one of them, and the New Jersey Devils with the other team. The team that replaced the Vegas Golden Knights would do much better in the standings. Why? Because in the scenario where they replace Vegas, they are replacing the strongest team in the weakest division. In the scenario where they replace New Jersey, they are replacing the weakest team in the strongest division. The team that replaced New Jersey would have 5, maybe 10 more points. But, would that make them a better team? No! The teams are f***ing identical! This is why context is very important, and this is why a metric like GAR is better at predicting the winner of a playoff series than a metric like standings points. Both teams would have a team GAR of 0, but very different standings points.
I probably didn't need to explain what context means or why it was important; just about all hockey fans have an opinion on context and believe it should be adjusted for. By page 5 of every thread that compares two players with similar accomplishments and accolades, the discussion shifts to comparing each player's linemates and context, rather than their individual performance. Find an Auston Matthews Vs. Leon Draisaitl thread from last summer, and by page 5, the discussion will have shifted to fans downplaying their own players in a comparison between Rieder/Lucic/McDavid and Hyman/Nylander/Kapanen. Taylor Hall won the Hart Trophy in 2017-2018 not because he had the most points, but because he had the largest gap in points between himself and his next closest teammate, and his weak team made the playoffs. Back in the 1940's, fans probably argued that Toe Blake wouldn't have scored as many points if he weren't lining up next to Elmer Lach and Maurice Richard on Montreal's top line.
This is all by way of saying that the concept of adjusting for context is not new. We can pretty much all agree that some sort of context should be used when assessing players and teams. But the concept of using regressions to objectively adjust for context is very new, and draws a ton of ire from more traditionalist hockey fans who prefer more arbitrary, subjective methods of doing so. In fact, I never see people say "but context!" as often as they do when metrics like GAR and RAPM are brought up - even though these metrics actually do adjust for context, unlike other metrics such as points which aren't held to such a high standard. I don't know if this is because people don't understand that metrics like GAR/RAPM adjust for context, or because they just want an easy excuse to discredit whatever assessment came alongside the GAR/RAPM, but it's rather misguided to slam these metrics for not including context. In reality, these metrics require the least additional context to be provided, because they already adjust for a ton of it.
It's worth noting that I just looked at was at the team level. At the individual player level, things get a bit more dicey, and I think even the creators of GAR, along with the biggest proponents, would tell you that it does not perfectly distribute credit among individual players. For example, the top-6 players in PPO GAR (power play offense) are McDavid, Chiasson, Draisaitl, Bergeron, Pastrnak, and DeBrusk. Is it likely that Chiasson and DeBrusk, whose PP scoring rates are 65th and 164th, respectively, are two of the top-6 most positively impactful PP players in the league? No, probably not. It's a lot more likely that the model isn't giving enough credit for that PP impact to guys like McDavid, Draisaitl, and Pastrnak - who respective PP scoring rates rank 1st through 3rd. At the team level, Edmonton and Boston rank 1st and 2nd in PP GAR, so it's likely that they have some of the league's best PP players - the credit just isn't perfectly distributed here by GAR.
However, that doesn't mean that GAR should be discredited entirely for individual players. The issues with distribution of credit are more prominent on the power play than they are at even strength for a few reasons, and I went out of my way to pull out the most wonky GAR results that I don't agree with. By understanding how GAR is calculated, it's easier to make subjective assessments to get an idea of where GAR results may differ from a player's actual impact. But, since GAR gives us a good picture of team quality, the teams full of players with strong GAR will be good teams, and vice versa, so the players whose GAR are not at all indicative of their level of performance are more likely to be a few outliers who have antagonistic outlier teammates on the other side of the spectrum. It's not like this year's Detroit Red Wings are going to be full of players with good GAR, or last year's Tampa Bay Lightning will be full of players with bad GAR, but the contextual adjustments made by GAR will show that a few players on those teams (Larkin, Mantha, Koekkok, Schenn) do fit that bill. If a player's GAR really doesn't match their impact, it should be pretty easy to take a look under the hood and get an idea of whether or not they've got a teammate whose GAR is an outlier in the opposite direction and figure out why that is.
Most importantly, because the contextual adjustments behind GAR make it so much more effective than raw standings at the team level, it's reasonable to say that those contextual adjustments also make it more effective than the metrics we use at the individual player level, even if there should still be more uncertainty around GAR at the player level.
To be clear, there should still be some uncertainty when approaching GAR at the team level as well; these metrics aren't perfect. I've got my own issues with GAR that I could go into better detail on below. But, by adjusting for context, they are better than the metrics that don't adjust for context, and that is proven by how they predict playoff series victories. They clearly do a better job of assessing team quality than raw standings points. And since GAR at the team level is comprised of the GAR of all of their players, GAR can tell us which teams are full of good players and which aren't; even if they will occasionally fail to perfectly distribute credit among players on that team.
If you want to come into this thread and discredit the merit of stats, in favor of the eye test, go ahead. But in that case, you have to discredit all stats, including the traditional ones like standings points that are clearly inferior to GAR. I don't think anybody is actually prepared to do that.
Last edited: