Predicting forwards' salaries

Hockey Outsider · Jan 4, 2021

Objective & scope

I wanted to see if it's possible to develop a model that predicts the salaries of restricted free agents ("RFA") and unrestricted free agents ("UFA"). As always, my goal is to present data in a way that's objective, transparent, and (hopefully) interesting.

Parameters

My analysis is based on the 2019 off-season, and I'm limiting it to forwards who played at least one NHL game in the prior season (2018-19). There were 134 such forwards (64 RFA's and 90 UFA's). The median number of games that these forwards played in the previous season was 68.

I excluded goalies because there aren't many transactions, so we'd run into sample size issues. I excluded defensemen because defense isn't particularly well-measured statistically, and would therefore be more difficult to capture in a formula.

Note that I'm trying to predict "incremental" salary - that is, salary over the NHL minimum of $700K. For example, Par Lindholm's contract had a cap hit of $850K, so his incremental salary is $150K.

Hockey Outsider · Jan 4, 2021

Why bother?

Someone might ask me why I'm bothering to do this, since hockey-graphs.com ("HG") has done their own analysis (link - Projecting NHL Skater Contracts for the 2019 Offseason). My response is three-fold:

1. HG published their results, but they didn't disclose the actual formula. That's like climbing a mountain, and not taking a picture at the top. For me, knowing how the method works is as important as knowing the result.

2. The biggest difference in our approaches is I'm only trying to predict the average cap hit, and HG is trying to predict both the average cap hit and the contract length. The reason I'm not bothering to model both is because cap hit and contract length are highly correlated (R = 0.81). Generally, players who are good enough to get a high salary can pressure their team to give them the security of a lengthy contract. Players who are earning below the league average, as a general rule, get shorter contracts, because they're replaceable. Reviewing the results of HG's model, they do a far better job predicting contract value compared to contract length. I wondered if their results would be stronger if they dropped contract length (since it's not particularly well-modelled as it is, and having two dependent variables that are highly corellated with each other creates challenge in the modelling).

3. HG's model is far more complex than mine. This wasn't my main objective, but I wanted to see how much accuracy is lost in doing a much simpler approach. This method took me about three hours to put together (the write-up took another three hours, give or take). I understand their methodology, but it would probably take me ~40 hours to replicate their results. If I was an NHL owner, of course I'd want my staff to do the most comprehensive analysis possible. But, as a fan, I wonder how much accuracy is really gained by using a much more convoluted method.

Hockey Outsider · Jan 4, 2021

Correlations

Before putting together the model, I started by looking at more than 80 performance statistics, and looking at the correlation between that number, and their incremental cap hit. These statistics included:

Conventional stats (games played, goals, assists, points, penalty minutes, and ice time)
Offensive stats by situation (looking at goals, assists and points - separately at ES, PP and SH)
Rate stats (looking at these numbers on a per-game basis, and on a per-60-minute basis)
"Defensive" stats (hits, blocked shots, and face-offs taken, etc.)
Advanced stats (5-on-5 Corsi and Fenwick - both raw counts and the percentage, various zone start stats, etc.)

There's virtually no limit to which stats can be included here. In theory, you can include some really arcane stats - like venue-adjusted relative Fenwick when the score is tied and the game's on a Tuesday night. If someone thinks I'm missing something crucial, they can send me the data and I can report back on what the correlation is.

After looking at more than 80 statistics (and calculating the correlation between that number, and the incremental cap hit), one emerged as the biggest driver of contract value - points. Yes, that simple statistic, which has been essentially unchanged for more than a century, gives a higher correlation than any other single statistic. It yields a correlation of 0.90 (I assume anybody who's read this far has a basic understanding of what a correlation coefficient is - this is a very high result for a single statistic in a complex modelling environment).

Goals (in isolation) and assists (in isolation) also produce very high correlations, but not quite as strong as total points. Similarly, looking only at even-strength points, or only at powerplay points, you still get high correlations, but points is still the strongest statistic. Points per game is weaker than total points (and, similarly, goals per game is weaker than total goals, and assists per game is weaker than total assists). I looked at per-60 minute production a number of different ways (ES only, PP only, and all situations - looking at goals only, assists only, and points in all three situation), and none of them have correlations anywhere near the top of the list.

Looking beyond offensive statistics, the correlations are weak (under 0.4) for penalty minutes, plus/minus, hits, blocked shots, and face-offs taken. Some of the advanced stats (raw number of "Corsi for", as an example) can get you a fairly high correlation, but I suspect that's only because it's already fairly strongly correlated with games played and ice time. Once you look at advanced stats beyond raw counting numbers (ie Corsi percentage, Fenwick percentage, etc.), the correlations are weak.

Interestingly, points is not used in the final model (because once we start using more than one variable, some other patterns emerge). But if you need to pick just one statistic - keep it simple and go with points.

Hockey Outsider · Jan 4, 2021

A dirty word

Before we move onto the model, let's talk about multicollinearity. Like most eight-syllable words, it sounds complicated, but it isn't. Essentially, multicollinearity happens when some of the inputs in a model are highly correlated with other inputs. For example, goals is highly correlated with shots on goal, total ice time, total assists, even-strength goals, etc. It would be undesirable to include everything in the model, because it would have a tough time figuring out what the actual impact of each variable is.

If I push my stats program to its limits (it can't handle 80+ variables at once), I'd get a model that's simultaneously 1) highly predictive of contract value and 2) completely useless. The reason the model would be useless is because many of the input stats wouldn't be statistically significant (because a bunch of the variables are moving together in sync). Therefore we want the model to be as concise as possible - to rely on the absolute fewest number of inputs required.

To make it clear how much of a problem multicollinearity can be, here's an excerpt of a "correlation matrix", showing the correlation between some of the variables I'm looking at:

It's an exaggeration to say "everything correlates with everything" - but only a slightly exaggeration.

Hockey Outsider · Jan 4, 2021

The model

This graph plots the actual vs predicted salary for all 134 forwards:

Here are the inputs:

The starting point - the predicted incremental cap hit (for simplicity, the "salary") is assumed to be zero
Age - for every year after (before) age 22, salary decreases (increases) by approximately $21K
Even strength goals - for each ESG scored, salary increases by approximately $124K
Powerplay goals - for each PPG scored, salary increases by approximately $174K
Even strength assists - for each ESA scored, salary increases by approximately $114K
Powerplay assists - for each PPA scored, salary increases by approximately $68K
Ice time - for each minute played (any situation), salary decreases by approximately $1K

Hockey Outsider · Jan 4, 2021

Do the inputs make sense ("common sense" test)?

Before we talk about how accurate the model is, we need to consider if the inputs actually make sense. I would argue that they do.

The age variable makes sense. Most studies show that forwards peak relatively young (early to mid twenties). Very young players might get a premium based on the expectation of future improvement, but that ends soon. The promise of potential apparently doesn't last very long.

The weighting of the various offensive stats is interesting. Most people accept the premise that, in general, a goal is worth more than an assist. This is consistent with the data above. Between ES and PP, a goal is worth $149K on average, compared to $92K for an assist. This is a ratio of 1.62:1, which is pretty close to the overall ratio of goals to assists in the modern NHL (roughly 1.70:1).

The balance between ES and PP scoring is also interesting. The average ES point is worth $119K, and the average PP point is worth $122K. That makes sense - a point is a point, regardless of the situation, and the salary seems to reflect that. One surprising observation is there's a big spread between the value of goals vs assists on the powerplay, but not at even-strength. I suspect we have a small bit of multicollinearity happening here. The correlation between goals and assists is much higher at ES (compared to on the PP), and that might be skewing the results somewhat.

Lastly, each minute played, regardless of situation, reduces salary by a small amount. This might reflect efficiency. All things being equal, a coach would prefer a player to produce a given amount of offense in fewer minutes. The adjustment is small, but it exists. There doesn't appear to be a meaningful distinction between ES, PP and SH ice time (thus total TOI is used).

Overall the inputs seem reasonable, based on common sense, so let's consider how accurate the model is.

Hockey Outsider · Jan 4, 2021

How accurate is the model?

The model is highly accurate. The correlation coefficient (between actual and predicted salary) is 0.94, meaning that it explains about 88% of the data. When I compared this to the HG model, they also have a correlation of 0.94 (looking only at the cap hit - excluding the contract term). If you go to four decimal points, my model is actually slightly more accurate (0.9376 compared to HG's 0.9355) but it's close enough that, for all practical purposes, their predictive power is equal.

For the record, based on reading HG's article, I have zero doubt that they're better at building models than me. The reason it's so close is because I'm only projecting cap hit, while they're trying to project cap hit and the contract term (at the same time). As I mentioned, the contract term is modelled much less accurately than the cap hit. I suspect that if they solely focused on the cap hit, they might be able to bump the correlation up a bit higher.

I also checked to ensure that each coefficient in the model is statistically significant. Five of the six variables (ESG, PPG, ESA, PPA and TOI) are statistically significant at the 5% level. The age variable isn't statistically significant - yet the model remains somewhat more accurate when we keep it. I don't have an explanation for this - but since this variable makes the model stronger, and there's a very obvious "common sense" reason to retain it, I'm keeping it in.

All that being said - a correlation of 0.94 is an excellent result given the complexity of the data. That's not to say that the formula is perfect. If someone can make suggestions that would further enhance the accuracy, I'd welcome it.

Hockey Outsider · Jan 4, 2021

What about prior year data?

One important question is whether the model becomes more accurate if we incorporate prior year information. That is, if we also use data from 2018 (rather than only from 2019), will that improve the accuracy of the forecast?

Before jumping into the model, let's get back to looking at correlations. As I showed earlier, the correlation between points in 2019, and the amount of incremental cap hit, was 0.90. If I look at the correlation based on the sum of 2018 and 2019 points, the correlation remains unchanged at 0.90. Using linear programming, I calculated the weight between current year and prior year point totals to maximize the correlation. I was able to maximize the correlation with a weighting of approximately 71/29. This pushes the correlation up to 0.91. So there's some incremental value when taking into account prior year performance, but the improvement is miniscule.

I repeated the analysis looking at each of the key variables (ESG, PPG, ESA, PPA) separately. In each case, the correlation improves slightly (by 0.02 to 0.05) when doing a weighted average between 2018 and 2019. So, initially, there appears to be some benefit to incorporating prior year stats, but it's small.

I then tried to build a model based on stats from the current and prior year. Despite the fact that the correlations were slightly higher when we looked at prior year data, the model was clearly weaker as a result. First, the overall correlation was weaker than in the previous version. Second, a number of the variables - both current year and prior year - were no longer statistically significant at the 5% level. I can't prove why this is the case, but I suspect that the model again falls victim to multicollinearity (severe, in this case). Points correlate highly from year to year (0.83). Each of the four major components (ESG, PPG, ESA, PPA) have a correlation of at least 0.72 from year to year. There might be a benefit to looking at prior year data, but it's very small, and it gets lost in the interaction between current year and prior year numbers.

Hockey Outsider · Jan 4, 2021

Objections to the model

One possible objection is the model doesn't distinguish between RFA and UFA players. That's not accurate though. I included that as a variable, and the model consistently showed that there's no meaningful difference in salary between RFA and UFA players. Consider that the RFA players averaged $70,088 per point scored, while UFA players averaged $69,081 per point scored - a trivial difference. (Actually, one of the most interesting things I've learned from doing this is UFA's, as a whole, appear to be paid almost exactly what they should be. It's the RFA's, as a group, who are overpaid - Mitch Marner and Patrick Laine being good examples. It seems puzzling that GM's pay more per point to RFA's when, in theory, they have more leverage in that situation).

A second possible objection is that the model is largely based on offensive production, and there's more to hockey than offensive stats. It's true - but I'm not convinced that there are any statistics that accurately measure defensive play. But it's conceivable that some of the difference (between actual and predicted cap hit) can be attributable to defense, or other aspects of the game not well-captured by conventional statistics.

A third possible objection is the model predicts that some players should have been paid less than $700K. But I don't think that's an issue with the model itself. The NHL and NHLPA have agreed to a minimum salary of that amount. It's possible that a team is forced to pay a player $700K, even though he's worth less than that based on performance. The player with the lowest predicted salary is Brad Malone. He produced zero points in 16 games and was 29 (well past the age where most players start declining). This suggests that he was worth less than $700K, but the Oilers had no choice but to pay that amount. (Or maybe they should have signed someone else, if they were forced to spend the league minimum anyway).

A fourth possible objection is the model doesn't take the salary cap into account. That's true - but it's also irrelevant. In 2020, the maximum salary was $16.3M. No RFA or UFA was predicted to have, or actually received, a salary anywhere close to that amount. It would take an extreme example to justify a salary that even approached the max (a 19 year old who scored 60 goals and 90 assists playing only 15 minutes a game would get you close).

A fifth possible objection is the model is artificially boosted by including a bunch of borderline NHL players (who produce minimal offense and earn the league minimum, or just barely above that). Even if you remove all the players earning under $2M (which is more than half of the data), the model still holds up reasonably well, with a correlation of 0.84. (To be clear, I'm not re-running the model after throwing out more than half the data - that's using the same coefficients that I posted above). Obviously, it's weaker than before, but it still does a very good job of modelling the data, even if the lowest-paid players are excluded entirely.

Search

Search

Predicting forwards' salaries

Hockey Outsider

Registered User

Hockey Outsider

Registered User

Hockey Outsider

Registered User

Hockey Outsider

Registered User

Attachments

Hockey Outsider

Registered User

Hockey Outsider

Registered User

Hockey Outsider

Registered User

Hockey Outsider

Registered User

Hockey Outsider

Registered User

Ad

Latest posts

Upcoming events

Ad

Ad