INTRODUCTION

As promised, I've done another analysis of secondary assists. All of the previous studies done in this area, including mine, have looked at the data using the same conceptual approach - examining year-over-year correlations. In other words - does knowing how many secondary assists a player records in YEAR X allow us to better estimate how many secondary assists, total assists, or points, a player will record in YEAR X+1. (The answer is pretty clear - yes it helps us predict points in the future, but not as well as goals or primary assists).

For this study, I've analyzed the data in a completely different way. To the best of my knowledge, nobody has done this type of analysis before. The question that I was trying to answer is - does knowing how many secondary assists a player records help us predict how well his team scores when he's on the ice?

An example might help illustrate this concept. Let's say two players have the following stat lines:

Player 1: 30 goals, 25 primary assists, 25 secondary assists
Player 2: 30 goals, 25 primary assists, 10 secondary assists

What we really care about is - does Player 1's team do better when he's on the ice, compared to Player 2? If, as some have argued, secondary assists are just statistical noise, then we'd expect that knowing how many secondary assists a player records wouldn't improve our predictive ability. If secondary assists are more than just statistical noise, then knowing the number of secondary assists a player earns would improve the accuracy of our predictive models. (Obviously, looking at only two players is meaningless - there's way too much context not taken into account - but I'm going to look at thousands of data points).

My approach is as follows. First, I'll definite the sample and validate the data. Second, I'll estimate how many 5-on-5 goals a player should be on the ice for, once we know his 5-on-5 goals and primary assists. Third, I'll do the same analysis, except this time I'll also use secondary assists, and see if we get a more useful prediction model.

The conclusion - which will soon be obvious from the data - is secondary assists have informational value.

DATA VALIDATION

Defining the sample

For this project, I'm looking at twelve seasons' worth of data - 2007-08 through 2018-19. I'm looking at forwards only, who have played at least 300 minutes of 5-on-5 ice time.

Source of data

All data has been taken from Natural Stat Trick

Number of players in sample each year

SeasonCount
2008396
2009398
2010397
2011393
2012396
2013339
2014393
2015410
2016405
2017407
2018415
2019423
Total4772
[TBODY] [/TBODY]
In total, I have 4,772 data points. The numbers fall within a narrow range from 2008 to 2017 (except for 2013) - between 393 and 407 players met my criteria each year. There were significantly fewer players from 2013 because that was the lockout-shortened season. There was a slight uptick in 2018 and 2019, because Las Vegas joined the league, so there were several more roster spots that became available.

Validating the data

It was important to validate the data. First, I'm not positive the data from Natural Stat Trick is completely accurate. Second, I wanted to make sure I didn't make any errors in compiling and organizing the data (I needed to combine data from multiple databases from that site).

As one example - let's look at John Tavares. NHL.com shows him as having 187 goals, 146 primary assists, and 68 secondary assists at 5-on-5. My data has him at 189, 146, and 68, respectively. So both assist figures agreed, goals are off by 2. Obviously I can't spot-check 4,000+ lines of data, but I checked some players. Most of the data is identical, but I found some small discrepancies here and there (like I did for Tavares). None were significant so I'm confident we have a good starting point for the analysis.

NHL.com, as far as I can tell, doesn't have 5-on-5 goals for data. So I cross-referred this to hockey-reference.com. They have Tavares at 557 5-on-5 GF through the 2019 season. My data has him at 556. Another immaterial difference, so again I think we have a good starting point for the analysis.

ANALYSIS 1 - GOALS AND PRIMARY ASSISTS ONLY

View media item 7045
What does the model mean?

The model above predicts that a player's team will score at a baseline rate of 0.81 goals per 60 minutes at 5-on-5, plus an extra 1.20 G/60 for every goal the player scores, plus an extra 1.25 G/60 for every primary assist the player records.

Is the model accurate?

The R^2 is 0.72, which means this is a fairly accurate model.

Does the model make sense?

There's nothing about it that strikes me as being obviously wrong - open to comments.

ANALYSIS 2 - GOALS, PRIMARY ASSITS, AND SECONDARY ASSISTS

View media item 7047
What does the model mean?

The model above predicts that a player's team will score at a baseline rate of 0.58 goals per 60 minutes at 5-on-5, plus an extra 1.10 G/60 for every goal the player scores, plus an extra 1.06 G/60 for every primary assist the player records, plus an extra 1.08 G/60 for every second assist the player records.

Is the model accurate?

The R^2 is 0.82, which means this is an accurate model. Note that this is a much higher result ("coefficient of determination", if you want to get technical) compared to the previous model. You can also see this pretty clearly by looking at the graph - the data is much closer to the trendline in this one.

Does the model make sense?

Compared to the previous model, this model is significantly lowering the baseline level of offense (that a team would score with effectively zero contributions from a player). It slightly reduces the value of goals and primary assists, while recognizing a new variable - the secondary assist.

From a technical standpoint - the model possibly suffers from a defect called "multicollinearity". When there are two predictor variables that are highly correlated (as primary and secondary assists are), the model might get confused about the relative predictive value of each variable. So it's possible that the coefficient for one one of the numbers is too high, and the other is too low. Before someone tells me that this invalidates the entire model - it doesn't. The overall accuracy of a model isn't affected (just the relative value of the coefficients within the model may be off).

WHAT DOES THIS ALL MEAN?

The data is quite clear. Although we can put together a reasonably accurate model using only goals and primary assist data, the accuracy is clearly enhanced when we include secondary assists. This means that there's informational value in secondary assists. If there wasn't, there wouldn't be any meaningful difference between Model 1 and Model 2.

The only objection to this analysis that I can think of is someone might argue that this a case of the tail wagging the dog. The argument might be - players who are good at recording secondary assist don't cause their team to score; they record lots of secondary assists because their team scores a lot when they're on the ice.

My response to that is - first of all, we're looking at a gigantic set of data, covering 12 years worth of data and more than 4,700 player-seasons. It's almost certain that, in certain situations, players have racked up a lot of secondary assists by virtue of being on good teams. But there's no evidence whatsoever that this is the case for 4,700+ data points.

Second, look at which players recorded the most 5-on-5 secondary assists per 60 minutes. If we use 6,000 minutes (over the entire 12 year period) as a cutoff, the top twelve consists of H. Sedin, Benn, Getzlaf, Backstrom, St. Louis, Williams, Thornton, Kucherov, Scheifele, Ribiero, Datsyuk and Crosby. With the exception of Justin Williams, these are all top offensive players (McDavid would have been 2nd on the list, had he met the minutes cutoff). It seems obvious that the players recording the most secondary assists per 60 aren't depth players, leaching off the offensive talents of their teammates - they're among the best offensive players in the league, and the driving forces behind their teams.

My overall conclusion is secondary assists aren't statistical noise. If they were - Model 2 wouldn't have been much effective at predicting how many goals (per 60 minutes, at 5-on-5) a team scores when a player is on the ice.