Mirtle: Making sense of hockey's WAR debate, the stats battle that never ends

I didn’t see where exactly it started this time, but it probably doesn’t matter.

Hockey’s little corner of social media got worked up about WAR stats last week. It’s a debate that blows up every year between various factions of numbers people, and this time it felt particularly ugly.

Like a lot of statistical concepts, WAR comes from baseball. The abbreviation stands for Wins Above Replacement, and the stat’s aim is to “to summarize a player’s total contributions to their team in one statistic.”

That, as you can imagine, is incredibly difficult. Even more so in hockey than in baseball. Of late, what we’ve seen in NHL circles is various analysts coming up with their own variations of WAR, a process that Ian Tulloch explained in-depth last fall.

What I wanted to do here was talk a little bit about WAR as a concept, explain the controversy and see if we can come up with any potential follow-through points that will get us out of this rut where the same debate resurfaces constantly. I enlisted the help of Matt Cane, a Canadian data scientist and hockey analyst now living in Boston, and Tyler Dellow, our resident NHL analytics guru here in Toronto.

Cane has defended WAR models in the past, while still declaring himself a skeptic. Dellow, meanwhile, is a full-on dissenter when it comes to this push to come up with one number to define an NHL player’s contributions.

This is obviously going to be a dense topic overall, so if it’s not your cup of tea, check out some of our other hockey content in the middle of August.

A sample WAR list from corsica.hockey

Mirtle: I’ll keep this really simple to start: Why are WAR stats so controversial in the hockey community and are we any closer to this being less of a heated debate?

Cane: I think there’s two related drivers behind the debate around WAR stats. The first is that most (if not all) WAR metrics that are getting pushed these days are regression outputs, which are fairly black box and often require a good degree of technical knowledge to understand the general idea behind how they work. Some people see this as a fatal flaw, which isn’t unfair: It’s really difficult to audit a model’s output if a lot of the decisions are made behind the scenes. Others see it as a selling point: If our evaluation of players requires us to blend multiple (often conflicting) statistics together, then a cold, ruthless algorithm is better suited to handle the nuances than the human brain.

The second point is that WAR stats, like nearly every other stat, often disagree with the eye test or other methods of evaluation. If you’re hesitant about the regression approach, this gives you ample evidence for why these models don’t make sense, but these same “There’s no way Jake Muzzin is that good” arguments have existed back to the early days of Corsi. Given that I don’t think we’ll ever get to the “truth” of who’s good and who’s not, I don’t really see these example-based arguments — right or wrong — going away any time soon.

Dellow: I agree with Matt’s point that the technical knowledge required is a barrier to acceptance. At the same time, though, I don’t see it as the most significant issue. My own skepticism flows from a couple of points. Do we have the data needed to build a model that divides the credit appropriately? Do we understand the game well enough for these models to be right often enough that they’re giving us serious insight?

Unfortunately, as Matt hints at, this leads to a lot of arguing about examples. I’ve seen enough examples where I don’t believe whatever some model of GAR (Goals Above Replacement) is telling us because I can look at other data in conjunction with video or because I know why it’s giving a certain answer and think that the model is wrong in how its dealing with something. One popular GAR model will tell you that John Carlson was more valuable to Washington’s PP last year than Alex Ovechkin, Niklas Backstrom or Evgeny Kuznetsov or that Jeff Petry was more valuable than Jonathan Drouin on Montreal’s power play.

When you dig into these power plays, it’s fairly easy to surmise why this model produces those answers: Carlson and Petry produced points and the model has no way of evaluating how valuable the plays leading to those points were. That’s a single example involving power plays — the most straightforward part of the game — where I happened to be familiar with the power plays in question. Whenever I look at one of these models, I find examples like this.

So, for me, it’s not so much that they’re “controversial.” It’s that I think that you have to answer a lot of questions about how the game works before you can build a sensible model. And you have to have the data to answer those questions. I don’t think we’re there on either front yet — although the lack of data to accurately value what’s happening seems like the bigger problem.

None of this is to say that trying to come up with a model that summarizes a player’s contribution to his team is a bad thing. To me, it’s just that the data with which we’re working is sufficiently limited and there are dark spots in the public description of how the game works with data. As those improve, I suspect we’ll have models that provoke less controversy.

Bruce Bennett/Getty Images

Mirtle: I’m glad Matt brought up Corsi. From a mainstream perspective, in terms of the hockey media and fans talking about “analytics,” the shot attempt stats have been a worthwhile starting point, going back 10 years now. In that sense, I think Corsi’s simplicity is actually its strength — it’s fairly intuitive how it’s calculated and what it’s telling you. For all its flaws, it is accurately measuring one part of the game and highlighting its impact.

WAR, meanwhile, is often black boxed and presented as a list of players’ ranked by quality. What goes into that rank isn’t always clear — and what it’s telling you about the game isn’t either. In the beginning, Corsi felt like it unearthed some basic truth about the game. The WAR stats don’t currently seem able to offer that kind of clarity.

Is this a case where a WAR statistic may ultimately have more uses for teams than the general public and media?

Dellow: I don’t think so. If you had a WAR stat that you could rely upon, I suspect that it’d be tremendously useful for everyone around the game. If anything though, I think it would be less useful to teams than the general public.

A big issue for teams is identifying players who they should or shouldn’t target. A large part of that is determining whether you think a player can succeed in a specific role on your team. If you consider the offensive component of WAR in baseball, it’s largely driven by what a player does in your at-bats. In other words, it’s largely independent of a player’s team.

Hockey doesn’t work this way. Take James van Riemsdyk’s career on the power play as an example. For a lot of his career, he didn’t really produce much on the power play. He turned into a monster at the end of his time in Toronto but he had the benefit of a system that was designed to play to his strengths. If you’re a team considering acquiring him, knowing that he generated a lot of value on Toronto’s power play isn’t really enough. (Assuming, of course, that he did and that the role he played isn’t easily filled by someone else.) You’d need to be comfortable that you had the personnel to put him into a similar position.

In fact, rather than paying for van Riemsdyk’s past results, a smart team might be better off trying to identify a player who has the skills to do what he does but hasn’t had the opportunity to do so yet, for whatever reason. So from the perspective of evaluating players, WAR seems to me to be missing a lot of information that would be essential to teams. Their needs go beyond those of the general public.

Cane: I’d agree that in specific situations (particularly where systems play an outsized role (such as on the power play) translating results between teams will be difficult. But I’d imagine there’s enough movement between teams each year that with a good WAR model you’ll be able to account for that when you’re making projections. Even if WAR models aren’t great at translating between contexts, there’s not any reason to think that they’re any worse than existing statistics.

I think the big value to teams is that it allows them to start evaluating each decision using a common currency. With single number metrics and a decent projection system you can look at each move in terms of the net benefit it provides to your team. It’s obviously never going to be enough to just do a raw accounting of what’s coming and what’s going using WAR, but it does at least give you a sanity check that can spark discussion about where the metrics may disagree with your intuition.

And while it will likely always be more more difficult for fans to use WAR than other stats, most of the time a player’s WAR tends to agree with the consensus view of a player’s ability or contribution, and where it doesn’t WAR can still be understood as the product of it’s components. Most WAR metrics are simply a combination of shot attempt or expected goal metrics, quality of teammates/competition, score effects, and zone starts. How they’re combined together can be complex, but you can often break down a player’s WAR the same way you’d break down their Corsi – by looking at the context that their stats were generated in.

Jeff Vinnick/NHLI via Getty Images

Mirtle: If some of the WAR models stress test better than Corsi (and other statistics) using correlations to future success, is that the ultimate indicator that they’re useful? How do we settle on a particular WAR model as being better than the others given the black-box arrangement? Are there some that you like better than others and why?

Dellow: It’s tricky to test WAR models. To take an extreme example: say you simply assigned all the WAR a team generated to the sixth defenceman through the first half of the season and then looked at how WAR predicted the second half of the season. It’d probably do really well! Good teams tend to continue being good. If we had omniscient knowledge of how the value that a hockey team creates should be allocated though, we’d know that it was wrong. (I’d say that we’d know this is wrong without that but if someone made a model that did that, there’d be people defending it.)

As you move from the team level to the allocating credit on an individual level, things get complicated. I expect that I’ll be comfortable with a WAR model once we’ve got better data that enables us to better understand how players impact the game.

Cane: I think one thing to remember is that WAR can be both a descriptive stat or a predictive stat. WAR as a descriptive stat can certainly be useful as a sanity check in something like awards voting (not that I’d ever suggest the voters have ever erred) without having to be repeatable enough to have predictive ability. But I do agree that if models jump around too much it’ll make it difficult for people to trust their outputs, even if they are accurately capturing what’s going on in the past.

Each WAR model is going to have different strengths and weaknesses so it’s hard to name a “best” model, and often looking at several together is a good way to balance these out (particularly if you have a structured system to weight them together). I think it’s important though to understand the general methodology of a model before you use it, and in particular what some of the methodological weaknesses might be. For example, multicollinearity (where it’s difficult to separate the impact of players who play together almost all the time, like the Sedins) is a big issue for a lot of WAR models, but the degree to which it’s a problem varies between models. That can often lead to wacky results (where Daniel looks like a great defender but Henrik looks below replacement level) and so there are some cases where digging into specific players results can highlight particular flaws in these models.

Bill Wippert/NHLI via Getty Images

Mirtle: What’s the future of WAR-like statistics in hockey? If we get to the point where players are tagged and we have elaborate tracking data, in what ways does that improve WAR models? And is that the point at which they become more mainstream? Or more arcane and hard to “sell” to the general public?

Dellow: I think elaborate tracking data could improve WAR models because it would give the modellers a better understanding of what’s really happening. Right now there are a ton of assumptions baked into their models that I don’t really think are defensible. There are also huge limitations in terms of the data from which the models can draw inferences. Better data would enable the modellers to look at what actually happens on the ice and understand how the presence or absence of certain players changes the game as well as the way in which coaches are impacting things. In theory, that should lead people with technical ability to better answers.

We’ll see how mainstream WAR models ultimately become. If there was a public one that wasn’t obviously plagued with issues, I expect it would ultimately be well accepted. That’s certainly been the case in baseball, even though baseball has competing models that disagree with each other on some players. People want answers that are straightforward and easy to understand. Those answers have to either pass the smell test or provide overwhelming evidence that the counterintutive conclusions are actually pretty bulletproof though. If they don’t, they won’t make an impact.

Cane: I’m not sure I agree that WAR is any more flawed than any other methods we have right now. Every stat is going to have faulty assumptions, and every WAR model we have right now obviously has problems evaluating some players (and some of these problems are significant), but I don’t think the assumptions that go into the most popular models right now are any more flawed than the assumptions that underlie Corsi or Expected Goals, for example.

But I think overall tracking data is both going to make WAR models more accurate, and also (strangely enough) less relevant, as we’ll be better able to isolate and measure specific skills that players possess (and the value of those skills). WAR right now is often useful because it’s able to weight a lot of complex information together in a structured way, but you can also imagine a world where we can estimate a defender’s gap control ability accurately and where we can estimate the value of adding that specific skill to a specific team. WAR won’t be obsolete, of course, because there’s obvious use cases for single number metrics, but we’ll have less cases where we’ll have to rely on them in place of a more nuanced view.

This is more of an observation, but I think one part of WAR that’s often lost in this whole debate is the decision around what to use as replacement level, which has a huge impact on how most of the general public reacts to any given result. Across the public models that exist today, there’s wild variations in where replacement level is set – for defencemen on Corsica it’s closer to league average, while the Evolving Wild model has it a lot lower. I’m not saying one is right over the other (though I’d lean towards the latter over the former), but I think we often ignore how replacement level is decided when we look at the strengths or weaknesses of a given model. Choosing replacement level is arguably (for teams at least) the most important decision that you make, since it sets the level at which you’re better off giving up on a player than giving them more minutes. But what a replacement level player looks like in the NHL is something that’s rarely (if ever) debated. All of which is to say, if people want something to yell about on Twitter, I’d be more than happy to try to untangle that web alongside them.

Photo: Andre Ringuette/NHLI via Getty Images