Nate Silver's Concept of "Rich Data" – and Why We're Not There (Yet) in Elections
[Image courtesy of sportsonearth]
A few weeks ago, I was at a meeting of election technologists where I talked about how I looked forward to the day when election administrators had the ability to thoroughly analyze their field as well and thoroughly as the new breed of analysts do sports.
The big reason, of course, is that election administration has yet to collect data anywhere near as obsessively as sports statisticians – but in a recent piece for ESPN Magazine (reprinted at fivethirtyeight) noted author and analyst Nate Silver talks about how and why sports data has so many advantages over other fields:
As I describe in my book “The Signal and the Noise: Why So Many Predictions Fail but Some Don’t,” the rapid and tangible progress in sports analytics is more the exception than the rule. It’s important to remind sports nerds — who, as they look at streams of PER or wRC+ numbers, have become a bit spoiled — of this fair and maybe even obvious point. Because out there in the wider world … [w]e still have tremendous trouble predicting how the economy will perform more than a few months in advance, or understanding why a catastrophic earthquake occurs at a particular place and time, or knowing whether a flu outbreak will turn into a bad one.
It’s not for any lack of interest in data and analytics. For a while, I gave a lot of talks to promote my book and met a lot of people I might not encounter otherwise: from Hollywood producers and CEOs of major companies to the dude from India who hoped to be the Billy Beane of cricket.
But there’s a perfect storm of circumstances in sports that makes rapid analytical progress possible decades before other fields have their Moneyball moments.
In particular, Silver notes three key differences between sports data and that available in other fields:
1. Sports has awesome data.
Give me a sec. Really, I’ll only need a second. I just went to Baseball-Reference.com and looked up how many at-bats have been taken in major league history. It’s 14,260,129.
The volume is impressive. But what’s more impressive is that I can go to RetroSheet.org and, for many of those 14 million at-bats, look up the hitter, the pitcher, who was on base, how many people attended the game and whether the second baseman wore boxers or briefs. It’s not just “big data.” It’s something much better: rich data.
By rich data, I mean data that’s accurate, precise and subjected to rigorous quality control. A few years ago, a debate raged about how many RBIs Cubs slugger Hack Wilson had in 1930. Researchers went to the microfiche, looked up box scores and found that it was 191, not 190. Absolutely nothing changed about our understanding of baseball, but it shows the level of scrutiny to which stats are subjected.
Compare that to something like evaluating the American economy. The problems aren’t in the third decimal place: We sometimes don’t even know whether the sign is positive or negative. When the recession hit in December 2007 — the worst economic collapse since the Great Depression — most economists didn’t believe we were in one at all. The recession wasn’t officially identified until December 2008. Imagine what this would be like in sports! We’re not sure how many points Damian Lillard scored last night, but we’re reasonably confident it was between 27 and negative 2. Check back in a few months.
As if statheads weren’t spoiled enough, we’re getting more data all the time. From PITCHf/x [in baseball] to SportVU [in basketball], we have nearly a three-dimensional record of every object on the field in real time. Questions once directed at scouts — Does Carmelo really get back on defense? What’s the break on Kershaw’s curve? — are now measurable.
We’re simply not there yet in elections; not even turnout – arguably the most widely collected stat in elections – is gathered, calculated or reported the same way. Without that, this kind of deep analytical dive is still a distant dream.
2. In sports, we know the rules.
And they don’t change much. As I noted, there has been little progress in predicting earthquakes. We know a few basic things — you’re more likely to experience an earthquake in California than in New Jersey — but not a lot more.
What’s the problem? “We’re looking at rock,” one seismologist lamented to me for my book. Unlike a thunderstorm, we can’t see an earthquake coming, nor can we directly observe what triggers it. Scientists have identified lots of correlations in earthquake data, but they have relatively little understanding of what causes one at any particular time. If there are a billion possible relationships in geology’s historical data, you’ll come up with a thousand million-to-one coincidences on the basis of chance alone. In seismology, for instance, there have been failed predictions about earthquake behavior in locations from Peru to Sumatra — all based on patterns that looked foolproof in the historical data but were random after all.
False positives are less of an issue in sports, where rules are explicit and where we know a lot about causality. Take how we evaluate pitcher performance. It turns out that if you want to forecast a pitcher’s future win-loss record, just about the last thing to look at is his previous record. Instead, focus on his ERA, or better yet his strikeout-to-walk ratio, or maybe even the PITCHf/x data on pitch velocity and location.
Why? Winning is the name of the game, and you win by allowing fewer runs than your opponent. So ERA says more about winning than a pitcher’s record. But you can do even better: Runs are prevented by striking out batters (and not walking them), and strikeouts are generated by throwing good pitches, which is why WHIP and strikeouts per nine innings also serve predictive purposes. Understanding the structure of the system gives statistical analysis a much higher batting average.
Again, we’re not yet there in election administration. Not only do states and localities conduct elections in their own varied ways, we don’t even have agreement at a more general level whether the goal should be to maximize turnout, convenience or enfranchisement or minimize fraud, error or cost. Without that consensus, it’s hard to decide what data to collect and why. Let’s not also forget that wins and losses aside (and unlike election offices) sports teams also exist to make money – and nothing focuses the mind quite like the need to maximize profit.
3. Sports offers fast feedback and clear marks of success.
One hallmark of analytically progressive fields is the daily collection of new data that allows researchers to rapidly test ideas and chuck the silly ones. One example: dramatically improved weather forecasts. The accuracy of hurricane landfall predictions, for instance, has almost tripled over the past 30 years.
Sports, especially baseball, fits in this category too. In Billy Beane’s first few years running the A’s, the team had awful defenses — bad enough that Matt Stairs briefly played center. Beane theorized that because defense was so hard to quantify, he shouldn’t focus on it. His assumption turned out to be completely wrong. As statheads came to learn about defense, it proved to be more important than everyone thought, not less. Because the A’s were playing every day and Beane could study the defensive metrics like dWAR that emerged, he learned quickly and adjusted his approach. His more recent teams have had much-improved defenses.
In many ways, this might be the biggest challenge of all. Despite the joke (rapidly becoming a reality in some places) that “there’s always an election somewhere,” there just aren’t as many opportunities to collect election data on a regular basis as there are in a typical sports schedule.
In short, then, the field of elections still has a way to go before it’s capable of the same kind of analysis to which sports are subjected.
And yet …
With the development of new data standards for election technology, it’s my hope that we’ll be able to begin identifying common data points (including but by no means limited to turnout) that can be used to analyze election administration. This information will become even more crucial as the voting experience expands and it becomes important to figure out not just how many people are voting, but how, where and when – and whether such patterns vary by age, race/ethnicity and/or geography. Once we do, it will be possible for election administrators to adjust their practices based on the data as rapidly as a GM, manager or coach of a sports team.
Of course, these are all still distant goals – and as Silver points out, sports (and especially baseball) have about a hundred-year head start on election administration. Still, by identifying the ways in which sports has developed the concept of “rich data” there may be a way to adopt those advantages for elections sooner than later.
Now all we need is to figure out a reason to head to Florida and Arizona every February.