Gronke.new.forms.by.state.jpg

[Data plots courtesy of Paul Gronke of the Early Voting Information Center at Reed College; Graphic courtesy of the U.S. Election Assistance Commission]

I talk a lot here about the power of data to inform election policy. Today’s however, I’d like to follow the timeless advice to writers – “show, don’t tell” – and walk through what it takes to find, format and display data in a way that tells a useful story about elections.

Our guide through this process will be my friend and colleague Paul Gronke, who is generally enthusiastic but almost levitates when the time comes to crunch election data. [Need proof? Just knowing this post was in the works got Paul so fired up he hammered out a post of his own over at ElectionUpdates.]

The raw material comes from the U.S. Election Assistance Commission’s release of data underlying the 2010 National Voter Registration Act report to Congress.

Job one is to download the raw data and then use the data codebook to identify the fields you’re interested in.

Mind you, that’s easier said than done; because of some formatting issues, Paul ended up taking the raw file and translating it into another (tab-delimited) format that he could import into Stata, his analysis program of choice.

Once that’s done, there is usually still a little more cleanup to do. Here, leading zeros in the field used to identify counties don’t translate neatly, so Paul does a little sleight of hand to fix the issue. The data is then imported into the analysis software.

The first step in Stata is to generate a log file that identifies missing data and runs a series of checks to make sure the data is internally consistent.

Once that’s done, it’s time to do some visualization. The image at the top uses box plots to display the proportion of new registration forms in each county that were new (as opposed to address and other changes). If you look carefully, you’ll note that there are counties (either represented by the dark dots or the top end of a state box) that exceed 1 – which means the data says more than 100% of all registrations were new. Obviously, that doesn’t make sense; this plot would then be a reason to go back and figure out what’s happening with the underlying data.

Using the same general process, you can plot rejected forms by county within states (click to enlarge) …
Gronke.rejected.forms.by.state.jpg

… and the proportion of registration forms that came from the DMV (corrected to eliminate counties with proportions exceeding 100% [click to view]) (click below to enlarge):

Gronke.DMV.forms.corrected.jpg

Lather, rinse, repeat … and eventually, you get the data in the kind of shape that allows you to have a graphic designer produce something like this from the 2010 EAC report:

EAC.2010.NVRA.JPG

A picture like this demonstrates that only about 30% of registration forms are actually new registrations – and the majority of forms report address changes within a jurisdiction. That information might help policymakers assess how much of their registration activity results not in new voters but in maintenance of existing voter records – and consider changes to their process to streamline the process.

In many ways, policymaking is storytelling – and stories are easier to tell when you (or someone like Paul) is using data to draw a picture that tells them for you.