Computational Cartography & Gerrymandering

My brief recent dabbling with computational cartography (see Part 1 & Part 2) was inspired by Neil Freeman. His project was initiated as a (speculative) reform to the Electoral College.


Incidentally, this is a great idea. Every four years when whichever party looses the White House starts complaining about how silly the Electoral College is I feel compelled to point out that it's actually an excellent way to maximize the expected impact of any one voter's vote. The catch is that this requires states to have equally-weighted populations, which obviously we don't.

I just ran a quick Monte Carlo to check the numbers on this. Let's say you voted in 10 million elections in your life, and in each one there were 100 million voters. (This is obviously way too many elections, but I wanted something that didn't require me to report the results in scientific notation.) Let's further stipulate that every one of those races was close: a dead heat with a standard deviation of 1%. In this idealized case you could expect to be the tie-breaking vote — thus having your vote "matter" — in less than four of them. Four in ten million.

cnt = 10000000;
pop = 100000000;
for i=1:1000,
  results = round(random('normal',n/2,pop*.01,[cnt 1]));
  ties=[ties;sum(results==n/2)];
end;
mean(ties)     => ans = 3.9160

boston-red-sox-world-series-trophy

If, on the other hand, there were 51 states of 1,960,784 voters each, you would be the tie-breaker for your state 203 times. This would only matter if the other 50 states were tied 25/25. You can expect this to happen 11.2% of the time. Which means your chance of being the deciding vote goes from 4-in-10,000,000 to 23-in-10,000,000 which means your vote "matters" more than five times as much in the (modified) Electoral College system than it does in the winner-take-all system.

All of this is far too numerical to explain to someone in casual conversation, so my standard defense of the Electoral College is to agree that we should abolish it as soon as the winner of the World Series is decided by whichever teams scores the most runs in 63 innings rather than the team which wins four of seven games. The reasons to prefer both the Electoral College and best-of-seven champions are the same, but I'm already way too far off track to cover that now.


Okay, back to the point. Equally-sized states would be cool. But the chances of us redrawing state boundaries is zero.

Despite this, I think an automated system to draw boundaries for equally populated regions is far more than just an exercise. Why? Because while our state boundaries are effectively carved in stone by tradition — or in the case of my home of Maryland, literally carved in stone — there are political borders that are re-drawn with regularity: congressional districts.

That's how we end up with absurdity like this:

Maryland 3rd
Maryland's (totally reasonable, not at all farcical) 3rd Congressional District

Gerrymandering was one of the chief complaints than Glenn Reynolds made in his recent EconTalk appearance:

I'm also pretty much agnostic on term limits. I used to be opposed to them. And as I see how effective gerrymandering is, which I think is one of the reasons people hate Congress but love their Congressmen, or at least tolerate their Congressmen, because of gerrymandering, I think that's something that is difficult to deal with and probably wouldn't be helped that much by term limits. If you've got a gerrymandered district, especially with modern gerrymandering technology, the new guy who gets elected in it is likely to look an awful lot like the old guy because the district is so one-dimensional.

Taking the "especially with modern gerrymandering technology" as a given, how can we use technology to pre-empt this trend? It's usually best to address technical problems with technical solutions rather than social ones.

One option is to draw the boundaries the same way we do now, but reject them if their score on some measure falls below a certain acceptable level. For example, IDV Solutions has ranked districts by perimeter/sqrt(area). This is, by their own admission, a crude way to do things. To name just one problem, it favors districts that fall on state borders like the Mason-Dixon Line over those that fall on borders like the Potomac River.

Other measures are possible, obviously. One that I think would be useful is based on nearest neighbors. In a non-gerrymandered map the vast majority of addresses will have all of their k nearest neighbors in the same district. As districts become more complex, more people will be have large proportions of their neighbors in other districts. If you limit the neighbors to those within the state, which is reasonable, you minimize the Mason-Dixon/Potomac problem.

Another measure, which I've thought out less fully, would be to measure the Kolmogorov Complexity of the map. The descriptive length of the Maryland 3rd is... well "long" doesn't cut it. Compare that to the former definition of the Maryland 1st: "the Eastern Shore up to the Susquehanna River." The Kolmogorov Complexity itself is incomputable, but I think we have very good proxies in this case. Whoever draws the maps must be able to present in writing, with no graphics, an unambiguous definition of the districts in less than k words. Or alternately, must be able to present them orally to a judge, spoken and without visual aids, in c minutes.

There's a general principle here: If you can't explain it succinctly, it's a bad rule.

Another approach might be to measure the curvature or the fractal dimension of boundaries between districts. I've thought even less about this though.

The approach of capping some metric has the option of being simple, objective, and easily implementable. Whatever partisan committee which now draws the borders presents their map to an adjudicator, who passes it through an established, preferably open-source, program which calculates the metrics. If they're above some pre-established threshold the map is rejected. This is already an improvement over the subjective way we do things now.

A second option, suggested by David Friedman, is to have the partisan committee submit not a finished map, but a program to generate a map. The inputs to the program could contain population data, municipal boundaries, etc. but no information about party affiliation or past voting records, or any proxies for them. There would also be a cap on program length to keep anyone from submitting what is essentially just a precomputed lookup table.

Given the low numeracy I see in most legislators, I think it would be a hard sell to get them produce such a program, but the idea has tremendous appeal to me. The whole problem is to remove subjectivity and make the decision on the basis only of explicitly acceptable criteria, which is exactly what computers do: they're entirely dispassionate and the only know what you tell them.

We could take this one step further and simply codify the program to be used rather than having partisan committees writing their own after every census. I find a technocratic Bureau of Districting Algorithms much easier to imagine than local politicians wrestling with GIS algorithms.

Posted in CS / Science / Tech / Coding | Tagged , | Leave a comment

Roman Mars

The latest episode of Bullseye includes a great interview with radio producer and podcaster Roman Mars.

99percent-invisible-logoMars is the man behind the wonderful podcast 99% Invisible. (No relation to OWS.) 99% Invisible is about the design and architecture of both the extremely weird (Kowloon Walled City, Razzle Camoflage) to the entirely prosaic (broken Metro escalators, check cashing stores, culs-de-sac) to the outright awesome (The Feltron Annual Report, Trappist ales).

I had always thought Mars had a background in architecture. Turns out he actually went to grad school to study genetics. A lot of what he said about studying and learning and why he went to grad school really resonated with me, which is why I'm writing this post. (Besides wanting to evangelize 99% Invisible, which I couldn't recommend more.)

The only complaint I have with the interview came when Mars said that if you simply read a list of his podcast topics without listening to the show you'd think they were the most boring things in the world.

I couldn't disagree more. The topics he chooses are exactly the sort of thing that lead people like me to descend into hour-long Wikipedia spelunking expeditions. (Except his investigations of them have way higher production value and are told much more artfully than the Wikipedia writing-by-committee process produces.) Don't you want to learn about how Gallaudet University designed buildings suitable for the deaf? Or how audio engineers sound-design the Olympics? Yes. Yes you do.

Posted in Uncategorized | Tagged , | Leave a comment

Armen Alchian & Unnecessary Mathematical Fireworks

Cato Daily Podcast :: Remembering Armen Alchian

Don Boudreaux discussing Armen Alchian's preference for clear prose over "mathematical pyrotechnics" reminded me of a few neural networks researchers I know. I won't name names, because it wasn't a favorable comparison. There's far too much equation-based whizz-bangery going on in some papers.

I use to think the problem was insufficient sophistication in my own math background, but I've recently heard independently from two very smart people in our Applied Math/Scientific Computing program that they also find the math in a lot of these papers to be more of an obfuscating smoke screen than a clarifying explication. If they find it hard to follow I've got good reason to believe the problem isn't just me.

Posted in Business / Economics, CS / Science / Tech / Coding | Tagged , | Leave a comment

Computational Cartography (Part 2)

In the previous post I talked a bit about how one might build a system which would algorithmically divide the US into 50 states of approximately equal populations.

I've messed around with the problem a bit and developed a very rough approximation. Actually, I don't even feel comfortable calling this an approximation. But it is at least the scaffolding you could build an eventual solution around.

auto sized states - after

It imports the relevant data from the Census (after a lot of PitA data cleaning) and throws it up on the screen, connecting neighboring counties. Then it picks 50 random counties to be the centers of the states and runs a k-means update on them (weighted by the county populations) to iterate to a better solution. The only wrinkle to a basic clustering is that it uses the population of a state to set a z-coordinate value for each centroid when determining the distance to each point. States with more people in them are represented higher up, making them effectively smaller.

The image above is what you end up with; here's where the system started:

auto sized states - before

The ratio of the most populous state to the least in America now is 66:1. As simple as this system is, some of the runs I tested on got that ratio down to 11:1.

There's a thousand things to do to this to make it an actual solution to the problem of automatically re-drawing state boundaries. I need to do things from the small (use spherical geometry to accurately calculate the distance between pairs of lat/long coordinates), to the medium (link AK/HI/PR to the mainland better), to the major (balance state populations more explicitly). Nevertheless, I think this is a useful framework in to build some intelligent decision making on top of.

Posted in CS / Science / Tech / Coding | Tagged | 1 Comment

Computational Cartography (Part 1)

In the sixth grade my Social Studies teacher lent me a book which had a whole section devoted to alternative maps of the US. What would it look like with a dozen states? Or 200? What if state lines were drawn according to climate and vegetation patterns, or ancestral ethnicities, or any other speculative idea? I'm fascinated by maps generally, and these were no exception.

Here's a example of a speculative map of the US by Neil Freeman that's been making the rounds in the last few days:

The United States redrawn as Fifty States with Equal Population
"Electoral college reform," Neil Freeman

My first reaction to seeing this was: Cool. Now how to do I automate this? Those two thoughts go together in my head a lot. Rather than just idly speculating how to get a machine to do this, I'm actually going to give it a try. Or at least give a try at a first approximation.

Freeman's method was initialized algorithmically, but a lot of the decisions were subjective. He also uses a lot of data (drainage basins, commuting patterns) that would be too much of a pain for me to track down an integrate for this sort of spare-time project.

I'm starting by doing everything on a county-by-county basis, which is a useful limit to work with. Without limiting myself to some relatively coarse-grained pre-existing boundaries the computational possibilities would be staggeringly vast, and the result would probably look too foreign anyway. Limiting things to the 3,033 counties and county-equivalents (Louisianan parishes, Alaskan boroughs, Virginian independent cities, ...) will make things more manageable.

I'll need population data for each county, which I can get from the Census Bureau. I'll also need an adjacency graph of counties, which I thought would be tricky, but it turns out the Census Bureau has me covered their too. The location of each county is given in that first link. For the final rendering I'll need the shape of each county. Location is simple, but I'll have to do some digging for shape.

Methods... methods... hmmm.

A clustering approach seems like a natural way to start. Something like k-means where you start with some arbitrary, incorrect centers for each state, and move them around until each state is approximately equally sized. That would at least give you something to refine from.

K-means is related to Learning Vector Quantization, which was an ancestor of Self-Organizing Maps, so that's another direction to explore. I'm not sure how you'd enforce the population similarity complaint off the top of my head. Or, for that matter, how you'd want to encode any of the information in feature space. (This 2011 paper by Skupin and Esperbé seems like a solid lead [pdf].) I'm not sure how an SOM would work for this but I have a strong feeling there's potential down that road.

There's definitely a greedy hill-climbing approach that could be easily implemented. Which means there's a Simulated Annealing approach as well. No way of knowing a priori how effective either would be.

Those techniques — and many others — will need a fitness function. It would be simple enough to write one based on the inequality in state populations. (Ratio of smallest to largest, Gini coefficient, ...) There should be a penalty for non-compact states, perhaps based on perimeter-to-area raio. But that may cause a problem for states along coast lines or rivers. Perhaps use the adjacency graph and penalize states with large diameters, weighted by their actual distance in geographic-space, relative to the number of counties. In any event, Hawaii and Alaska may need special attention for any distance-based rules.

Genetic Algorithms might come in handy, especially if you already had a reasonable starting point to refine from. How would a chromosome be represented though? Each county is one gene, taking available values of 1-50? The state space would be huge. Something better than the naive approach would be needed.

I'm sure there are notes in either my Computational Geometry or GIS notebooks that would come in handy. I'll have to flip through those next time I'm in the lab. Maybe my Algorithmic Game Theory notes as well; I seem to remember some relevant problems on siting cell towers that may be adaptable.

This should give me plenty to get started with. I've got a lot of other irons in the fire, but hopefully in a few days I'll be back with another post laying out my preliminary efforts.

Posted in CS / Science / Tech / Coding | Tagged | 2 Comments

Netflix, again: HBR misses the point

I don't intend for this blog to be nothing but commentary about Netflix. I promise. But this is the intersection of business and technology and art, so it's got my attention.

HBR :: Grant McCracken :: Will Netflix Flourish Where Hollywood Failed?

So does Netflix have an edge? Is there any reason to think they can flourish where so many have failed? The apparent answer is data. Netflix has lots and lots of data. They know what we watch, when we watch, where we stop watching, where we repeat a scene, where we reach for the fast-forward button, and most critically, when we break off and move on. They know which movies sell well at 8:00 on a Friday night and which ones we like to watch on Sunday afternoon. They can surmise which directors, writers, and stars produce the most watchable entertainment. They have magnificent data.

(1) Yes, data is their edge. (2) They don't need to make better content than everyone else. They just need to make content good enough to give them bargaining chips when they strike deals with other content makers and distributors.

And that's a tragedy. Netflix has so much data that they are going to be tempted to climb into the creative tent and start offering "advice."

This is almost exactly wrong. Or not wrong, but useless. Producers are constantly offering "advice." The difference is that Netflix's advice will be based on numbers, while other producers' advice is based on sophistry and illusion. ("Does it contain any abstract reasoning concerning quantity or number? No. Does it contain any experimental reasoning concerning matter of fact and existence? No. Consign it then to the flames: For it can contain nothing but sophistry and illusion.")

They can claim to know exactly what works and what does not. Well, sorry, no. Knowing that something works leaves us a long way from knowing why something works. And this leaves us a long way from knowing how to reproduce it in another movie. The only thing this data can be absolutely sure to produce is arrogance. We have seen this mistake before.

Yes, they can claim to know exactly what works and what does not and why. Or they could not. There's nothing inherent in a quantitative approach that rules out epistemic humility. In fact, there's much to quantitative reasoning that makes it more humble. When's the last time you saw someone run a t-test on an executive's intuitions or gut feelings?

This means that whatever the data say, Netflix cannot tell a director, "We need a fight scene here." And it really can't say, "We need a fight scene at the 14-minute mark." Doing so, will not only drive creatives away, but viewers as well. As Henry Jenkins has said, viewers are newly sophisticated and critical. They can see formula a long way off. They can see plot mechanics the second they hit the screen. And the moment this happens, they are off.

Hold on a minute. Why would you assume that Netflix's results would be more formulaic than the traditional Hollywood approach? Humans can only sort out cause and effect when there are a couple of moving pieces. Computerized pattern recognition can do so in much more complicated environments. Doesn't it stand to reason that Netflix's discovered patterns will be more complex, and therefore less formulaic and noticeable, than the patterns that intuition- and tradition-guided producers hew to?

Netflix, therefore, will have to temper their itch to intervene. Naturally, we are not talking carte blanche here. We are not saying that we take any artist and turn them loose. Because we know a great deal of capital has been squandered by creatives keen to prove how artistic and avant garde they are. No, what we need are culture producers who are — in the language of Goldilocks — "just right." They need to be able to tell a story and obey some of the story-telling conventions even as they do new and interesting things to break and bend those conventions. Only then will painters paint and patrons watch.

The advice in this post true for every company producing creative output. It's masquerading as being specifically about Netflix. It's not only more general than it's made out to be, it's arguably less applicable to Netflix than to their competitors.

I'm also more than a little weary of critiques being made against numerical decision making without any consideration of the faults of the non-numeric decision making it's displacing.

Posted in Business / Economics | Tagged , | Leave a comment

Netflix Marathon

Tyler Cowen :: Will marathon viewing become the TV norm?

On Friday, Netflix will release a drama expressly designed to be consumed in one sitting: “House of Cards,” a political thriller starring Kevin Spacey and Robin Wright. Rather than introducing one episode a week, as distributors have done since the days of black-and-white TVs, all 13 episodes will be streamed at the same time. “Our goal is to shut down a portion of America for a whole day,” the producer Beau Willimon said with a laugh.

Ad-financed shows — still a clear majority of viewing — may prefer to have impressions from the ads spread out over weeks and months rather than concentrated in one long marathon sitting.

On the other hand, when watching an hour long show — or even a half-hour — I routinely see the same ad multiple times. Not ads for the same product or service, but the very same advertisement. I am sure there is a lot of literature about the trade-offs between repetition and staleness to doing this. (Note to self: ask about this at the next marketing quant lunch.)

house_of_cards

Cowen continues:

Furthermore the show itself relies more heavily on an effective and immediate burst of concentrated marketing, with little room to build word of mouth and roll out a campaign with stages.

Yes, you lose word-of-mouth, but you also lose the inevitable week-to-week decay as people drop out of the viewership pool. Most TV shows show a remarkably consistent exponential decay in viewership. It's not at all clear to me that the gaining from WOM and the losing from audience decay is preferable to having neither.

This is being framed as a contest between watching 13 episodes in one day and watching them over four months. My wife and I have been watching one episode of "House of Cards" every day or so. I think this middle ground may be a better solution than either extreme. A two week roll-out keeps viewers focused and concentrates marketing, but doesn't roll the dice on one big push.

Note that Netflix has an advantage that other outlets don't: they can continue to advertise the show for free through their service. This won't drive new members to subscribe, but I think they benefit even when existing members watch the show. True, it doesn't boost revenue, but racking up higher viewership both makes it easier for them to create high-quality shows in the future, and it strengthens their in-house productions as a bargaining chip when negotiating with other content producers and distributors, which I think is the real value of "House of Cards."

One media market which is still highly serialized and has clearly not come to grips with the implications of that is comics. Here is just one recent piece about this. People have been fretting over the serialization-vs-collection transition and the friction it causes since I started reading comics six years ago, and they don't seem any closer to resolving the tension.

PS "House of Cards" is very highly recommended. I haven't had a show I was this excited about binge-watching in a couple of years.

Posted in Business / Economics | Tagged , | Leave a comment

Hello world!

This is my new blog. I noticed that the book reviews and other notes on my personal webpage were getting longer and unwieldy, so I've decided to migrate some of that content over here.

I plan on posting notes on what I've been reading, as well as thoughts about science, computation, business, art, and... whatever else. You know, blog stuff.

Posted in Uncategorized | 1 Comment