Updated: I noticed a floating point error when using this that I discuss correcting at the end of this post.
I thought I would post some of the bite-sized coding pieces I've done recently. To lead off, here's Ruby function to find the distance between two points given their latitude and longitude.
Latitude is given in degrees north of the equator (use negatives for the Southern Hemisphere) and longitude is given in degrees east of the Prime Meridian (optionally use negatives for the Western Hemisphere).
include Math
DEG2RAD = PI/180.0
def lldist(lat1, lon1, lat2, lon2)
rho = 3960.0
theta1 = lon1*DEG2RAD
phi1 = (90.0-lat1)*DEG2RAD
theta2 = lon2*DEG2RAD
phi2 = (90.0-lat2)*DEG2RAD
val = sin(phi1)*sin(phi2)*cos(theta1-theta2)+cos(phi1)*cos(phi2)
val = [-1.0, val].max
val = [ val, 1.0].min
psi = acos(val)
return psi*rho
end
A couple of notes:
Everything with val at the bottom is to deal with an edge case that can crop up when you try to get the distance between a point and itself. In that case val should be equal to 1.0, but on my systems some floating-point errors creep in and I get 1.0000000000000002, which is out of range for the acos() function.
This returns the distance in miles. If you want some other unit, redefine rho with the appropriate value for the radius of the earth in your desired unit (6371 km, 1137 leagues, 4304730 passus, or what have you).
This assumes the Earth is spherical, which is a decent first approximation, but is still just that: a first approximation. ((As far as I'm concerned, this is my canonical example of the difference between a first and second approximation. The Earth isn't really a oblate spheroid either, but that makes a very good second approximation — about 100 m. (See John Cook here and here.) ))
I am currently writing a second version to account for the difference between geographic and geocentric latitude which should do a good job of accounting for the Earth's eccentricity. The math is not hard, but finding ground truth to validate my results against is, since the online calculators I've tried to check against do not make their assumptions clear. I did find a promising suite of tools for pilots, and I'd hope if you're doing something as fraught with consequences as flying that you've accounted for these sorts of things.
Update:
I've been putting this to use in a new project, and I've noticed an edge case that didn't crop up before. Due to some floating-point oddities, trying to get the distance between a point and itself will throw an error. In that case the value passed to acos() should be 1.0 but ends up being 1.0000000000000002 on my system. Since the domain of acos() is [-1,1] this is no good.
If you want to be on the safe side you can replace this...
val = sin(phi1)*sin(phi2)*cos(theta1-theta2)+cos(phi1)*cos(phi2)
val = [-1.0, val].max
val = [ val, 1.0].min
psi = acos(val)
...and that will take care of things. (Yes, this is a verbose way of doing things, but I often prefer the verbose-and-overt to the laconic-and-recondite.)
Here is a question to think about. If religions help to create social capital by allowing people to signal conscientiousness, conformity, and trustworthiness [as Norenzayan claims], how does this relate to Bryan Caplan’s view that obtaining a college degree performs that function?
That might explain why the credentialist societies of Han China were relatively irreligious. Kling likes to use the Vickies/Thetes metaphor from Neal Stephenson's Diamond Age, and I think this dichotomy could play well with that. Wouldn't the tests required by the Reformed Distributed Republic fill this role, for instance?
This is by far the best, simplest explanation of Coase's insights that I have read. Having read plenty of Landsburg, that should not — indeed does not — surprise me.
His final 'graph is a digression, but a good point:
Coase’s Nobel Prize winning paper is surely one of the landmark papers of 20th century economics. It’s also entirely non-technical (which is fine), and (in my opinion) ridiculously verbose (which is annoying). It’s littered with numerical examples intended to illustrate several different but related points, but the points and the examples are so jumbled together that it’s often difficult to tell what point is being illustrated... Pioneering work is rarely presented cleanly, and Coase was a true pioneer.
And this is why I put little stock in "primary sources" when it comes to STEM. The intersection between people/publications who originate profound ideas and people/publications which explain profound ideas well is a narrow one. If what you want is the latter, don't automatically mistake it for the former. The best researchers are not the best teachers, and this is true as much for papers as it is for people.
That said, sometimes the originals are very good. Here are two other opinions on this, from Federico Pereiro and John Cook.
Start a font by tweaking all glyphs at once. With more than twenty parameters, design custom classical or experimental shapes. Once prototyping of the font is done, each point and curve of a glyph can be easily modified. Explore, modify, compare, export with infinite variations.
(Okay, so technically this may not belong on a "reading list.") Duncan previously created The History of Rome podcast, which is one of my favorites. Revolutions is his new project, and it just launched. Get on board now.
This would be a great starting place for high-school or freshmen STEM curricula. As a bonus, it has this nice epigraph from Richard Hamming:
"In science, if you know what you are doing, you should not be doing it. In engineering, if you do not know what you are doing, you should not be doing it. Of course, you seldom, if ever, see either pure state."
I'm at the tail end of a doctoral program and going on the job market. This is good advice. What's disappointing is that this would have been equally good and applicable advice for people going on the job market back when I started grad school. The fact that we're five years (!!) down the road and we still have need of these sorts of "surviving in horrid job markets" pieces is bleak.
All programming blogs need at least one post unofficially titled “Indisputable Proof That I Am Awesome.” These are usually my favorite kind of read, as the protagonist starts out with a head full of hubris, becomes mired in self-doubt, struggles on when others would have quit, and then ultimately triumphs over evil (that is to say, slow or buggy computer code), often at the expense of personal hygiene and/or sanity.
I'm a fan of the debugging narrative, and this is a fine example of the genre. I've been wrestling with code for mapping projections recently, so I feel Miller's pain specifically. In my opinion the Winkel Tripel is mathematically gross, but aesthetically unsurpassed. Hopefully I'll find some time in the next week or so to put up a post about my mapping project.
File under "all college degrees are not created equal" or perhaps "no, junior, you may not borrow enough to buy a decent house in order to get a BA in psych."
Social cohesion can be thought of as a manifestation of how "iterated" people feel their interactions are, how likely they are to interact with the same people again and again and have to deal with long term consequences of locally optimal choices, or whether they feel they can "opt out" of consequences of interacting with some set of people in a poor way.
Munger links to some very good analysis but it occurs to me that what is really needed is the variance of grades over time and not just the mean. (Obviously these two things are related since the distribution is bounded by [0, 4]. A mean which has gone from 2.25 to 3.44 will almost certainly result in less variance here.)
I don't much care where the distribution is centered. I care how wide the distribution is — that's what lets observers distinguish one student from another. Rankings need inequality. Without it they convey no information.
I share Graur's and Tabarrok's wariness over "high impact false positives" in science. This is a big problem with no clear solutions.
The Graur et al. paper that Tabarrok discusses is entertaining in its incivility. Sometimes civility is not the correct response to falsehoods. It's refreshing to see scientists being so brutally honest with their opinions. Some might say they are too brutal, but at least they've got the honest part.
McCaffrey is completely right. But good luck to him reasoning people out of an opinion they were never reasoned into in the first place.
I do like the neologism "sustainable pricing" that he introduces. Bravo for that.
I would add a sixth reason to his list: accusations of "price gouging" are one rhetorical prong in an inescapable triple bind. A seller has three MECE choices: price goods higher than is common, the same as is common, or lower than is common. These choices will result in accusations of price gouging, collusion, and anti-competitive pricing, respectively. Since there is no way to win when dealing with people who level accusations of gouging, the only sensible thing to do is ignore them.
Executive summary: apiarists have agency, and the world isn't static. If the death rate of colonies increases, they respond by creating more colonies. Crisis averted.
"Betting Therapy" should be a thing. You go to a betting therapist and describe your fears — everything you're afraid will happen if you do X — and then the therapist offers to bet money on whether it actually happens to you or not. After you lose enough money, you stop being afraid.
This is the second machine learning competition hosted at Kaggle that I've gotten serious about entering and sunk time into only to be derailed by a paper deadline. I'm pretty frustrated. Since I didn't get a chance to submit anything for the contest itself, I'm going to outline the approach I was trying here.
First a bit of background on this particular contest. The data is very high dimensional (1875 features) and multicategorical (9 classes). You get 1000 labeled training points, which isn't nearly enough to learn a good classifier on this data. In addition you get ~130000 unlabeled points. The goal is to leverage all the unlabeled data to be able to build a decent classifier out of the labeled data. To top it off you have no idea what the data represents, so it's impossible to use any domain knowledge.
I saw this contest a couple of weeks ago shortly after hearing a colleague's PhD proposal. His topic is the building networks of Kohonen Self-Organizing Maps for time series data, so SOMs are where my mind went first. SOMs are a good fit for this task: they can learn on labeled or unlabeled data, and they're excellent at dimensionality reduction.
My approach was to use the unlabeled training data to learn a SOM, since they lend themselves well to unsupervised learning. Then I passed the labeled data to the SOM. The maximally active node (i.e. the node whose weight vector best matches the input vector, aka the "best matching unit" or BMU) got tagged with the class of that training sample. Then I could repeat with the test data, and read out the class(es) tagged to the BMU for each data point.
So far that's simple enough, but there is far too much data to learn a SOM on efficiently, ((I also ran up against computational constraints here. I'm using almost every CPU cycle (and most of the RAM) I can get my hands on to run some last-minute analysis for the aforementioned paper submission, so I didn't have a lot of resources left over to throw at this. To top it off there's a bunch of end-of-semester server maintenance going on which both took processors out of the rotation and prevented me from parallelizing this the way I wanted.)) so I turned to my old ensemble methods.
[1] SOM bagging. The most obvious approach in many ways. Train each network on only a random subset of the data. The problem here is that any reasonable fraction of the data is still too big to get into memory. (IIRC Breiman's original Bagging paper used full boostraps, i.e. resamples the same size as the original set and even tested using resamples larger than the original data. That's not an option for me.) I could only manage 4096 data points (a paltry 3% of the data set) in each sample without page faulting. (Keep in mind again that a big chunk of this machine's memory was being used on my actual work.)
[2] SOM random dendrites. Like random forests, use the whole data set but only select a subset of the features for each SOM to learn from. I could use 64 of 1985 features at a time. This is also about 3%; the standard is IIRC more like 20%.
In order to add a bit more diversity to ensemble members I trained each for a random number of epochs between 100 and 200. There are a lot of other parameters that could have been adjusted to add diversity: smoothing, distance function and size of neighborhoods, size of network, network topology, ...
This is all pretty basic. There tricky part is combining the individual SOM predictions. For starters, how should you make a prediction with a single SOM? The BMU often had several different classes associated with it. You can pick whichever class has a plurality, and give that network's vote to that class. You can assign fractions of its vote in proportion to the class ratio of the BMU. You can take into account the distance between the sample of the BMU, and incorporate the BMU's neighbors. You can use a softmax or other probabilistic process. You can weight nodes individually or weight the votes of each SOM. This weighting can be done the traditional way (e.g. based on accuracy on a validation set) or in a way that is unique to the SOM's competitive learning process (e.g. how many times was this node the BMU? what is the distance in weight-space between this node and its neighbors? how much has this node moved in the final training epochs?).
At some point I'm going to come back to this. I have no idea if Kaggle keeps the infrastructure set up to allow post-deadline submissions, but I hope they do. I'd like to get my score on this just to satisfy my own curiosity.
This blackbox prediction concept kept cropping up in my mind while reading Nate Silver's The Signal and the Noise. We've got all these Big Questions where we're theoretically using scientific methods to reach conclusions, and yet new evidence rarely seems to change anyone's mind.
Does Medicaid improve health outcomes? Does the minimum wage increase unemployment? Did the ARRA stimulus spending work? In theory the Baicker et al. Oregon study, Card & Krueger, and the OMB's modeling ought to cause people to update beliefs but they rarely do. Let's not even get started on the IPCC, Mann's hockey stick, etc.
So here's what I'd like to do for each of these supposedly-evidence-based-and-numerical-but-not-really issues. Assemble an expert group of econometricians, modelers, quants and so on. Give them a bunch of unlabeled data. They won't know what problem they're working on or what any of the features are. Ask them to come up with the best predictors they can.
If they determine minimum wages drive unemployment without knowing they're looking at economic data then that's good evidence the two are linked. If their solution uses Stanley Cup winners but not atmospheric CO2 levels to predict tornado frequency then that's good evidence CO2 isn't a driver of tornadoes.
I don't expect this to settle any of these questions once-and-for-all — I don't expect anything at all will do that. There are too many problems (who decides what goes in the data set or how it's cleaned or scaled or lagged?). But I think doing things double-blind like this would create a lot more confidence in econometric-style results. In a way it even lessens the data-trawling problem by stepping into the issue head-on: no more doubting how much the researchers just went fishing for any correlation they could find, because we know that's exactly what they did, so we can be fully skeptical of their results.
"\e[A": history-search-backward
"\e[B": history-search-forward
set show-all-if-ambiguous on
set completion-ignore-case on
This allows you to search through your history using the up and down arrows … i.e. type cd / and press the up arrow and you'll search through everything in your history that starts with cd /.
Wow. That is not an exaggeration at all: the most useful thing. I am so thrilled to finally be able to search my shell history the same way I can my Matlab history. I've been able to do this there for ages and my mind still hasn't caught up with not being able to do it in the shell.
If it's not clear to you why this is useful or why it pleases me, I don't think there's anything I can do to explain it. Sorry.
PS Anyone have first-hand experience with the fish shell? The autocompletions and inline, automatic syntax highlighting seem clever. I need to get around to giving it a try on one of my boxes.
The Raspberry Pi is the brainchild of a couple of computer scientists at Cambridge University. Back in 2006, they lamented the decline in programming skills among applicants for computer-science courses. ... Over the past ten years, computer-science students have gone from arriving at university with a knowledge of several assembly and high-level programming languages to having little more than a working knowledge of HTML, Javascript and perhaps PHP—simple tools for constructing web sites. To learn a computer language, “you’ve got to put in your 10,000 hours,” says Dr Upton. “And it’s a lot easier if you start when you’re 18.” Some would say it is even better to start at 14.
The problem is not a lack of interest, but the lack of cheap, programmable hardware for teenagers to cut their teeth on. For typical youngsters, computers have become too complicated, too difficult to open (laptops especially) and alter their settings, and way too expensive to tinker with and risk voiding their warranty by frying their innards.
I don't see the connection between learning to code and having super-cheap hardware. Back when I was a kid learning to program I actually had to pay real money for a compiler. (Anyone else remember Borland Turbo C++?) Now you're tripping over free languages and environments to use, including many that run entirely through your browser so there's zero risk to your machine.
Honestly how many teens are going to go full-David Lightman and be doing serious enough hacking that their hardware is at risk? Is the goal to make sure teens have the opportunity to start learning to code before college, or to give them hardware to tinker with? Those are both fine goals. Being a software guy I'd put more weight on the former, but the important thing is that the way to accomplish either are completely different.
The Pi is a great way to meet the goal of giving people cheap hardware to experiment with. But if the goal is to give kids an opportunity to start racking up their 10k hours in front of an interpeter or compiler then projects like Repl.it are a lot better. (Repl.it has in-browser interpreters for JavaScript, Ruby, Python, Scheme and a dozen other languages.)
For starters, [your correspondant] plans to turn his existing Raspberry Pi into a media centre. By all accounts, Raspbmc—a Linux-based operating system derived from the XBox game-player’s media centre—is a pretty powerful media player. The first task, then, is to rig the Raspberry Pi up so it can pluck video off the internet, via a nearby WiFi router, and stream it direct to a TV in the living room. Finding out not whether, but just how well, this minimalist little computer manages such a feat will be all part of the fun.
I did this exact project about a month ago, and couldn't be more pleased with either the results or the fun of getting it to work. I still have to tinker with some things: the Vimeo plugin won't log into my account, and I need to build a case. Other than that, I wish I had done this a long time ago.
Google is shutting down the Google Reader service, as you may have heard.
I am not happy about this, but I'm also not throwing fits about it. I spent more time in my Reader tab than any other by a huge margin. I'm sure there's some other service out there that will fill the void. Google is also being pretty considerate about this: they're giving users several months warning and they've made tools available to export your subscription data.
RWCG has had a couple of posts taking the wind out the sails of all the people who are tearing out their hair and rending their garments and casting the evil eye towards Mountain View. Overall I think he's on the right track, but this post doesn't quite line up for me:
How dare Google shut down a service I made heavy use of for years and years and paid nothing whatsoever for! [...] I wonder if this event marks the end of the ‘free’ era. You know the free era, it’s the era where this was the prevailing business philosophy:
1. Drive competing services out of business with a free service (subsidized by a profitable product).
2. Cancel free service.
3. ???
Actually no, that’s still snarky. It’s more like this:
1. Company gives away something for free.
2. People like it and use it.
3. This makes people think the company is cool. Their friend.
4. When it goes public, they buy its stock, cuz it’s so cool, and everyone they know uses and likes it. Surely that’s gotta mean something, stock-wise.
5. People continue to use the free thing and come to not only rely on it but expect it as their birthright.
6. ...
7. Profit?
According to this philosophy, giving away cool stuff for free was the wave of the future. It’s what all smart, cool companies did. Only backward knuckle-dragging idiots couldn’t figure out how this added up to a business model. Economic realities were no longer reality.
I think he's short-changing the business potential of a product ("product"?) like Google Reader. There's no direct line between "make Reader & give it away free" and "profit," but this approach still has some uses.
1. Yes, it makes people think you're cool and friendly and not-evil. Many firms do things for that reason alone. They spend billions on things way outside their core competencies just so people think they're cool and friendly. Isn't that the entire point of the "Corporate Social Responsibility" fad?
2. Providing services like Reader makes Google look cool to people generally, but more importantly it makes them look cool to geeks. It's a punchline that a firm's number one asset is it's people, but that's pretty true about Google. They can do what they do because they get the pick of the litter of hackers.
I went to career fair my CS department sponsored a month or so ago. The line for the Google table was literally out the door. Most people in my graduate program are angling for academic jobs. Google is one of maybe four private companies that people will be impressed you're interviewing with.
3. Projects like Reader not only motivate applicants, they motivate employees.
Talent and productivity are extremely unevenly distributed in coders. The best are many orders of magnitude better than the median; the bottom decile (conservatively) have negative productivity. You usually don't get the best by offering them orders of magnitude more money, you get them by giving them cool problems to work on.
If you're excited about spending some time developing X, there's a good chance Google will let you do that. (At least in comparison to if you were working at Initech.) What's more, there's a chance Google will roll out X to millions of people, like they did with Reader. I can't stress enough how big of a motivator that can be.
4. Google's strategy for a while has been that anything they do to make people want to use the internet more is good for them, because they capture a dominant slice of the ad revenue online. More people spending more time online is better for Google, period. Reader fits into that. That's not a strategy that will work forever, or for many (any?) other companies. It can also be used to justify a lot of wasted "investments." But it's also true.
5. Google lives and dies off of data. Reader could have been generating that for them. I have no idea how much they actually learned from people's reading habits, if anything, but it had the potential to be a goldmine.
I have no idea if it was a good idea or a bad one for Google to shut off Reader. I'm skeptical, but I realize I have none of the facts. I can't imagine it would cost that much to keep it running as is, especially compared to their other projects. I'm not sure what better use they have for the resources they're redeploying. I'm curious that they didn't even try to make it ad supported. Hell, I would have even paid directly to keep using it, and I pay for approximately zero online subscriptions.
Again, I don't know what they know. But I do know that "there's no revenue from this free product; let's shut it down" should not be the beginning and end of this decision making.