Nov 23, 2007

The Trains To Orbit Leave From Platform 5



Not a road to the stars, but a railtrack to orbit. "The restaurant car is on the top of the train." How far ? Well, like riding the Transib проезда туда и обратно, twice in a row. But the view ! The view ! Sure, it's expensive to build, but that's a cherished tradition. In Austria, too. Equator ? No, no, Linz is +48° 17' 43", but that's no problem, believe me. The Van Allen belt ? I'm more afraid of railroad strikes. And once we're in GEO, you can take the connecting train to the southern hemisphere from platform 3, and go down again over Brazil. Transatlantic bridge !


(Artwork Ticket to the moon by Christoph Steinbrener & Rainer Dempf)

A Simple Heuristic Explanation of Solomonoff Induction

Solomonoff induction, named after its inventor Ray Solomonoff, is a mathematical method that describes how to take some set of observations and produce an educated guess about possible processes underlying that observations. You could then use that guess, among else, to make a prediction about your next observation. The great thing is that this guess would, in a very general way, be better than any other guess anyone could make using just the same observations. Solomonoff induction is a simple and useful, yet widely misunderstood idea. Here I'm trying to give a very short, heuristic explanation of the basics. You can also try out this slightly more challenging explanation. A follow-up post to this one will deal with possible applications of Solomonoff's results.

Something simple beforehand: A string of 5 bits can take 2 x 2 x 2 x 2 x 2 =2^5 configurations. A string of n bits can take 2^n configurations. A string of n+1 bits can take 2^(n+1) configurations, that's an additional factor of two over 2^n. So if we have one extra bit of length, we can make twice as many different strings.

Imagine the following task: You are sitting in a lab in front of a computer screen. The computer is running a simple program, but you have no idea what the program looks like. The program is printing output after output on the screen. After a while, you should give an educated guess about the unknown program's next output.

To make things a little easier, the experimentator is telling you the program is at most 1 million bits in length. This is not part of Solomonoff induction originally, but accept it for the moment.

"Well", you say, "I could look through all possible programs shorter than million bits, see if they could produce the output I've seen, and throw away all programs that don't. That is, programs that output something different, or get trapped in infinite loops, or crash, or don't even compile. Because, it is an old maxim of mine that when you have excluded the impossible, whatever remains, however improbable, must be the truth. Then I'll let this "truth" run on my own computer, and use the output to predict the next output on the screen.

That's a good idea, but what if you end up with more than one program in the end ? Well, there's no reason to think the experimentator tried to make things particularly easy or complicated, and you're not a big fan of medieval philosophy either, so you decide to split the bets evenly between all the remaining programs. If you'd end up with 5 possible candidate programs, you'd say each one has 1/5 probability of being the right program.

But it may take a while to sift through all the 2^1.000.000 possible programs. So the experimentator has mercy and gives you two sheets of paper containing the printout of two programs that do in fact produce the output you've seen. One is 1.999 bits long (it's titled SHORT), the other 2.000 bits (it's titled LONG). The experimentator also tells you that the two programs embody the only two simple approaches to produce the data you've seen, any other approach would be waaayyy more complicated. SHORT will output Cat next, long will output Fish.

You're about to say "there's a 50% chance LONG is the right program, and a 50% ..." but then you hesitate. Because you just found a simple way to create more programs that are using the same approaches as LONG and SHORT: Just insert some comments into LONG and SHORT. The comment doesn't even have to be witty, nonsense will do just fine. The output will be the same, and if the program is shorter than 1.000.000 bits, it'll be OK. These will be valid programs, and although they do the same things that LONG and SHORT do, they must be counted as individuals.

You realize you can make a lot of variations of LONG and SHORT this way. With LONG, you have 1.000.000 - 2.000 = 1.998.000 bits remaining for commentary. With SHORT, you have 1.000.000 - 1.999 = 1.998.001 bits, that's one extra bit. If you can make a Gazillion comments on LONG, this one extra bit allows you to make two Gazillion comments on SHORT, twice as many.

So within all possible programs of less than 1.000.000 bits of length there are twice as many variants of SHORT as there are of LONG. Consequently, you decide to say "I'll bet 2:1 that the program inside the computer is behaving like the program SHORT, and not like the program LONG. So it's 2:1 for Cat against Fish."

OK, that's it, in principle. Be aware that the length limit of 1.000.000 bits is imposed only for didactic reasons. The 2:1 ratio would be unchanged if we increased the limit to a Trillion bits - there's still the extra bit available in SHORT, and we can make twice as many comments. So let's ditch the limit altogether. Let's just say being one bit shorter makes a program twice as likely.

Be also aware that we have shown no preference for short programs in the beginning. We had no idea whether to expect short or long programs, so for simplicity we decided to split the probability even between all programs, irrespective of length. We just put our bets on SHORT in the end because there are more variations of SHORT than there are of long.

To rephrase it: If we had sampled random programs of less than 1.000.000 bits, and at the end of the day had ended up with twice as many programs outputting Cat than programs outputting Fish, we'd probably put our bets on Cat being the next output. But what we did was we found a very short Cat program and a slightly longer Fish program. From this we were able to deduce that there must be more Cat programs out there than Fish programs, because the shorter Cat program leaves more room for crazy comments without hitting the length limit (no matter how big the length limit really is).

So you see the basic idea is really simple - not having the slightest idea beforehand what program to expect means assigning equal probability to all candidate programs. And finding a short program means deducing that there are more variations of the short program than of any longer program - a factor of two for every extra bit - so there are more variations of the short program in our set of candidate programs, so we'll put higher bets on the short program (again, a factor of two for every extra bit.)

Nov 19, 2007

Wheelchaired Robot Girl Totally Un-Moemoe IRL.


Canadian robot enthusiast Le Trung's creation Aiko, the "world's first sexually harassed, disabled Fembot" (Engadget), once again vividly demonstrates the Grand-Canyon-like dimensions of the uncanny valley. Watch the video here. Some comments by various posters:


- "Her right hook punch looks promising." - "I, for one, welcome our wheelchair-bound, face-slapping female android overlords." - "Wow. She speaks perfect Engrish." - "OK, so I'm going to finish that underground bunker after all."

I have to admit that this makes me seriously reconsider my own robot-girlfriend project.

Hey, I'm joking.

Honestly.

And if - I'm saying if - I ever were to hypothetically build a robot girl in my basement I surely wouldn't ever sink as low as to cannibalize an Oriental Industries Candy Girl, as Le Trung apparently seems to have done. (A Nana, if you ask me; notice the slightly more protruding chin in Aiko resulting from added motorization, which is in fact difficult to do without...OK, forget what I just said.)

To make today's cup of weirdness full, I found there is also a Candy Girl available that looks bizarrely like often-spaced-out Osaka-San* from Azumanga Daioh , once again nicely illustrating MIT professor Max Tegmark's cosmological theory of radical Platonism, which states that every logically possible entity does in fact exist somewhere in the Universe, most likely in Japan.


( * = It's the 未来; I will not post a link. It's deplorable enough already that my blog is linking to Oriental Industry's main page. Look her up for yourself, if you think you're brave enough. )

Nov 9, 2007

....with science !

Comic artist Aaron Diaz of Dresden Codak fame has decided to quit his day job and work full time on his webcomic. It's nice artwork, it's high-brow, it's fun, and it's got characters you wish you could meet in real life. And DC seems to really understand the hardships of being a Singularitarian. Let's support him through purchasing stuff and through donations ! (It might even get you a place reserved in secular heaven.)

Nov 8, 2007

Science Has No Use For Ockham's Razor

entia non sunt multiplicanda praeter necessitatem.
"Please keep things simple."

(William of Ockham)
(Bertrand Russel)

manus non sunt ventilandae praeter necessitatem.
"Please keep the handwaving down."
(Me)

You know, I've had it with Ockham's razor.

My work in machine learning is more or less orbiting the Solomonoff - Chaitin - Kolmogorov - Hutter - Boulton - Wallace galaxy. This simply means I'm assuming that the data I'm analyzing is the output of a computational process, any computational process. I have no idea whatsoever as to the sourcecode of this process, so I'm trying to assign equal a priori probability to all programs. Now suppose I'm stumbling over two short programs which in fact do output my data. Both programs are 1000 bits long. Let's say the first one is a neural net, and the other's a support vector machine.
Now assume, after playing around with my first program, I'm finding out that only the first 50 bits are in fact important for producing the output. The rest is just random garbage. I could in fact try out all combinations of those remaining 950 bits and get 2^950 different neural nets that all output my data. Now I'm trying the same thing with program two. Here, only the first 49 bits matter, and I could create 2^951 variations of support vector machines, that's twice as many as in the case of program 1. Since I try to assign equal a priori probability to all programs, and possible support vector machines outnumber possible neural nets two-to-one, I'd bet two-to-one that my for the support vector machine and against the neural net.
Note that the "1000 bits" do not figure into the result, I could just have well have chosen 10.000 bits, or 10 Gazillion bits. Also, if the first program had been 723 bits instead of 1000, I could have just padded it with 277 extra garbage bits to make it as long as the second. The argument stays the same. We're cutting a few corners here, but the basic idea is that, when you have to assign probabilities to various models, you calculate the number of bits absolutely necessary to produce your models, and penalize all models but the shortest by a relative factor of 0.5 for every bit of extra length. Let me repeat it, this is just a consequence of assuming the true process that's creating your data (the "world") is a program, any program, and before having seen the data, you have no idea whatsoever which program. Simple, isn't it ?

Welcome to the world of of Solomonoff induction.

The attentive reader might have noticed the complete absence of any reference to Ockham in the above explanation. What Ockham himself really intended to say is not entirely clear, nor is it actually too clear what people today mean when they invoke his name. To repeat it once again, the reason we penalize long models, or theories, in Solomonoff induction, is because we don't know a priori which program created our observation. It's not like we have anything against long models, or that we said hey, remember Ockham! Sure, what we've ended up with seems to go along somewhat with Ockham's razor, but we notice this after we got our results. So if anything, you could try to say Solomonoff induction explains why Ockham's razor works, and not the other way round. But don't, for it doesn't.
To illustrate this think of the two hypotheses "Afro-Americans get comparatively few PhDs because of [a complicated interplay of socioeconomic factors]" and "Afro-Americans get comparatively few PhDs because of they don't have the intelligence gene X." Shooting from their hip, people would say the second hypothesis is simpler. Is it ?
How the hell should I know !! Imagine just for a moment trying to translate those two verbal statements into computer programs which produce the data in question. The data in question being human academic achievement. PhD theses. Social interactions. Application interviews. Then imagine what has to be included in the program's source code: Human genetics, human brain structure, social dynamics, macroeconomic systems...We're talking at least gigabits of data here. Trying to estimate the length of such huge programs down to a few bits is like doing a bit of lattice quantum chromodynamics in your head in order to estimate the proton mass. Humans simply can't do this. If you can, give me call. I have a job for you.
So the connection between the rigorous theory that is Solomonoff Induction, and the intuitive insight that is Ockham's razor is tentative at best. OK, nonexistent. The same goes for machine learning theories like minimum message length (MML), minimum description length (MDL), or the Akaike information criterion (AIC), which can all be shown to be approximations of Solomonoff induction.
Then why do so many people, even those working in the very field, handwavingly invoke Ockham as the forefather of their discipline ?

Ockham’s Razor has long been known as a philosophical paradigm, and in recent times, has become an. invaluable tool of the machine learning community. (link)

Algorithmic probability [comment: the theory behind Solomonoff induction] rests upon two philosophical principles...[]...The second philosophical foundation is the principle of Occam's razor. (link).

Or this, that, and many more examples ?

Let me make it clear that I really respect the authors quoted above as scientist, (the author of the second quote contributed fundamentally to the field of algorithmic probability theory himself !). But really, I cannot imagine any other reasons for summoning Ockham in this context than the desire to look humanistic, or philosophical, the desire to make students nodd in "comprehension", or, I'm sorry, a bit of muddled thinking.

OK, so let's make it clear once more:
  • Ockham's razor is an intuitive philosophical insight.
  • Ockham's razor is NOT the underlying principle of Solomonoff induction. It may have been an inspiration to Solomoff, but so may have been, say, talking to Marvin Minsky. Note also the complete absence of the name "Ockham" (or Occam) in this talk.
  • MML, MDL, AIC, MAP, and even least-squares approaches to theory formation can all be derived from Solomonoff induction. Logically, Ockham's razor is NOT the underlying principle of any of these theories.
  • Solomonoff induction is NOT a "formalization" of Ockham's razor. Solomonoff induction does NOT proof Ockham's razor is useful.
  • Ockham's razor is NOT an empirical observation. It's a maxime, a rule of thumb, a heuristic. It's usefulness can in fact be debated, since it's a rubberband rule, i.e. you can stretch it in to various sizes and shapes. Your intuitionist notion of simplicity may not be the same as mine. In the end, we're back to gut feeling.
  • Ockham's razor is intended for use by human beings. You cannot really translate it into a rigorous mathematical statement. In particular Solomonoff induction is not a "version" of Ockham's razor.
  • MDL, MML, MAP, AIC are valid mathematical approaches at scientific data analysis. A scientist should not defend the use of these methods by invoking Ockham's razor. And if a scientist invokes Ockham's razor in a non-mathematical situation, be aware he's essentially talking about his gut feeling.

Nov 6, 2007

オーストリアの紅葉


この写真ザンクト・フローリアン(St. Florian)修道院のブナの木が表示されます。 ところで作曲家アントン・ブルックナーはこの場所の近くに生まれた。日本人の読者のごあいさつを!