Nov 23, 2007

A Simple Heuristic Explanation of Solomonoff Induction

Solomonoff induction, named after its inventor Ray Solomonoff, is a mathematical method that describes how to take some set of observations and produce an educated guess about possible processes underlying that observations. You could then use that guess, among else, to make a prediction about your next observation. The great thing is that this guess would, in a very general way, be better than any other guess anyone could make using just the same observations. Solomonoff induction is a simple and useful, yet widely misunderstood idea. Here I'm trying to give a very short, heuristic explanation of the basics. You can also try out this slightly more challenging explanation. A follow-up post to this one will deal with possible applications of Solomonoff's results.

Something simple beforehand: A string of 5 bits can take 2 x 2 x 2 x 2 x 2 =2^5 configurations. A string of n bits can take 2^n configurations. A string of n+1 bits can take 2^(n+1) configurations, that's an additional factor of two over 2^n. So if we have one extra bit of length, we can make twice as many different strings.

Imagine the following task: You are sitting in a lab in front of a computer screen. The computer is running a simple program, but you have no idea what the program looks like. The program is printing output after output on the screen. After a while, you should give an educated guess about the unknown program's next output.

To make things a little easier, the experimentator is telling you the program is at most 1 million bits in length. This is not part of Solomonoff induction originally, but accept it for the moment.

"Well", you say, "I could look through all possible programs shorter than million bits, see if they could produce the output I've seen, and throw away all programs that don't. That is, programs that output something different, or get trapped in infinite loops, or crash, or don't even compile. Because, it is an old maxim of mine that when you have excluded the impossible, whatever remains, however improbable, must be the truth. Then I'll let this "truth" run on my own computer, and use the output to predict the next output on the screen.

That's a good idea, but what if you end up with more than one program in the end ? Well, there's no reason to think the experimentator tried to make things particularly easy or complicated, and you're not a big fan of medieval philosophy either, so you decide to split the bets evenly between all the remaining programs. If you'd end up with 5 possible candidate programs, you'd say each one has 1/5 probability of being the right program.

But it may take a while to sift through all the 2^1.000.000 possible programs. So the experimentator has mercy and gives you two sheets of paper containing the printout of two programs that do in fact produce the output you've seen. One is 1.999 bits long (it's titled SHORT), the other 2.000 bits (it's titled LONG). The experimentator also tells you that the two programs embody the only two simple approaches to produce the data you've seen, any other approach would be waaayyy more complicated. SHORT will output Cat next, long will output Fish.

You're about to say "there's a 50% chance LONG is the right program, and a 50% ..." but then you hesitate. Because you just found a simple way to create more programs that are using the same approaches as LONG and SHORT: Just insert some comments into LONG and SHORT. The comment doesn't even have to be witty, nonsense will do just fine. The output will be the same, and if the program is shorter than 1.000.000 bits, it'll be OK. These will be valid programs, and although they do the same things that LONG and SHORT do, they must be counted as individuals.

You realize you can make a lot of variations of LONG and SHORT this way. With LONG, you have 1.000.000 - 2.000 = 1.998.000 bits remaining for commentary. With SHORT, you have 1.000.000 - 1.999 = 1.998.001 bits, that's one extra bit. If you can make a Gazillion comments on LONG, this one extra bit allows you to make two Gazillion comments on SHORT, twice as many.

So within all possible programs of less than 1.000.000 bits of length there are twice as many variants of SHORT as there are of LONG. Consequently, you decide to say "I'll bet 2:1 that the program inside the computer is behaving like the program SHORT, and not like the program LONG. So it's 2:1 for Cat against Fish."

OK, that's it, in principle. Be aware that the length limit of 1.000.000 bits is imposed only for didactic reasons. The 2:1 ratio would be unchanged if we increased the limit to a Trillion bits - there's still the extra bit available in SHORT, and we can make twice as many comments. So let's ditch the limit altogether. Let's just say being one bit shorter makes a program twice as likely.

Be also aware that we have shown no preference for short programs in the beginning. We had no idea whether to expect short or long programs, so for simplicity we decided to split the probability even between all programs, irrespective of length. We just put our bets on SHORT in the end because there are more variations of SHORT than there are of long.

To rephrase it: If we had sampled random programs of less than 1.000.000 bits, and at the end of the day had ended up with twice as many programs outputting Cat than programs outputting Fish, we'd probably put our bets on Cat being the next output. But what we did was we found a very short Cat program and a slightly longer Fish program. From this we were able to deduce that there must be more Cat programs out there than Fish programs, because the shorter Cat program leaves more room for crazy comments without hitting the length limit (no matter how big the length limit really is).

So you see the basic idea is really simple - not having the slightest idea beforehand what program to expect means assigning equal probability to all candidate programs. And finding a short program means deducing that there are more variations of the short program than of any longer program - a factor of two for every extra bit - so there are more variations of the short program in our set of candidate programs, so we'll put higher bets on the short program (again, a factor of two for every extra bit.)

1 comment:

Daniel kokotajlo said...

FYI, this is the ONLY explanation of Solomonoff Induction that I've seen that actually convinces me it's a good idea. Every other explanation I've seen just does hand-wavy stuff and says "Shorter programs are more likely, obviously..." Sometimes they start talking about the probability of generating a given program with a coin flip, which I'm pretty sure is not how it works. (Then again, I only think that because I really like your explanation of things.)

Anyhow, thanks!