Nov 8, 2007

Science Has No Use For Ockham's Razor

entia non sunt multiplicanda praeter necessitatem.
"Please keep things simple."

(William of Ockham)
(Bertrand Russel)

manus non sunt ventilandae praeter necessitatem.
"Please keep the handwaving down."
(Me)

You know, I've had it with Ockham's razor.

My work in machine learning is more or less orbiting the Solomonoff - Chaitin - Kolmogorov - Hutter - Boulton - Wallace galaxy. This simply means I'm assuming that the data I'm analyzing is the output of a computational process, any computational process. I have no idea whatsoever as to the sourcecode of this process, so I'm trying to assign equal a priori probability to all programs. Now suppose I'm stumbling over two short programs which in fact do output my data. Both programs are 1000 bits long. Let's say the first one is a neural net, and the other's a support vector machine.
Now assume, after playing around with my first program, I'm finding out that only the first 50 bits are in fact important for producing the output. The rest is just random garbage. I could in fact try out all combinations of those remaining 950 bits and get 2^950 different neural nets that all output my data. Now I'm trying the same thing with program two. Here, only the first 49 bits matter, and I could create 2^951 variations of support vector machines, that's twice as many as in the case of program 1. Since I try to assign equal a priori probability to all programs, and possible support vector machines outnumber possible neural nets two-to-one, I'd bet two-to-one that my for the support vector machine and against the neural net.
Note that the "1000 bits" do not figure into the result, I could just have well have chosen 10.000 bits, or 10 Gazillion bits. Also, if the first program had been 723 bits instead of 1000, I could have just padded it with 277 extra garbage bits to make it as long as the second. The argument stays the same. We're cutting a few corners here, but the basic idea is that, when you have to assign probabilities to various models, you calculate the number of bits absolutely necessary to produce your models, and penalize all models but the shortest by a relative factor of 0.5 for every bit of extra length. Let me repeat it, this is just a consequence of assuming the true process that's creating your data (the "world") is a program, any program, and before having seen the data, you have no idea whatsoever which program. Simple, isn't it ?

Welcome to the world of of Solomonoff induction.

The attentive reader might have noticed the complete absence of any reference to Ockham in the above explanation. What Ockham himself really intended to say is not entirely clear, nor is it actually too clear what people today mean when they invoke his name. To repeat it once again, the reason we penalize long models, or theories, in Solomonoff induction, is because we don't know a priori which program created our observation. It's not like we have anything against long models, or that we said hey, remember Ockham! Sure, what we've ended up with seems to go along somewhat with Ockham's razor, but we notice this after we got our results. So if anything, you could try to say Solomonoff induction explains why Ockham's razor works, and not the other way round. But don't, for it doesn't.
To illustrate this think of the two hypotheses "Afro-Americans get comparatively few PhDs because of [a complicated interplay of socioeconomic factors]" and "Afro-Americans get comparatively few PhDs because of they don't have the intelligence gene X." Shooting from their hip, people would say the second hypothesis is simpler. Is it ?
How the hell should I know !! Imagine just for a moment trying to translate those two verbal statements into computer programs which produce the data in question. The data in question being human academic achievement. PhD theses. Social interactions. Application interviews. Then imagine what has to be included in the program's source code: Human genetics, human brain structure, social dynamics, macroeconomic systems...We're talking at least gigabits of data here. Trying to estimate the length of such huge programs down to a few bits is like doing a bit of lattice quantum chromodynamics in your head in order to estimate the proton mass. Humans simply can't do this. If you can, give me call. I have a job for you.
So the connection between the rigorous theory that is Solomonoff Induction, and the intuitive insight that is Ockham's razor is tentative at best. OK, nonexistent. The same goes for machine learning theories like minimum message length (MML), minimum description length (MDL), or the Akaike information criterion (AIC), which can all be shown to be approximations of Solomonoff induction.
Then why do so many people, even those working in the very field, handwavingly invoke Ockham as the forefather of their discipline ?

Ockham’s Razor has long been known as a philosophical paradigm, and in recent times, has become an. invaluable tool of the machine learning community. (link)

Algorithmic probability [comment: the theory behind Solomonoff induction] rests upon two philosophical principles...[]...The second philosophical foundation is the principle of Occam's razor. (link).

Or this, that, and many more examples ?

Let me make it clear that I really respect the authors quoted above as scientist, (the author of the second quote contributed fundamentally to the field of algorithmic probability theory himself !). But really, I cannot imagine any other reasons for summoning Ockham in this context than the desire to look humanistic, or philosophical, the desire to make students nodd in "comprehension", or, I'm sorry, a bit of muddled thinking.

OK, so let's make it clear once more:
  • Ockham's razor is an intuitive philosophical insight.
  • Ockham's razor is NOT the underlying principle of Solomonoff induction. It may have been an inspiration to Solomoff, but so may have been, say, talking to Marvin Minsky. Note also the complete absence of the name "Ockham" (or Occam) in this talk.
  • MML, MDL, AIC, MAP, and even least-squares approaches to theory formation can all be derived from Solomonoff induction. Logically, Ockham's razor is NOT the underlying principle of any of these theories.
  • Solomonoff induction is NOT a "formalization" of Ockham's razor. Solomonoff induction does NOT proof Ockham's razor is useful.
  • Ockham's razor is NOT an empirical observation. It's a maxime, a rule of thumb, a heuristic. It's usefulness can in fact be debated, since it's a rubberband rule, i.e. you can stretch it in to various sizes and shapes. Your intuitionist notion of simplicity may not be the same as mine. In the end, we're back to gut feeling.
  • Ockham's razor is intended for use by human beings. You cannot really translate it into a rigorous mathematical statement. In particular Solomonoff induction is not a "version" of Ockham's razor.
  • MDL, MML, MAP, AIC are valid mathematical approaches at scientific data analysis. A scientist should not defend the use of these methods by invoking Ockham's razor. And if a scientist invokes Ockham's razor in a non-mathematical situation, be aware he's essentially talking about his gut feeling.

3 comments:

Anonymous said...

So, you're saying, by Occam's Razor, we have to remove Occam's Razor from our explanation for why Solomonoff induction works? ;-)

Manuel Moertelmaier said...

Well, there's a indeed a point to that. If we *could* use Solomonoff Induction in everyday situations, Occam's Razor would indeed declare itself to be superfluous "explanation" of sorts (Compare EURISKOs famed "it's a good idea to throw out a randomly chosen heuristic" heuristic, which by chance was applied first (and last) to itself, or my ad-hoc-invented "Schmirgelberger's Razor" which states that "Generic rules pulled out of thin air are generally bullshit, this being especially true for Schmirgelberger's Razor", etc)

But in reality Solomonoff induction, though theoretically optimal, is absolutely impractical in everyday reasoning. One could argue that Occam's Razor embodies something akin to the MDL (MML) principle applied to an extremely complex description language (the "language of human thought", in poetic terms.) The ecological validity of the MDL principle can be argued to stem from the optimality of Solomonoff induction, plus the, again poetic, statement that "evolution did a good job on us." We're treading on really thin ice here, and have long left the ground of mathematically justifiable statements. In fact, I cannot give a stringent reason why Occam's Razor is useful. I can't even provide any hard evidence that it *is* in fact useful. Maybe slightly more complex explanations are better suited to be processed by (human) neural networks, or maybe they lead to new creative insights more quickly etc. I simply don't know. Occam's Razor is, and that is my big point in this post, a piece of armchair reasoning, and nothing more, due to the impossibility of rigorously defining what is "simple" in terms of human "explanations". It's justifiable to use the MDL principle in mathematical data analysis, but if you're talking about who murdered Lady Abigail at midnight, you shalt not invoke the name of Solomonoff.

So if Solomonoff Induction can't explain why Occam's Razor works (does it?), how about the other way round ? This idea can be quickly discarded. Solomonoff equals Bayes plus the assumption that our input from the world is produced by a computable process. Nothing more here, no initial preference for simplicity at all. Solomonoff induction could very well be invented by a mathematically competent civilization that has never produced any figure like Occam, even if, admittedly, Occam's work may have been an inspiration to Ray Solomonoff in our world.

Besides, I'm not feeling too comfortable with the term "works" here, as Solomonoff induction is in fact incomputable. It's more like the north star - it tells you in which direction to go, but you shouldn't think you can actually reach it.

Anonymous said...

We can use Solomonoff in certain special cases such as where the disparity is HUGE (God vs. gravity) or where we have one theory included in another.