entia non sunt multiplicanda praeter necessitatem.
"Please keep things simple."
manus non sunt ventilandae praeter necessitatem.
"Please keep the handwaving down."
You know, I've had it with Ockham's razor.
My work in machine learning is more or less orbiting the Solomonoff - Chaitin - Kolmogorov - Hutter - Boulton - Wallace galaxy. This simply means I'm assuming that the data I'm analyzing is the output of a computational process, any computational process. I have no idea whatsoever as to the sourcecode of this process, so I'm trying to assign equal a priori probability to all programs. Now suppose I'm stumbling over two short programs which in fact do output my data. Both programs are 1000 bits long. Let's say the first one is a neural net, and the other's a support vector machine.
Now assume, after playing around with my first program, I'm finding out that only the first 50 bits are in fact important for producing the output. The rest is just random garbage. I could in fact try out all combinations of those remaining 950 bits and get 2^950 different neural nets that all output my data. Now I'm trying the same thing with program two. Here, only the first 49 bits matter, and I could create 2^951 variations of support vector machines, that's twice as many as in the case of program 1. Since I try to assign equal a priori probability to all programs, and possible support vector machines outnumber possible neural nets two-to-one, I'd bet two-to-one that my for the support vector machine and against the neural net.
Note that the "1000 bits" do not figure into the result, I could just have well have chosen 10.000 bits, or 10 Gazillion bits. Also, if the first program had been 723 bits instead of 1000, I could have just padded it with 277 extra garbage bits to make it as long as the second. The argument stays the same. We're cutting a few corners here, but the basic idea is that, when you have to assign probabilities to various models, you calculate the number of bits absolutely necessary to produce your models, and penalize all models but the shortest by a relative factor of 0.5 for every bit of extra length. Let me repeat it, this is just a consequence of assuming the true process that's creating your data (the "world") is a program, any program, and before having seen the data, you have no idea whatsoever which program. Simple, isn't it ?
Welcome to the world of of Solomonoff induction.
The attentive reader might have noticed the complete absence of any reference to Ockham in the above explanation. What Ockham himself really intended to say is not entirely clear, nor is it actually too clear what people today mean when they invoke his name. To repeat it once again, the reason we penalize long models, or theories, in Solomonoff induction, is because we don't know a priori which program created our observation. It's not like we have anything against long models, or that we said hey, remember Ockham! Sure, what we've ended up with seems to go along somewhat with Ockham's razor, but we notice this after we got our results. So if anything, you could try to say Solomonoff induction explains why Ockham's razor works, and not the other way round. But don't, for it doesn't.
To illustrate this think of the two hypotheses "Afro-Americans get comparatively few PhDs because of [a complicated interplay of socioeconomic factors]" and "Afro-Americans get comparatively few PhDs because of they don't have the intelligence gene X." Shooting from their hip, people would say the second hypothesis is simpler. Is it ?
How the hell should I know !! Imagine just for a moment trying to translate those two verbal statements into computer programs which produce the data in question. The data in question being human academic achievement. PhD theses. Social interactions. Application interviews. Then imagine what has to be included in the program's source code: Human genetics, human brain structure, social dynamics, macroeconomic systems...We're talking at least gigabits of data here. Trying to estimate the length of such huge programs down to a few bits is like doing a bit of lattice quantum chromodynamics in your head in order to estimate the proton mass. Humans simply can't do this. If you can, give me call. I have a job for you.
So the connection between the rigorous theory that is Solomonoff Induction, and the intuitive insight that is Ockham's razor is tentative at best. OK, nonexistent. The same goes for machine learning theories like minimum message length (MML), minimum description length (MDL), or the Akaike information criterion (AIC), which can all be shown to be approximations of Solomonoff induction.
Then why do so many people, even those working in the very field, handwavingly invoke Ockham as the forefather of their discipline ?
Ockham’s Razor has long been known as a philosophical paradigm, and in recent times, has become an. invaluable tool of the machine learning community. (link)
Algorithmic probability [comment: the theory behind Solomonoff induction] rests upon two philosophical principles......The second philosophical foundation is the principle of Occam's razor. (link).
Let me make it clear that I really respect the authors quoted above as scientist, (the author of the second quote contributed fundamentally to the field of algorithmic probability theory himself !). But really, I cannot imagine any other reasons for summoning Ockham in this context than the desire to look humanistic, or philosophical, the desire to make students nodd in "comprehension", or, I'm sorry, a bit of muddled thinking.
OK, so let's make it clear once more:
- Ockham's razor is an intuitive philosophical insight.
- Ockham's razor is NOT the underlying principle of Solomonoff induction. It may have been an inspiration to Solomoff, but so may have been, say, talking to Marvin Minsky. Note also the complete absence of the name "Ockham" (or Occam) in this talk.
- MML, MDL, AIC, MAP, and even least-squares approaches to theory formation can all be derived from Solomonoff induction. Logically, Ockham's razor is NOT the underlying principle of any of these theories.
- Solomonoff induction is NOT a "formalization" of Ockham's razor. Solomonoff induction does NOT proof Ockham's razor is useful.
- Ockham's razor is NOT an empirical observation. It's a maxime, a rule of thumb, a heuristic. It's usefulness can in fact be debated, since it's a rubberband rule, i.e. you can stretch it in to various sizes and shapes. Your intuitionist notion of simplicity may not be the same as mine. In the end, we're back to gut feeling.
- Ockham's razor is intended for use by human beings. You cannot really translate it into a rigorous mathematical statement. In particular Solomonoff induction is not a "version" of Ockham's razor.
- MDL, MML, MAP, AIC are valid mathematical approaches at scientific data analysis. A scientist should not defend the use of these methods by invoking Ockham's razor. And if a scientist invokes Ockham's razor in a non-mathematical situation, be aware he's essentially talking about his gut feeling.