**(Sketch of a Possible Explanation**

**Why Occam’s Razor Works...)**

*(Though motivated by deep questions in philosophy, this is a speculative math-y blog post; non-technically oriented readers beware…)*

How can, or should, an intelligent mind make sense of the
firehose of complex, messy data that its sensors feed it? Minds recognize patterns, but generally
there are many many patterns in the data coming into a mind, and figuring out
which data to pay attention to is a significant problem. Some major aspects of this problem are: Figuring
out which of the patterns that have occurred in one context are more likely to
occur in other similar contexts, and figuring out which of the patterns that
have occurred in the past are more likely to occur in the future.

One informal principle that seems broadly useful for solving
this “pattern selection” problem is “Occam's Razor.” This principle is commonly taken as a key
ingredient in the scientific method – it plays a key role in many philosophies
of science, including the “social computational probabilist” philosophy I’ve
presented here and in

*The Hidden Pattern*.
Occam’s Razor has been around a while (well before its
namesake, William of Ockham) and has been posited in multiple forms, e.g.:

*“Nature operates in the shortest way possible” -- Aristotle, BC 384-322*

*“We consider it a good principle to explain the phenomena by the simplest hypothesis possible.” -- Ptolemy, c. AD 90 - 168*

*“Plurality must never be posited without necessity” -- William of Ockham, c. 1287-1347*

*“Entities must not be multiplied beyond necessity” -- John Punch, 1639*

*“Everything should be as simple as it can be, but not simpler” -- Albert Einstein (paraphrased by Roger Sessions)*

The modern form of the Razor, as used in discussions of
scientific methodology and philosophy of science, could be phrased something
like:

*Given two explanations that explain the same data, the simpler one should be preferred*

or as Sklar (1977) phrased it,

*In choosing a scientific hypothesis in light of evidential data, we must frequently add to the data some methodological principle of simplicity in order to select out as ''preferred'' one of the many different possible hypotheses, all compatible with the specified data.*

This principle is often taken for granted, and has a certain
intuitive ring of truth to it -- but why should we actually accept it? What makes it true?

**Arguments for Occam’s Razor**

Perhaps the most compelling argument for Occam's Razor is
the theory of Solomonoff Induction, generalized more recently by Marcus Hutter into a theory of Universal
AI. This theory shows, very roughly
speaking, that the assumption that the shortest computer program computing a
set of data is the best explanation for that data, where "best" is
defined in terms of accurate prediction of missing or future parts of the
dataset. This is very elegant but the
catch is that effectively it only applies to fairly large datasets, because it
relies heavily on the fact that, in the limit of large datasets, the shortest program
explaining the dataset in one programming language is

*approximately*the same length as the shortest program explaining that same dataset in any other programming language. It assumes one is in a regime where the computational cost of simulating one programming language (or computer) using another is a trivial matter to be brushed aside.
There are other approaches to justifying Occam's Razor as
well. The Akaike Information Criterion
(AIC) formalizes the balance between simplicity and goodness-of-fit that is
required to achieve extrapolation without overfitting. However,
the derivations underlying the AIC and its competitor, the Bayesian
Information Criterion (BIC), hold only for large datasets. The AICc version works for small datasets but
relies on special distributional assumptions.

There is also Kelly's (2007, 2010) interesting argument that
the shortest hypothesis is, under certain assumptions, the one that will
require one to change one's mind least often upon exposure to new data. This is interesting but somewhat begs the question
of why it's so bad to change one's mind when exposed to new data. Kelly's proofs also seem to get bogged down
in various technical intricacies and conditions, which may however be ultimately resolvable in a
clean way.

Here I present a new
argument for Occam's Razor, which appears to work for small as well as
large datasets, and which is based on the statistical notion of
subsampling. At present the argument is
just a sketch, yet to be filled out into a formal proof. In essence, what is done is to construct a
particular computational model, based on the properties of a given dataset, and
argue that using Occam's Razor relative to this sort of computational model
leads to good explanations for any dataset, large or small. As the size of the dataset increases, the
explanatory advantage gained by choosing a dataset-guided computational model
for using Occam's Razor decreases, and the choice of computational model
becomes increasingly arbitrary.

**Sketch of a New Argument why Occam’s Razor Works**

I’ve been thinking for a while about a new, somewhat
different argument for why Occam’s Razor makes sense and is a good idea. I haven’t found time to write up a formal
proof , so I’m just going to sketch my “proof idea” here. Eventually maybe I’ll find time to try to
turn this into a real proof, which may well yield to some gotchas, restrictive
conditions or new discoveries…. Or maybe
some other brave and noble soul will take the bait and try to make a real proof
based on my idea…

Ok, so -- the crux of the argument I’m thinking of is as
follows.

Consider, for simplicity, a binary classification problem,
involving a data-set D living within a
data universe U, and a a "ground truth" mapping F: U --> {0,1}. This is not the only context my proposed
argument applies to, but it’s a simple context for explaining the basic idea.

Then, consider two sets of functions S1 and S2, both learned
via application of some learning algorithm L to study D (and not the rest of
F). Suppose:

- The functions in S1 have accuracy a across D, and have size s1
- The functions in S2 have accuracy a across D, and have size s2 > s1

Then the proposition I make is: On average, functions in S1
will give a higher accuracy on F than functions in S2 (**).

Why would (**) be true?

I believe it can probably be demonstrated via induction on
the size of D

Suppose (**) holds for D^* that are smaller than D. Then, suppose we apply crossvalidation to
assess the value of functions in S1 and S2; that is, we run a series of
experiments in which we partition D into 2 subsets (D1, D2) and then apply L to
learn classification functions on D1, and test these functions on D2. These experiment will yield many functions
that don't belong in either S1 or S2, and also some that do belong to these sets. According to the inductive hypothesis: on
average the functions L finds (on the validation folds) belonging to S1 will have greater accuracy across D as
compared to those that L finds (on the validation folds) belonging to S2
(***).

But then from (***) and the basic theory of crossvalidation
(which says that hypotheses doing better on the test portions of validation
folds, tend to do better out of sample), we derive (**).

The question then becomes how to start the inductive
hypothesis off. What is the base case?

Well, one possibility
is for the base case to be the situation where D contains two elements d0 and
d1, with F(d_0) = 0 and F(d_1)=1. To
understand this case, suppose that the data elements d (in D) each have k
internal features. Based on comparing
d0 and d1, there is no way for L to identify dependencies between the different
internal features. Thus, the models most
likely to give good accuracy on F are single-feature models. But these are smaller than any other
models. Thus in this (somewhat degenerate)
base case, smaller models are better.

I think this extreme case reflects a more general truth:
When D is very small, models that ignore dependencies are more likely to give
high accuracy, because it’s generally hard to identify dependencies based on
small datasets.

So there you go – Bob’s your uncle!

The crux of the argument is:

- Simpler models are better for extrapolating from datasets of size N, because simple models are better for extrapolating from size N/k, and crossvalidation theory says that models working better on data subsets are better at working on the whole dataset
- Simpler models are better for extrapolating from very small datasets because it’s not possible to meaningfully extrapolate dependencies between variables based on very small datasets, and models that treat variables as independent and don’t try to model dependencies, are intrinsically simpler

**Dependency on the Expressive Language**

The above is admittedly a sketchy argument at this point,
and more rigorous analysis may expose some holes. But, provisionally taking the basic argument
for granted, it’s worth asking what the above argument says about the language
in which models are expressed.

The main constraint seems to come from the base case: we
need an expressive language in which modeling a dataset in a way that ignores
dependencies, is generally more concise than modeling a dataset in a way that
takes dependencies into account. There
may also be aspects of the “crossvalidation theory” invoked vaguely above, that
depend in some way on the specifics of the expressive language.

While vague and still speculative, this seems promising to
me, e.g. as compared to the Solomonoff induction based foundation for Occam’s
Razor. In the Solomonoff approach, the
judgment of “what is simple” displays a strong dependence on the underlying
Universal Turing Machine, which becomes irrelevant only for large N. But a lot of everyday-life pattern
recognition seems to be best considered in the “small N” context. A lot
of human pattern recognition does seem to depend on what “expressive language”
the human mind/brain is using to represent patterns. On the other hand, in the present approach, the dependency
on the expressive language seems much weaker.
“What is simple” seems to be mostly -- What is judged simple by an
expressive language in which: models that ignore dependencies are simpler than
those that incorporate them….

**So What?**

What’s the point of this kind of argumentation? (apart from the copious entertainment value it supplies to those of us with odd senses of aesthetics and humour, that is... ;D )

The point is that Occam’s Razor, valuable as it is, is a
vague, hand-wavy sort of principle – but given its very central importance in
the philosophy of mind and of science, it would be very nice to have a more
precise version!

*Among other goals, a more precise version of the Razor could provide useful guidance to AI systems in analyzing data and AGI systems in thinking about their experiences.*...

__A Few Quasi-Random References__
@BOOK{Burnham2002,

title = { Model Selection and Multimodel Inference: A Practical Information-Theoretic Approach},

publisher = {Springer},

year = {2002},

author = {Burnham, K P and Anderson, D R},

}

@book{Hutter2005,

author = "Hutter, Marcus",

title = "Universal {Artificial} {Intelligence}: {Sequential} {Decisions} based on {Algorithmic} {Probability}",

publisher = "Springer",

year = 2005

}

@ARTICLE{Kelly2010,

author = {Kevin T Kelly and Conor Mayo-Wilson},

title = {Ockham Efficiency Theorem for Stochastic Empirical Methods },

journal = {Journal of Philosophical Logic 39: pp. 679-312. },

year = {2010},

}

@ARTICLE{Kelly2007,

author = {Kevin T Kelly},

title = {Ockham’s Razor, Empirical Complexity, and Truth-finding Efficiency” },

journal = {Theoretical Computer Science },

year = {2007},

}

@BOOK{Sklar1977,

title = {Space, Time, and Spacetime,},

publisher = {Berkeley},

year = {1977},

author = {L Sklar },

}