To follow this blog by email, give your address here...

Tuesday, June 16, 2015

Why Occam’s Razor Works

(Sketch of a Possible Explanation Why Occam’s Razor Works...)

(Though motivated by deep questions in philosophy, this is a speculative math-y blog post; non-technically oriented readers beware…)

How can, or should, an intelligent mind make sense of the firehose of complex, messy data that its sensors feed it?    Minds recognize patterns, but generally there are many many patterns in the data coming into a mind, and figuring out which data to pay attention to is a significant problem.   Some major aspects of this problem are: Figuring out which of the patterns that have occurred in one context are more likely to occur in other similar contexts, and figuring out which of the patterns that have occurred in the past are more likely to occur in the future. 

One informal principle that seems broadly useful for solving this “pattern selection” problem is “Occam's Razor.”   This principle is commonly taken as a key ingredient in the scientific method – it plays a key role in many philosophies of science, including the “social computational probabilist” philosophy I’ve presented here and in The Hidden Pattern.

Occam’s Razor has been around a while (well before its namesake, William of Ockham) and has been posited in multiple forms, e.g.:

“Nature operates in the shortest way possible” -- Aristotle, BC 384-322

“We consider it a good principle to explain the phenomena by the simplest hypothesis possible.”  -- Ptolemy, c. AD 90 -  168

“Plurality must never be posited without necessity” -- William of Ockham, c. 1287-1347

“Entities must not be multiplied beyond necessity” -- John Punch, 1639

“Everything should be as simple as it can be, but not simpler” --  Albert Einstein (paraphrased by Roger Sessions)

The modern form of the Razor, as used in discussions of scientific methodology and philosophy of science, could be phrased something like:

Given two explanations that explain the same data, the simpler one should be preferred

or as Sklar (1977) phrased it,

In choosing a scientific hypothesis in light of evidential data, we must frequently add to the data some methodological principle of simplicity in order to select out as ''preferred'' one of the many different possible hypotheses, all compatible with the specified data.

This principle is often taken for granted, and has a certain intuitive ring of truth to it -- but why should we actually accept it?   What makes it true? 

Arguments for Occam’s Razor

Perhaps the most compelling argument for Occam's Razor is the theory of Solomonoff Induction, generalized more recently by  Marcus Hutter into a theory of Universal AI.   This theory shows, very roughly speaking, that the assumption that the shortest computer program computing a set of data is the best explanation for that data, where "best" is defined in terms of accurate prediction of missing or future parts of the dataset.  This is very elegant but the catch is that effectively it only applies to fairly large datasets, because it relies heavily on the fact that, in the limit of large datasets, the shortest program explaining the dataset in one programming language is approximately the same length as the shortest program explaining that same dataset in any other programming language.   It assumes one is in a regime where the computational cost of simulating one programming language (or computer) using another is a trivial matter to be brushed aside.

There are other approaches to justifying Occam's Razor as well.   The Akaike Information Criterion (AIC) formalizes the balance between simplicity and goodness-of-fit that is required to achieve extrapolation without overfitting.  However,  the derivations underlying the AIC and its competitor, the Bayesian Information Criterion (BIC), hold only for large datasets.  The AICc version works for small datasets but relies on special distributional assumptions.

There is also Kelly's (2007, 2010) interesting argument that the shortest hypothesis is, under certain assumptions, the one that will require one to change one's mind least often upon exposure to new data.   This is interesting but somewhat begs the question of why it's so bad to change one's mind when exposed to new data.  Kelly's proofs also seem to get bogged down in various technical intricacies and conditions, which  may however be ultimately resolvable in a clean way.

Here I present a new  argument for Occam's Razor, which appears to work for small as well as large datasets, and which is based on the statistical notion of subsampling.   At present the argument is just a sketch, yet to be filled out into a formal proof.  In essence, what is done is to construct a particular computational model, based on the properties of a given dataset, and argue that using Occam's Razor relative to this sort of computational model leads to good explanations for any dataset, large or small.   As the size of the dataset increases, the explanatory advantage gained by choosing a dataset-guided computational model for using Occam's Razor decreases, and the choice of computational model becomes increasingly arbitrary.

Sketch of a New Argument why Occam’s Razor Works

I’ve been thinking for a while about a new, somewhat different argument for why Occam’s Razor makes sense and is a good idea.   I haven’t found time to write up a formal proof , so I’m just going to sketch my “proof idea” here.   Eventually maybe I’ll find time to try to turn this into a real proof, which may well yield to some gotchas, restrictive conditions or new discoveries….  Or maybe some other brave and noble soul will take the bait and try to make a real proof based on my idea…
Ok, so -- the crux of the argument I’m thinking of is as follows.

Consider, for simplicity, a binary classification problem, involving  a data-set D living within a data universe U, and a a "ground truth" mapping F: U --> {0,1}.    This is not the only context my proposed argument applies to, but it’s a simple context for explaining the basic idea.

Then, consider two sets of functions S1 and S2, both learned via application of some learning algorithm L to study D (and not the rest of F).   Suppose:

  • The functions in S1 have accuracy a across D, and have size s1
  • The functions in S2 have accuracy a across D, and have size s2 > s1


Then the proposition I make is: On average, functions in S1 will give a higher accuracy on F than functions in S2 (**).

Why would (**) be true?

I believe it can probably be demonstrated via induction on the size of D  

Suppose (**) holds for D^* that are smaller than D.   Then, suppose we apply crossvalidation to assess the value of functions in S1 and S2; that is, we run a series of experiments in which we partition D into 2 subsets (D1, D2) and then apply L to learn classification functions on D1, and test these functions on D2.   These experiment will yield many functions that don't belong in either S1 or S2, and also some that do belong to these sets.   According to the inductive hypothesis: on average the functions L finds (on the validation folds) belonging to  S1 will have greater accuracy across D as compared to those that L finds (on the validation folds) belonging to S2 (***).  

But then from (***) and the basic theory of crossvalidation (which says that hypotheses doing better on the test portions of validation folds, tend to do better out of sample), we derive (**).

The question then becomes how to start the inductive hypothesis off.   What is the base case?

Well,  one possibility is for the base case to be the situation where D contains two elements d0 and d1, with F(d_0) = 0 and F(d_1)=1.   To understand this case, suppose that the data elements d (in D) each have k internal features.   Based on comparing d0 and d1, there is no way for L to identify dependencies between the different internal features.  Thus, the models most likely to give good accuracy on F are single-feature models.  But these are smaller than any other models.  Thus in this (somewhat degenerate) base case, smaller models are better.

I think this extreme case reflects a more general truth: When D is very small, models that ignore dependencies are more likely to give high accuracy, because it’s generally hard to identify dependencies based on small datasets.


So there you go – Bob’s your uncle!

The crux of the argument is:

  • Simpler models are better for extrapolating from datasets of size N, because simple models are better for extrapolating from size N/k, and crossvalidation theory says that models working better on data subsets are better at working on the whole dataset
  • Simpler models are better for extrapolating from very small datasets because it’s not possible to meaningfully extrapolate dependencies between variables based on very small datasets, and models that treat variables as independent and don’t try to model dependencies, are intrinsically simpler


Dependency on the Expressive Language

The above is admittedly a sketchy argument at this point, and more rigorous analysis may expose some holes.   But, provisionally taking the basic argument for granted, it’s worth asking what the above argument says about the language in which models are expressed.  

The main constraint seems to come from the base case: we need an expressive language in which modeling a dataset in a way that ignores dependencies, is generally more concise than modeling a dataset in a way that takes dependencies into account.   There may also be aspects of the “crossvalidation theory” invoked vaguely above, that depend in some way on the specifics of the expressive language.

While vague and still speculative, this seems promising to me, e.g. as compared to the Solomonoff induction based foundation for Occam’s Razor.   In the Solomonoff approach, the judgment of “what is simple” displays a strong dependence on the underlying Universal Turing Machine, which becomes irrelevant only for large N.  But a lot of everyday-life pattern recognition seems to be best considered in the “small N” context.    A lot of human pattern recognition does seem to depend on what “expressive language” the human mind/brain is using to represent patterns.  On the other hand, in the present approach, the dependency on the expressive language seems much weaker.   “What is simple” seems to be mostly -- What is judged simple by an expressive language in which: models that ignore dependencies are simpler than those that incorporate them….

So What?

What’s the point of this kind of argumentation?  (apart from the copious entertainment value it supplies to those of us with odd senses of aesthetics and humour, that is... ;D )

The point is that Occam’s Razor, valuable as it is, is a vague, hand-wavy sort of principle – but given its very central importance in the philosophy of mind and of science, it would be very nice to have a more precise version!    


Among other goals, a more precise version of the Razor could provide useful guidance to AI systems in analyzing data and AGI systems in thinking about their experiences.

...

A Few Quasi-Random References

@BOOK{Burnham2002,
  title = { Model Selection and Multimodel Inference: A Practical Information-Theoretic Approach},
  publisher = {Springer},
  year = {2002},
  author = {Burnham, K P and Anderson, D R},
}

@book{Hutter2005,
 author = "Hutter, Marcus",
 title = "Universal {Artificial} {Intelligence}: {Sequential} {Decisions} based on {Algorithmic} {Probability}",
 publisher = "Springer",
 year = 2005
}


@ARTICLE{Kelly2010,
  author = {Kevin T Kelly and Conor Mayo-Wilson},
  title = {Ockham Efficiency Theorem for Stochastic Empirical Methods },
  journal = {Journal of Philosophical Logic 39: pp. 679-312. },
 year = {2010},
}


@ARTICLE{Kelly2007,
  author = {Kevin T Kelly},
  title = {Ockham’s Razor, Empirical Complexity, and Truth-finding Efficiency”  },
  journal = {Theoretical Computer Science },
 year = {2007},
}

@BOOK{Sklar1977,
  title = {Space, Time, and Spacetime,},
  publisher = {Berkeley},
  year = {1977},
  author = {L Sklar },
}