… and some Thoughts on Syntactic vs. Semantic Approaches to Guiding Automated Theorem Proving
Quite a few people have been asking me these last few days about GPT-f, OpenAI's new foray into the domain of automated theorem proving (ATP).
It’s not often that AITP (AI for Theorem Proving, an AI field with a long history and exciting recent progress, but a fairly high level of obscurity compared to e.g. video processing or NLP) gets into the popular media -- but if OpenAI is good at one thing, it’s getting AI into the popular media...
GPT-f is in the same basic technological vein as GPT-2 / GPT-3, but is focused on math rather than natural language -- and like the other GPT systems, it is yielding results that are interesting in some ways and frustratingly idiotic in others.
My own sense is that, just as GPT-3 is not the right research direction to get to real human-level language facility, GPT-f is not the right research direction to get to real human-level (let alone superhuman) mathematical theorem-proving.
Let me explain a bit...
GPT -- Prediction and Simulation Without Understanding
You can find my general take on GPT-3 here ; and if you haven’t read it, you should also take a look at Gary Marcus's somewhat similar-in-spirit analysis, nicely titled “GPT-3, Bloviator”
TL;DR is that while the GPT algorithms can be fiendishly good at prediction and generation, they have no understanding of the underlying meaning of what they are predicting and generating, and because of this they systematically make a lot of dumb mistakes among their impressive-looking feats… and there seems no clear way to fix this dumb-ness without introducing radically different architectures and approaches.
GPT-3 is a fascinating artifact and deserved at least a nontrivial fraction of the attention it got, but in my view it is not really a breakthrough in NLP. I look at it more as an incremental improvement in implementation and deployment of the breakthrough concept of transformer NNs, which was introduced by Google e.g. in the BERT model.
GPT-3 does constitute a significant step forward in some aspects of NLP functionality -- however the fact that it doesn’t understand what it is talking about constraints its utility in applications where meaningful reasoning or analysis or invention (as opposed to casually meaningful-looking simulacra of these things) are required. The modest but nontrivial percentage of utter nonsense it generates makes it hard to apply in context like customer support, education or medical chat where it’s not OK for 5% or 20% (or even 1%) of what comes out of your AI to be plausible-sounding bloviatorial bullshit with no foundation in reality.
Viewed in the general context of ongoing work in the ATP field, it seems to me GPT-f is less of a step forward than GPT-2 or GPT-3 -- but for sure it is meaningful incremental progress on ATP, fitting in comfortably with a large amount of ongoing progress by others in the field (which however does not tend to get covered in the tech media, because most researchers working on ATP don’t have OpenAI’s PR budget or facility). GPT-f also has a similar core shortcoming to the other GPTs -- it does not understand math any better than GPT-2 or GPT-3 understand language, which seems likely to constrain its utility as a theorem proving tool.
Just as the NLP field needs a substantial breakthrough to get to systems that can really interact linguistically like people, similarly the ATP field needs a substantial breakthrough to get to systems that can really prove theorems at the level of human mathematicians. It seems extremely clear from looking at the pattern of errors made by GPT-2, GPT-3 and GPT-f that GPT type systems will not constitute this breakthrough. (I have my own ideas about how to get to this breakthrough, and will touch on a few of these a little later on in this post, but that’s not my main focus here.)
Accelerating Automated Theorem Proving
The first thing to understand about ATP (automated theorem proving) is that it’s basically a solved problem in one concrete sense: Current automated theorem provers, if you let them run long enough, can prove or disprove any mathematical hypothesis you give them in a variety of standard formal mathematical systems. This is a mature technology.
The catch of course is that “long enough” is often way too long. So the ATP field focuses a lot of attention on “guidance” of theorem provers, which use various forms of generalization and learning to help ATP systems avoid running down too many dead ends before getting to the proofs or disproofs they’re looking for.
(It’s worth noting that, according to Hutter’s AIXI theory, AGI in general is a solved problem in a similar sense -- algorithms like AIXI^tl can in principle solve powerful formalizations of the AGI problem with simple code, given unrealistically much compute power. However, the practical state of the art with non-AI-driven ATP systems exceeds that with AIXI^tl like AGI systems; i.e. brute-force-ish ATP systems guided by simple-ish heuristics can get more done than AIXI^tl like AGI systems currently can in most other domains.)
Compared to even a very complex game like Go, math is an extremely open-ended domain. However, there are subsets of math that are more constrained and Go-like, and it seems plausible that methods roughly similar to those that worked for Go -- integrated appropriately into general-purpose ATP systems - could solve theorem proving in these domains. (I’ll mention a couple of these below.)
The idea to use transformer neural nets for guiding ATP systems is not original with OpenAI. I believe the first work in this direction was done by my son Zar’s PhD thesis advisor Josef Urban, one of the long-time leaders in the ATP field and the organizer of the annual AITP (AI for Theorem Proving) conference, which is being held this upcoming week in France with an online component as well. Josef’s work from February 2020 appeared online in March and was published in July at CICM. The ideas and links in the rest of this section draw heavily on some recent conversations I had with Josef.
When I asked Josef for a good example of "semantic nonsense" generated by transformer NNs in the theorem-proving domain, he pointed me to the proof at the top of p. 318 here. Anyone who knows basic set theory and is willing to spend a few minutes focusing attention can confirm that this is an utter gibberish arrangement of simple inference steps. Each of these steps would be sensible in some other context -- but to my eye they are radically out of place and senseless here, and clearly indicative that GPT is doing "the wrong sort of thing". No human math student would ever string together steps like this, unless they were randomly copying inference steps from proofs in their textbooks without paying attention to their meaning or relevance.
The work of Josef and his colleagues has touched a bunch of areas generally related to the GPT theorem proving work -- e.g. showing that neural nets with attentional mechanisms can with premise selection , and showing that neural net based embedding vectors capturing limited aspects of proof semantics can help guide theorem proving. However more symbolic approaches have also shown promise, along with work on automatic creation of higher-order proof mechanisms like tactics and heuristics -- with no approach being a clear silver bullet yet.
It’s also interesting to reflect on the OpenAI team’s choice of the Metamath corpus for their experiments -- this is a reasonable corpus to use, but it’s just one among a host of different corpora of theorems and proofs used in the ATP field. Why choose that one in particular? Compared to many alternatives, Metamath’s distinguishing characteristic is that it’s very “low level” -- i.e. the inference steps in the proofs in Metamath’s formalism tend to be very small, without e.g. use of more abstract "tactics" allowing bigger leaps. This corpus is ideally suited to GPT-f’s "brute force"-ish approach, the greatest strength of which is the ability to extrapolate from large numbers of combinations of micro-scale proof steps.
All in all, looking at the GPT-f work in the context of the recent thrus of work by Josef and his students, and the scope of papers presented at AITP-20, and work done on ATP on other corpora as well as MetaMath -- one sees that GPT-f is one among a bunch of different approaches using various ML algorithms to guide theorem-provers, and one doesn’t see any clear sense in which GPT-f is the most promising avenue being explored.
In fact my own suspicion as an AI researcher is that the more exciting and interesting paths to making AI-guided ATP work lie entirely elsewhere.
Math Reasoning, Scientific Reasoning, Commonsense Reasoning
I’m not really active in the AITP space currently, but my PhD was in math and it’s an area I’ve followed with fascination for decades. And I have kept up with the area relatively closely lately due to my son’s PhD work in the area -- as well as due to the close relation between AI for math theorem proving and certain aspects of what we’re doing in the OpenCog project as regards biological data interpretation and automated agent control.
In OpenCog we are not currently concerned with making AIs that prove math theorems -- but we are concerned with making AIs that use theorem proving in probabilistic logics to do things like understand why certain combinations of genes tend to impact certain diseases, or figure out how an NPC in a Minecraft-like game should move blocks around to be able to get to some object it wants. The formal problem of logical inference for commonsense and scientific reasoning is very similar to the formal problem of logical inference for mathematical reasoning -- we hit up against similar problems to the folks in the AITP community, and are taking closely related strategies in terms of using ML algorithms to guide the proof process.
Likely Strengths and Weaknesses of GPT-f
My strong suspicion is that with the GPT-f style approach to theorem-proving,, there will be weak generalization to proofs/theorems that are qualitatively different than the ones in the training data. The same will hold for any other approach that involves training ML models on low-level representations of proofs, without some sort of internal representation (engineered or learned) that corresponds to higher level structures like tactics, more abstract proof-patterns or concepts, etc.
It also seems likely that some applications of theorem-proving these limitations will not matter as much as in others. E.g.
formal verification of the smart contracts that actually occur in current blockchain systems
existence and uniqueness theorems for differential equations useful in practical modeling for everyday physical systems
would seem to be cases where the theorems and proofs are "all kinda similar" in a way that might make the GPT style of pattern recognition relatively effective.
However, if confronted with a theorem from a new domain of mathematics (say, if put in the position of Galois when inventing abstract algebra; or Weierstrass etc. when inventing real analysis), one would expect that a system learning patterns at this low level would not be able to perform much transfer learning, and would need a huge amount of time to bootstrap itself up to functionality in the new domain.
Similarly, if confronted with a theorem from a familiar domain that requires a counterintuitive sort of proof, I’d expect this sort of system would be unlikely to be able to find it. E.g. the Fundamental Theorem of Algebra is different from the bulk of algebra theorems in that it's most conveniently proved via recourse to theorems from complex analysis; but if a GPT type theorem prover had been trained on algebra theorems with more traditional algebra-style proofs, it would be quite hard put to make the leap to try a proof involving complex analysis. Of course most human mathematicians have a hard time making leaps like that as well -- but some are able to, and they do it via having abstract conceptual representations that bridge different areas of mathematics…
Sketch of a (Possibly) Better Way
Zar and I have mused a few times about doing something similar to the methodology taken in GPT-f and other similar systems, but with a measure of interestingness/ surprisingness in the loop. I.e., to simplify a fair bit, what OpenAI has done here -- and what Josef did with his earlier related work with GPT-2 for ATP, and others have done with other ML systems in other prior work -- is
generate a bunch of theorems
have their theorem-prover prove those theorems
learn proof-patterns from these proofs
use these proof-patterns to make the theorem-prover more effective
lather, rinse, repeat
With this approach, each time around the cycle you can choose theorems that are at the edge of what your prover can currently do.
What Zar and I have mused about doing (which is surely not original and has likely also been a desire of many others thinking about the area) is
have our theorem-prover prove those theorems
learn proof-patterns from these proofs
use these proof-patterns to make the theorem-prover more effective
lather, rinse, repeat
In fact I talked about this in my presentation at AITP last year https://goertzel.org/aitp-19/, with some elaborations on different ways of looking at “interestingness” in a mathematics context.
Among the multiple ways to assess "Interestingness" in this context, two critical ones are
("Surprisingness" can be measured information-theoretically in various ways, and there is plenty of subtlety here, as purely statistical information theory is a bit lame in this context yet algorithmic information theory in its full splendor is intractable, so one can concoct various intermediary measures.)
Neither Zar nor I has proceeded with this kind of work so far -- due to having lots of other stuff on our plates, and also due to the computational resources and development time needed for this sort of thing (i.e. among other factors, we don't have a billion dollars from Microsoft... though of course this wouldn't really take remotely that much resources...)
The subtlety here is that if you're generating interesting theorems (where "interestingness" embodies a notion of compositionality, as is implied by the "utility as a lemma" aspect of the definition of interestingness hinted above), you're presumably generating theorems that involve some abstract representations and structures, rather than just algorithmically complex tangles of low-level mathematical operations.
So the approach which includes interestingness in the loop, would seem more amenable to learning approaches that include higher-level tactics and other sorts of abstractions -- hence more amenable to learning approaches capable of significant transfer learning.
In short -- in terms of the challenge of automatically generating interesting new theorems, one suspects the GPT type approach is not much more likely to succeed than the proverbial army of monkeys at their typewriters, whereas an interestingness/abstraction driven approach would have much more hope.
The task of estimating the utility of a theorem as a lemma for other interesting theorems has a lot of overlap with the task of identifying useful proof-patterns from a corpus of proofs. My own preferred approach to these tasks would involve importing the proofs into the OpenCog Atomspace (a weighted, labeled hypergraph knowledge store that we now are using for probabilistic commonsense and scientific inference) and then using a multi-paradigm AI approach combining neural graph embeddings, hypergraph pattern mining and probabilistic logical inference. This leads to various fascinating recursions, including the potential use of the same surprisingness-based AI-for-ATP approach to accelerate the probabilistic logical inference involved. But while this could be done using the current version of OpenCog, various issues with scalability and implementation awkwardness occur, and this has led some of my colleagues and I to put our focus recently on designing a radically more flexible and scalable version of OpenCog, OpenCog Hyperon. But this now leads beyond the scope of this blog post…
Semi-Concluding Ramble…
Whether Hyperon will actually wind up to be the silver bullet for automated theorem proving and other AGI-ish applications remains to be seen -- and it won’t be seen this year, as there is a monster amount of work to be done to make Hyperon a reality. However, the point I want to make right now is that this would be a non-trivially different direction than what the AITP community is currently taking. Josef Urban and others at the heart of the AITP field have the intuition that a more semantic approach to mining patterns from proofs and using them for proof guidance will be valuable -- but using Hyperon based probabilistic logic to represent and infer proof patterns would be a big leap in this direction, substantially different from what one sees in the AITP literature so far.
On the other hand, as I’ve emphasized above, GPT-f -- which does indeed work creditably well on the MetaMath corpus to which it has been applied -- is very much in the vein of what others in the AITP field have been doing for a while. It’s really cool to see big companies get into the automated theorem proving space, which not long ago was more of a tiny obscure academic corner -- no doubt this is going to help accelerate progress in the field in multiple ways. However, let’s not be under any illusion about where the main source of innovation and progress is in AITP -- at this stage it’s definitely not in the big tech companies. OpenAI may be better known than, say, Josef Urban's AITP research group, but there's no doubt who has made more contributions to AI progress. To make the breakthroughs needed to solve theorem-proving and other major AGI-ish challenges is going to require lots of free-flowing creativity, and quite possibly will emerge from the decentralized mess of university labs, open source projects and early-stage startups rather than the secretive, massive-data-and-processing-centric efforts of big tech firms.