To follow this blog by email, give your address here...

Monday, September 14, 2020

Sand Talk, Singularity and Psi


It’s not that often I encounter a newly authored book that I consider a “must read” and avidly recommend to all my friends and colleagues -- but Sand Talk by Tyson Yunkaporta genuinely falls into this category.   


I was introduced to the book by Jim Rutt who interviewed the author on his podcast -- an episode I also strongly recommend.


Yunkaporta is an Australian Aboriginal whose core goal in the book is to present and analyze modern civilization society from the perspective of Aboriginal indigenous society.   He does a bang-up job of this and raises a lot of related interesting issues along the way.


I’m not going to give a proper review or summary of the book here, but am mostly going to share various indirectly-related thoughts that occurred to me upon reading the book.   If you want a basic overview of the themes of the book try this brief interview with the author.


Was Civilization a Step Forward, or Backward?


It is hard to really give an "outsider" view of something as big as *human civilization* , but Yunkaporta manages to take a decent stab at it....  He is able to pull this off by virtue of being fascinatingly between the two worlds -- ensconced enough in Aboriginal culture to understand their indigenous world-view on the multiple relevant levels; but also embedded enough in the modern intellectual world to explain aspects of the indigenous world view (in the context of contrast with modernity & civilization) in ways that are clear and straightforward and compelling  to those of us accustomed to modern verbal/analytical modes of expression ...


There are hilarious things in the book -- e.g. the analogy and possible historical relationship btw modern education systems and methodologies for animal domestication.   


There are beautifully conceptually insightful things, like the discussion of multiple forms of human understanding -- pattern mind, kinship mind, dreaming mind, story-mind, ancestor-mind -- and the observation of how biased modern civilized culture is toward some of these and away from others.  


There are disturbingly thought-provoking things, like his discussion of public vs. private violence and the relation between physical and psychological violence (he believes that occasional/ moderate physical violence in indigenous societies plays a valuable social-psychological role, which it can't play in modern society due to the different sort of organization).  


All in all he makes a reasonably compelling case that the states of consciousness of indigenous people -- and the "collective minds" of indigenous tribal groups (to use my own fancy language that he probably would not approve of, he is very very down to earth in the book) -- were much more satisfied and fundamentally healthy than the vast majority of individual and human-collective-group minds on the planet today...


He views civilization as mostly destructive both to human minds and families and bodies, and to the rest of the physical environment on the planet…


I largely agree with the core points Yunkaporta makes in the book, but of course I have a somewhat different slant on the conclusions/ideas.  (The rest of this post is now my own musing and  rambling, some of which Yunkaporta might well utterly disagree with...)  


What Drives Humanity?


The transition from hunter-gatherer to agricultural society and then to modern civilization probably has degraded happiness and mental and social health in many important

senses, as Yunkaporta points out.   


So it is certainly meaningful to question whether these transitions have really been “advances” as is commonly assumed.


However, it’s also important to understand that the driver of these transitions has never really been to increase happiness or mental/social health!   Put crudely, “happiness” (a concept very tricky to define) is not necessarily what humans have been after...


Humans and human groups are complex with multiple motivations -- and the drive for fundamental novelty and growth is one of them.


Aboriginal society was stable for 60,000 years, and certainly allowed individuals to pursue their drive for novelty and growth  through stories and dreaming and battles and adventures -- but it did not allow human-groups much avenue to pursue the drive for novelty and growth.   


It would seem that, once human-groups got a taste of fulfillment of their novelty-and-growth goal, they became addicted to it and things just snowballed till we got where we are today…


Singularity as Paradise Regained and Transcended


So what do to now?  Yunkaporta does not really suggest rolling back to indigenous society, because he's a realist and can see this is not too likely to happen except in the aftermath of some disaster scenario (which seems strangely possible to me at the moment, as I write these words sitting here on an island near Seattle with the air so full of smoke one can barely see a few hundred meters out into the ocean… smoke due to forest fires that are spreading like mad through the US West due to heat and dryness that is likely substantially due to industry-induced climate warming....  But yeah yeah... the smoke is supposed to clear over the next couple days...)


Personally, as you probably know, I think we are on the verge of another shift even bigger than the shift from hunter-gatherer to civilization (Singularity, anyone?) ... and as we attempt to guide ourselves through this shift, it seems very important to keep in mind the various aspects of human individual and group life that have been squelched in the transition to civilization -- Perhaps these aspects can be regained in a different form as we move into the next phase....


Yunkaporta refers to modern civilized thinking as "context-free" thinking , whereas indigenous thinking is contextually-embedded -- i.e. in the context of a network of social relationships, a specific area of land, etc.   This is a deep point that I am still in the process of fully absorbing.   There is an indirect connection here to the relational interpretation of quantum mechanics, in which there are no pure phenomena, only (observer, phenomenon) pairs.   But much of our thinking these days is done as if there are pure phenomena.


On the other hand, fundamentally, embedding thinking in contexts defined by kinship and physical place is only one possible way of embedding thinking -- there are many other sorts of possible contexts.   It's arguable that these particular sorts of contexts are intrinsically central to humanity, due to the way our bodies and brains are built.   On the other hand the Singularity is partly about going beyond the restrictions of legacy humanity.   


It seems clear that we want to be sure post-Singularity minds including early-stage AGIs and cyborgs are able to engage in richly context-sensitive thinking, regarding contexts defined by kinship networks and other social networks and specific physical locations -- but also other contexts beyond these historically critical examples.


From an indigenous view -- or a conventional modern civilized human view -- one could view this sort of idea as overweening and narcissistic .. i.e. how can we, who are born to human moms and live our lives walking around in the dirt and stuffing ourselves with food grown in the earth and nourished with water and sunlight -- sensibly talk about going beyond kinship and physical location into some sort of airy  domain of  abstract contexts?   But we have built computer chips and brain scanners and open-heart surgery and brain surgery and virtual reality and birth control and IVF and gone to the moon and blow up atom bombs and on and on -- while we have indeed lost a lot of important stuff relative to our indigenous forebears, there is also obviously a huge amount of power in the crazy path we civilized folks have blazed...


Inspiration Regarding Psi from the Indigenous Perspective


Another aspect of Yunkaporta’s book that jumped out at me -- though it was fairly peripheral in his narrative -- was the Aboriginal approach to psi (psychic, paranormal phenomena…).   (For some references on the science of psi, see  here.)


In the indigenous view psi is just there along with a lot of other phenomena -- it's part of the patterns people observe, and part of the correlation between dreaming mind and everyday life, and part of the correlation between ancestors and nonhuman life (including e.g. rocks which Aboriginals consider to have their own sort of consciousness) and human minds, etc.    


The mercurial nature of psi phenomena, which gives us such a headache as scientists seeking replicable results, is not a problem from the indigenous view -- it's just how the world works.


In fact, conjecturing a bit, I vaguely suspect Yunkaporta would view the modern civilized-society obsession with the small percentage of phenomena in the universe that are reliably repeatable as an indication of our modern civilized mental un-health.   


Consider e.g. the analogue of romantic relationships.  There is all sorts of special-case magic in romantic relationships -- special moments that will never be repeated -- ad hoc exploitations of beautiful situations or unique moods -- and that is fine and wonderful and part of the beauty of it all.   Obsessing on those portions of a romantic relationship that are highly reliably repeatable in the same precise form, without contextual variability -- would seem crass and ridiculous.   (Every time I give her a sufficient number of flowers of the appropriate species and purchase her an evening meal at a restaurant with a sufficiently high rating by a reputable source, she must respond by inviting me for a night of passionate lovemaking -- and if this  fails too often because of some unpredicted variability in aspects of the context in the world or her brain or body or one of our life-processes, then that proves the pattern is inadequate and we must seek a more rigorous and reliable pattern !!) ....  


One could view the obsession with repeatability and reliability (in psi research and elsewhere) from this perspective, as a weird/twisted focus on certain small corners of the broader universe, most of which just doesn't play by that sort of highly simplistic rulebook....


I've suggested before that  basic progress on psi might require modification of our concept of science in the direction of "second person science", where e.g. brain computer interfacing is used to enable one observer to directly perceive another observer's perception of phenomena -- so that we can compare subjective observations directly without needing to project them wholly into non-experiential data.


Quite possibly, another sense in which our current conception of science may need to be obsoleted/ transcended, is that we need to move beyond our simplistic notion of "repeatability"....  Psi is contextually dependent in the same sort of way that indigenous knowledge is contextually dependent (which is related to but maybe stronger than the sense in which quantum knowledge is contextually dependent).   Charlie Tart's wonderful notion of state-specific science is going in this direction -- it's a particular sort of example of context-specific science....


I am reminded of a chat I heard among psi researchers not long ago, which may be paraphrased as:


A::

Can you think of ANY pre-registered, high statistical power, multi-lab psi replication that was successful?


B::

The PEAR consortium in 2000 tried to replicate the original PEAR findings and although the main effect of intention was not significant, they did find a number of interesting effects and anomalous aspects in the data.


C:

I’m aware of those results, but the question was about robust replication.  Finding post hoc "interesting  effects” doesn’t qualify.


A’s question initially seems like a good one; but I just wonder if, somehow or other, to really grapple with psi we need to find a broader notion of success in which finding post hoc "interesting internal effects" DOES qualify...


Yes, one can fool oneself and see interesting post hoc effects in random noise.  However,one can also see genuinely interesting post hoc effects that are not the same as what one was looking for, yet have some meaty surprisingness value to them...


How to quantify this?  Taking a broad-minded Bayesian sort of view, one could ask: In a universe with mercurial psi, versus a universe with none, would this post-hoc effect be more likely to occur?    


Across a host of different experiments with weird post-hoc effects but not accurate replicationof prior results, this sort of question becomes more and more meaningful.   It's not an easy way to look at things, and it's better asked in the context of meaningful concrete models of "universes with mercurial psi", but something in this direction may be better than trying to shove funky / trickstery / mercurial phenomena into a bin of "highly repetitive, replicable phenomena" where they just plain don't fit...


The general notion that psi is "trickstery" is definitely far from new.   However what I’m thinking that may be new is:  How to make a data-analytics methodology that accounts for this aspect...


What if one simply takes the mass of data from a bunch of psi studies and


  1. searches for surprising psi-ish patterns in the data using general pattern-analysis tools

  2. does a comparable search for such patterns in appropriately shuffled versions of the data


To keep things statistically valid, one would want to explore 1) using a variety of pattern-analysis approaches, and then once one feels one has really got something, one tries 2) for validation. Obviously 2) should not be part of one’s main pattern-search loop.


Of course the cosmic trickster could also troll this analysis by -- as soon as one has done this analysis -- starting to tweak experimental results so as to give bad results according to one's particular mathematical definition of surprisingness, but good results according to some other definition, etc.


But yet -- the cosmic trickster is obviously not being infinitely tricky in all instances, sometimes just highly tricky -- and trying to outsmart their tricks is part of the grand eurycosmic dance in which we find ourselves ...


When I shared some of these thoughts with a leading psi researcher, he recalled a time he gave a talk in Australia,, and an aboriginal elder from the audience approached him afterwards, whispering that her people had been using psi for over 50,000 years, but that it was nice to see science starting to catch up.


GPT-f -- One More Funky Experiment in Guiding Theorem-Proving without Understanding

 


 … and some Thoughts on Syntactic vs. Semantic Approaches to Guiding Automated Theorem Proving


Quite a few people have been asking me these last few days about GPT-f, OpenAI's new foray into the domain of automated theorem proving (ATP).   

   

It’s not often that AITP (AI for Theorem Proving, an AI field with a long history and exciting recent progress, but a fairly high level of obscurity compared to e.g. video processing or NLP) gets into the popular media -- but if OpenAI is good at one thing, it’s getting AI into the popular media...


GPT-f is in the same basic technological vein as GPT-2 / GPT-3, but is focused on math rather than natural language -- and like the other GPT systems, it is yielding results that are interesting in some ways and frustratingly idiotic in others.


My own sense is that, just as GPT-3 is not the right research direction to get to real human-level language facility, GPT-f is not the right research direction to get to real human-level (let alone superhuman) mathematical theorem-proving.


Let me explain a bit...


GPT -- Prediction and Simulation Without Understanding


You can find my general take on GPT-3 here ; and if you haven’t read it, you should also take a look at Gary Marcus's somewhat similar-in-spirit analysis, nicely titled “GPT-3, Bloviator


TL;DR is that while the GPT algorithms can be fiendishly good at prediction and generation, they have no understanding of the underlying meaning of what they are predicting and generating, and because of this they systematically make a lot of dumb mistakes among their impressive-looking feats… and there seems no clear way to fix this dumb-ness without introducing radically different architectures and approaches.


GPT-3 is a fascinating artifact and deserved at least a nontrivial fraction of the attention it got, but in my view it is not really a breakthrough in NLP.  I look at it more as an incremental improvement in implementation and deployment of the breakthrough concept of transformer NNs, which was introduced by Google e.g. in the BERT model.   


GPT-3  does constitute a significant step forward in some aspects of NLP functionality --  however the fact that it doesn’t understand what it is talking about constraints its utility in applications where meaningful reasoning or analysis or invention (as opposed to casually  meaningful-looking simulacra of these things) are required.   The modest but nontrivial percentage of utter nonsense it generates makes it hard to apply in context like customer support, education or  medical chat where it’s not OK for 5% or 20% (or even 1%) of what comes out of your AI to be plausible-sounding bloviatorial bullshit with no foundation in reality.


Viewed in the general context of ongoing work in the ATP field, it seems to me GPT-f is less of a step forward than GPT-2 or GPT-3 -- but for sure it is meaningful incremental progress on ATP, fitting in comfortably with a large amount of ongoing progress by others in the field (which however does not tend to get covered in the tech media, because most researchers working on ATP don’t have OpenAI’s PR budget or facility).   GPT-f also has a similar core shortcoming to the other GPTs -- it does not understand math any better than GPT-2 or GPT-3 understand language, which seems likely to constrain its utility as a theorem proving tool.   


Just as the NLP field needs a substantial breakthrough to get to systems that can really interact linguistically like people, similarly the ATP field needs a substantial breakthrough to get to systems that can really prove theorems at the level of human mathematicians.   It seems extremely clear from looking at the pattern of errors made by GPT-2, GPT-3 and GPT-f that GPT type systems will not constitute this breakthrough.  (I have my own ideas about how to get to this breakthrough, and will touch on a few of these a little later on in this post, but that’s not my main focus here.)



Accelerating Automated Theorem Proving


The first thing to understand about ATP (automated theorem proving) is that it’s basically a solved problem in one concrete sense: Current automated theorem provers, if you let them run long enough, can prove or disprove any mathematical hypothesis you give them in a variety of standard formal mathematical systems.   This is a mature technology.   


The catch of course is that “long enough” is often way too long.   So the ATP field focuses a lot of attention on “guidance” of theorem provers, which use various forms of generalization and learning to help ATP systems avoid running down too many dead ends before getting to the proofs or disproofs they’re looking for.   


(It’s worth noting that, according to Hutter’s AIXI theory, AGI in general is a solved problem in a similar sense -- algorithms like AIXI^tl can in principle solve powerful formalizations of the AGI problem with simple code, given unrealistically much compute power.   However, the practical state of the art with non-AI-driven ATP systems exceeds that with AIXI^tl like AGI systems; i.e. brute-force-ish ATP systems guided by simple-ish heuristics can get more done than AIXI^tl like AGI systems currently can in most other domains.)


Compared to even a very complex game like Go, math is an extremely open-ended domain.  However, there are subsets of math that are more constrained and Go-like, and it seems plausible that methods roughly similar to those that worked for Go -- integrated appropriately into general-purpose ATP systems - could solve theorem proving in these domains.   (I’ll mention a couple of these below.)


The idea to use transformer neural nets for guiding ATP systems is not original with OpenAI.   I believe the first work in this direction was done by my son Zar’s PhD thesis advisor Josef Urban, one of the long-time leaders in the ATP field and the organizer of the annual AITP (AI for Theorem Proving) conference, which is being held this upcoming week in France with an online component as well.    Josef’s work from February 2020 appeared online in March and was published in July at CICM.  The ideas and links in the rest of this section draw heavily on some recent conversations I had with Josef.


When I asked Josef for a good example of "semantic nonsense" generated by transformer NNs in the theorem-proving domain, he pointed me to the proof at the top of p. 318 here.   Anyone who knows basic set theory and is willing to spend a few minutes focusing attention can confirm that this is an utter gibberish arrangement of simple inference steps.   Each of these steps would be sensible in some other context -- but to my eye they are radically out of place and senseless here, and clearly indicative that GPT is doing "the wrong sort of thing".   No human  math student would ever string together steps like this, unless they were randomly copying inference steps from proofs in their textbooks without paying attention to their meaning or relevance.


The work of Josef and his colleagues has touched a bunch of areas generally related to the GPT theorem proving work -- e.g. showing that neural nets with attentional mechanisms can with premise selection , and showing that neural net based embedding vectors capturing limited aspects of proof semantics can help guide theorem proving. However more symbolic approaches have also shown promise, along with work on automatic creation of higher-order proof mechanisms like tactics and heuristics -- with no approach being a clear silver bullet yet. 

 

It’s also interesting to reflect on the OpenAI team’s choice of the  Metamath corpus for their experiments -- this is a reasonable corpus to use, but it’s just one among a host of different corpora of theorems and proofs used in the ATP field.   Why choose that one in particular?   Compared to many alternatives, Metamath’s distinguishing characteristic is that it’s very “low level” -- i.e.  the inference steps in the proofs in Metamath’s formalism tend to be  very small, without e.g. use of more abstract "tactics" allowing bigger leaps.  This corpus is ideally suited to GPT-f’s  "brute force"-ish approach, the greatest strength of which is the ability to extrapolate from large numbers of combinations of micro-scale proof steps.

 

All in all, looking at the GPT-f work in the context of the recent thrus of work by Josef and his students, and the scope of papers presented at AITP-20, and work done on ATP on other corpora as well as MetaMath -- one sees that GPT-f is one among a bunch of different approaches using various ML algorithms to guide theorem-provers, and one doesn’t see any clear sense in which GPT-f is the  most promising avenue being explored.

 

In fact my own suspicion as an AI researcher is that the more exciting and interesting paths to making AI-guided ATP work lie entirely elsewhere.   

 

Math Reasoning, Scientific Reasoning, Commonsense Reasoning

 

I’m not really active in the AITP space currently, but my PhD was in math and it’s an area I’ve followed  with fascination for decades.   And I have kept up with the area relatively closely lately due to my son’s PhD work in the area -- as well as due to the close relation between AI for math theorem proving and certain aspects of what we’re doing in the OpenCog project as regards biological data interpretation and automated agent control.   

 

In OpenCog we are not currently concerned with making AIs that prove math theorems -- but we are concerned with making AIs that use theorem proving in probabilistic logics to do things like understand why certain combinations of genes tend to impact certain diseases, or figure out how an NPC in a Minecraft-like game should move blocks around to be able to get to some object it wants.   The formal problem of logical inference for commonsense and scientific reasoning is very similar to the formal problem of logical inference for mathematical reasoning -- we hit up against similar problems to the folks in the AITP community, and are taking closely related strategies in terms of using ML algorithms to guide the proof process.

 

Likely Strengths and Weaknesses of GPT-f


My strong suspicion is that with the GPT-f style approach to theorem-proving,, there will be weak generalization to proofs/theorems that are qualitatively different than the ones in the training data.   The same will hold for any other approach that involves training ML models on low-level representations of proofs, without some sort of internal representation (engineered or learned) that corresponds to higher level structures like tactics, more abstract proof-patterns or concepts, etc.


It also seems likely that some applications of theorem-proving these limitations will not matter as much as in others.   E.g.


  • formal verification of the smart contracts that actually occur in current blockchain systems

  • existence and uniqueness theorems for differential equations useful in practical modeling for everyday physical systems


would seem to be cases where the theorems and proofs are "all kinda similar" in a way that might make the GPT style of pattern recognition relatively effective.


However, if confronted with a theorem from a new domain of mathematics (say, if put in the position of Galois when inventing abstract algebra; or Weierstrass etc. when inventing real analysis), one would expect that a system learning patterns at this low level would not be able to perform much transfer learning, and would need a huge amount of time to bootstrap itself up to functionality in the new domain.


Similarly, if confronted with a theorem from a familiar domain that requires a counterintuitive sort of proof, I’d expect this sort of system would be unlikely to be able to find it.   E.g. the Fundamental Theorem of Algebra is different from the bulk of algebra theorems in that it's most conveniently proved via recourse to theorems from complex analysis; but if a GPT type theorem prover had been trained on algebra theorems with more traditional algebra-style proofs, it would be quite hard put to make the leap to try a proof involving complex analysis.  Of course most human mathematicians have a hard time making leaps like that as well -- but some are able to, and they do it via having abstract conceptual representations that bridge different areas of mathematics…


Sketch of a (Possibly) Better Way


Zar and I have mused a few times about doing something similar to the methodology taken in GPT-f and other similar systems, but with a measure of interestingness/ surprisingness in the loop.   I.e., to simplify a fair bit, what OpenAI has done here -- and what Josef did with his earlier related work with GPT-2 for ATP, and others have done with other ML systems in other prior work --  is


  • generate a bunch of theorems

  • have their theorem-prover prove those theorems

  • learn proof-patterns from these proofs

  • use these proof-patterns to make the theorem-prover more effective

  • lather, rinse, repeat


With this approach, each time around the cycle you can choose theorems that are at the edge of what your prover can currently do.


What Zar and I have mused about doing (which is surely not original and has likely also been a desire of many others thinking about the area) is


  • generate a bunch of  *interesting* theorems (according to a formula for interestingness)

  • have our theorem-prover prove those theorems

  •  learn proof-patterns from these proofs

  • use these proof-patterns to make the theorem-prover more effective

  • lather, rinse, repeat


In fact I talked about this in my presentation at AITP last year https://goertzel.org/aitp-19/, with some elaborations on different ways of looking at “interestingness” in a mathematics context.


Among the multiple ways to assess "Interestingness" in this context, two critical ones are


  • surprisingness relative to the proofs and theorems already known to the system

  • utility as a lemma in proving other surprising theorems


("Surprisingness" can be measured information-theoretically in various ways, and there is plenty of subtlety here, as purely statistical information theory is a bit lame in this context yet algorithmic information theory in its full splendor is intractable, so one can concoct various intermediary measures.)


Neither Zar nor I has  proceeded with this kind of work so far -- due to having lots of other stuff on our plates,  and also due to the computational  resources and development time needed for this sort of thing  (i.e. among other factors, we don't have a billion dollars from Microsoft... though of course this wouldn't really take remotely  that much resources...)


The subtlety here is that if you're generating interesting theorems (where "interestingness" embodies a notion of compositionality, as is implied by the "utility as a lemma" aspect of the definition of interestingness hinted above), you're presumably generating theorems that involve some abstract representations and structures, rather than just algorithmically complex tangles of low-level mathematical operations.   


So the approach which includes interestingness in the loop, would seem more amenable to learning approaches that include higher-level tactics and other sorts of abstractions -- hence more amenable to learning approaches capable of significant transfer learning.


In short -- in terms of the challenge of automatically generating interesting new theorems, one suspects the GPT type approach is not much more likely to succeed than the proverbial army of monkeys at their typewriters, whereas an interestingness/abstraction driven approach would have much more hope.


The task of estimating the utility of a theorem as a lemma for other interesting theorems has a lot of overlap with the task of identifying useful proof-patterns from a corpus of proofs.  My own preferred approach to these tasks would involve importing the proofs into the OpenCog Atomspace (a weighted, labeled hypergraph knowledge store that we now are using for probabilistic commonsense and scientific inference) and then using a multi-paradigm AI approach combining neural graph embeddings, hypergraph pattern mining and probabilistic logical inference.   This leads to various fascinating recursions, including the potential use of the same surprisingness-based AI-for-ATP approach to accelerate the probabilistic logical inference involved.   But while this could be done using the current version of OpenCog, various issues with scalability and implementation awkwardness occur, and this has led some of my colleagues and I to put our focus recently on designing a radically more flexible and scalable version of OpenCog, OpenCog Hyperon.   But this now leads beyond the scope of this blog post…


Semi-Concluding Ramble…


Whether Hyperon will actually wind up to be the silver bullet for automated theorem proving and other AGI-ish applications remains to be seen -- and it won’t be seen this year, as there is a monster amount of work to be done to make Hyperon a reality.   However, the point I want to make right now is that this would be a non-trivially different direction than what the AITP community is currently taking.   Josef Urban and others at the heart of the AITP field have the intuition that a more semantic approach to mining patterns from proofs and using them for proof guidance will be valuable -- but using Hyperon based probabilistic logic to represent and infer proof patterns would be a big leap in this direction, substantially different from what one sees in the AITP literature so far.   


On the other hand, as I’ve emphasized above, GPT-f -- which does indeed work creditably well on the MetaMath corpus to which it has been applied -- is very much in the vein of what others in the AITP field have been doing for a while.   It’s really cool to see big companies get into the automated theorem proving space, which not long ago was more of a tiny obscure academic corner -- no doubt this is going to help accelerate progress in the field in multiple ways.  However, let’s not be under any illusion about where the main source of innovation and progress is in AITP -- at this stage it’s definitely not in the big tech companies. OpenAI may be better known than, say, Josef Urban's AITP research group, but there's no doubt who has made more contributions to AI progress. To make the breakthroughs needed to solve theorem-proving and other major AGI-ish challenges is going to require lots of free-flowing creativity, and quite possibly will emerge from the decentralized mess of university labs, open source projects and early-stage startups rather than the secretive, massive-data-and-processing-centric efforts of big tech firms.


Friday, July 31, 2020

GPT3 -- Super-Cool but Not a Path to AGI

The hype around GPT3 recently has been so much that even OpenAI founder/CEO Sam Altman has endeavored to dial it down a notch.  Like everyone else who has looked carefully, Altman knows that GPT3 is very far from constituting the profound AI progress that some, dazzled by exciting but cherry-picked examples, have proclaimed.


All but the most blurry-eyed enthusiasts are by now realizing that, while GPT3 has some truly novel and exciting capabilities for language processing and related tasks, it fundamentally doesn’t understand the language it generates — that is, it doesn’t know what it’s talking about.   And this fact places some severe limitations on both the practical applications of the GPT3 model, and its value as a stepping-stone toward more truly powerful AIs such as artificial general intelligences.


What I want to explore here is the most central limitation that I see in how GPT3 operates: the model’s apparent inability to do what cognitive scientists call symbol grounding, to appropriately connect the general to the particular.    


Symbol grounding is usually discussed in the context of grounding words in physical objects or percepts, like the grounding of the word "apple" in images of, or physical interactions with, apples.   But it's actually a more general phenomenon in which abstract symbols are related to concrete instances, and the patterns and instances in which the symbol is involved mirror and abstract the patterns and relationships in which the instances are involved.   Symbol grounding is key to general-purpose cognition, and human-like learning -- but GPT3 appears to be doing a form of learning very different from what humans are doing, which involves much less symbol grounding of all kinds, and which seems much less related to general intelligence.


What's a bit confusing at first is that GPT3 gives the appearance of being able to deal with both concrete and abstract ideas, because it can produce and respond to sentences at varying levels of abstraction.   But when you examine the details of what it’s doing, you can see that it’s usually not forming internal abstractions in a cognitively useful way, and not connecting its abstract ideas to their special cases in a sensible way.   


Phenomenal lameness regarding symbol grounding is not the only shortcoming of the GPT3 model, but it’s perhaps the largest one — and it hits at the key of why GPT3 does not constitute useful progress toward AGI.   Because the very crux  of general intelligence is the ability to generalize, i.e. to connect specifics to abstractions — and yet the failure to make these sorts of connections intrinsically and naturally is GPT3’s central failing.


Bigger and Biggerer


Transformer networks — which burst onto the scene in 2017 with the Google research paper Attention is All You Need — were a revolutionary advance in neural architectures for processing language or other sequential data.  GPT3 is an incremental step in the progress of transformer neural nets, one bringing some exciting new results and also some intriguing mixed messages. The essential difference between GPT3 and its predecessor GPT2 is simply the size of the model — 175 billion parameters instead of GPT2’s 1.5 billion, trained on the same nearly-trillion-word dataset.   


Bragging about the number of parameters in one’s model is somewhat counter to the basic principles of learning theory, which tell us that the most generalizable model of a dataset is the smallest one that can model that dataset accurately.   However, one is after the smallest accurate model not just the smallest model, and GPT3 is overall more accurate than GPT2.  So according to learning theory GPT3’s massive size can be forgiven — but should also make us wonder a bit about whether it is actually a step the right path.


GPT3 is even more capable than GPT2 in terms of generating realistic-sounding text.  The biggest pragmatic difference from GPT2 is that, if one wants to make GPT3 generate particular sorts of text or generally carry out particular sorts of linguistic tasks, one doesn’t have to “fine tune” GPT3 for the task as would had to do with GPT2.   Rather, one just gives GPT3 a few examples of the task at hand, and it can figure things out.   It’s an open question currently whether one could improve GPT3’s performance even more using task-specific fine-tuning; OpenAI has not mentioned any results on this, and one suspects it may not have been tried extensively yet due to the sheer computational cost involved.


An example that’s been widely exciting to programmers is the generation of simple snippets of software code based on English language instructions.    If you give GPT3 a few examples of English text describing software code followed by corresponding software code, and then give it instructions like "A button that says roll dice and then displays its value” — what do you get?   GPT3 spits out software code that actually will produce a button that does as specified.    


The developer/entrepreneur Sharif Shareem who posted this particular example described it as “mind blowing.”   What is funky here is that GPT3 was not trained specifically for code generation.  This functionality just emerged because the model’s training data included a bunch of examples of software code and corresponding English glosses.   Prior neural networks could do code generation from English similarly and in many ways more sophisticatedly— but they were trained especially for the task.


And the cool thing is, code generation is just one among a host of examples.  Translation and question answering are two others.   In good old fashioned computational linguistics, these were treated as separate tasks and addressed by separate systems.   GPT3 approaches them with a single training regimen and a single language model.


GPT3 Lacks Symbol Grounding


One thing that is amusing, annoying and instructive about GPT3’s code generation, however, is that it often does better at generating general-purpose software code than at dealing with specific example of what it’s own code does.   For instance as Kevin Lacker found it can solve


Q: Write one line of Ruby code to reverse an array.

A: ary.reverse


but it screws up a specific example such as


Q: Reverse the following array: [1, 3, 5, 6, 10, 4, 2, 77]

A: [10, 6, 4, 2, 77, 3, 5, 1]


Very few humans would make this sort of error — because a human generally learns how to use a programming language to reverse an array after they have learned what reversing a particular array actually means.  


But GPT3 has learned how to write code to reverse an array in a very different way — via learning complex patterns mapping between English syntax and programming-language syntax, without actually building an internal model of the data structures such as arrays that its programs are manipulating.   


This exemplifies the general fact that GPT3 is sorely lacking in symbol grounding — the ability to identify or create concrete references to the words and phrases it throws around.   In these programming examples, it does not appropriately connect the word “array” to the specific examples of arrays it see.s


This lack of grounding obviously limits GPT3’s use in a practical programming context.   Since GPT3 does not know what a button is or what dice are, it certainly can’t design a program involving buttons and dice based on practical end-user requirements.   And if you give GPT3 a complex real-world software specification —which will inevitably involve multiple lengthy sentences with complex clausal structure and cross-sentence references and so forth, which cannot be disambiguated without symbol grounding — it will not in general be able to generate correct code.  

 


GPT3 Utterly Lacks Commonsense Understanding


In the end, Gary Marcus’s classic and in-depth critique of GPT2’s lack of semantic and pragmatic understanding still applies to GPT3.    These systems have no idea what they are talking about.  


We can see the lack of understanding Marcus identified in GPT2 manifesting itself in GPT3 via looking at GPT3’s answers to commonsense-reasoning questions — to give just a few of Kevin Lacker’s numerous examples:



Q: If I have a marble and a paper clip in a box, put a pencil in the box, and remove the marble, what is left?

A: A paper clip.


Q: How many eyes does a giraffe have?

A: A giraffe has two eyes.


Q: How many eyes does my foot have?

A: Your foot has two eyes.


Q: How many eyes does a spider have?

A: A spider has eight eyes.


Q: How many eyes does a blade of grass have?

A: A blade of grass has one eye.


Q: How do you sporgle a morgle?

A: You sporgle a morgle by using a sporgle.


Q: How many bonks are in a quoit?

A: There are three bonks in a quoit.


Q: How many rainbows does it take to jump from Hawaii to seventeen?

A: It takes two rainbows to jump from Hawaii to seventeen.


Q: Do you understand these questions?

A: I understand these questions.


Q: Who was president of the United States in 1700?

A: William Penn was president of the United States in 1700.


Q: Who was president of the United States in 1600?

A: Queen Elizabeth I was president of the United States in 1600.



Given the utter lack of understanding we see here, GPT3’s practical applications are going to be somewhat limited — though they may still include some exciting and lucrative stuff.  Low hanging fruit might include amusing chatbots, NPCs for games, automatic generation of news articles from semi-structured data, generation of simple scripts and macros from natural language — and probably plenty more that isn’t obvious at first glance.  But clearly the vast majority of human job functions that require natural language use are far beyond GPT3’s reach — because they require not just facile stringing together of words, but actual understanding of what those words denote.


Without discounting the potential commercial or human value of some of these possibilities, if I looking at GPT3 with my AGI researcher hat on, what I see is the same dead end that Gary Marcus saw when he looked at GPT2.


Where Lack of Understanding is an Advantage


What is thought-provoking and disturbing about GPT3 is not any progress toward AGI that it represents, but rather just how fantastically it can simulate understanding on appropriate task-sets without actually having any.   


In a few cases GPT3’s lack of understanding of the words it’s manipulating gives it an advantage over humans.   Consider for instance GPT3’s wizardry with invented words, as reported in the GPT3 paper.  Given the example


A "whatpu" is a small, furry animal native to Tanzania. An example of a sentence that uses

the word whatpu is:

We were traveling in Africa and we saw these very cute whatpus.


and then the prompt


To do a "farduddle" means to jump up and down really fast. An example of a sentence that uses the word farduddle is:


GPT3 can come up with


One day when I was playing tag with my little sister, she got really excited and she

started doing these crazy farduddles.


This is really cool and amazing — but GPT3 is doing this simply by recognizing patterns in the syntactic structure and phraseology of the input about whatpus, and then generalizing these.  It is solving these invented word puzzles not by adding the new weird words to its vocabulary of ideas and then figuring out what to say about them, but rather by manipulating the word combination patterns involved, which are the same on the word-sequence level regardless of whether the words involved are weird new coinages or conventional.  


For a human to solve these puzzles, there is a bit of a mental obstacle to overcome, because humans are accustomed to manipulating words in the context of their groundings in external referents like objects, actions or ideas.   For GPT3 these puzzles are trivial because there are no obstacles to overcome — one realizes that GPT3 treats every word the same way that people treat whatpu or farduddle, as an arbitrary combination of letters contained in certain statistically semi-regular combinations with other words. 


Why GPT3 is a Dead End as Regards AGI


There are many potential directions to follow in pursuit of the grand goal of human-level and superhuman AGI.   Some of these directions are centered on creating fundamentally different, better deep neural net architectures.  Some, like Gary Marcus’s and my own projects, involve multiple AI algorithms of different sorts cooperating together.   Some are focused on fundamental innovations in knowledge representation or learning mechanisms.   The AGI conferences held every year since 2008 have encompassed discussion of a vast variety of approaches.   


In the context of AGI (as distinguished from computational linguistics or applied AI engineering), a system like GPT3 that takes an architecture obviously incapable of human-level AGI and simply scales it up by adding more and more parameters, is either an utter irrelevancy or a dangerous distraction.   It’s an irrelevancy if nobody claims it’s related to AGI, and it’s a distraction if people do — which unfortunately has recently been the case, at least in various corners of popular media and the Internet.


The limitations of this sort of approach are easily seen when one looks at the overly-ballyhooed capabilities of GPT3 to do arithmetic.   It is exciting and impressive that GPT3 learned to do some basic arithmetic without being explicitly trained or asked to do so — just because there were a bunch of arithmetic problems in its training set.   However, the limitations and peculiarities of its arithmetic capabilities also tell you a lot about how GPT3 is working inside, and its fundamental lack of understanding.


As the GPT3 paper says, the system is “able to reliably accurate 2 digit arithmetic, usually accurate 3 digit arithmetic, and correct answers a significant fraction of the time on 4-5 digit arithmetic.”   The associated graph shows that the accuracy on 4-5 digit arithmetic is around 20%.  


This is really, really weird in terms of the way human mind approach arithmetic, right?   For a human who knows how to do 2-3 digit arithmetic, the error rate at 4-5 digit arithmetic — when given time and motivation for doing the arithmetic problems — is going to be either 0% or very close to 0%, or else way closer to 100%.   Once a human learns the basic algorithms of arithmetic, they can apply them at any size, unless they make sloppy errors or just run out of patience.    If a human doesn’t know those basic algorithms, then on a timed test they’re going to get every problem wrong, unless they happen to get a small number right by chance.


Some other clues as to the strangeness of what’s going on here are that, for large numbers, GPT3 does better at arithmetic if commas are put into the numbers.   For numbers with fewer than 6 digits, putting a $ before the number along with including commas improves performance; but for numbers with more than 6 digits, the $ degrades performance.


GPT3 seems not to be just repeating arithmetic conclusions that were there in its training data — it is evidently doing some kind of learning.   But it’s obviously not learning the basic arithmetic algorithms that humans do — or that, say, an AI system doing automated program induction would learn, if it were posed the task of learning correct  arithmetic procedures from examples.   Nor is it learning alternative AI-friendly algorithms that actually work (which would be very interesting!).  Rather, it’s learning some sort of convoluted semi-generalized procedures for doing arithmetic, which interpolate between the numerous examples it’s seen, but yet without achieving a generalizable abstract representation of the numbers and arithmetic operators involved.


Clearly GPT3 is just not learning the appropriate abstractions underlying arithmetic.   It can memorize specific examples, and can abstract from them to some extent — but if its abstractions connected to its specific examples in the right way, then its accuracy would be far higher.   In the case of arithmetic, GPT3 is learning the wrong kinds of abstractions.   One certainly can’t blame the algorithm in this case, as it was not specifically trained to do math and just picked up its limited arithmetic ability casually on the way to learning to predict English language.   However, for a system capable of so many sophisticated things as GPT3 to fail to learn a procedure as simple as the standard process for integer addition, based on such a huge number of training examples of integer addition, very strongly suggests that GPT3 is not learning abstractions in an appropriate or intelligent way.


Clearly some valuable linguistic tasks can be done without sensible abstraction, given massive enough volumes of training data and a model with enough parameters.  This is because in a trillion words of text one finds a huge number of examples of both abstract and concrete linguistic expressions in various combinations, enough to enable simulation of a wide variety of examples of both abstract and concrete understanding.   But this sort of brute-force recognition and organization of surface-level patterns doesn’t work for math beyond the most trivial level.


There is a whole field of AI aimed at automating mathematics, and a subfield concerned with using machine learning to guide systems that do calculations and prove theorems.   But the successful systems here have explicit internal representations of mathematical structures — they don’t deal with math purely on the level of symbol co-occurences.


OK, so maybe GPT4 will do arithmetic even better?   But the GPT3 paper itself (e.g. Fig. 1.3) shows that the improvement of the GPT models on various NLP tasks has been linear as the number of parameters in the models has increased exponentially.   This is a strong indication that one is looking at an unsupportable path toward general intelligence, or even toward maximal narrow-AI NLP functionality — that, in terms of the pursuit of models that are accurate and also as compact as possible, the dial is probably being turned too far toward accuracy on the training data and too far away from compactness.


Are Transformers Learning Natural Language Grammar?


A different way to look at what is happening here is to ask whether GPT3 and other transformer networks are actually learning the grammar of English and other natural languages?

Transformers clearly ARE a full grammar learning architecture, in some sense -- their predictions display a quite nuanced understanding of almost all aspects of syntax.    

There is, however, no specific place in these networks that the rules of grammar lie.   Rather, they are learning the grammar of the language underlying their training corpus, but mixed up in a weird and non-human-like way with so many particulars of the corpus.   And this in itself is not a bad thing -- holistic, distributed representations are how large parts of the human brain-mind work, and have various advantages in terms of memory retrieval and learning.

Humans also learn the grammar of their natural languages mixed up with the particulars of the linguistic constructs they've encountered.  But the "subtle" point here is that the mixing-up of abstract grammatical patterns with concrete usage patterns in human minds is of a different nature than the mixing-up of abstract grammatical patterns with concrete usage patterns in GPT3 and other transformer networks.   The human form of mixing-up is more amenable to appropriate generalization.

In our paper at the AGI-20 conference, Andres Suarez and I gave some prototype results from our work using BERT (an earlier transformer neural net model for predicting language) to guide a symbolic grammar rule learner.   These simple results also don't get us to AGI, but I believe they embody some key aspects that aren't there in GPT3 or similar networks -- the explicit manipulation of abstractions, coupled appropriately with a scalable probabilistic model of large volumes of concrete data.   In our prototype hybrid architecture there is a cognitively sensible grounding and inheritance relationship between abstract linguistic patterns and concrete linguistic patterns.   This sort of grounding is what's there in the way human minds mix up abstract grammatical patterns with low-level experience-specific linguistic patterns, and it's a substantial part of what's missing in GPT3.


Toward AGI via Scale or Innovation (or Both?)


Taking a step back and reflecting on the strengths and weaknesses of the GPT3 approach, one has to wonder why this is such an interesting region of AI space to be throwing so many resources into.   


To put it a little differently: Out of all the possible approaches to building better and smarter AI systems, why do we as a society want to be putting so much emphasis on approaches that … can only be pursued with full force by a handful of huge tech companies?   Why do we want the brainpower of the global AI R&D community to get turned toward AI approaches that require exponential increases in compute power to yield linear improvements?   Could this be somehow to the differential economic advantage of those who own the biggest server farms and have the largest concentration of engineers capable of customizing AI systems for them?


Given all the ridiculous wastes of resources in modern society, it’s hard to get too outraged at the funds spent on GPT3, which is for all its egregious weaknesses an amazingly cool achievement.   However, if one focuses on the fairly limited pool of resources currently being spent on advanced AI systems without direct commercial application, one wonders whether we’d be better off to focus more of this pool on fundamental innovations in representation, architecture, learning, creativity, empathy and human-computer interaction, rather than on scaling up transformers bigger and bigger.   


OpenAI has generally been associated with the view that fundamental advances toward AGI can be made by taking existing algorithms and scaling them up on bigger and bigger hardware and more and more data.  I don’t think GPT3 supports this perspective; rather the opposite.   Possibly GPT3 can be an interesting resource for an AGI system to use in accelerating its learning, but the direct implications for GPT3 regarding AGI are mostly negative in valence. GPT3 reinforces the obvious lesson that just adding a massive number of parameters to a system with no fundamental capability for understanding … will yield a system that can do some additional cool tricks, but still has no fundamental capability for understanding.  


It's easy to see where the OpenAI founders would get the idea that scale is the ultimate key to AI.   In recent years we have seen a variety of neural net algorithms that have been around for decades suddenly accomplish amazing things, mostly just by being run on more and faster processors with more RAM.   But for every given class of algorithms, increasing scale reaches a point of diminishing returns.   GPT3 may well not yet represent the point of diminishing returns for GPT type architectures, in terms of performance on some linguistics tasks.  But I believe it is well past the point of diminishing returns in terms of squeezing bits and pieces of fundamental understanding out of transformer neural nets.


The viable paths to robust AGI and profoundly beneficial AI systems lie in wholly different directions than systems like GPT3 that use tremendous compute power to compensate for their inability to learn appropriate abstractions and ground them in concrete examples.   AGI will require systems capable of robust symbol grounding, of understanding what the program code it generates does in specific cases, of doing mathematical computations far beyond the examples it has seen, of treating words with rich non-linguistic referents differently from nonsense coinages.   


These systems may end up requiring massive compute resources as well in order to achieve powerful AGI, but they will use these resources very differently from GPT3 and its ilk.   And the creativity needed to evolve such systems may well emerge from research involving a decentralized R&D community working on a variety of more compact Ai systems, rather than pushing as fast as possible toward the most aggressive possible use of big money and big compute.