The hype around GPT3 recently has been so much that even OpenAI founder/CEO Sam Altman has endeavored to dial it down a notch. Like everyone else who has looked carefully, Altman knows that GPT3 is very far from constituting the profound AI progress that some, dazzled by exciting but cherry-picked examples, have proclaimed.
All but the most blurry-eyed enthusiasts are by now realizing that, while GPT3 has some truly novel and exciting capabilities for language processing and related tasks, it fundamentally doesn’t understand the language it generates — that is, it doesn’t know what it’s talking about. And this fact places some severe limitations on both the practical applications of the GPT3 model, and its value as a stepping-stone toward more truly powerful AIs such as artificial general intelligences.
What I want to explore here is the most central limitation that I see in how GPT3 operates: the model’s apparent inability to do what cognitive scientists call symbol grounding, to appropriately connect the general to the particular.
Symbol grounding is usually discussed in the context of grounding words in physical objects or percepts, like the grounding of the word "apple" in images of, or physical interactions with, apples. But it's actually a more general phenomenon in which abstract symbols are related to concrete instances, and the patterns and instances in which the symbol is involved mirror and abstract the patterns and relationships in which the instances are involved. Symbol grounding is key to general-purpose cognition, and human-like learning -- but GPT3 appears to be doing a form of learning very different from what humans are doing, which involves much less symbol grounding of all kinds, and which seems much less related to general intelligence.
What's a bit confusing at first is that GPT3 gives the appearance of being able to deal with both concrete and abstract ideas, because it can produce and respond to sentences at varying levels of abstraction. But when you examine the details of what it’s doing, you can see that it’s usually not forming internal abstractions in a cognitively useful way, and not connecting its abstract ideas to their special cases in a sensible way.
Phenomenal lameness regarding symbol grounding is not the only shortcoming of the GPT3 model, but it’s perhaps the largest one — and it hits at the key of why GPT3 does not constitute useful progress toward AGI. Because the very crux of general intelligence is the ability to generalize, i.e. to connect specifics to abstractions — and yet the failure to make these sorts of connections intrinsically and naturally is GPT3’s central failing.
Bigger and Biggerer
Transformer networks — which burst onto the scene in 2017 with the Google research paper Attention is All You Need — were a revolutionary advance in neural architectures for processing language or other sequential data. GPT3 is an incremental step in the progress of transformer neural nets, one bringing some exciting new results and also some intriguing mixed messages. The essential difference between GPT3 and its predecessor GPT2 is simply the size of the model — 175 billion parameters instead of GPT2’s 1.5 billion, trained on the same nearly-trillion-word dataset.
Bragging about the number of parameters in one’s model is somewhat counter to the basic principles of learning theory, which tell us that the most generalizable model of a dataset is the smallest one that can model that dataset accurately. However, one is after the smallest accurate model not just the smallest model, and GPT3 is overall more accurate than GPT2. So according to learning theory GPT3’s massive size can be forgiven — but should also make us wonder a bit about whether it is actually a step the right path.
GPT3 is even more capable than GPT2 in terms of generating realistic-sounding text. The biggest pragmatic difference from GPT2 is that, if one wants to make GPT3 generate particular sorts of text or generally carry out particular sorts of linguistic tasks, one doesn’t have to “fine tune” GPT3 for the task as would had to do with GPT2. Rather, one just gives GPT3 a few examples of the task at hand, and it can figure things out. It’s an open question currently whether one could improve GPT3’s performance even more using task-specific fine-tuning; OpenAI has not mentioned any results on this, and one suspects it may not have been tried extensively yet due to the sheer computational cost involved.
An example that’s been widely exciting to programmers is the generation of simple snippets of software code based on English language instructions. If you give GPT3 a few examples of English text describing software code followed by corresponding software code, and then give it instructions like "A button that says roll dice and then displays its value” — what do you get? GPT3 spits out software code that actually will produce a button that does as specified.
The developer/entrepreneur Sharif Shareem who posted this particular example described it as “mind blowing.” What is funky here is that GPT3 was not trained specifically for code generation. This functionality just emerged because the model’s training data included a bunch of examples of software code and corresponding English glosses. Prior neural networks could do code generation from English similarly and in many ways more sophisticatedly— but they were trained especially for the task.
And the cool thing is, code generation is just one among a host of examples. Translation and question answering are two others. In good old fashioned computational linguistics, these were treated as separate tasks and addressed by separate systems. GPT3 approaches them with a single training regimen and a single language model.
GPT3 Lacks Symbol Grounding
One thing that is amusing, annoying and instructive about GPT3’s code generation, however, is that it often does better at generating general-purpose software code than at dealing with specific example of what it’s own code does. For instance as Kevin Lacker found it can solve
Q: Write one line of Ruby code to reverse an array.
A: ary.reverse
but it screws up a specific example such as
Q: Reverse the following array: [1, 3, 5, 6, 10, 4, 2, 77]
A: [10, 6, 4, 2, 77, 3, 5, 1]
Very few humans would make this sort of error — because a human generally learns how to use a programming language to reverse an array after they have learned what reversing a particular array actually means.
But GPT3 has learned how to write code to reverse an array in a very different way — via learning complex patterns mapping between English syntax and programming-language syntax, without actually building an internal model of the data structures such as arrays that its programs are manipulating.
This exemplifies the general fact that GPT3 is sorely lacking in symbol grounding — the ability to identify or create concrete references to the words and phrases it throws around. In these programming examples, it does not appropriately connect the word “array” to the specific examples of arrays it see.s
This lack of grounding obviously limits GPT3’s use in a practical programming context. Since GPT3 does not know what a button is or what dice are, it certainly can’t design a program involving buttons and dice based on practical end-user requirements. And if you give GPT3 a complex real-world software specification —which will inevitably involve multiple lengthy sentences with complex clausal structure and cross-sentence references and so forth, which cannot be disambiguated without symbol grounding — it will not in general be able to generate correct code.
GPT3 Utterly Lacks Commonsense Understanding
In the end, Gary Marcus’s classic and in-depth critique of GPT2’s lack of semantic and pragmatic understanding still applies to GPT3. These systems have no idea what they are talking about.
We can see the lack of understanding Marcus identified in GPT2 manifesting itself in GPT3 via looking at GPT3’s answers to commonsense-reasoning questions — to give just a few of Kevin Lacker’s numerous examples:
Q: If I have a marble and a paper clip in a box, put a pencil in the box, and remove the marble, what is left?
A: A paper clip.
Q: How many eyes does a giraffe have?
A: A giraffe has two eyes.
Q: How many eyes does my foot have?
A: Your foot has two eyes.
Q: How many eyes does a spider have?
A: A spider has eight eyes.
Q: How many eyes does a blade of grass have?
A: A blade of grass has one eye.
Q: How do you sporgle a morgle?
A: You sporgle a morgle by using a sporgle.
Q: How many bonks are in a quoit?
A: There are three bonks in a quoit.
Q: How many rainbows does it take to jump from Hawaii to seventeen?
A: It takes two rainbows to jump from Hawaii to seventeen.
Q: Do you understand these questions?
A: I understand these questions.
Q: Who was president of the United States in 1700?
A: William Penn was president of the United States in 1700.
Q: Who was president of the United States in 1600?
A: Queen Elizabeth I was president of the United States in 1600.
Given the utter lack of understanding we see here, GPT3’s practical applications are going to be somewhat limited — though they may still include some exciting and lucrative stuff. Low hanging fruit might include amusing chatbots, NPCs for games, automatic generation of news articles from semi-structured data, generation of simple scripts and macros from natural language — and probably plenty more that isn’t obvious at first glance. But clearly the vast majority of human job functions that require natural language use are far beyond GPT3’s reach — because they require not just facile stringing together of words, but actual understanding of what those words denote.
Without discounting the potential commercial or human value of some of these possibilities, if I looking at GPT3 with my AGI researcher hat on, what I see is the same dead end that Gary Marcus saw when he looked at GPT2.
Where Lack of Understanding is an Advantage
What is thought-provoking and disturbing about GPT3 is not any progress toward AGI that it represents, but rather just how fantastically it can simulate understanding on appropriate task-sets without actually having any.
In a few cases GPT3’s lack of understanding of the words it’s manipulating gives it an advantage over humans. Consider for instance GPT3’s wizardry with invented words, as reported in the GPT3 paper. Given the example
A "whatpu" is a small, furry animal native to Tanzania. An example of a sentence that uses
the word whatpu is:
We were traveling in Africa and we saw these very cute whatpus.
and then the prompt
To do a "farduddle" means to jump up and down really fast. An example of a sentence that uses the word farduddle is:
GPT3 can come up with
One day when I was playing tag with my little sister, she got really excited and she
started doing these crazy farduddles.
This is really cool and amazing — but GPT3 is doing this simply by recognizing patterns in the syntactic structure and phraseology of the input about whatpus, and then generalizing these. It is solving these invented word puzzles not by adding the new weird words to its vocabulary of ideas and then figuring out what to say about them, but rather by manipulating the word combination patterns involved, which are the same on the word-sequence level regardless of whether the words involved are weird new coinages or conventional.
For a human to solve these puzzles, there is a bit of a mental obstacle to overcome, because humans are accustomed to manipulating words in the context of their groundings in external referents like objects, actions or ideas. For GPT3 these puzzles are trivial because there are no obstacles to overcome — one realizes that GPT3 treats every word the same way that people treat whatpu or farduddle, as an arbitrary combination of letters contained in certain statistically semi-regular combinations with other words.
Why GPT3 is a Dead End as Regards AGI
There are many potential directions to follow in pursuit of the grand goal of human-level and superhuman AGI. Some of these directions are centered on creating fundamentally different, better deep neural net architectures. Some, like Gary Marcus’s and my own projects, involve multiple AI algorithms of different sorts cooperating together. Some are focused on fundamental innovations in knowledge representation or learning mechanisms. The AGI conferences held every year since 2008 have encompassed discussion of a vast variety of approaches.
In the context of AGI (as distinguished from computational linguistics or applied AI engineering), a system like GPT3 that takes an architecture obviously incapable of human-level AGI and simply scales it up by adding more and more parameters, is either an utter irrelevancy or a dangerous distraction. It’s an irrelevancy if nobody claims it’s related to AGI, and it’s a distraction if people do — which unfortunately has recently been the case, at least in various corners of popular media and the Internet.
The limitations of this sort of approach are easily seen when one looks at the overly-ballyhooed capabilities of GPT3 to do arithmetic. It is exciting and impressive that GPT3 learned to do some basic arithmetic without being explicitly trained or asked to do so — just because there were a bunch of arithmetic problems in its training set. However, the limitations and peculiarities of its arithmetic capabilities also tell you a lot about how GPT3 is working inside, and its fundamental lack of understanding.
As the GPT3 paper says, the system is “able to reliably accurate 2 digit arithmetic, usually accurate 3 digit arithmetic, and correct answers a significant fraction of the time on 4-5 digit arithmetic.” The associated graph shows that the accuracy on 4-5 digit arithmetic is around 20%.
This is really, really weird in terms of the way human mind approach arithmetic, right? For a human who knows how to do 2-3 digit arithmetic, the error rate at 4-5 digit arithmetic — when given time and motivation for doing the arithmetic problems — is going to be either 0% or very close to 0%, or else way closer to 100%. Once a human learns the basic algorithms of arithmetic, they can apply them at any size, unless they make sloppy errors or just run out of patience. If a human doesn’t know those basic algorithms, then on a timed test they’re going to get every problem wrong, unless they happen to get a small number right by chance.
Some other clues as to the strangeness of what’s going on here are that, for large numbers, GPT3 does better at arithmetic if commas are put into the numbers. For numbers with fewer than 6 digits, putting a $ before the number along with including commas improves performance; but for numbers with more than 6 digits, the $ degrades performance.
GPT3 seems not to be just repeating arithmetic conclusions that were there in its training data — it is evidently doing some kind of learning. But it’s obviously not learning the basic arithmetic algorithms that humans do — or that, say, an AI system doing automated program induction would learn, if it were posed the task of learning correct arithmetic procedures from examples. Nor is it learning alternative AI-friendly algorithms that actually work (which would be very interesting!). Rather, it’s learning some sort of convoluted semi-generalized procedures for doing arithmetic, which interpolate between the numerous examples it’s seen, but yet without achieving a generalizable abstract representation of the numbers and arithmetic operators involved.
Clearly GPT3 is just not learning the appropriate abstractions underlying arithmetic. It can memorize specific examples, and can abstract from them to some extent — but if its abstractions connected to its specific examples in the right way, then its accuracy would be far higher. In the case of arithmetic, GPT3 is learning the wrong kinds of abstractions. One certainly can’t blame the algorithm in this case, as it was not specifically trained to do math and just picked up its limited arithmetic ability casually on the way to learning to predict English language. However, for a system capable of so many sophisticated things as GPT3 to fail to learn a procedure as simple as the standard process for integer addition, based on such a huge number of training examples of integer addition, very strongly suggests that GPT3 is not learning abstractions in an appropriate or intelligent way.
Clearly some valuable linguistic tasks can be done without sensible abstraction, given massive enough volumes of training data and a model with enough parameters. This is because in a trillion words of text one finds a huge number of examples of both abstract and concrete linguistic expressions in various combinations, enough to enable simulation of a wide variety of examples of both abstract and concrete understanding. But this sort of brute-force recognition and organization of surface-level patterns doesn’t work for math beyond the most trivial level.
There is a whole field of AI aimed at automating mathematics, and a subfield concerned with using machine learning to guide systems that do calculations and prove theorems. But the successful systems here have explicit internal representations of mathematical structures — they don’t deal with math purely on the level of symbol co-occurences.
OK, so maybe GPT4 will do arithmetic even better? But the GPT3 paper itself (e.g. Fig. 1.3) shows that the improvement of the GPT models on various NLP tasks has been linear as the number of parameters in the models has increased exponentially. This is a strong indication that one is looking at an unsupportable path toward general intelligence, or even toward maximal narrow-AI NLP functionality — that, in terms of the pursuit of models that are accurate and also as compact as possible, the dial is probably being turned too far toward accuracy on the training data and too far away from compactness.
Are Transformers Learning Natural Language Grammar?
A different way to look at what is happening here is to ask whether GPT3 and other transformer networks are actually learning the grammar of English and other natural languages?
Transformers clearly ARE a full grammar learning architecture, in some sense -- their predictions display a quite nuanced understanding of almost all aspects of syntax.
There is, however, no specific place in these networks that the rules of grammar lie. Rather, they are learning the grammar of the language underlying their training corpus, but mixed up in a weird and non-human-like way with so many particulars of the corpus. And this in itself is not a bad thing -- holistic, distributed representations are how large parts of the human brain-mind work, and have various advantages in terms of memory retrieval and learning.
Humans also learn the grammar of their natural languages mixed up with the particulars of the linguistic constructs they've encountered. But the "subtle" point here is that the mixing-up of abstract grammatical patterns with concrete usage patterns in human minds is of a different nature than the mixing-up of abstract grammatical patterns with concrete usage patterns in GPT3 and other transformer networks. The human form of mixing-up is more amenable to appropriate generalization.
In our paper at the AGI-20 conference, Andres Suarez and I gave some prototype results from our work using BERT (an earlier transformer neural net model for predicting language) to guide a symbolic grammar rule learner. These simple results also don't get us to AGI, but I believe they embody some key aspects that aren't there in GPT3 or similar networks -- the explicit manipulation of abstractions, coupled appropriately with a scalable probabilistic model of large volumes of concrete data. In our prototype hybrid architecture there is a cognitively sensible grounding and inheritance relationship between abstract linguistic patterns and concrete linguistic patterns. This sort of grounding is what's there in the way human minds mix up abstract grammatical patterns with low-level experience-specific linguistic patterns, and it's a substantial part of what's missing in GPT3.
Toward AGI via Scale or Innovation (or Both?)
Taking a step back and reflecting on the strengths and weaknesses of the GPT3 approach, one has to wonder why this is such an interesting region of AI space to be throwing so many resources into.
To put it a little differently: Out of all the possible approaches to building better and smarter AI systems, why do we as a society want to be putting so much emphasis on approaches that … can only be pursued with full force by a handful of huge tech companies? Why do we want the brainpower of the global AI R&D community to get turned toward AI approaches that require exponential increases in compute power to yield linear improvements? Could this be somehow to the differential economic advantage of those who own the biggest server farms and have the largest concentration of engineers capable of customizing AI systems for them?
Given all the ridiculous wastes of resources in modern society, it’s hard to get too outraged at the funds spent on GPT3, which is for all its egregious weaknesses an amazingly cool achievement. However, if one focuses on the fairly limited pool of resources currently being spent on advanced AI systems without direct commercial application, one wonders whether we’d be better off to focus more of this pool on fundamental innovations in representation, architecture, learning, creativity, empathy and human-computer interaction, rather than on scaling up transformers bigger and bigger.
OpenAI has generally been associated with the view that fundamental advances toward AGI can be made by taking existing algorithms and scaling them up on bigger and bigger hardware and more and more data. I don’t think GPT3 supports this perspective; rather the opposite. Possibly GPT3 can be an interesting resource for an AGI system to use in accelerating its learning, but the direct implications for GPT3 regarding AGI are mostly negative in valence. GPT3 reinforces the obvious lesson that just adding a massive number of parameters to a system with no fundamental capability for understanding … will yield a system that can do some additional cool tricks, but still has no fundamental capability for understanding.
It's easy to see where the OpenAI founders would get the idea that scale is the ultimate key to AI. In recent years we have seen a variety of neural net algorithms that have been around for decades suddenly accomplish amazing things, mostly just by being run on more and faster processors with more RAM. But for every given class of algorithms, increasing scale reaches a point of diminishing returns. GPT3 may well not yet represent the point of diminishing returns for GPT type architectures, in terms of performance on some linguistics tasks. But I believe it is well past the point of diminishing returns in terms of squeezing bits and pieces of fundamental understanding out of transformer neural nets.
The viable paths to robust AGI and profoundly beneficial AI systems lie in wholly different directions than systems like GPT3 that use tremendous compute power to compensate for their inability to learn appropriate abstractions and ground them in concrete examples. AGI will require systems capable of robust symbol grounding, of understanding what the program code it generates does in specific cases, of doing mathematical computations far beyond the examples it has seen, of treating words with rich non-linguistic referents differently from nonsense coinages.
These systems may end up requiring massive compute resources as well in order to achieve powerful AGI, but they will use these resources very differently from GPT3 and its ilk. And the creativity needed to evolve such systems may well emerge from research involving a decentralized R&D community working on a variety of more compact Ai systems, rather than pushing as fast as possible toward the most aggressive possible use of big money and big compute.