Wednesday, June 15, 2011

Why is evaluating partial progress toward human-level AGI so hard?

This post co-authored by Ben Goertzel and Jared Wigmore

Here we sketch a possible explanation for the well-known difficulty of measuring intermediate progress toward human-level AGI is provided, via extending the notion of cognitive synergy to a more refined notion of ”tricky cognitive synergy.”

The Puzzle: Why Is It So Hard to Measure Partial Progress Toward Human-Level AGI?

A recurrent difficulty in the AGI field is the difficulty of creating a good test for intermediate progress toward the goal of human-level AGI.

It’s not entirely straightforward to create tests to measure the final achievement of human-level AGI, but there are some fairly obvious candidates here. There’s the Turing Test (fooling judges into believing you’re human, in a text chat) the video Turing Test, the Robot College Student test (passing university, via being judged exactly the same way a human student would), etc. There’s certainly no agreement on which is the most meaningful such goal to strive for, but there’s broad agreement that a number of goals of this nature basically make sense.

On the other hand, how does one measure whether one is, say, 50 percent of the way to human-level AGI? Or, say, 75 or 25 percent?

It’s possible to pose many ”practical tests” of incremental progress toward human-level AGI, with the property that IF a proto-AGI system passes the test using a certain sort of architecture and/or dynamics, then this implies a certain amount of progress toward human-level AGI based on particular theoretical assumptions about AGI. However, in each case of such a practical test, it seems intuitively likely to a significant percentage of AGI researcher that there is some way to ”game” the test via designing a system specifically oriented toward passing that test, and which doesn’t constitute dramatic progress toward AGI.

Some examples of practical tests of this nature would be

  • The Wozniak ”coffee test”: go into an average American house and figure out how to make coffee, including identifying the coffee machine, figuring out what the buttons do, finding the coffee in the cabinet, etc.
  • Story understanding – reading a story, or watching it on video, and then answering questions about what happened (including questions at various levels of abstraction)
  • Passing the elementary school reading curriculum (which involves reading and answering questions about some picture books as well as purely textual ones)
  • Learning to play an arbitrary video game based on experience only, or based on experience plus reading instructions

One interesting point about tests like this is that each of them seems to some AGI researchers to encapsulate the crux of the AGI problem, and be unsolvable by any system not far along the path to human-level AGI – yet seems to other AGI researchers, with different conceptual perspectives, to be something probably game-able by narrow-AI methods. And of course, given the current state of science, there’s no way to tell which of these practical tests really can be solved via a narrow-AI approach, except by having a lot of people try really hard over a long period of time.

A question raised by these observations is whether there is some fundamental reason why it’s hard to make an objective, theory-independent measure of intermediate progress toward advanced AGI. Is it just that we haven’t been smart enough to figure out the right test – or is there some conceptual reason why the very notion of such a test is problematic?

We don’t claim to know for sure – but in this brief note we’ll outline one possible reason why the latter might be the case.

Is General Intelligence Tricky?

The crux of our proposed explanation has to do with the sensitive dependence of the behavior of many complex systems on the particulars of their construction. Often-times, changing a seemingly small aspect of a system’s underlying structures or dynamics can dramatically affect the resulting high-level behaviors. Lacking a recognized technical term to use here, we will refer to any high-level emergent system property whose existence depends sensitively on the particulars of the underlying system as tricky. Formulating the notion of trickiness in a mathematically precise way is a worthwhile pursuit, but this is a qualitative essay so we won’t go that direction here.

Thus, the crux of our explanation of the difficulty of creating good tests for incremental progress toward AGI is the hypothesis that general intelligence, under limited computational resources, is tricky.

Now, there are many reasons that general intelligence might be tricky in the sense we’ve defined here, and we won’t try to cover all of them here. Rather, we’ll focus on one particular phenomenon that we feel contributes a significant degree of trickiness to general intelligence.

Is Cognitive Synergy Tricky?

One of the trickier aspects of general intelligence under limited resources, we suggest, is the phenomenon of cognitive synergy.

The cognitive synergy hypothesis, in its simplest form, states that human-level AGI intrinsically depends on the synergetic interaction of multiple components (for instance, as in the OpenCog design, multiple memory systems each supplied with its own learning process). In this hypothesis, for instance, it might be that there are 10 critical components required for a human-level AGI system. Having all 10 of them in place results in human-level AGI, but having only 8 of them in place results in having a dramatically impaired system – and maybe having only 6 or 7 of them in place results in a system that can hardly do anything at all.

Of course, the reality is almost surely not as strict as the simplified example in the above paragraph suggests. No AGI theorist has really posited a list of 10 crisply-defined subsystems and claimed them necessary and sufficient for AGI. We suspect there are many different routes to AGI, involving integration of different sorts of subsystems. However, if the cognitive synergy hypothesis is correct, then human-level AGI behaves roughly like the simplistic example in the prior paragraph suggests. Perhaps instead of using the 10 components, you could achieve human-level AGI with 7 components, but having only 5 of these 7 would yield drastically impaired functionality – etc. Or the same phenomenon could be articulated in the context of systems without any distinguishable component parts, but only continuously varying underlying quantities. To mathematically formalize the cognitive synergy hypothesis in a general way becomes complex, but here we’re only aiming for a qualitative argument. So for illustrative purposes, we’ll stick with the ”10 components” example, just for communicative simplicity.

Next, let’s suppose that for any given task, there are ways to achieve this task using a system that is much simpler than any subset of size 6 drawn from the set of 10 components needed for human-level AGI, but works much better for the task than this subset of 6 components(assuming the latter are used as a set of only 6 components, without the other 4 components).

Note that this supposition is a good bit stronger than mere cognitive synergy. For lack of a better name, we’ll call it tricky cognitive synergy. The tricky cognitive synergy hypothesis would be true if, for example, the following possibilities were true:

  • creating components to serve as parts of a synergetic AGI is harder than creating components intended to serve as parts of simpler AI systems without synergetic dynamics
  • components capable of serving as parts of a synergetic AGI are necessarily more complicated than components intended to serve as parts of simpler AGI systems.

These certainly seem reasonable possibilities, since to serve as a component of a synergetic AGI system, a component must have the internal flexibility to usefully handle interactions with a lot of other components as well as to solve the problems that come its way. In terms of our concrete work on the OpenCog integrative proto-AGI system, these possibilities ring true, in the sense that tailoring an AI process for tight integration with other AI processes within OpenCog, tends to require more work than preparing a conceptually similar AI process for use on its own or in a more task-specific narrow AI system.

It seems fairly obvious that, if tricky cognitive synergy really holds up as a property of human-level general intelligence, the difficulty of formulating tests for intermediate progress toward human-level AGI follows as a consequence. Because, according to the tricky cognitive synergy hypothesis, any test is going to be more easily solved by some simpler narrow AI process than by a partially complete human-level AGI system.


We haven’t proved anything here, only made some qualitative arguments. However, these arguments do seem to give a plausible explanation for the empirical observation that positing tests for intermediate progress toward human-level AGI is a very difficult prospect. If the theoretical notions sketched here are correct, then this difficulty is not due to incompetence or lack of imagination on the part of the AGI community, nor due to the primitive state of the AGI field, but is rather intrinsic to the subject matter. And if these notions are correct, then quite likely the future rigorous science of AGI will contain formal theorems echoing and improving the qualitative observations and conjectures we’ve made here.

If the ideas sketched here are true, then the practical consequence for AGI development is, very simply, that one shouldn’t worry all that much about producing compelling intermediary results. Just as 2/3 of a human brain may not be much use, similarly, 2/3 of an AGI system may not be much use. Lack of impressive intermediary results may not imply one is on a wrong development path; and comparison with narrow AI systems on specific tasks may be badly misleading as a gauge of incremental progress toward human-level AGI.

Hopefully it’s clear that the motivation behind the line of thinking presented here is a desire to understand the nature of general intelligence and its pursuit – not a desire to avoid testing our AGI software! Truly, as AGI engineers, we would love to have a sensible rigorous way to test our intermediary progress toward AGI, so as to be able to pose convincing arguments to skeptics, funding sources, potential collaborators and so forth -- as well as just for our own edification. We really, really like producing exciting intermediary results, on projects where that makes sense. Such results, when they come, are extremely informative and inspiring to the researchers as well as the rest of the world! Our motivation here is not a desire to avoid having the intermediate progress of our efforts measured, but rather a desire to explain the frustrating (but by now rather well-established) difficulty of creating such intermediate goals for human-level AGI in a meaningful way.

If we or someone else figures out a compelling way to measure partial progress toward AGI, we will celebrate the occasion. But it seems worth seriously considering the possibility that the difficulty in finding such a measure reflects fundamental properties of the subject matter – such as the trickiness of cognitive synergy and other aspects of general intelligence.


Josh Jordan said...

Why isn't the Hutter Prize a good example of a non-game-able way to measure progress towards the goal of human-level AGI?

Tom Michael said...

Hi Ben,

As someone who works in neuropsychological assessment, I found this post particularly interesting.

I very much like the cognitive synergy idea, as its something I'm very familiar with from working with brain injured people. A particularly rare symptom following brain injury, such as confabulation, might be caused by damage to a very small and particular area of the brain. In contrast, a much more common deficit, such as a working memory deficit, can be caused by damage to various different brain areas, all of which need to work together in synergy in order for the persons cognition (or general intelligence) to be intact.

This is very much the case in terms of executive function, self control and decision making, which can each be fractionated into many dissociable parts via the lesion method (observing deficits following focal brain injury) or via association (observing activity via the imaging methods).

However, as neuropsychologists, we have to make neuropsychological assessments of peoples cognitive ability, and this is where the similarity to your test dilemma comes in. If someone has a deficit in the ability to plan and organise their lives, we can measure this with various psychometric tests, with a high degree of reliability. However, for heirarchically complex tasks such as planning, the synergy of various cognitive processes is required, meaning that although we know that a brain injured person might have problems with planning post brain injury, its very hard for us to tell which of the cognitive processes have been damaged, at least in a particular individual.

It is possible for us to form theories as to what aspects of cognition different brain areas are critical to (lesion method) or associated with (imaging method). However, we need large numbers of participants in our studies in order to be able to do this. In the cases of hierarchically complex processes such as planning, formed of other cognitive processes working in synergy, this is still difficult though.

What neuropsychology needs is the development of cognitively simpler tasks, such that we can determine with greater precision which areas of the brain are critical to particular aspects of cognition. The problem with such simple tasks in your field however, as you point out, is that cognitively simpler tasks are very likely more easily solvable by narrow AI systems.

I'm not sure if I can suggest anything to help overcome this dilemma, but hope that its reassuring that researchers in other areas of intelligence have similar concerns and difficulties. Perhaps if you are able to create a toddler level AGI, the best person to assess its level of intelligence will be a neuropsychologist :)

Ben Goertzel said...

Tom, thanks for your interesting comments -- I look forward to getting to the stage where a neuropsychologist will be helpful in assessing OpenCog! One advantage of working with AGI systems as compared to humans is that it's easily possible to probe inside their "brains" and monitor what's going on and make minor adjustments and see their effects. In that sense a powerful AGI would be a neurospsychologists dream, I guess!

Ben Goertzel said...

Josh -- about the Hutter prize for text compression. It seems to me that the best way to get minor advances on the state of the art in text compression would be to tweak current text compression methods by minorly improving their probabilistic modeling capability, in ways that IMO would not push interestingly toward advanced AGI. On the other hand, I agree that a human-like mind somehow linked into a probabilistic text compressor would probably be able to do really well on the Hutter Prize. What I don't see is how working on text compression as the Hutter Prize suggests is a good way to evaluate systems that occupy the intermediate space between "minor improvements of current text compressor" and "human-like AGI mind hybridized with text compressor." It seems to me that moderate progress on the Hutter text compression prize can be "gamed", whereas dramatic progress is really really hard, and intermediate progress may not be feasible. In this way the Hutter Prize may be vaguely similar to the Loebner Prize -- on the Loebner Prize as well, moderate progress is gotten by "gaming" the test, and dramatic progress is really really hard, and intermediate progress may not be feasible.

Unfortunately we don't have a rigorous analysis of any of these proposed tests (Hutter Prize, Loebner Prize, whatever), so all this discussion is pretty qualitative and hand-wavy. So if your intuition is different than mine on this, I'm not sure how I could convince you. After a point, we just have to go with our own understandings to motivate our work...

Richard Loosemore said...

As you might imagine -- given my previous writings in which I argue that we are facing a complex systems problem in this field -- I am very much in sympathy with what you say here.

A couple of observations. The notion that you have labeled "trickiness" corresponds to what I called a "global-local-disconnect" ... both of these ideas are about how the overall behavior of a system is sensitive to the underlying mechanisms. ("Tricky" is, I agree, much more concise!). One difference between our two perspectives is that where you see some trickiness at the module level, I would see it also within modules. If that were valid, it would make the business of constructing an AGI that much, well, ... trickier.

A second observation is that all of this could be couched in terms of the "smoothness" of the mapping from mechanism-space to behavior-space. If a point in each of these two spaces corresponds to a particular choice of all the mechanisms (/behaviors), then what we want is for a small delta in the behavior space (going from a poorly functioning system to a better functioning system) to correspond to a small delta in the mechanism space. So your notion of trickiness and my previous arguments all point to a pathological non-smoothness in that mapping, with a small behavior change requiring a massive and perhaps unfindable change in mechanism. That does not add much to the discussion, I agree, but I find it useful as a way to think about these issues.

Last comment is an observation about the practical aspects of measuring progress. Although I agree that these theoretical reasons would make it hard, I think it might be possible to reduce some of the difficulty by insisting on a developmental history for each AGI system, as part of the measurement of its performance level. So, for example, if someone claimed to have a system that could pass the Wozniak test, I would only be interested if that same system could show a history of learning to acquire skills up to an including that coffee-making skill.

So my test would be: its not what you do, its the way that you (learn to) do it.

There is little I could do to make that test more rigorous: it is really a gut feeling that we could improve matters that way.

(And, more generally, I am less confident that these ideas can be made much more mathematically rigorous).

Ben Goertzel said...


Yeah, I actually agree that the trickiness extends deeper than the "interaction btw high level components" level. Even in OpenCog there is trickiness inside most of the modules.... Perhaps where we differ is that I think that via careful and creative analysis, we can figure out designs that will work, in spite of the unavoidable trickiness.... Of course I also think we will then have to empirically (manually and auto) tune the parameters of these designs. From what I understand, you're less bullish on the "figuring out a design" part and more bullish on the creation of tools for manually and auto tuning very flexible designs....

About testing AGIs via studying their developmental trajectories. It's a sensible approach yet also susceptible to gaming. For instance, someone could program in the desired final functionality, and then program their system to gradually reveal this functionality, step by step, as their system matures.... Then they could argue that this is a good qualitative model of developmental neuropsychology, because the brain develops new functionalities as it develops ;p

-- Ben G

Ben Goertzel said...

Some extracts from AGI email list discussion following this post...

Why is evaluating partial progress toward human-level AGI so hard?

Russell Wallace:
Because of the G. AGI, by definition, is the attempt to build
something that can perform well in _general_, that is, at many different tasks. A single test is therefore of little use, because even if the candidate system aces the test, you don't know how many other things it can or can't do. If you want to measure progress, the only even somewhat reliable way would be to use a compound test composed of many different tasks.

Tim Tyler:
So: use lots of tests, and generate them using a machine.

That's what Shane Legg reported doing here:

Then there are various ways to weight the different tests (i.e.
different distributions from which to generate the tests), and
different AGIs will come out superior depending on how you do the

Tim Tyler:
We have figured out how to test intelligence. That is pretty-much on the "solved problems" pile.

Absolutely not.

We have figured out how to test certain forms of **human**
intelligence pretty well. But these tests don't make much sense for intelligent systems with significantly nonhuman cognitive

As an obvious illustration of this, recently raised on this list, in
humans the ability to do verbal analogy problems is correlated with
lots of useful capabilities; but in the domain of AI problems, this
isn't so... the narrow AI programs that solve these analogy problems
reasonably well do NOT have any useful capabilities...

IQ test scores correlate with various useful abilities **in humans**, not (necessarily, nor in current practice) in AIs with nonhumanlike architectures...

Richard Loosemore said...


As far as manually and auto-tuning the parameters to improve performance, I think we are probably on the same page about how important that is (I think over time we would probably both converge on development methodologies in which there is a lot of that kind of work).

I think the more important difference between our perspectives is that I believe in the existence of local "extreme pathologies" in that mapping I mentioned earlier, which might cause your parameter tuning searches to occasionally get stuck in a dead end. In other words, if you were at Point B(x) in the behavior space [where x is the corresponding point in mechanism space and B is the mapping function from design space to behavior space] and if the system was just not working well, so you want to get to Point B(y) which is close but better, it might just be the case that the moving from x to y in the design space is fiendishly hard. Bad smoothness in the function in that region. Basically, of course, y is a very long way away from x, and so all of a sudden the auto- and manual-tuning of parameters starts to thrash, your programmers can't seem to make any progress, nobody has any idea why, etc. etc. In colloquial terms, you would be stuck in a "You Can't Get There From Here" trap.

So, my position is that I believe (for various theoretical reasons that are close to your own argument in this essay) that these traps will be common because of the nature of this game. And my solution to the problem is to say that Nature used a trial and error method to find a region of the mapping function that is relatively smooth (namely in the vicinity of the human brain design), and that we can piggyback on her efforts if we adopt a policy of staying as close as possible to the human design. Hence, I start from an overwhelming amount of cognitive psychology (plus a bunch of strategies designed to make "mechanism-sense" of that stuff), in the hope that this puts my starting point in the design space in a relatively smooth area for that mapping. That way (so the plan goes) I can hope that none of my manual- and auto-tuning of parameters will end up exploding.

I should probably write this up. It sounds (to me) like a way to explain the CSP that might make more sense to people.

Ben Goertzel said...

Richard -- I see ... so we agree that general intelligence and cognitive synergy and so forth are tricky -- but you think they're even trickier than I do !!

Yes, I think in these comments you're moving toward a clearer explanation of your "complex systems problem" idea than I've seen you produce before. Writing up a paper in this vein would probably be a good idea ;-)

.... ben

Anonymous said...

Hi Ben,

you are going for artificial GENERAL intelligence, not for artificial HUMAN intelligence. From this angle the mentioned tests look too anthropocentric. If the goal is AGI, why limit it (or the means to test its level of intelligence) to human-specific tasks?

Assuming that human intelligence ≠ general intelligence, which it should be, as nobody is going for a 1:1 replica of the human brain, those tests would not tell whether an AGI is "generally" intelligent.
Would they not only tell whether an AGI works in the narrow scope of human function and therefore reduce the AGI (from human perspective) to a narrow AI (assuming human intelligence itself is a form of narrow intelligence compared to an AGI) because its intelligence beyond the tests never gets measured and therefore might even remain unseen and ununderstood?

My apologies if I tread on a beaten path with my questions or maybe misunderstand the definition of AGI. I am interested in the field but not a professional.


Ben Goertzel said...


Truly, completely general AGI is possible only for systems with infinite computational resources (i.e. it's not physically possible according to currently understood physics).

So, every AGI is more intelligent at some things than others. Humans are one particular kind of "partial AGI" ... and surely not the smartest kind. But if we want to build AGI systems that we can comprehend and work with, and that will understand us, it seems the place to start is with roughly human-like AGI, augmented with whatever extra capabilities are easy to stick in there... and then see what naturally seems to follow next from there...

-- Ben

Jimmy said...

It seems like, when I observe the levels of intelligence in nature there are large gaps. For example between primates and humans there seems to be a large gap in intelligence.

Anonymous said...

I believe the answer to the question if progress in AGI can be measured depends on the approach being taken to build the AGI.

If you for example take Ben's OpenCog as an example which relies on the priciple(or rather promise) of Cognitive Synergy then I agree that it might be harder to measure the progress.
But this is not necessarily the case for other AGI approaches.

In general I think that progress towards AGI is measurable. In my opinion you can measure the progress by applying the AI to solve narrow problems.
The difference to narrow AI programming is that you don't program the AI directly to solve a particular narrow problem. Instead you work on the basic solving strategies of your AGI which allow it to solve a variety of narrow problems on its own without you directly working on that narrow problem.

Ben Goertzel said...


Definitely, if you're working within a certain AGI approach then it's quite possible to measure progress, in a way that will make sense conditional on your confidence in the basic approach you're taking.

What's hard is to find an incremental success measure that operates PURELY on the level of the AGI system's accomplishments, without requiring any knowledge or understanding of the system's internals, or any belief in the sensibleness or promise of the underlying principles, etc.

If you're just averaging together performance on a number of narrow tasks, then your AGI won't outperform a system comprised of a grab-bag of specialized programs, not until it's rather advanced.

If you're trying to measure learning on tasks, then you still have to compete with systems that were pre-programmed for powerful functionality on these tasks, and were also pre-programmed to simulate development via gradually using more and more of their pre-programmed capability over time.

If you're trying to measure learning how to do never-before-encountered tasks, you have to use a quite wide variety of tasks and make sure that none of the competitors knew the collection of tasks in advance, so that nobody could preprogram their systems accordingly. This does have promise, but it's complicated, right?

The test then involves giving a set of AGI systems a collection of different tasks, qualitatively different from each other and not announced in advance, and then watching the systems learn to perform them.

If done right, this kind of test would perhaps avoid gaming via narrow-AI systems or grab-bags thereof!

And for this kind of test, I predict that most AGI programs will perform very poorly, until they reach a certain critical level of internal completeness (corresponding e.g. to effective cognitive synergy), at which point there will be a large jump in their capabilities, followed by steady progress.

This critical level is what I think of as the "AGI Sputnik Moment" ...

-- Ben G

-- Ben

Anonymous said...

Yes, completely agree.
Narrow AI programs will ofcourse perform better than an AGI "working" on the same narrow problem. But that doesn't matter it's not the point.

The only thing that matters is to "gear up" the AGI for dealing with general problems. That means in the beginning it is more important that the AGI finds "some" solution to a problem rather than a "perfect" one.

"And for this kind of test, I predict that most AGI programs will perform very poorly, until they reach a certain critical level of internal completeness (corresponding e.g. to effective cognitive synergy), at which point there will be a large jump in their capabilities, followed by steady progress."

It is certainly correct that only a fairly advanced AGI would be able to pass such a test. But aside from that I think that there is no such thing as a "completeley new" problem. I believe that every problem can be categorized in an (very) abstract way and thus the AI can solve it by techniques it has either learned or by being programmed to solve it.

In that case you can measure the progress of your AGI by applying it to narrow tasks. Although I admit that it is hard to determine an absolute progress value since you are not really able to see the "finish". But at least there will be a "felt progress". The more narrow tasks the AGI can solve the closer it gets to the ultimate goal.

etienne said...

turing love test : fooling the judge you're straight by making the judgette in love !

Luxury Villa Rentals In France said...

This is a great, well formatted and easy to understand post with useful information, keep it up and keep writing this way.

Apartment Software said...

This is a good site to spent time on .I just stumbled upon your informative blog and wanted to say that I have really enjoyed reading your blog posts. Very good article, thanks for your sharing.

Stock Market said...

Thanks for sharing your ideas. I'm just glad I was been able to visit your site. Keep it up.

Like Youtube said...

I've just decided to create a blog, which I have been wanting to do for a while. Thanks for this post, it's really useful!

beauty said...

The blog contains informational and educational material. The post enhance my thoughts and experience. So nice!

glass splash backs for kitchens said...

You have some honest ideas here. It looks like you have done a research on the issue and discovered. I think most peoples will agree with your blog.

DB said...

You have at least five fake posters on here who are most likely spammers.

Anyways, nice post.
Perhaps, from a layman's perspective (although I have a bachelors in Compsci I'm not an AI expert), we're asking too much of it. Even human IQ tests are modular.

I think that perhaps passing the test by building a whole bunch of narrow AI facilities and hooking them up into a single connected system is not the same thing as those narrow AI capabilities having generalized utility in other areas.

Nevertheless, a "general" AI who is really a bunch of different modules with a single narrow AI asking the question "what type of problem is this" and passing the inputs to the appropriate narrow AI module would still be hella useful.

The watson from jeopardy is still a very very useful tool to have at your disposal.

arman said...

I have studied your site fully and I realized that, it is the most beneficial for us. I want to take more information through this site.

Wedding sherwani said...

"I’ve bookmarked this because I found it interesting. I would be very interested to hear more news on this. Thanks!"

desain kamar mandi said...

Great post. I was checking constantly this blog and I am impressed! Extremely helpful information particularly the ultimate section : ) I take care of such info much. I was looking for this particular info for a very lengthy time. Thanks and good luck. cara cepat hamil | contoh gambar pemandangan

DK said...

I would like to say this is an excellent site that I have ever come across. Very informative. Please write more so that we can get more details.
Hosted Navision