To follow this blog by email, give your address here...

Friday, March 26, 2010

The GOLEM Eats the Chinese Parent (Toward An AGI Meta-Architecture Enabling Both Goal Preservation and Radical Self-Improvement)

I thought more about the ideas in my previous blog post on the "Chinese Parent Theorem," and while I didn't do a formal proof yet, I did write up the ideas a lot more carefully

GOLEM: Toward An AGI Meta-Architecture Enabling Both Goal Preservation and Radical Self-Improvement

and IMHO they make even more sense now....

Also, I changed the silly name "Chinese Parent Meta-Architecture" to the sillier name "GOLEM" which stands for "Goal-Oriented LEarning Meta-architecture"

The GOLEM ate the Chinese Parent!

I don't fancy that GOLEM, in its present form, constitutes a final solution to the problem of "making goal preservation and radical self-improvement compatible" -- but I'm hoping it points in an interesting and useful direction.

(I still have some proofs about GOLEM sketched in the margins of a Henry James story collection, but the theorems are pretty weak and I'm not sure when I'll have time to type them in. If they were stronger theorems I would be more inspired for it. Most of the work in typing them in would be in setting up the notations ;p ....)

But Would It Be Creative?

In a post on the Singularity email list, Mike Tintner made the following complaint about GOLEM:

Why on earth would you want a "steadfast" AGI? That's a contradiction of AGI.

If your system doesn't have the capacity/potential to revolutionise its goals - to have a major conversion, for example, from religiousness to atheism, totalitarianism to free market liberalism, extreme self-interest and acquisitiveness to extreme altruism, rational thinking to mystical thinking, and so on (as clearly happens with humans), gluttony to anorexia - then you don't have an AGI, just another dressed-up narrow AI.

The point of these examples should be obviously not that an AGI need be an intellectual, but rather that it must have the capacity to drastically change

  1. the priorities of its drives/goals,
  2. the forms of its goals

and even in some cases:

3. eliminate certain drives (presumably secondary ones) altogether

My answer was as follows:

I believe one can have an AGI that is much MORE creative and flexible in its thinking than humans, yet also remains steadfast in its top-level goals...

As an example, imagine a human whose top-level goal in life was to do what the alien god on the mountain wanted. He could be amazingly creative in doing what the god wanted -- especially if the god gave him

  • broad subgoals like "do new science", "invent new things", "help cure suffering" , "make artworks", etc.
  • real-time feedback about how well his actions were fulfilling the goals, according to the god's interpretation
  • advice on which hypothetical actions seemed most likely to fulfill the goals, according to the god's interpretation

But his creativity would be in service of the top-level goal of serving the god...

This is like the GOLEM architecture, where

  • the god is the GoalEvaluator
  • the human is the rest of the GOLEM architecture

I fail to see why this restricts the system from having incredible, potentially far superhuman creativity in working on the goals assigned by the god...

Part of my idea is that the GoalEvaluator can be a narrow AI, thus avoiding an infinite regress where we need an AGI to evaluate the goal-achievement of another AGI...

Can the Goal Evaluator Really Be a Narrow AI?

A dialogue with Abram Demski on the Singularity email list led to some changes to the original GOLEM paper.

The original version of GOLEM states that the GoalEvaluator would be a Narrow AI, and failed to make the GoalEvaluator rely on the Searcher to do its business...

Abram's original question, about this original version, was "Can the Goal Evaluator Really Be a Narrow AI?"

My answer was:

The terms narrow-AI and AGI are not terribly precise...

The GoalEvaluator needs to basically be a giant simulation engine, that tells you: if program P is run, then the probability of state W ensuing is p. Doing this effectively could involve some advanced technologies like probabilistic inference, along with simulation technology. But it doesn't require an autonomous, human-like motivational system. It doesn't require a system that chooses its own actions based on its goals, etc.

The question arises, how does the GoalEvaluator's algorithmics get improved, though? This is where the potential regress occurs. One can have AGI_2 improving the algorithms inside AGI_1's GoalEvaluator. The regress can continue, till eventually one reaches AGI_n whose GoalEvaluator is relatively simple and AGi-free...


After some more discussion, Abram made some more suggestions, which led me to generalize and rephrase his suggestions as follows:

If I understand correctly, what you want to do is use the Searcher to learn programs that predict the behavior of the GoalEvaluator, right? So, there is a "base goal evaluator" that uses sensory data and internal simulations, but then you learn programs that do approximately the same thing as this but much faster (and maybe using less memory)? And since this program learning has the specific goal of learning efficient approximations to what the GoalEvaluator does, it's not susceptible to wire-heading (unless the whole architecture gets broken)...

After the dialogue, I incorporated this suggestion into the GOLEM architecture (and the document linked from this blog post).

Thanks Abram!!

Wednesday, March 17, 2010

"Chinese Parent Theorem"?: Toward a Meta-Architecture for Provably Steadfast AGI

Continuing my series of (hopefully edu-taining ;) blog posts presenting speculations on goal systems for superhuman AGI systems, this one deals with the question of how to create an AGI system that will maintain its initial goal system even as it revises and improves itself -- and becomes so much smarter that in many ways it becomes incomprehensible to its creators or its initial conditions.

This is closely related to the problem Eliezer Yudkowsky has described as "provably Friendly AI." However, I would rather not cast the problem that way, because (as Eliezer of course realizes) there is an aspect of the problem that isn't really about "Friendliness" or any other particular goal system content, but is "merely" about the general process of goal-content preservation under progressive self-modification.

Informally, I define an intelligent system as steadfast if it continues to pursue the same goals over a long period of time. In this terminology, one way to confront the problem of creating predictably beneficial AGI, is to solve the two problems of:

  1. Figuring out how to encapsulate the goal of beneficialness in an AGI's goal system
  2. Figuring out how to create (perhaps provably) steadfast AGI, in a way that applies to the "beneficialness" goal among others
My previous post on Coherent Aggregated Volition (CAV) dealt with the first of these problems. This post deals with the second. My previous post on predictably beneficial AGI deals with both.

The meat of this post is a description of an AGI meta-architecture, that I label the Chinese Parent Meta-Architecture -- and that I conjecture could be proved to be steadfast, under some reasonable (though not necessarily realistic, since the universe is a mysterious place!) assumptions about the AGI system's environment.

I don't actually prove any steadfastness result here -- I just sketch a vague conjecture, which if formalized and proved would deserve the noble name "Chinese Parent Theorem."

I got partway through a proof yesterday and it seemed to be going OK, but I've been distracted by more practical matters, and so for now I decided to just post the basic idea here instead...

Proving Friendly AI

Eliezer Yudkowsky has described his goal concerning “proving Friendly AI” informally as follows:

The putative proof in Friendly AI isn't proof of a physically good outcome when you interact with the physical universe.

You're only going to try to write proofs about things that happen inside the highly deterministic environment of a CPU, which means you're only going to write proofs about the AI's cognitive processes.

In particular you'd try to prove something like "this AI will try to maximize this goal function given its beliefs, and it will provably preserve this entire property (including this clause) as it self-modifies".

It seems to me that proving something like this shouldn’t be sooooo hard to achieve if one assumes some basic fixed “meta-architectural” structure on the part of the AI, rather than permitting total unrestricted self-modification. Such a meta-architecture can be assumed without placing any limits on the AI’s algorithmic information content, for example.

Of course, preservation of the meta-architecture can be assumed as part of the AI system's goal function. So by assuming a meta-architecture, one may be able to prove a result restricted to a certain broad class of goal functions ... and the question becomes whether that class is broad enough to be interesting.

So my feeling is that, if one wants to pursue such a research direction, it makes sense to begin by proving theorems restricted to goals embodying some assumptions about fixed program structure -- and then try to improve the theorems by relaxing the assumptions.

A Simple AGI Meta-Architecture with the Appearance of Steadfastness

After writing the first draft of this post, I discussed the "provably steadfast AGI" problem with a clever Chinese friend, and she commented that what the self-modifying AGI needs (in order to maintain its original goal content as it self-modifies) is a traditional Chinese parent, who will watch the system from the outside as it self-modifies, and continually nag it and pester it and remind it of its original goals.

At first I thought this was just funny, but then it occurred to me that it was actually the same idea as my meta-architecture! My GoalEvaluator component (in the meta-architecture below) is basically a ChineseParent component, living separately from the rest of the system and providing ongoing, fixed goal evaluation.

The thought-experiment I've been exploring is: an AGI system S with the following high-level meta-architecture:

  • Goal Evaluator = external system (not part of S) that tells S, for each possible future world (including environment states and internal program states), how well this world fulfills the goal. This is the "Chinese parent."
  • HistoricalRepository = database storing the past history of S’s internal states and actions, as well as information about the environment during S’s past
  • Operating Program = the program that S is governing its actions by, at a given point in time
    • chosen by the Metaprogram as the best program the Searcher has found, where "best" is judged as "highest probability of goal achievement" based on the output of the Predictor and the Goal Evaluator
  • Predictor = program that estimates, given a candidate operating program P and a possible future world W, the odds of P leading to W
  • Searcher = program that searches through program space to find a new program optimizing a provided objective function
  • Tester = hard-wired program that estimates the quality of a candidate Predictor, using a simple backtesting methodology
    • That is, the Tester assesses how well a Predictor would have performed in the past, using the data in the HistoricalRepository
  • Metaprogram = fixed program that uses Searcher program to find a good
    • Searcher program (judged by the quality of the programs it finds, as judged by the Predictor program)
    • Predictor program (as judged by the Tester’s assessments of its predictions)
    • Operating Program (judged by Predictor working with Goal Evaluator, according to the idea of choosing an Operating Program with the maximum expected goal achievement)
    • The metaprogram also determines the amount of resources to allocate to searching for a Searcher versus a Predictor versus an OP, according to a fixed algorithm for parameter adaptation.

Next, assume that the goal embodied by the GoalEvaluator includes, as a subgoal, the preservation of the overall meta-architecture described above.

Then, it seems to me one can very likely prove that the program will maintain its architecture while seeking to maximize F, under the assumptions that

  • the Goal Evaluator (i.e. the "Chinese parent") does its job correctly
  • no one messes with the underlying hardware.

If proved, this would be the so-called "Chinese Parent Theorem" ;-) .... I don't have time to work on such a proof right now, but would be happy to collaborate with someone on this!

As noted above, this approach doesn't allow full self-modification; it assumes certain key parts of the AGI (meta)architecture are hard-wired. But the hard-wired parts are quite basic and leave a lot of flexibility. So a "Chinese Parent Theorem" of this nature would cover a fairly broad and interesting class of goal functions, it seems to me.

What happens if one implements the Goal Evaluator according to the same architecture, though? In this case, one must postulate a meta-Goal-Evaluator, whose goal is to specify the goals for the first Goal Evaluator: the Chinese Grandparent! Eventually the series must end, and one must postulate an original ancestor Goal Evaluator that operates according to some other architecture. Maybe it's a human, maybe it's CAV, maybe it's some hard-wired code. Hopefully it's not a bureaucratic government committee ;-)

Niggling Practical Matters and Future Directions

Of course, this general schema could be implemented using OpenCog or any other practical AGI architecture as a foundation -- in this case, OpenCog is "merely" the initial condition for the Predictor and Searcher. In this sense, the approach is not extraordinarily impractical.

However, one major issue arising with the whole meta-architecture proposed is that, given the nature of the real world, it's hard to estimate how well the Goal Evaluator will do its job! If one is willing to assume the above meta-architecture, and if a proof along the lines suggested above can be found, then the “predictably beneficial” part of the problem of "predictably beneficial AGI" is largely pushed into the problem of the Goal Evaluator.

Returning to the "Chinese parent" metaphor, what I suggest may be possible to prove is that given an effective parent, one can make a steadfast child -- if the child is programmed to obey the parent's advice about its goals, which include advice about its meta-architecture. The hard problem is then ensuring that the parent's advice about goals is any good, as the world changes! And there's always the possibility that the parents ideas about goals shift over time based on their interaction with the child (bringing us into the domain of modern or postmodern Chinese parents ;-D)

Thus, I suggest, the really hard problem of making predictably beneficial AGI probably isn't "preservation of formally-defined goal content under self-modification." This may be hard if one enables total self-modification, but I suggest it's probably not that hard if one places some fairly limited restrictions on self-modification. The hypothetical Chinese Parent Theorem vaguely outlined here can probably be proved and then strengthened pretty far, reducing meta-architectural assumptions considerably.

The really hard problem, I suspect, is how to create a GoalEvaluator that correctly updates goal content as new information about the world is obtained, and as the world changes -- in a way that preserves the spirit of the original goals even if the details of the original goals need to change. Because the "spirit" of goal content is a very subjective thing.

One approach to this problem, hinted above, would be to create a GoalEvaluator operating according to CAV . In that case, one would be counting on (a computer-aggregated version of) collective human intuition to figure out how to adapt human goals as the world, and human information about it, evolves. This is of course what happens now -- but the dynamic will be much more complex and more interesting with superhuman AGIs in the loop. Since interacting with the superhuman AGI will change human desires and intuitions in all sorts of ways, it's to be expected that such a system would NOT eternally remain consistent with original "legacy human" goals, but would evolve in some new and unpredicted direction....

A deep and difficult direction for theory, then, would be to try to understand the expected trajectories of development of systems including

  • a powerful AGI, with a Chinese Parent meta-architecture as outlined here (or something similar), whose GoalEvaluator is architected via CAV based on the evolving state of some population of intelligent agents
  • the population of intelligent agents, as ongoingly educated and inspired by both the world and the AGI

as they evolve over time and interact with a changing environment that they explore ever more thoroughly.

Sounds nontrivial!

Sunday, March 14, 2010

Creating Predictably Beneficial AGI

The theme of this post is a simple and important one: how to create AGI systems whose beneficialness to humans and other sentient beings can be somewhat reliably predicted.

My SIAI colleague Eliezer Yudkowsky has frequently spoken about the desirability of a "(mathematically) provably Friendly AI", where by "Friendly" he means something like "beneficial and not destructive to humans" (see here for a better summary). My topic here is related, but different; and I'll discuss the relationship between the two ideas below.

This post is a sort of continuation of my immediately previous blog post, further pursuing the topic of goal-system content for advanced, beneficial AGIs. That post discussed one of Yudkowsky's ideas related to "Friendliness" -- Coherent Extrapolated Volition (CEV) -- along with a more modest and (I suggest) more feasible notion of Coherent Aggregated Volition (CAV). The ideas presented here are intended to work along with CAV, rather than serving as an alternative.

There are also some relations between the ideas presented here and Schmidhuber's Godel Machine -- a theoretical, unlikely-ever-to-be-practically-realizable AGI system that uses theorem-proving to ensure its actions will provably help it achieve its goals.

Variations of "Provably Friendly AI"

What is "Provably Friendly AI"? (a quite different notion from "predictably beneficial AGI")

In an earlier version of this blog post I gave an insufficiently clear capsule summary of Eliezer's "Friendly AI" idea, as Eliezer pointed out in a comment to that version; so this section includes his comment and tries to do a less wrong job. The reader who only wants to find out about predictably beneficial AGI may skip to the next section!

In Eliezer's comment, he noted that his idea for a FAI proof is NOT to prove something about what certain AI systems would do to the universe, but rather about what would happen inside the AI system itself:

The putative proof in Friendly AI isn't proof of a physically good outcome when you interact with the physical universe.

You're only going to try to write proofs about things that happen inside the highly deterministic environment of a CPU, which means you're only going to write proofs about the AI's cognitive processes.

In particular you'd try to prove something like "this AI will try to maximize this goal function given its beliefs, and it will provably preserve this entire property (including this clause) as it self-modifies".

So, in the context of this particular mathematical research programme ("provable Friendliness"), what Eliezer is after is what we might call an internally Friendly AI, which is a separate notion from a physically Friendly AI. This seems an important distinction.

To me, "provably internally FAI" is interesting mainly as a stepping-stone to "provably physically FAI" -- and the latter is a problem that seems even harder than the former, in a variety of obvious and subtle ways (only a few of which will be mentioned here).

All in all, I think that "provably Friendly AI" -- in the above senses or others -- is an interesting and worthwhile goal to think about and work towards; but also that it's important to be cognizant of the limitations on the utility of such proofs.... Much as I love math (I even got a math PhD, way back when), I have to admit the world of mathematics has its limits.

First of all Godel showed that mathematics is only formally meaningful relative to some particular axiom system, and that no axiom system can encompass all mathematics in a consistent way. This is worth reflecting on in the context of proofs about internally Friendly AI, especially when one considers the possibility of AGI systems with algorithmic information exceeding any humanly comprehensible axiom system. Obvious we cannot understand proofs about many interesting properties or behaviors of the latter type of AGI system.

But more critically, the connection between internal Friendliness and physical Friendliness remains quite unclear. The connection between complex mathematics and physical reality is based on science, and all of our science is based on extrapolation from a finite bit-set of observations (which I've previously called the Master Data Set -- which is not currently all gathered into one place, though, given the advance of Internet technology, it soon may be).

For example, just to pose an extreme case, there could be aliens out there who identify and annihilate any planet that gives rise to a being with an IQ over 1000. In this case a provably internally FAI might not be physically Friendly at all; and through no fault of its own. It probably makes sense to carry out proofs and informal arguments about physically FAI based on assumptions ruling out weird cases like this -- but then the assumptions do need to be explicitly stated and clarified.

So, my worry about FAI in the sense of Eliezer's above comment, isn't so much about the difficulty of the "internally FAI" proof, but rather about the difficulty of formalizing the relation between internally FAI and physically FAI in a way that is going to make sense post-Singularity.

It seems to me that, given the limitations of our understanding of the physical universe: at very best, a certain AI design could potentially be proven physically Friendly in the same sense that, in the 1800s, quantum teleportation, nuclear weapons, backwards time travel, rapid forwards time travel, perpetual motion machines and fRMI machines could have been proved impossible. I.e., those things could have been proved impossible based on the "laws" of physics as assumed at that time. (Yes, I know we still think perpetual motion machines are impossible, according to current science. I think that's probably right, but who really knows for sure? And the jury is currently out on backwards time travel.)

One interesting scenario to think about would be a FAI in a big computer with a bunch of human uploads. Then one can think about "simulated-physically FAI" as a subcase of "internally FAI." In this simulation scenario, one can also think about FAI and CEV together in a purely deterministic context. But of course, this sort of "thought experiment" leads to complexities related to the possibility of somebody in the physical universe but outside the CPU attacking the FAI and threatening it and its population of uploads...

OK, enough about FAI for now. Now, on to discuss a related quest, which is different from the quest for FAI in several ways; but more similar to the quest for physically FAI than that for internally FAI....

Predictably Beneficial AGI

The goal of my thinking about "predictably beneficial AGI" is to figure out how to create extremely powerful AGI systems that appear likely to be beneficial to humans, under reasonable assumptions about the physical world and the situations the AI will encounter.

Here "predictable" doesn't mean absolutely predictable, just: statistically predictable, given the available knowledge about the AGI system and the world at a particular point in time.

An obvious question is what sort of mathematics will be useful in the pursuit of predictably beneficial AGI. One possibility is theoretical computer science and formal logic, and I wouldn't want to discount what those disciplines could contribute. Another possibility, though, which seems particularly appealing to me, is nonlinear dynamical systems theory. Of course the two areas are not exclusive, and there are many known connections between these kinds of mathematics.

On the crudest level, one way to model the problem is as follows. One has a system S, so that

S(t+1) = q( S(t), E(t) )

E(t+1) = r(S(t), E(t) )

where E is the environment (which is best modeled as stochastic and not fully known). One has an objective function

G( E(t),...,E(t+s) )

that one would like to see maximized -- this is the "goal." Coherent Aggregated Volition, as described in my previous blog post, is one candidate for such a goal.

One may also assume a set of constraints C that the system must obey, which we may write as


The functions G and C are assumed to encapsulate the intuitive notion of "beneficialness."

Of course, the constraints may be baked into the objective function, but there are many ways of doing this; and it's often interesting in optimization problems to separate the objective function from the constraints, so one can experiment with different ways of combining them.

This is a problem class that is incredibly (indeed, uncomputably) hard to solve in the general case ... so the question comes down to: given the particular G and C of interest, is there a subclass of systems S for which the problem is feasibly and approximatively solvable?

This leads to an idea I will call the Simple Optimization Machine (SOMA)... a system S which seeks to maximize the two objectives

  1. maximize G, while obeying C
  2. maximize the simplicity of the problem of estimating the degree to which S will "maximize G, while obeying C", given the Master Data Set

Basically, the problem of ensuring the system lies in the "nice region of problem space" is thrown to the system itself, to figure out as part of its learning process!

Of course one could wrap this simplicity criterion into G, but it seems conceptually simplest to leave it separate, at least for purposes of current discussion.

The function via which these two objectives are weighted is a parameter that must be tuned. The measurement of simplicity can also be configured in various ways!

A hard constraint could also be put on the minimum simplicity to be accepted (e.g. "within the comprehensibility threshold of well-educated, unaugmented humans").

Conceptually, one could view this as a relative of Schmidhuber's Godel Machine. The Godel Machine (put very roughly) seeks to achieve a goal in a provably correct way, and before each step it takes, it seeks to prove that this step will improve its goal-achievement. SOMA, on the other hand, seeks to achieve a goal in a manner that seems to be simply demonstrable to be likely to work, and seeks to continually modify itself and its world with this in mind.

A technical note: one could argue that because the functions q and r are assumed fixed, the above framework doesn't encompass "truly self-modifying systems." I have previously played around with using hyperset equations like

S(t+1) = S(t)[S(t)]

and there is no real problem with doing this, but I'm not sure it adds anything to the discussion at this point. One may consider q and r to be given by the laws of physics; and I suppose that it's best to initially restrict our analytical explorations of beneficial AGI to the case of AGI systems that don't revise the laws of physics. If we can't understand the case of physics-obeying agents, understanding the more general case is probably hopeless!


I stress that SOMA is really an idea about goal system content, and not an AGI design in itself. SOMA could be implemented in the context of a variety of different AGI designs, including for instance the open-source OpenCog approach.

It is not hard to envision ways of prototyping SOMA given current technology, using existing machine learning and reasoning algorithms, in OpenCog or otherwise. Of course, such prototype experiments would give limited direct information about the behavior of SOMA for superhuman AGI systems -- but they might give significant indirect information, via helping lead us to general mathematical conclusions about SOMA dynamics.

Altogether, my feeling is that "CAV + Predictably Beneficial AGI" is on the frontier of current mathematics and science. They pose some very difficult problems that do, however seem potentially addressable in the near future via a combination of mathematics and computational experimentation. On the other hand, I have a less clear idea of how to pragmatically do research work on CEV or the creation of practically feasible yet provably physically Friendly AGI.

My hope in proposing these ideas is that they (or other similar ideas conceived by others) may serve as a sort of bridge between real-world AGI work and abstract ethical considerations about the hypothetical goal content of superhuman AGI systems.