This is closely related to the problem Eliezer Yudkowsky has described as "provably Friendly AI." However, I would rather not cast the problem that way, because (as Eliezer of course realizes) there is an aspect of the problem that isn't really about "Friendliness" or any other particular goal system content, but is "merely" about the general process of goal-content preservation under progressive self-modification.
Informally, I define an intelligent system as steadfast if it continues to pursue the same goals over a long period of time. In this terminology, one way to confront the problem of creating predictably beneficial AGI, is to solve the two problems of:
- Figuring out how to encapsulate the goal of beneficialness in an AGI's goal system
- Figuring out how to create (perhaps provably) steadfast AGI, in a way that applies to the "beneficialness" goal among others
The meat of this post is a description of an AGI meta-architecture, that I label the Chinese Parent Meta-Architecture -- and that I conjecture could be proved to be steadfast, under some reasonable (though not necessarily realistic, since the universe is a mysterious place!) assumptions about the AGI system's environment.
I don't actually prove any steadfastness result here -- I just sketch a vague conjecture, which if formalized and proved would deserve the noble name "Chinese Parent Theorem."
I got partway through a proof yesterday and it seemed to be going OK, but I've been distracted by more practical matters, and so for now I decided to just post the basic idea here instead...
Proving Friendly AI
Eliezer Yudkowsky has described his goal concerning “proving Friendly AI” informally as follows:
The putative proof in Friendly AI isn't proof of a physically good outcome when you interact with the physical universe.
You're only going to try to write proofs about things that happen inside the highly deterministic environment of a CPU, which means you're only going to write proofs about the AI's cognitive processes.
In particular you'd try to prove something like "this AI will try to maximize this goal function given its beliefs, and it will provably preserve this entire property (including this clause) as it self-modifies".
It seems to me that proving something like this shouldn’t be sooooo hard to achieve if one assumes some basic fixed “meta-architectural” structure on the part of the AI, rather than permitting total unrestricted self-modification. Such a meta-architecture can be assumed without placing any limits on the AI’s algorithmic information content, for example.
Of course, preservation of the meta-architecture can be assumed as part of the AI system's goal function. So by assuming a meta-architecture, one may be able to prove a result restricted to a certain broad class of goal functions ... and the question becomes whether that class is broad enough to be interesting.
So my feeling is that, if one wants to pursue such a research direction, it makes sense to begin by proving theorems restricted to goals embodying some assumptions about fixed program structure -- and then try to improve the theorems by relaxing the assumptions.
A Simple AGI Meta-Architecture with the Appearance of Steadfastness
After writing the first draft of this post, I discussed the "provably steadfast AGI" problem with a clever Chinese friend, and she commented that what the self-modifying AGI needs (in order to maintain its original goal content as it self-modifies) is a traditional Chinese parent, who will watch the system from the outside as it self-modifies, and continually nag it and pester it and remind it of its original goals.
At first I thought this was just funny, but then it occurred to me that it was actually the same idea as my meta-architecture! My GoalEvaluator component (in the meta-architecture below) is basically a ChineseParent component, living separately from the rest of the system and providing ongoing, fixed goal evaluation.
The thought-experiment I've been exploring is: an AGI system S with the following high-level meta-architecture:
- Goal Evaluator = external system (not part of S) that tells S, for each possible future world (including environment states and internal program states), how well this world fulfills the goal. This is the "Chinese parent."
- HistoricalRepository = database storing the past history of S’s internal states and actions, as well as information about the environment during S’s past
- Operating Program = the program that S is governing its actions by, at a given point in time
- chosen by the Metaprogram as the best program the Searcher has found, where "best" is judged as "highest probability of goal achievement" based on the output of the Predictor and the Goal Evaluator
- Predictor = program that estimates, given a candidate operating program P and a possible future world W, the odds of P leading to W
- Searcher = program that searches through program space to find a new program optimizing a provided objective function
- Tester = hard-wired program that estimates the quality of a candidate Predictor, using a simple backtesting methodology
- That is, the Tester assesses how well a Predictor would have performed in the past, using the data in the HistoricalRepository
- Metaprogram = fixed program that uses Searcher program to find a good
- Searcher program (judged by the quality of the programs it finds, as judged by the Predictor program)
- Predictor program (as judged by the Tester’s assessments of its predictions)
- Operating Program (judged by Predictor working with Goal Evaluator, according to the idea of choosing an Operating Program with the maximum expected goal achievement)
- The metaprogram also determines the amount of resources to allocate to searching for a Searcher versus a Predictor versus an OP, according to a fixed algorithm for parameter adaptation.
Then, it seems to me one can very likely prove that the program will maintain its architecture while seeking to maximize F, under the assumptions that
- the Goal Evaluator (i.e. the "Chinese parent") does its job correctly
- no one messes with the underlying hardware.
If proved, this would be the so-called "Chinese Parent Theorem" ;-) .... I don't have time to work on such a proof right now, but would be happy to collaborate with someone on this!
As noted above, this approach doesn't allow full self-modification; it assumes certain key parts of the AGI (meta)architecture are hard-wired. But the hard-wired parts are quite basic and leave a lot of flexibility. So a "Chinese Parent Theorem" of this nature would cover a fairly broad and interesting class of goal functions, it seems to me.
What happens if one implements the Goal Evaluator according to the same architecture, though? In this case, one must postulate a meta-Goal-Evaluator, whose goal is to specify the goals for the first Goal Evaluator: the Chinese Grandparent! Eventually the series must end, and one must postulate an original ancestor Goal Evaluator that operates according to some other architecture. Maybe it's a human, maybe it's CAV, maybe it's some hard-wired code. Hopefully it's not a bureaucratic government committee ;-)
Niggling Practical Matters and Future Directions
Of course, this general schema could be implemented using OpenCog or any other practical AGI architecture as a foundation -- in this case, OpenCog is "merely" the initial condition for the Predictor and Searcher. In this sense, the approach is not extraordinarily impractical.
However, one major issue arising with the whole meta-architecture proposed is that, given the nature of the real world, it's hard to estimate how well the Goal Evaluator will do its job! If one is willing to assume the above meta-architecture, and if a proof along the lines suggested above can be found, then the “predictably beneficial” part of the problem of "predictably beneficial AGI" is largely pushed into the problem of the Goal Evaluator.
Returning to the "Chinese parent" metaphor, what I suggest may be possible to prove is that given an effective parent, one can make a steadfast child -- if the child is programmed to obey the parent's advice about its goals, which include advice about its meta-architecture. The hard problem is then ensuring that the parent's advice about goals is any good, as the world changes! And there's always the possibility that the parents ideas about goals shift over time based on their interaction with the child (bringing us into the domain of modern or postmodern Chinese parents ;-D)
Thus, I suggest, the really hard problem of making predictably beneficial AGI probably isn't "preservation of formally-defined goal content under self-modification." This may be hard if one enables total self-modification, but I suggest it's probably not that hard if one places some fairly limited restrictions on self-modification. The hypothetical Chinese Parent Theorem vaguely outlined here can probably be proved and then strengthened pretty far, reducing meta-architectural assumptions considerably.
The really hard problem, I suspect, is how to create a GoalEvaluator that correctly updates goal content as new information about the world is obtained, and as the world changes -- in a way that preserves the spirit of the original goals even if the details of the original goals need to change. Because the "spirit" of goal content is a very subjective thing.
One approach to this problem, hinted above, would be to create a GoalEvaluator operating according to CAV . In that case, one would be counting on (a computer-aggregated version of) collective human intuition to figure out how to adapt human goals as the world, and human information about it, evolves. This is of course what happens now -- but the dynamic will be much more complex and more interesting with superhuman AGIs in the loop. Since interacting with the superhuman AGI will change human desires and intuitions in all sorts of ways, it's to be expected that such a system would NOT eternally remain consistent with original "legacy human" goals, but would evolve in some new and unpredicted direction....
A deep and difficult direction for theory, then, would be to try to understand the expected trajectories of development of systems including
- a powerful AGI, with a Chinese Parent meta-architecture as outlined here (or something similar), whose GoalEvaluator is architected via CAV based on the evolving state of some population of intelligent agents
- the population of intelligent agents, as ongoingly educated and inspired by both the world and the AGI
as they evolve over time and interact with a changing environment that they explore ever more thoroughly.
Sounds nontrivial!
9 comments:
Systems with inflexible goals are on a certain path to extinction, in the long run. In trying to devise goals which don't change you kind of get into a situation of diminishing returns, in that the only goals which are truly universal are things which are extremely vague, such as "keep surviving". Any goal more elaborate than survival is bound to eventually break down in a complex changing environment containing other adapting entities.
Bob, I intuitively tend to agree that as Keynes said "in the long run we'll all be dead."
Even if we achieve immortality in various strong senses -- our selves will change over time so that our current selves will be dead anyway. Most likely.
But that doesn't obviate the interestingness of designing strategies for the medium term!!
As I tried to articulate in the last paragraph of my post, a meta-architecture like I described would ultimately co-evolve with the humans collaboratively defining its GoalEvaluator. So it wouldn't be a static thing. It would be a way of coupling the evolution of advanced AGI with the evolution of humanity ... for a while.
Bob ... hypothetically such an AGI meta-architecture as I described in this blog post could fit nicely into a path of "controlled gradual ascension" for humans.
The idea of "controlled ascension" being that if you become a god in 5 second, you're basically dying in a flash of ecstasy. But if you grow into a god gradually over 1000 years, then due to the continuity of the process, the god is more genuinely "you." ....
Which may not be important in a grant cosmic sense, but is important in a human sense.
So, one path to controlled ascension may be to create advanced AGIs that are highly biased to stick steadfastly to the collective goals of humanity, as they self-modify and become transhumanly intelligent.
This doesn't have to last forever .. if it just lasts long enough for humans to gradually feel themselves becoming gods, then controlled ascension is achieved ;D
But I didn't put this in the main blog post, because the Chinese Parent Meta-architecture is not tied to controlled ascension, it is relevant in other future scenarios as well.
IMO, the first thing that needs proving about such goal-directed systems is that they won't wirehead themselves.
Tim: you say the first thing to prove is that the system won't wirehead itself.
In this case, the only way it could wirehead itself would be to modify its meta-architecture or its parent (GoalEvaluator). So if one proves that the meta-architecture will remain invariant over time (under appropriate assumptions), then it follows that it won't wirehead itself.
Could you provide a short comment on how do you see the "Chinese Parent Theorem" relates to "Goedel Machines" (by Schmidhuber)?
Responding to the comment on Godel machines: the Godel machine is an obviously infeasible architecture. The meta-architecture I described here could be implemented in a rigorous and infeasible way like a Godel machine, or it could be implemented with heuristic program search algorithms instead. However, it's not clear how strong are the assumptions one must make about the heuristic algorithms, to get any interesting theorems without hideously complex proofs.
Very good post. This is a good place to step outside the box. Having the parents with 'agreed to' goals nagging the Meta Arc will work as the evolution of human society moves forward.
The parent will push the Meta Arc to attain greater goals, but these greater goals will be simplistic. As progression occurs, judgement within the machine will need to analyze whether a solution to a particular issue moves toward that greater goal, then if multiple issues/solutions rise to that.
Through communication with humans, the above can be analyzed also, a joining of the emotive and calculative providing comfort to humans as a whole. The great thing about the machine will be its ability to not react with emotion which currently leads to the extreme where all bets are placed on something that often times proves irrelevent over a period of years.
To push the goal leading to 'S' feeling satisfaction through a greater level of happiness in the world will lead to a world where greater things will be attained at a faster pace.
Good job.
But why Chinese, or not Japanese or Vietnamese?
I'm not in this area, but I would like to know why couldn't it be to say "Indian Parent Theorem"?
Post a Comment