My SIAI colleague Eliezer Yudkowsky has frequently spoken about the desirability of a "(mathematically) provably Friendly AI", where by "Friendly" he means something like "beneficial and not destructive to humans" (see here for a better summary). My topic here is related, but different; and I'll discuss the relationship between the two ideas below.

This post is a sort of continuation of my immediately previous blog post, further pursuing the topic of goal-system content for advanced, beneficial AGIs. That post discussed one of Yudkowsky's ideas related to "Friendliness" -- Coherent Extrapolated Volition (CEV) -- along with a more modest and (I suggest) more feasible notion of Coherent Aggregated Volition (CAV). The ideas presented here are intended to work along with CAV, rather than serving as an alternative.

There are also some relations between the ideas presented here and Schmidhuber's Godel Machine -- a theoretical, unlikely-ever-to-be-practically-realizable AGI system that uses theorem-proving to ensure its actions will provably help it achieve its goals.

Variations of "Provably Friendly AI"

What is "Provably Friendly AI"? (a quite different notion from "predictably beneficial AGI")

In an earlier version of this blog post I gave an insufficiently clear capsule summary of Eliezer's "Friendly AI" idea, as Eliezer pointed out in a comment to that version; so this section includes his comment and tries to do a less wrong job. The reader who only wants to find out about predictably beneficial AGI may skip to the next section!

In Eliezer's comment, he noted that his idea for a FAI proof is NOT to prove something about what certain AI systems would do to the universe, but rather about what would happen inside the AI system itself:

The putative proof in Friendly AI isn't proof of a physically good outcome when you interact with the physical universe.

You're only going to try to write proofs about things that happen inside the highly deterministic environment of a CPU, which means you're only going to write proofs about the AI's cognitive processes.

In particular you'd try to prove something like "this AI will try to maximize this goal function given its beliefs, and it will provably preserve this entire property (including this clause) as it self-modifies".

So, in the context of this particular mathematical research programme ("provable Friendliness"), what Eliezer is after is what we might call an internally Friendly AI, which is a separate notion from a physically Friendly AI. This seems an important distinction.

To me, "provably internally FAI" is interesting mainly as a stepping-stone to "provably physically FAI" -- and the latter is a problem that seems even harder than the former, in a variety of obvious and subtle ways (only a few of which will be mentioned here).

All in all, I think that "provably Friendly AI" -- in the above senses or others -- is an interesting and worthwhile goal to think about and work towards; but also that it's important to be cognizant of the limitations on the utility of such proofs.... Much as I love math (I even got a math PhD, way back when), I have to admit the world of mathematics has its limits.

First of all Godel showed that mathematics is only formally meaningful relative to some particular axiom system, and that no axiom system can encompass all mathematics in a consistent way. This is worth reflecting on in the context of proofs about internally Friendly AI, especially when one considers the possibility of AGI systems with algorithmic information exceeding any humanly comprehensible axiom system. Obvious we cannot understand proofs about many interesting properties or behaviors of the latter type of AGI system.

But more critically, the connection between internal Friendliness and physical Friendliness remains quite unclear. The connection between complex mathematics and physical reality is based on science, and all of our science is based on extrapolation from a finite bit-set of observations (which I've previously called the Master Data Set -- which is not currently all gathered into one place, though, given the advance of Internet technology, it soon may be).

For example, just to pose an extreme case, there could be aliens out there who identify and annihilate any planet that gives rise to a being with an IQ over 1000. In this case a provably internally FAI might not be physically Friendly at all; and through no fault of its own. It probably makes sense to carry out proofs and informal arguments about physically FAI based on assumptions ruling out weird cases like this -- but then the assumptions do need to be explicitly stated and clarified.

So, my worry about FAI in the sense of Eliezer's above comment, isn't so much about the difficulty of the "internally FAI" proof, but rather about the difficulty of formalizing the relation between internally FAI and physically FAI in a way that is going to make sense post-Singularity.

It seems to me that, given the limitations of our understanding of the physical universe: at very best, a certain AI design could potentially be proven physically Friendly in the same sense that, in the 1800s, quantum teleportation, nuclear weapons, backwards time travel, rapid forwards time travel, perpetual motion machines and fRMI machines could have been proved impossible. I.e., those things could have been proved impossible based on the "laws" of physics as assumed at that time. (Yes, I know we still think perpetual motion machines are impossible, according to current science. I think that's probably right, but who really knows for sure? And the jury is currently out on backwards time travel.)

One interesting scenario to think about would be a FAI in a big computer with a bunch of human uploads. Then one can think about "simulated-physically FAI" as a subcase of "internally FAI." In this simulation scenario, one can also think about FAI and CEV together in a purely deterministic context. But of course, this sort of "thought experiment" leads to complexities related to the possibility of somebody in the physical universe but outside the CPU attacking the FAI and threatening it and its population of uploads...

OK, enough about FAI for now. Now, on to discuss a related quest, which is different from the quest for FAI in several ways; but more similar to the quest for physically FAI than that for internally FAI....

Predictably Beneficial AGI

The goal of my thinking about "predictably beneficial AGI" is to figure out how to create extremely powerful AGI systems that appear likely to be beneficial to humans, under reasonable assumptions about the physical world and the situations the AI will encounter.

Here "predictable" doesn't mean absolutely predictable, just: statistically predictable, given the available knowledge about the AGI system and the world at a particular point in time.

An obvious question is what sort of mathematics will be useful in the pursuit of predictably beneficial AGI. One possibility is theoretical computer science and formal logic, and I wouldn't want to discount what those disciplines could contribute. Another possibility, though, which seems particularly appealing to me, is nonlinear dynamical systems theory. Of course the two areas are not exclusive, and there are many known connections between these kinds of mathematics.

On the crudest level, one way to model the problem is as follows. One has a system S, so that

S(t+1) = q( S(t), E(t) )

E(t+1) = r(S(t), E(t) )

where E is the environment (which is best modeled as stochastic and not fully known). One has an objective function

G( E(t),...,E(t+s) )

that one would like to see maximized -- this is the "goal." Coherent Aggregated Volition, as described in my previous blog post, is one candidate for such a goal.

One may also assume a set of constraints C that the system must obey, which we may write as

C(E(t),...,E(t+s))

The functions G and C are assumed to encapsulate the intuitive notion of "beneficialness."

Of course, the constraints may be baked into the objective function, but there are many ways of doing this; and it's often interesting in optimization problems to separate the objective function from the constraints, so one can experiment with different ways of combining them.

This is a problem class that is incredibly (indeed, uncomputably) hard to solve in the general case ... so the question comes down to: given the particular G and C of interest, is there a subclass of systems S for which the problem is feasibly and approximatively solvable?

This leads to an idea I will call the Simple Optimization Machine (SOMA)... a system S which seeks to maximize the two objectives

- maximize G, while obeying C
- maximize the simplicity of the problem of estimating the degree to which S will "maximize G, while obeying C", given the Master Data Set

Basically, the problem of ensuring the system lies in the "nice region of problem space" is thrown to the system itself, to figure out as part of its learning process!

Of course one could wrap this simplicity criterion into G, but it seems conceptually simplest to leave it separate, at least for purposes of current discussion.

The function via which these two objectives are weighted is a parameter that must be tuned. The measurement of simplicity can also be configured in various ways!

A hard constraint could also be put on the minimum simplicity to be accepted (e.g. "within the comprehensibility threshold of well-educated, unaugmented humans").

Conceptually, one could view this as a relative of Schmidhuber's Godel Machine. The Godel Machine (put very roughly) seeks to achieve a goal in a provably correct way, and before each step it takes, it seeks to prove that this step will improve its goal-achievement. SOMA, on the other hand, seeks to achieve a goal in a manner that seems to be simply demonstrable to be likely to work, and seeks to continually modify itself and its world with this in mind.

A technical note: one could argue that because the functions q and r are assumed fixed, the above framework doesn't encompass "truly self-modifying systems." I have previously played around with using hyperset equations like

S(t+1) = S(t)[S(t)]

and there is no real problem with doing this, but I'm not sure it adds anything to the discussion at this point. One may consider q and r to be given by the laws of physics; and I suppose that it's best to initially restrict our analytical explorations of beneficial AGI to the case of AGI systems that don't revise the laws of physics. If we can't understand the case of physics-obeying agents, understanding the more general case is probably hopeless!

Discussion

I stress that SOMA is really an idea about goal system content, and not an AGI design in itself. SOMA could be implemented in the context of a variety of different AGI designs, including for instance the open-source OpenCog approach.

It is not hard to envision ways of prototyping SOMA given current technology, using existing machine learning and reasoning algorithms, in OpenCog or otherwise. Of course, such prototype experiments would give limited direct information about the behavior of SOMA for superhuman AGI systems -- but they might give significant indirect information, via helping lead us to general mathematical conclusions about SOMA dynamics.

Altogether, my feeling is that "CAV + Predictably Beneficial AGI" is on the frontier of current mathematics and science. They pose some very difficult problems that do, however seem potentially addressable in the near future via a combination of mathematics and computational experimentation. On the other hand, I have a less clear idea of how to pragmatically do research work on CEV or the creation of practically feasible yet provably physically Friendly AGI.

My hope in proposing these ideas is that they (or other similar ideas conceived by others) may serve as a sort of bridge between real-world AGI work and abstract ethical considerations about the hypothetical goal content of superhuman AGI systems.

## 3 comments:

The putative proof in Friendly AI isn't proof of a physically good outcome when you interact with the physical universe.

You're only going to try to write proofs about things that happen inside the highly deterministic environment of a CPU, which means you're only going to write proofs about the AI's cognitive processes.

In particular you'd try to prove something like "this AI will

tryto maximize this goal function given its beliefs, and it will provably preserve this entire property (including this clause) as it self-modifies".So, yes, I'm afraid you were arguing against a bit of a strawman here.

-- Eliezer Yudkowsky

Eliezer,

Sorry the blog post gave the impression of "arguing against a straw man"; I've revised the wording so as to incorporate your comment and (hopefully) no longer give that impression!

Of course, the main point of my post was not to argue against anything but to suggest an alternative approach...

-- Ben

Let's say I build a knowledge representation pool (e.g. some relational database) and a translation engine (e.g some searching modality), such that any human who wants to ask any question that is not NP-hard (e.g. which way to Venus? How many photons were emitted from the sun yesterday? How long until my flowers bloom? Note: In the case of the final question, the AI might have to specify a bounded region of time, assign probabilities, and use some kind of heuristic to supply a response, and qualify that response with references to indeterminacy) can get a short, intelligible answer immediately transmitted to his or her mind.

By arranging a relation between the mind of this AI and the mind of the user, you have to account for the interaction between the user and the AI. Even if the AI is passive, and only responds to queries, by empowering the user with knowledge, the AI can become the impetus for unfriendly effects in the ecosystem. There are suggestibility effects. The AI has to handle cases, the AI has to know something about the user before the AI supplies the user with information, and the AI has to have goals and constraints in place to prevent certain outcomes deemed to be unfriendly.

The deeming of a particular case scenario as unfriendly will vary with the moral framework of the host, which will possibly be modified by the host's shock levels. A post-conventional shock level 5 Dr. Manhattan may not regard the destruction of all life on earth as of superior moral significance to the decay of a radioactive isotope in a box in an imagined universe. A Virindi Master might tell you that your epithelial tissue is most supple, and offer to buy it from you. You might nararate your a stream of consciousness to your mother containing the perceptions of exploited animals around the world, and cause her to go into epileptic shock and begin talking to an invisible being. You have to respond to these kinds of situations on a case-by-case basis until some general rule can be formulated which allows you to handle all cases.

Since the suggestibility of individuals is variable, you have to have some understanding of those individuals' minds.

This bridges me to a discussion of paradigms. If you observe the socioeconomic structure of a society, you can make inferences about the sociocultural orientation of the groups, and the tacit epistemological assumptions made by individuals in those groups. By visiting a particular individual, and eliciting information from this individual (for example, by shocking the individual, as Dirk Gently did to Richard Macduff, by informing him that he was a suspect in a murder case), one is able to become percipiently aware of the individual's suggestibility, well enough even to prompt the individual to make decisions like, strip naked, jump off a bridge, and tread water until there is no more energy to tread water with, or some condition precedent to the exit of the loop is fulfilled.

As a species, and as individuals, we have to become aware of our percipient interests, our real interests. Hypothetically, these interests can be identified in a boolean fashion, for example, there may have been a transaction or occurrence which gave rise to life on earth. Let's say that a space ship blew up 4.2 billion years ago, and there is a ghost that is attempting to go back in time to cause the spaceship to not blow up. If that ghost fulfills its mission, we cease to exist. So if survival of our species is one of the criteria for friendliness, then a friendly AI would probably identify the intentions of any entity that wishes to adversely impact our survival, and setup strange attractors to divert that entity's consciousness away from any event in spacetime that could be modified to adversely affect our survival interest. I am interested to know what you think are the real interests that our species experiences, Ben.

This is an un-edited comment, as I have work and play to do.

Post a Comment