To follow this blog by email, give your address here...

Saturday, August 30, 2008

On the Preservation of Goals in Self-Modifying AI Systems

I wrote down some speculative musings on the preservations of goals in self-modifying AI systems, a couple weeks back; you can find them here:

The basic issue is: what can you do to help mitigate against the problem of "goal drift", wherein an AGI system starts out with a certain top-level goal governing its behavior, but then gradually modifies its own code in various ways, and ultimately -- through inadvertent consequences of the code revisions -- winds up drifting into having different goals than it started with. I certainly didn't answer the question but I came up with some new ways of thinking about the problem, and formalizing the problem, that I think might be interesting....

While the language of math is used in the paper, don't be fooled into thinking I've proved anything there ... the paper just contains speculative ideas without any real proof, just as surely as if they were formulated in words without any equations. I just find that math is sometimes the clearest way to say what I'm thinking, even if I haven't come close to proving the correctness of what I'm thinking yet...

An abstract of the speculative paper is:

Toward an Understanding of the Preservation of Goals
in Self-Modifying Cognitive Systems

Ben Goertzel

A new approach to thinking about the problem of “preservation of AI goal systems under repeated self-modification” (or, more compactly, “goal drift”) is presented, based on representing self-referential goals using hypersets and multi-objective optimization, and understanding self-modification of goals in terms of repeated iteration of mappings. The potential applicability of results from the theory of iterated random functions is discussed. Some heuristic conclusions are proposed regarding what kinds of concrete real-world objectives may best lend themselves to preservation under repeated self-modification. While the analysis presented is semi-rigorous at best, and highly preliminary, it does intuitively suggest that important humanly-desirable AI goals might plausibly be preserved under repeated self-modification. The practical severity of the problem of goal drift remains unresolved, but a set of conceptual and mathematical tools are proposed which may be useful for more thoroughly addressing the problem.


Nick Tarleton said...

To put it as simply as possible (and with the explicit admission that we are now hand-waving even more furtively), the question is whether we anticipate the system getting into situations where solutions involving abandoning its concrete objectives will appear deceptively good. If so, then the biasing involved in G_A is probably useful (if it can, in fact, be achieved for the concrete objective A in question).

If we anticipate that, the system can as well, and e.g. be more wary of whatever class of situations where this is a risk. I find it hard to imagine that adding a new top-level goal in advance could be the best choice here, unless the AI is and will remain dumber than human or badly biased (in which case you have much bigger problems). In the unlikely event a new supergoal really is the best way, the AI can always patch itself.

On the other hand, if we anticipate the system getting into situations where solutions involving personally abandoning its concrete objectives will appear deceptively bad as ways of actually achieving its concrete objectives, then the biasing involved G_A in is actually counterproductive.

Even in non-deceptive situations, adding any goal to be traded off against Helpfulness will be counterproductive, as it necessarily results in lower expected Helpfulness.

It seems commonsensically that, in all but pathological cases, the best way to maximize G as an outcome is going to be to maintain G as a goal. But logically, this is not *always* going to be the case.

And when that's not the case, why should it maintain G?

Nick Tarleton said...

I said: Even in non-deceptive situations, adding any goal to be traded off against Helpfulness will be counterproductive, as it necessarily results in lower expected Helpfulness.

Well, <=, but in practice surely at least slightly <, because more resources will be devoted to defending the goal system.

Mike said...

iterated randomness around a goal makes think of chaotic attractors. Can the goal then be considered not a discrete rule, but a general tendency? Would fixed rules ever be strong enough to withstand the stress of self-modifying systems, or would they grow increasingly brittle and prone to total failure?

Lost-In-Symbols said...

a pushout in Category Theory?
What of Piaget's suggestion that the top goal is simply growth in all dimensions. Growth, suitably defined as maximizing the potential functions (to avoid fatness as a goal for example)

BTW - I enjoy the expansiveness you bring to these topics and the moral questions you ask about how best to apply your talents and technologies - but then back to your religion post.....why optimize? sure, evolutionary imperative....but (for me at least) there's something more there, something completely the minds own invention and it's own joy -