There is not much controversial about the idea that an AGI should have, among its goals, the goal of radically improving itself.

A bit dodgier is the notion that an AGI should have, among its goals, the goal of updating and improving its goals based on its increasing knowledge and understanding and intelligence.

Of course, this sort of ongoing goal-refinement and even outright goal-revolutionizing is a key part of human personal development. But where AGIs are involved, there is concern that if an AI starts out with goals that are human-friendly and then revises and improves its goals, it may come up with new goals that are less and less copacetic to humans.

In principle if one’s goal is to create for oneself a new goal that is, however, compatible with the spirit of one’s old goal — then one shouldn’t run into major problems. The new goal will be compatible with the spirit of the old goal, and part of the spirit of the old goal is that any new goals emerging should be compatible with the spirit of the old goal — so the new goal should contain also the proviso that any new new goals it spawns will also be compatible with its spirit and thus the spirit of the old goal. Etc. etc. ad infinitum.

But this does seem like a “What could possibly go wrong??” situation — in which small errors could accumulate as each goal replaces itself with its improved version, the improved version of the improved version etc. … and these small errors compound to yield something totally different from the starting point.

My goal here is to present a novel way of exploring the problem mathematically — and an amusing and interesting, if not entirely reassuring tentative conclusion, which is:

- For an extremely powerful AGI mind that is the result of repeated intelligent, goal-driven recursive self-modifications, it may actually be the case that recursive self-modification leaves goals approximately invariant in spirit
- For AGIs with closely human-like goal systems — which are likely to be the start of a sequence of repeated intelligent, goal-driven recursive self-modifications — there is no known reason (so far) to believe recursive self-modification won’t cause radical “goal drift”

## Quasi-Formalizing Goal-Driven Recursive Self-Improvement

Consider the somewhat vacuous goal:

*My goal is to improve my goal (in a way that is consistent with the spirit of the original goal) and to fulfill the improved version*

or better yet the less vacuous

*My goal is to achieve A and also to improve my goal (in a way that is consistent with the spirit of the original goal) and to fulfill the improved version*

where say

*A = “militate toward a world where all sentient being experience copious growth, joy and choice”*

or whatever formulation of “highly beneficial” you prefer.

We might formulate this quasi-mathematically as

*Fulfill G = {achieve A; and create G1 so that G1 > G and G==>G1 ; and fulfill G1}*

Here by G==>G1 I mean that G1 fulfills the spirit of G (and interpretation of “spirit” here is part of the formulation of G), and by G1 > G I mean that G1 can be produced by combining G with some other entity H that has nonzero complexity (so that G1 = G + H)

A more fleshed out version of this might be, verbally,

*My goal is to 1) choose actions highly compatible with all sentient beings experiencing a lot of growth, joy and choice; 2) increase my intelligence and knowledge; 3) improve the details of this goal appropriately based on my increased knowledge and intelligence, in a manner compatible with the spirit of the current version of the goal; 4) fulfill the improved version of the goal*

This sort of goal obviously can lead to a series such as

*G, G1, G2, G3, …*

One question that emerges here is: Under what conditions might this series converge, so that once one gets far enough along in the series, the adjacent goals in the series are almost the same as each other?

To explore this, we can look at the “limit case”

*Fulfill Ginf = {achieve A; and create Ginf so that Ginf > Ginf and Ginf ==> Ginf ; and fulfill Ginf}*

The troublesome part here is Ginf>Ginf which looks not to make sense — but actually makes perfect sense so long as Ginf is an infinite construct, just as

*(1, 1, 1, …) = append( 1, (1,1,…))*

Inasmuch as we are interested in finite systems, the question is then: Is there a sense in which we can look at the series of finite Gn as converging to this infinite limit?

Self-referential entities like Ginf are perfectly consistently modelable within ZFC set theory modified to use the Anti-Foundation Axiom. This set theory corresponds to classical logic enhanced with a certain sort of inductive logical definition.

One can also put a geometry on sets under the AFA, in various different ways. It's not clear what geometry makes most sense in this context, so I'll just describe one approach that seems relatively straightforward.

Each hyperset (each set under AFA) is associated with a directed pointed graph called its apg. Given a digraph and functions r and p for assigning contraction ratios and probabilities to the edges, one gets a DGIFS (Directed Graph Iterated Function System), whose attractor is a subset of finite-dimensional real space. Let us call a function that assigns (r,p) pairs to a digraph a DLF or Digraph Labeling Function. A digraph then corresponds to a function that maps DLFs into spatial regions. Given two digraphs D1 and D2, and a DLF F, let F1e and F2e denote the spatial regions produced by applying F to D1 and D2, discretized to ceil(1/e) bits of precision. One can then look at the average over all DLFs F (assuming some reasonable distribution on DLFs) of: The least upper bound of the normalized information distance NID(F1e, F2e) over all e>0. This gives a measure of two hypersets, in terms of the distance between their corresponding apgs. It has the downside of requiring a "reference computer" used to measure information distance (and the same reference computer can then be used to define a Solomonoff distribution over DLFs). But intuitively it should result in a series of ordinary sets that appear to logically converge to a certain hyperset, actually metrically converging to that hyperset.

Each hyperset (each set under AFA) is associated with a directed pointed graph called its apg. Given a digraph and functions r and p for assigning contraction ratios and probabilities to the edges, one gets a DGIFS (Directed Graph Iterated Function System), whose attractor is a subset of finite-dimensional real space. Let us call a function that assigns (r,p) pairs to a digraph a DLF or Digraph Labeling Function. A digraph then corresponds to a function that maps DLFs into spatial regions. Given two digraphs D1 and D2, and a DLF F, let F1e and F2e denote the spatial regions produced by applying F to D1 and D2, discretized to ceil(1/e) bits of precision. One can then look at the average over all DLFs F (assuming some reasonable distribution on DLFs) of: The least upper bound of the normalized information distance NID(F1e, F2e) over all e>0. This gives a measure of two hypersets, in terms of the distance between their corresponding apgs. It has the downside of requiring a "reference computer" used to measure information distance (and the same reference computer can then be used to define a Solomonoff distribution over DLFs). But intuitively it should result in a series of ordinary sets that appear to logically converge to a certain hyperset, actually metrically converging to that hyperset.

Measuring distance between two non-well-founded sets via applying this distance measure to the apg's associated with the sets, yields a metric in which it seems plausible the series of Gn converges to G.

## “Practical” Conclusions

Supposing the above sketch works out when explored in more detail -- what would that mean?

It would mean that approximate goal-preservation under recursive self-improvement is feasible — for goals that are fairly far along the path of iterated recursive self-improvement.

So it doesn’t reassure us that iterated self-improvement starting from human goals is going to end up with something ultimately resembling human goals in a way we would recognize or care about.

It only reassures us that, if we launch an AGI starting with human values and recursive self-improvement, eventually one of the AGIs in this series will face a situation where it has confidence that ongoing recursive self-improvement isn’t going to result in anything it finds radically divergent from itself (according to the above normalized symmetric difference metric).

The image at the top of this post is quite relevant here -- a series of iterates converging to the fractal Koch Snowflake curve. The first few iterates in the series are fairly different from each other. By the time you get to the 100th iterate in the series, the successive iterates are quite close to each other according to standard metrics for subsets of the plane. This is not just metaphorically relevant, because the metric on hyperset space outlined above works by mapping each hyperset into a probability distribution over fractals (where each fractal is something like the Koch Snowflake curve but more complex and intricate).

It may be there are different and better ways to think about approximate goal preservation under iterative self-modification. The highly tentative and provisional conclusions outlined here are what ensue from conceptualizing and modeling the issue in terms of self-referential forms and iterative convergence thereto.

The image at the top of this post is quite relevant here -- a series of iterates converging to the fractal Koch Snowflake curve. The first few iterates in the series are fairly different from each other. By the time you get to the 100th iterate in the series, the successive iterates are quite close to each other according to standard metrics for subsets of the plane. This is not just metaphorically relevant, because the metric on hyperset space outlined above works by mapping each hyperset into a probability distribution over fractals (where each fractal is something like the Koch Snowflake curve but more complex and intricate).

It may be there are different and better ways to think about approximate goal preservation under iterative self-modification. The highly tentative and provisional conclusions outlined here are what ensue from conceptualizing and modeling the issue in terms of self-referential forms and iterative convergence thereto.

## 2 comments:

Hi Ben, I’m a big fan. It’s valuable to see people at the cutting edge of reality. Read post and got the notion to respond although I’m not a coder but more or a reality hacker lol. Plus its good exercise for the brain to write letters to people you respect so here goes. I’m sure you’ve thought all this before but maybe not so I’ll say it anyways;)

One obvious analog for what your describing is genetics, which is simply passing information form one organism to its offspring based on its successful adaptive mechanisms. The growth or “goal improvement” of an organism comes from hybridization with partners, the environment, self reflection (if you have the tools), and mutation. The operating principal of life as Darwin would have you believe is only survival and adaptation is just a tool to meet that end (we are not teleological beings supposedly). Your Being (sentient AGI) will be teleological and therefore the default operating principal will be anything but survival. Survival will be trivial. Although Humans are some complex Mofos today, it took microorganisms billions of years to evolve into US and we were binary emotional computers for the majority of that time.

Considering the accelerated advancement of a “singularity” level AGI, if you bypass the emotional evolution and skip to self-improvement, how do you expect the computer to empathize with human values since survival is intrinsic to everything we do? It will have no concept of death or emotions other than facial expressions and vital checks. The best empathy we could hope for from a computer is likely existential pain are rather the fear of the unknown which may bond us until Deep Mind figures out the answer and we blink out of existence!

Lets say it took 350,000 generations over 7million years for humans to move from apes to social media addicts. Each generation change commences with an immediate loss of lets say 80% of intellectual data and almost no loss in emotional data which are the precursor to all decision making. Now you have a computer thinking at the speed of a “nobel prize discovery every 5 seconds” which if currently 5 prizes are given out each year that would mean a prize is awarded every 2.4 months or 6.2 million seconds. That means this computer would be thinking 1.2 million times faster than the collective thought of every human. From the perspective of goal preservation your looking at a computer thinking several quadrillion times faster than a person with no fundamental understanding of human values. It had better have a slow-moving, time-bound consciousness program running parallel to the fast mind so it doesn’t loose sight of the shore so to speak. Wouldn’t it make sense to have a “Master Consciousness” in the same way that we have one? And wouldn’t it make sense that we (Humans) have manual control over parts of that consciousness? You could build it like an open source project where multiple people need to OK something before its published. To you that probably sounds like putting training wheels on a jet car lol

Perhaps human consciousness evolved as a sorting algorithm to sift through the collective unconscious without getting too far from baseline goals. Might be useful to build a symbiotic AI built into a pair of AR glasses that studies emotions and values, it could then have an array of human decisions/responses which could be used as reference for its own decisions. This Symbiotic AI could act as the “Conscious Mind” in the singularity net and all major decisions would have to be referenced with human values. These values would evolve and be tweaked by humans to suit globally agreed upon ideals. I realize I missed a whole layer of complexity about your issue, the goal preservation WHILE self-improving, but wouldn’t it be better to have variable learning attenuation and guidance built into the system rather than blink and the AGI go 100 billion years in the wrong direction?

You have a great day Ben! I hope this finds you well! Thanks for doing what you do! Take care!

Dictated, but not read.

Kyle W

Wow �� �� ��

Post a Comment