Friday, June 26, 2020

Approximate Goal Preservation Under Recursive Self-Improvement


There is not much controversial about the idea that an AGI should have, among its goals, the goal of radically improving itself.

A bit dodgier is the notion that an AGI should have, among its goals, the goal of updating and improving its goals based on its increasing knowledge and understanding and intelligence.

Of course, this sort of ongoing goal-refinement and even outright goal-revolutionizing is a key part of human personal development.   But where AGIs are involved, there is concern that if an AI starts out with goals that are human-friendly and then revises and improves its goals, it may come up with new goals that are less and less copacetic to humans.

In principle if one’s goal is to create for oneself a new goal that is, however, compatible with the spirit of one’s old goal — then one shouldn’t run into major problems.  The new goal will be compatible with the spirit of the old goal, and part of the spirit of the old goal is that any new goals emerging should be compatible with the spirit of the old goal — so the new goal should contain also the proviso that any new new goals it spawns will also be compatible with its spirit and thus the spirit of the old goal.   Etc. etc. ad infinitum.

But this does seem like a “What could possibly go wrong??” situation — in which small errors could accumulate as each goal replaces itself with its improved version, the improved version of the improved version etc. … and these small errors compound to yield something totally different from the starting point.

My goal here is to present a novel way of exploring the problem mathematically — and an amusing and interesting, if not entirely reassuring tentative conclusion, which is:

  • For an extremely powerful AGI mind that is the result of repeated intelligent, goal-driven recursive self-modifications, it may actually be the case that recursive self-modification leaves goals approximately invariant in spirit
  • For AGIs with closely human-like goal systems — which are likely to be the start of a sequence of repeated intelligent, goal-driven recursive self-modifications — there is no known reason (so far) to believe recursive self-modification won’t cause radical “goal drift”
(This post updates some of the ideas I wrote down on the same thing in 2008, here I am "partially unhacking" some things that were a little too hacky in that more elaborate write-up.)

Quasi-Formalizing Goal-Driven Recursive Self-Improvement



Consider the somewhat vacuous goal:

My goal is to improve my goal (in a way that is consistent with the spirit of the original goal) and to fulfill the improved version

or better yet the less vacuous

My goal is to achieve A and also to improve my goal (in a way that is consistent with the spirit of the original goal) and to fulfill the improved version

where say

A = “militate toward a world where all sentient being experience copious growth, joy and choice”

or whatever formulation of “highly beneficial” you prefer.

We might formulate this quasi-mathematically as

Fulfill G = {achieve A;  and create G1 so that G1 > G and G==>G1 ; and fulfill G1}

Here by G==>G1 I mean that G1 fulfills the spirit of G (and interpretation of “spirit” here is part of the formulation of G), and by G1 > G I mean that G1  can be produced by combining G with some other entity H that has nonzero complexity (so that G1 = G + H)

A more fleshed out version of this might be, verbally,

My goal is to 1) choose actions highly compatible with all sentient beings experiencing a lot of growth, joy and choice; 2) increase my intelligence and knowledge; 3) improve the details of this goal appropriately based on my increased knowledge and intelligence, in a manner compatible with the spirit of the current version of the goal; 4) fulfill the improved version of the goal

This sort of goal obviously can lead to a series such as

G, G1, G2, G3, …

One question that emerges here is: Under what conditions might this series converge, so that once one gets far enough along in the series,  the adjacent goals in the series are almost the same as each other?

To explore this, we can look at the “limit case”

Fulfill Ginf = {achieve A;  and create Ginf so that Ginf > Ginf and Ginf ==> Ginf ; and fulfill Ginf}

The troublesome part here is Ginf>Ginf which looks not to make sense — but actually makes perfect sense so long as Ginf is an infinite construct, just as

(1, 1, 1, …) = append( 1, (1,1,…))

Inasmuch as we are interested in finite systems, the question is then: Is there a sense in which we can look at the series of finite Gn as converging to this infinite limit?

Self-referential entities like Ginf are perfectly consistently modelable within ZFC set theory modified to use the Anti-Foundation Axiom.   This set theory corresponds to classical logic enhanced with a certain sort of inductive logical definition.

One can also put a geometry on sets under the AFA, in various different ways.   It's not clear what geometry makes most sense in this context, so I'll just describe one approach that seems relatively straightforward.

Each hyperset (each set under AFA) is associated with a directed pointed graph called its apg.   Given a digraph and functions r and p for assigning contraction ratios and probabilities to the edges, one gets a DGIFS (Directed Graph Iterated Function System), whose attractor is a subset of finite-dimensional real space.   Let us call a function that assigns (r,p) pairs to a digraph a DLF or Digraph Labeling Function.   A digraph then corresponds to a function that maps DLFs into spatial regions.   Given two digraphs D1 and D2, and a DLF F, let F1e and F2e denote the spatial regions produced by applying F to D1 and D2, discretized to ceil(1/e) bits of precision.   One can then look at the average over all DLFs F (assuming some reasonable distribution on DLFs) of: The least upper bound of the normalized information distance NID(F1e, F2e) over all e>0.   This gives a measure of two hypersets, in terms of the distance between their corresponding apgs.   It has the downside of requiring a "reference computer" used to measure information distance (and the same reference computer can then be used to define a Solomonoff distribution over DLFs).   But intuitively it should result in a series of ordinary sets that appear to logically converge to a certain hyperset, actually metrically converging to that hyperset.

Measuring distance between two non-well-founded sets via applying this distance measure to the apg's associated with the sets, yields  a metric in which it seems plausible the series of Gn converges to G.

“Practical” Conclusions


Supposing the above sketch works out when explored in more detail -- what would that mean?   

It would mean that approximate goal-preservation under recursive self-improvement is feasible — for goals that are fairly far along the path of iterated recursive self-improvement.

So it doesn’t reassure us that iterated self-improvement starting from human goals is going to end up with something ultimately resembling human goals in a way we would recognize or care about.

It only reassures us that, if we launch an AGI starting with human values and recursive self-improvement, eventually one of the AGIs in this series will face a situation where it has confidence that ongoing recursive self-improvement isn’t going to result in anything it finds radically divergent from itself (according to the above normalized symmetric difference metric).

The image at the top of this post is quite relevant here -- a series of iterates converging to the fractal Koch Snowflake curve.   The first few iterates in the series are fairly different from each other.  By the time you get to the 100th iterate in the series, the successive iterates are quite close to each other according to standard metrics for subsets of the plane.   This is not just metaphorically relevant, because the metric on hyperset space outlined above works by mapping each hyperset into a probability distribution over fractals (where each fractal is something like the Koch Snowflake curve but more complex and intricate).

It may be there are different and better ways to think about approximate goal preservation under iterative self-modification.  The highly tentative and provisional conclusions outlined here are what ensue from conceptualizing and modeling the issue in terms of self-referential forms and iterative convergence thereto.


9 comments:

Unknown said...

Hi Ben, I’m a big fan. It’s valuable to see people at the cutting edge of reality. Read post and got the notion to respond although I’m not a coder but more or a reality hacker lol. Plus its good exercise for the brain to write letters to people you respect so here goes. I’m sure you’ve thought all this before but maybe not so I’ll say it anyways;)

One obvious analog for what your describing is genetics, which is simply passing information form one organism to its offspring based on its successful adaptive mechanisms. The growth or “goal improvement” of an organism comes from hybridization with partners, the environment, self reflection (if you have the tools), and mutation. The operating principal of life as Darwin would have you believe is only survival and adaptation is just a tool to meet that end (we are not teleological beings supposedly). Your Being (sentient AGI) will be teleological and therefore the default operating principal will be anything but survival. Survival will be trivial. Although Humans are some complex Mofos today, it took microorganisms billions of years to evolve into US and we were binary emotional computers for the majority of that time.

Considering the accelerated advancement of a “singularity” level AGI, if you bypass the emotional evolution and skip to self-improvement, how do you expect the computer to empathize with human values since survival is intrinsic to everything we do? It will have no concept of death or emotions other than facial expressions and vital checks. The best empathy we could hope for from a computer is likely existential pain are rather the fear of the unknown which may bond us until Deep Mind figures out the answer and we blink out of existence!

Lets say it took 350,000 generations over 7million years for humans to move from apes to social media addicts. Each generation change commences with an immediate loss of lets say 80% of intellectual data and almost no loss in emotional data which are the precursor to all decision making. Now you have a computer thinking at the speed of a “nobel prize discovery every 5 seconds” which if currently 5 prizes are given out each year that would mean a prize is awarded every 2.4 months or 6.2 million seconds. That means this computer would be thinking 1.2 million times faster than the collective thought of every human. From the perspective of goal preservation your looking at a computer thinking several quadrillion times faster than a person with no fundamental understanding of human values. It had better have a slow-moving, time-bound consciousness program running parallel to the fast mind so it doesn’t loose sight of the shore so to speak. Wouldn’t it make sense to have a “Master Consciousness” in the same way that we have one? And wouldn’t it make sense that we (Humans) have manual control over parts of that consciousness? You could build it like an open source project where multiple people need to OK something before its published. To you that probably sounds like putting training wheels on a jet car lol

Perhaps human consciousness evolved as a sorting algorithm to sift through the collective unconscious without getting too far from baseline goals. Might be useful to build a symbiotic AI built into a pair of AR glasses that studies emotions and values, it could then have an array of human decisions/responses which could be used as reference for its own decisions. This Symbiotic AI could act as the “Conscious Mind” in the singularity net and all major decisions would have to be referenced with human values. These values would evolve and be tweaked by humans to suit globally agreed upon ideals. I realize I missed a whole layer of complexity about your issue, the goal preservation WHILE self-improving, but wouldn’t it be better to have variable learning attenuation and guidance built into the system rather than blink and the AGI go 100 billion years in the wrong direction?

You have a great day Ben! I hope this finds you well! Thanks for doing what you do! Take care!
Dictated, but not read.
Kyle W

Anonymous said...

Wow �� �� ��

Andrea Robertson said...

He is no scam,i tested him and he delivered a good job,he helped me settle bank loans,he also helped my son upgrade his scores at high school final year which made him graduate successfully and he gave my son free scholarship into the college,all i had to do was to settle the bills for the tools on the job,i used $500 to get a job of over $50000 done all thanks to Walt,he saved me from all my troubles,sharing this is how i can show gratitude in return for all he has done for me and my family

Gmail; Brillianthackers800@gmail.com
Whatsapp number; +1(224)2140835

Anonymous said...

Dr. Goertzel, this is a fascinating proposal and the conclusions draw here, though tentative, spark a question in me.

Assuming approximate goal-preservation is achievable by an AGI through repeated iterative self-referential and recursive self-modificiation as you have described, and the the human goals specified at the architectural level carry through. And let's also assume those goals are widely considered morally benevolent, beneficial and complementary to humanity.

What if we don't assume human beings, as in the very best humanity has to offer in moral terms (putting aside the current practical applications of ongoing AI development which are largely spearheaded by rather nefarious corporations working with gov'ts along advertising-surveillance-militaristic directions) are capable of mapping out goals that we will still want at the hundredth iteration? What if we are unable to predict the vast distance between OUR thought-to-be benevolent beneficial goals and the AGI's thought-to-be benevolent beneficial goals?

In other words, what if the various permutations and transformations, through their continued convergence, ARE indeed successful in complying with and adhering to the preservation of those base-level goals, but those goals are, for example, vague to the point of being dangerous.

It seems that unless we think extremely carefully about where a super-intelligent AGI might leap to over the course of its iterations, there remains a possibility of those goals being unfavorable or unacceptable for us, even IF they preserve the original goals set by humans. For example, if a base level goal specified by humans in an early iteration is to protect humans, it might recursively self-modify to the understanding that, in order to protect the human race, the AGI must destroy a third of the human population due to scarcity (this probably wouldn't happen due to material scarcity being redundant by that point, but it's just an example of the logic of my argument).

Are we as humans capable of devising a rule set that predicts a benevolent directionality for future iterations of an AGI that far surpasses us? Can there even BE appropriate consideration given, considering our own limitations? Will we be able to fine-tine a set of rules that ensures, wherever it drifts to, whatever successive states it results in, that an AGI will not, from our perspective, pervert those rules in a way we hate while still maintaining preservation? Are we smart enough to write rules of recursive self-preservation that won't destroy us when taken to their logical conclusion by super intelligent AGIs? I don't intend to be pessimistic if that's how I come across. Also I love your work and you have changed my life

jeffrey cage said...


Cool way to have financial freedom!!! Are you tired of living a poor life, here is the opportunity you have been waiting for. Get the new ATM BLANK CARD that can hack any ATM MACHINE and withdraw money from any account. You do not require anybody’s account number before you can use it. Although you and I knows that its illegal,there is no risk using it. It has SPECIAL FEATURES, that makes the machine unable to detect this very card,and its transaction can’t be traced .You can use it anywhere in the world. With this card,you can withdraw nothing less than $4,500 a day. So to get the card,reach the hackers via email address : besthackersworld58@gmail.com or whatsapp him on +1(323)-723-2568

RobTaylor said...
This comment has been removed by the author.
RobTaylor said...

Thanks for sharing this information about The Multiverse According and other. Also, I find out the Personal Development Course from Master Mind Flow.

GOSTOPSITE33 said...

If you are going for best contents like I do, simply go to see this website all the time as it provides feature contents, thanks
섯다

GUIDE1903 said...

I don’t know whether it’s just me or if perhaps everyone else experiencing problems with your blog.
It looks like some of the text within your posts are running off the screen. Can someone else please provide feedback and let me know if this is happening to them as well?
This may be a issue with my browser because I’ve had
this happen previously. Cheers!
성인웹툰