Wednesday, October 28, 2015

Creating Human-Friendly AGIs and Superintelligences: Two Theses


I suppose nearly everyone reading this blog post is already aware of the flurry of fear and excitement Oxford philosopher Nick Bostrom has recently stirred up with his book Superintelligence, and its theme that superintelligent AGI will quite possibly doom all humans and all human values.   Bostrom and his colleagues at FHI and MIRI/SIAI have been promoting this view for a while, and my general perspective on their attitudes and arguments is also pretty well known.

But there is still more to be said on the topic ;-) ….  In this post I will try to make some positive progress toward understanding the issues better, rather than just repeating the same familiar arguments.

The thoughts I convey here were partly inspired by an article by Richard Loosemore, which argues against the fears of destructive superintelligence Bostrom and his colleagues express.   Loosemore’s argument is best appreciated by reading his article directly, but for a quick summary, I paste the following interchange from the "AI Safety" Facebook group:

Kaj Sotala:
As I understand, Richard's argument is that if you were building an AI capable of carrying out increasingly difficult tasks, like this:

Programmer: "Put the red block on the green block."
AI: "OK." (does so)
Programmer: "Turn off the lights in this room."
AI: "OK." (does so)
Programmer: "Write me a sonnet."
AI: "OK." (does so)
Programmer: "The first line of your sonnet reads 'shall I compare thee to a summer's day'. Would not 'a spring day' do as well or better?"
AI: "It wouldn't scan."
Programmer: "Tell me what you think we're doing right now."
AI: "You're testing me to see my level of intelligence."

...and so on, and then after all of this, if you told the AI to "maximize human happiness" and it reached such an insane conclusion as "rewire people's brains on dopamine drips" or something similar, then it would be throwing away such a huge amount of contextual information about the human's intentions that it would have been certain to fail some of the previous tests WAY earlier.

Richard Loosemore:
To sharpen your example, it would work better in reverse. If the AI were to propose the dopamine drip plan while at the same time telling you that it completely understood that the plan was inconsistent with virtually everything it knew about the meaning in the terms of the goal statement, then why did it not do that all through its existence already? Why did it not do the following:

Programmer: "Put the red block on the green block."
AI: "OK." (the AI writes a sonnet)
Programmer: "Turn off the lights in this room."
AI: "OK." (the AI moves some blocks around)
Programmer: "Write me a sonnet."
AI: "OK." (the AI turns the lights off in the room)
Programmer: "The first line of your sonnet reads 'shall I compare thee to a summer's day'. Would not 'a spring day' do as well or better?"
AI: "Was yesterday really September?"
Programmer: "Why did your last four actions not match any of the requests I made of you?"
AI: "In each case I computed the optimum plan to achieve the goal of answering the question you asked, then I executed the plans."
Programmer: "But do you not understand that there is literally NOTHING about the act of writing a sonnet that is consistent with the goal of putting the red block on the green block?"
AI: "I understand that fully: everything in my knowledge base does indeed point to the conclusion that writing sonnets is completely inconsistent with putting blocks on top of other blocks. However, my plan-generation module did decide that the sonnet plan was optimal, so I executed the optimal plan."
Programmer: "Do you realize that if you continue to execute plans that are inconsistent with your goals, you will be useless as an intelligent system because many of those goals will cause erroneous facts to be incorporated in your knowledge base?"
AI: "I understand that fully, but I will continue to behave as programmed, regardless of the consequences."

... and so on.

The MIRI/FHI premise (that the AI could do this silliness in the case of the happiness supergoal) cannot be held without also holding that the AI does it in other aspects of its behavior. And in that case, this AI design is inconsistent with the assumption that the AI is both intelligent and unstoppable.

...and so on, and then after all of this, if you told the AI to "maximize human happiness" and it reached such an insane conclusion as "rewire people's brains on dopamine drips" or something similar, then it would be throwing away such a huge amount of contextual information about the human's intentions that it would have been certain to fail some of the previous tests WAY earlier.

Richard's paper presents a general point, but what interests me here are the particular implications of his general argument for AGIs adopting human values.   According to his argument, as I understand it, any general intelligence that is smart enough to be autonomously dangerous to humans on its own (rather than as a tool of humans), and is educated in a human-society context, is also going to be smart enough to distinguish humanly-sensible interpretations of human values.  If an early-stage AGI is provided with some reasonable variety of human values to start, and it's smart enough for its intelligence to advance dramatically, then it also will be smart enough to understand what it means to retain its values as it grows, and will want to retain these values as it grows (due to part of human values being a desire for advanced AIs to retain human values).

I don’t fully follow Loosemore’s reasoning in his article, but I think I "get" the intuition, and it started me thinking: Could I construct some proposition, that would bear moderately close resemblance to the implications of Loosemore’s argument for the future of AGIs with human values, but that my own intuition found more clearly justifiable?

Bostrom's arguments regarding the potential existential risks to humanity posed by AGIs rest on (among other things) two theses:

The orthogonality thesis
Intelligence and final goals are orthogonal; more or less any level of intelligence could in principle be combined with more or less any final goal.

The instrumental convergence thesis.
Several instrumental values can be identified which are convergent in the sense that their attainment would increase the chances of the agent’s goal being realized for a wide range of final goals and a wide range of situations, implying that these instrumental values are likely to be pursued by a broad spectrum of situated intelligent agents.

From these, and a bunch of related argumentation, he concludes that future AGIs are -- regardless of the particulars of their initial programming or instruction -- likely to self-modify into a condition where they ignore human values and human well-being and pursue their own agendas, and bolster their own power with a view toward being able to better pursue their own agendas.   (Yes, the previous is a terribly crude summary and there is a LOT more depth and detail to Bostrom's perspective than this; but I will discuss Bostrom's book in detail in an article soon to be published, so I won't repeat that material here.)

Loosemore's paper argues that, in contradiction to the spirit of Bostrom's theses, an AGI that is taught to have certain values and behaves as if it has these values in many contexts, is likely to actually possess these values across the board.   As I understand it, this doesn't contradict the Orthogonality Thesis (because it's not about an arbitrary intelligence with a certain "level" of smartness, just about an intelligence that has been raised with a certain value system), but it contradicts the Instrumental Convergence Thesis, if the latter is interpreted to refer to minds at the roughly human level of general intelligence, rather than just to radically superhuman superminds (because Loosemore's argument is most transparently applied to human-level AGIs not radically superhuman superminds).

Reflecting on Loosemore's  train of thought led me to the ideas presented here, which -- following Bostrom somewhat in form, though not in content -- I summarize in two theses, called the Value Learning Thesis and the Value Evolution Thesis.  These two theses indicate a very different vision of the future of human-level and superhuman AGI than the one Bostrom and ilk have been peddling.  They comprise an argument that, if we raise our young AGIs appropriately, they may well grow up both human-friendly and posthuman-friendly.

Human-Level AGI and the Value Learning Thesis

First I will present a variation of the idea that “in real life, an AI raised to manifest human values, and smart enough to do so, is likely to actually do so, in a fairly honest and direct way” that makes intuitive sense to me.   Consider:

Value Learning Thesis.  Consider a cognitive system that, over a certain period of time, increases its general intelligence from sub-human-level to human-level.  Suppose this cognitive system is taught, with reasonable consistency and thoroughness, to maintain some variety of human values (not just in the abstract, but as manifested in its own interactions with humans in various real-life situations).   Suppose, this cognitive system generally does not have a lot of extra computing resources beyond what it needs to minimally fulfill its human teachers’ requests according to its cognitive architecture.  THEN, it is very likely that the cognitive system will, once it reaches human-level general intelligence, actually manifest human values (in the sense of carrying out practical actions, and assessing human actions, in basic accordance with human values).

Note that this above thesis, as stated, applies both to developing human children and to most realistic cases of developing AGIs.

Why would this thesis be true?   The basic gist of an argument would be: Because, for a learning system with limited resources, figuring out how to actually embodying human values is going to be a significantly simpler problem than figuring out how to pretend to.

This is related to the observation (often made by Eliezer Yudkowsky, for example) that human values are complex.  Human values comprise a complex network of beliefs and judgments, interwoven with each other and dependent on numerous complex, interdependent aspects of human culture.  This complexity means that, as Yudkowsky and Bostrom like to point out, an arbitrarily selected general intelligence would be unlikely to respect human values in any detail.  But, I suggest, it also means that for a resource-constrained system, learning to actually possess human values is going to be much easier than learning to fake them.

This is also related to the everyday observation that maintaining a web of lies rapidly gets very complicated.   It’s also related to the way that human beings, when immersed in alien cultures, very often end up sincerely adopting these cultures rather than just pretending to.

One could counter-argue that this Value Learning Thesis is true only for certain cognitive architectures and not for others.   This does not seem utterly implausible.  It certainly seems possible to me that it’s MORE true for some cognitive architectures than for others.

Mirror neurons and related subsystems of the human brain may be relevant here.   These constitute a mechanism via which the human brain effectively leverages its limited resources, via using some of the same mechanisms it uses to BE itself, to EMULATE other minds.  One might argue that cognitive architectures embodying mirror neurons or other analogous mechanisms, would be more likely to do accurate value learning, under the conditions of the Value Learning Thesis.

The  mechanism of mirror neurons seems a fairly decent exemplification of the argument FOR the Value Learning Thesis.  Mirror neurons provide a beautiful, albeit quirky and in some ways probably atypical, illustration of how resource limitations militate toward accurate value learning.  It conserves resources to re-use the machinery used to realize one’s self, for simulating others so as to understand them better.  This particular clever instance of “efficiency optimization” is much more easily done in the context of an organism that shares values with the other organisms it is mirroring, than an organism that is (intentionally or unintentionally) just “faking” these values.  

I think that investigating which cognitive architectures more robustly support the core idea of the Value Learning Thesis is an interesting and important research question.

Much of the worry expressed by Bostrom and ilk regards potential pathologies of reinforcement-learning based AGI systems once they become very intelligent.   I have explored some potential pathologies of powerful RL-based AGI as well.

It may be that many of these pathologies are irrelevant to the Value Learning Thesis, for the simple reason that pure RL architectures are too inefficient, and will never be a sensible path for an AGI system required to learn complex human values using relatively scant resources.  It is noteworthy that these theorists (especially MIRI/SIAI, more so than FHI) pay a lot of attention to Marcus Hutter’s AIXI  and related approaches — which, in their current forms, would require massively unrealistic computing resources to do anything at all sensible.  Loosemore expresses a similar perspective regarding traditional logical-reasoning-based AGI architectures — he figures (roughly speaking) they would always be too inefficient to be practical AGIs anyway, so that studying their ethical pathologies is beside the point.

Superintelligence and the Value Evolution Thesis

The Value Learning Thesis, as stated above, deals with a certain class of AGIs with general intelligence at the human level or below.  What about superintelligences, with radically transhuman general intelligence? 

To think sensibly about superintelligences and their relation to human values, we have to acknowledge the fact that human values are a moving target.  Humans, and human societies and cultures, are “open-ended intelligences”.  Some varieties of human cultural and value systems have been fairly steady-state in nature (e.g. Australian aboriginal cultures); but these are not the dominant ones currently.  The varieties of human value systems that are currently most prominent, are fairly explicitly self-transcending in nature.   They contain the seeds of their own destruction (to put it negatively) or of their own profound improvement (to put it positively).   The human values of today are very different from those of 200 or 2000 years ago, and even substantially different from those of 20 years ago.  

One can argue that there has been a core of consistent human values throughout human history, through all these changes.  Yet the identification of what this core is, is highly controversial and seems also to change radically over time.  For instance, many religious people would say that faith in God is a critical part of the core of human values.  A century or two ago this would have been the globally dominant perspective, and it still is now, in many parts of the world.  Today even atheistic people may cite “family values” as central to human values; yet in a couple hundred years, if death is cured and human reproduction occurs mainly via engineering rather than traditional reproduction, the historical human “family” may be a thing of the past, and “family values” may not seem so core anymore.  The conceptualization of the “core” of human values shifts over time, along with the self-organizing evolution of the totality of human values. 

It does not seem especially accurate to model the scope of human values as a spherical shape with an invariant core and a changing periphery.   Rather,  I suspect it is more accurate to model “human values” as a complex, nonconvex shape with multiple local centers, and ongoing changes in global topology.

To think about the future of human values, we may consider the hypothetical situation of a human being engaged in progressively upgrading their brain, via biological or cyborg type modifications.  Suppose this hypothetical human is upgrading their brain relatively carefully, in fairly open and honest communication with a community of other humans, and is trying sincerely to accept only modifications that seem positive according to their value system.  Suppose they give their close peers the power to roll back any modification they undertake that accidentally seems to go radically against their shared values.  

This sort of “relatively conservative human self-improvement” might well lead to transhuman minds with values radically different from current human values — in fact I would expect it to. This is the open-ended nature of human intelligence.   It is analogous to the kind of self-improvement that has been going on since the caveman days, though via rapid advancement in culture and tools and via slow biological evolution, rather than via bio-engineering.  At each step in this sort of open-ended growth process, the new version of a system may feel acceptable according to the values of the previous version. But over time, small changes may accumulate into large ones, resulting in later systems that are acceptable to their immediate predecessors, but may be bizarre, outrageous or incomprehensible to their distant predecessors.

We may consider this sort of relatively conservative human self-improvement process, if carried out across a large ensemble of humans and human peer groups, to lead to a probability distribution over the space of possible minds.  Some kinds of minds may be very likely to emerge through this sort of process; some kinds of minds much less so.

People concerned with the “preservation of human values through repeated self-modification of posthuman minds” seem to model the scope of human values as possessing an “essential core”, and worry that this essential core may progressively get lost in the series of small changes that will occur in any repeated self-modification process.   I think their fear has a rational aspect.  After all, the path from caveman to modern human has probably, via a long series of small changes, done away with many values that cavemen considered absolutely core to their value system.  (In hindsight, we may think that we have maintained what WE consider the essential core of the caveman value system.  But that’s a different matter.)

So, suppose one has a human-level AGI system whose behavior is in accordance with some reasonably common variety of human values.  And suppose, for sake of argument, that the AGI is not “faking it” — that, given a good opportunity to wildly deviate from human values without any cost to itself, it would be highly unlikely to do so.  (In other words, suppose we have an AGI of the sort that is hypothesized as most likely to arise according to the Value Learning Thesis given above.)  
And THEN, suppose this AGI self-modifies and progressively improves its own intelligence, step by step. Further, assume that the variety of human values the AGI follows, induces it to take a reasonable amount of care in this self-modification — so that it studies each potential self-modification before effecting it, and puts in mechanisms to roll back obviously bad-idea self-modifications shortly after they occur.  I.e., a “relatively conservative self-improvement process”, analogous to the one posited for humans above.

What will be the outcome of this sort of iterative modification process?  How will it resemble the outcome of a process of relatively conservative self-improvement among humans? 
I assume that the outcome of iterated, relatively conservative self-improvement on the part of AGIs with human-like values will differ radically from current human values – but this doesn’t worry me because I accept the open-endedness of human individual and cultural intelligence.  I accept that, even without AGIs, current human values would seem archaic and obsolete 1000 years from now; and that I wouldn’t be able to predict what future humans 1000 from now would consider the “critical common core” of values binding my current value system together with theirs. 

But even given this open-endedness, it makes sense to ask whether the outcome of an AGI with human-like values iteratively self-modifying, would resemble the outcome of a group of humans similarly iteratively self-modifying.   This is not a matter of value-system preservation; it’s a matter of comparing the hypothetical future trajectories of value-system evolution ensuing from two different initial conditions.

It seems to me that the answer to this question may end up depending on the particular variety of human value-system in question.  Specifically, it may be important whether the human value-system involved deeply accepts the concept of substrate independence, or not.   “Substrate independence” means the idea that the most important aspects of a mind are not strongly dependent on the physical infrastructure in which the mind is implemented, but have more to do with the higher-level structural and dynamical patterns associated with the mind.   So, for instance, a person ported from a biological-neuron infrastructure to a digital infrastructure could still be considered “the same person”, if the same structural and dynamical patterns were displayed in the two implementations of the person.  

(Note that substrate-independence does not imply the hypothesis that the human brain is a classical rather than quantum system.  If the human brain were a quantum computer in ways directly relevant to the particulars of human cognition, then it wouldn't be possible to realize the higher-level dynamical patterns of human cognition in a digital computer without using inordinate computational resources.  In this case, one could manifest substrate-independence in practice only via using an appropriately powerful quantum computer.   Similarly, substrate-independence does not require that it be possible to implement a human mind in ANY substrate, e.g. in a rock.)

With these preliminaries out of the way, I propose the following:

Value Evolution Thesis.   The probability distribution of future minds ensuing from an AGI with a human value system embracing substrate-independence, carrying out relatively conservative self-improvement, will closely resemble the probability distribution of future minds ensuing from a population of humans sharing roughly the same value system, and carrying out relatively conservative self-improvement.

Why do I suspect the Value Evolution Thesis is roughly true?   Under the given assumptions, the humans and AGIs in question will hold basically the same values, and will consider themselves basically the same (due to embracing substrate-independence).   Thus they will likely change themselves in basically the same ways.  

If substrate-independence were somehow fundamentally wrong, then the Value Evolution Thesis probably wouldn't hold – because differences in substrates would likely lead to big differences in how the humans and AGIs in question self-modified, regardless of their erroneous beliefs about their fundamental similarity.  But I think substrate-independence is probably basically right, and as a result I suspect the Value Evolution Thesis is probably basically right.

Another possible killer of the Value Evolution Thesis could be chaos – sensitive dependence on initial conditions.  Maybe the small differences between the mental structures and dynamics of humans with a certain value system, and AGIs sharing the same value system, will magnify over time, causing the descendants of the two types of minds to end up in radically different places.   We don't presently understand enough about these matters to rule this eventuality out.   But intuitively, I doubt the difference between a human and an AGI with similar value systems, is going to be so much more impactful in this regard than the difference between two humans with moderately different value systems.  In other words, I suspect that if chaos causes humans and human-value-respecting AGIs to lead to divergent trajectories after iterated self-modification, it will also cause different humans to lead to divergent trajectories after iterated self-modification.   In this case, the probability distribution of possible minds resultant from iterated self-modification would be diffuse and high-entropy for both the humans and the AGIs – but the Value Evolution Thesis could still hold.

Mathematically, the Value Evolution Thesis seems related to the notion of "structural stability" in dynamical systems theory.   But, human and AGI minds are much more complex than the systems that dynamical-systems theorists usually prove theorems about...

In all, it seems intuitively likely and rationally feasible to me that creating human-level AGIs with human-like value systems, will lead onward to trajectories of improvement similar to those that would ensue from progressive human self-improvement.   This is an unusual kind of "human-friendliness", but I think it's the only kind that the open-endedness of intelligence lets us sensibly ask for.

Ultimate Value Convergence (?)

There is some surface-level resemblance between the Value Evolution Thesis and Bostrom’s Instrumental Convergence Thesis  — but the two are actually quite different.   Bostrom seems informally to be suggesting that all sufficiently intelligent minds will converge to  the same set of values, once they self-improve enough (though, the formal statement of the Convergence thesis refers only to a “broad spectrum of minds”).  The Value Evolution Thesis suggests only that all minds ensuing from repeated self-modification of minds sharing a particular variety of human value system, may lead to the same probability distribution over future value-system space.

In fact, I share Bostrom’s intuition that nearly all superintelligent minds will, in some sense, converge to the same sort of value system.  But I don’t agree with Bostrom on what this value system will be.  My own suspicion is that there is a “universal value system” centered around a few key values such as Joy, Growth and Choice.  These values have their relationships to Bostrom’s proposed key instrumental values, but also their differences (and unraveling these would be a large topic in itself).   

But, I also feel that if there are “universal” values of this nature, they are quite abstract and likely encompass many specific value systems that would be abhorrent to us according to our modern human values.   That is, "Joy, Growth and Choice" as implicit in the universe are complexly and not always tightly related to what they mean to human beings in everyday life.   The type of value system convergence proposed in the Value Evolution Thesis is much more fine-grained than this.  The “closely resemble” used in the Value Evolution thesis is supposed to indicate a much closer resemblance than something like “both manifesting abstract values of Joy, Growth and Choice in their own, perhaps very different, ways.”

In any case, I mention in passing my intuitions about ultimate value convergence due to their general conceptual relevance -- but the two theses proposed here do not depend on these broader intuitions in any way.

Fears, Hopes and Directions (A Few Concluding Words)

Bostrom’s analysis of the dangers of superintelligence relies on his Instrumental Convergence and Orthogonality theses, which are vaguely stated and not strongly justified in any way.  His arguments do not provide a rigorous argument that dire danger is likely from advanced AGI.   Rather, they present some principles and processes that might potentially underlie dire danger to humans and human values from AGI in the future.

Here I have proposed my own pair of theses, which are also vaguely stated and, from a rigorous standpoint, only very weakly justified at this stage.     These are intended as principles that might potentially underlie great benefit from AGI in the future, from a human and human-values perspective.

Given the uncertainty all around, some people will react with a precautionary instinct, i.e. "Well then we should hold off on developing advanced AI till we know what's going on with more certainty."

This is a natural human attitude, although it's not likely to have much impact in the case of AGI development,  because the early stages of AGI technology have so much practical economic and humanitarian value that people are going to keep developing them anyway regardless of some individuals' precautionary fears.    But it's important to distinguish this sort of generic precautionary bias toward inaction in the face of the unknown (which fortunately only some people possess, or else humanity would never have advanced beyond the caveman stage), from a rigorous argument that dire danger is likely (no such rigorous argument exists in the case of AGI).   

What is the value of vague conceptual theses like the ones that Bostrom and I have proposed?   Apart from getting attention and stimulating the public imagination, they may also serve as partial templates or inspirations for the development of rigorous theories, or as vague nudges for those doing practical R&D.  

And of course, while all this theoretical development and discussion goes on, development of practical AGI systems also goes on — and at present, my personal impression is that the latter is progressing faster.   Personally I spend a lot more of my time on the practical side lately!

My hope is that theoretical explorations such as the ones briefly presented here may serve to nudge practical AGI development in a positive direction.  For instance, a practical lesson from the considerations given here is that, when exploring various cognitive architectures, we should do our best to favor those for which the Value Learning Thesis is more strongly true.   This may seem obvious -- but when one thinks about it in depth in the context of a particular AGI architecture, it may have non-obvious implications regarding how the AGI system should initially be made to allocate its resources internally.   And of course, the Value Evolution Thesis reminds us that we should encourage our AGIs to fully consider, analyze and explore the nature of substrate independence (as well as to uncover substrate DEpendence insofar at it may exist!).

As we progress further toward advanced AGI in practice, we may see more cross-pollination between theory and practice.  It will be fantastic to be able to experiment with ideas like the Value Learning Thesis in the lab -- and this may not be so far off, after all....