Introduction
I suppose nearly everyone reading this blog post is already aware
of the flurry of fear and excitement Oxford philosopher Nick Bostrom has recently stirred up with
his book Superintelligence, and its
theme that superintelligent AGI will quite possibly doom all humans and all human values. Bostrom and his colleagues at FHI and
MIRI/SIAI have been promoting this view for a while, and my general perspective on their attitudes and arguments is also pretty well known.
But there is still more to be said on the topic ;-) …. In this post I will try to make some positive
progress toward understanding the issues better, rather than just repeating the
same familiar arguments.
The thoughts I convey here were partly inspired by an article by Richard Loosemore, which argues against the fears of destructive
superintelligence Bostrom and his colleagues express. Loosemore’s argument is best appreciated by
reading his article directly, but for a quick summary, I paste the following interchange from the "AI Safety" Facebook group:
Kaj Sotala:
As I understand, Richard's argument is that if you were building an AI capable of carrying out increasingly difficult tasks, like this:
Programmer: "Put the red block on the green block."
AI: "OK." (does so)
Programmer: "Turn off the lights in this room."
AI: "OK." (does so)
Programmer: "Write me a sonnet."
AI: "OK." (does so)
Programmer: "The first line of your sonnet reads 'shall I compare thee to a summer's day'. Would not 'a spring day' do as well or better?"
AI: "It wouldn't scan."
Programmer: "Tell me what you think we're doing right now."
AI: "You're testing me to see my level of intelligence."
...and so on, and then after all of this, if you told the AI to "maximize human happiness" and it reached such an insane conclusion as "rewire people's brains on dopamine drips" or something similar, then it would be throwing away such a huge amount of contextual information about the human's intentions that it would have been certain to fail some of the previous tests WAY earlier.
Richard Loosemore:
To sharpen your example, it would work better in reverse. If the AI were to propose the dopamine drip plan while at the same time telling you that it completely understood that the plan was inconsistent with virtually everything it knew about the meaning in the terms of the goal statement, then why did it not do that all through its existence already? Why did it not do the following:
Programmer: "Put the red block on the green block."
AI: "OK." (the AI writes a sonnet)
Programmer: "Turn off the lights in this room."
AI: "OK." (the AI moves some blocks around)
Programmer: "Write me a sonnet."
AI: "OK." (the AI turns the lights off in the room)
Programmer: "The first line of your sonnet reads 'shall I compare thee to a summer's day'. Would not 'a spring day' do as well or better?"
AI: "Was yesterday really September?"
Programmer: "Why did your last four actions not match any of the requests I made of you?"
AI: "In each case I computed the optimum plan to achieve the goal of answering the question you asked, then I executed the plans."
Programmer: "But do you not understand that there is literally NOTHING about the act of writing a sonnet that is consistent with the goal of putting the red block on the green block?"
AI: "I understand that fully: everything in my knowledge base does indeed point to the conclusion that writing sonnets is completely inconsistent with putting blocks on top of other blocks. However, my plan-generation module did decide that the sonnet plan was optimal, so I executed the optimal plan."
Programmer: "Do you realize that if you continue to execute plans that are inconsistent with your goals, you will be useless as an intelligent system because many of those goals will cause erroneous facts to be incorporated in your knowledge base?"
AI: "I understand that fully, but I will continue to behave as programmed, regardless of the consequences."
... and so on.
The MIRI/FHI premise (that the AI could do this silliness in the case of the happiness supergoal) cannot be held without also holding that the AI does it in other aspects of its behavior. And in that case, this AI design is inconsistent with the assumption that the AI is both intelligent and unstoppable.
...and so on, and then after all of this, if you told the AI to "maximize human happiness" and it reached such an insane conclusion as "rewire people's brains on dopamine drips" or something similar, then it would be throwing away such a huge amount of contextual information about the human's intentions that it would have been certain to fail some of the previous tests WAY earlier.
Richard's paper presents a general point, but what interests me here are the particular implications of his general argument for AGIs adopting human values. According to his argument, as I understand it, any general intelligence that is smart enough to be autonomously dangerous to humans on its own (rather than as a tool of humans), and is educated in a human-society context, is also going to be smart enough to distinguish humanly-sensible interpretations of human values. If an early-stage AGI is provided with some reasonable variety of human values to start, and it's smart enough for its intelligence to advance dramatically, then it also will be smart enough to understand what it means to retain its values as it grows, and will want to retain these values as it grows (due to part of human values being a desire for advanced AIs to retain human values).
Kaj Sotala:
As I understand, Richard's argument is that if you were building an AI capable of carrying out increasingly difficult tasks, like this:
Programmer: "Put the red block on the green block."
AI: "OK." (does so)
Programmer: "Turn off the lights in this room."
AI: "OK." (does so)
Programmer: "Write me a sonnet."
AI: "OK." (does so)
Programmer: "The first line of your sonnet reads 'shall I compare thee to a summer's day'. Would not 'a spring day' do as well or better?"
AI: "It wouldn't scan."
Programmer: "Tell me what you think we're doing right now."
AI: "You're testing me to see my level of intelligence."
...and so on, and then after all of this, if you told the AI to "maximize human happiness" and it reached such an insane conclusion as "rewire people's brains on dopamine drips" or something similar, then it would be throwing away such a huge amount of contextual information about the human's intentions that it would have been certain to fail some of the previous tests WAY earlier.
Richard Loosemore:
To sharpen your example, it would work better in reverse. If the AI were to propose the dopamine drip plan while at the same time telling you that it completely understood that the plan was inconsistent with virtually everything it knew about the meaning in the terms of the goal statement, then why did it not do that all through its existence already? Why did it not do the following:
Programmer: "Put the red block on the green block."
AI: "OK." (the AI writes a sonnet)
Programmer: "Turn off the lights in this room."
AI: "OK." (the AI moves some blocks around)
Programmer: "Write me a sonnet."
AI: "OK." (the AI turns the lights off in the room)
Programmer: "The first line of your sonnet reads 'shall I compare thee to a summer's day'. Would not 'a spring day' do as well or better?"
AI: "Was yesterday really September?"
Programmer: "Why did your last four actions not match any of the requests I made of you?"
AI: "In each case I computed the optimum plan to achieve the goal of answering the question you asked, then I executed the plans."
Programmer: "But do you not understand that there is literally NOTHING about the act of writing a sonnet that is consistent with the goal of putting the red block on the green block?"
AI: "I understand that fully: everything in my knowledge base does indeed point to the conclusion that writing sonnets is completely inconsistent with putting blocks on top of other blocks. However, my plan-generation module did decide that the sonnet plan was optimal, so I executed the optimal plan."
Programmer: "Do you realize that if you continue to execute plans that are inconsistent with your goals, you will be useless as an intelligent system because many of those goals will cause erroneous facts to be incorporated in your knowledge base?"
AI: "I understand that fully, but I will continue to behave as programmed, regardless of the consequences."
... and so on.
The MIRI/FHI premise (that the AI could do this silliness in the case of the happiness supergoal) cannot be held without also holding that the AI does it in other aspects of its behavior. And in that case, this AI design is inconsistent with the assumption that the AI is both intelligent and unstoppable.
...and so on, and then after all of this, if you told the AI to "maximize human happiness" and it reached such an insane conclusion as "rewire people's brains on dopamine drips" or something similar, then it would be throwing away such a huge amount of contextual information about the human's intentions that it would have been certain to fail some of the previous tests WAY earlier.
Richard's paper presents a general point, but what interests me here are the particular implications of his general argument for AGIs adopting human values. According to his argument, as I understand it, any general intelligence that is smart enough to be autonomously dangerous to humans on its own (rather than as a tool of humans), and is educated in a human-society context, is also going to be smart enough to distinguish humanly-sensible interpretations of human values. If an early-stage AGI is provided with some reasonable variety of human values to start, and it's smart enough for its intelligence to advance dramatically, then it also will be smart enough to understand what it means to retain its values as it grows, and will want to retain these values as it grows (due to part of human values being a desire for advanced AIs to retain human values).
I don’t fully follow Loosemore’s reasoning in his article,
but I think I "get" the intuition, and it started me thinking: Could I construct some proposition, that would bear
moderately close resemblance to the implications of Loosemore’s argument for the future of AGIs with human values, but that my own intuition found more clearly justifiable?
Bostrom's arguments regarding the potential existential risks to humanity posed by AGIs rest on (among other things) two theses:
The orthogonality thesis
Intelligence
and final goals are orthogonal; more or less any level of intelligence could in
principle be combined with more or less any final goal.
The instrumental convergence thesis.
Several
instrumental values can be identified which are convergent in the sense that
their attainment would increase the chances of the agent’s goal being realized
for a wide range of final goals and a wide range of situations, implying that
these instrumental values are likely to be pursued by a broad spectrum of
situated intelligent agents.
From these, and a bunch of related argumentation, he concludes that future AGIs are -- regardless of the particulars of their initial programming or instruction -- likely to self-modify into a condition where they ignore human values and human well-being and pursue their own agendas, and bolster their own power with a view toward being able to better pursue their own agendas. (Yes, the previous is a terribly crude summary and there is a LOT more depth and detail to Bostrom's perspective than this; but I will discuss Bostrom's book in detail in an article soon to be published, so I won't repeat that material here.)
Loosemore's paper argues that, in contradiction to the spirit of Bostrom's theses, an AGI that is taught to have certain values and behaves as if it has these values in many contexts, is likely to actually possess these values across the board. As I understand it, this doesn't contradict the Orthogonality Thesis (because it's not about an arbitrary intelligence with a certain "level" of smartness, just about an intelligence that has been raised with a certain value system), but it contradicts the Instrumental Convergence Thesis, if the latter is interpreted to refer to minds at the roughly human level of general intelligence, rather than just to radically superhuman superminds (because Loosemore's argument is most transparently applied to human-level AGIs not radically superhuman superminds).
Loosemore's paper argues that, in contradiction to the spirit of Bostrom's theses, an AGI that is taught to have certain values and behaves as if it has these values in many contexts, is likely to actually possess these values across the board. As I understand it, this doesn't contradict the Orthogonality Thesis (because it's not about an arbitrary intelligence with a certain "level" of smartness, just about an intelligence that has been raised with a certain value system), but it contradicts the Instrumental Convergence Thesis, if the latter is interpreted to refer to minds at the roughly human level of general intelligence, rather than just to radically superhuman superminds (because Loosemore's argument is most transparently applied to human-level AGIs not radically superhuman superminds).
Reflecting on Loosemore's train of thought led me to the ideas presented here,
which -- following Bostrom somewhat in form, though not in content -- I summarize in two theses, called the Value Learning Thesis and the Value Evolution Thesis. These two theses indicate a very different
vision of the future of human-level and superhuman AGI than the one Bostrom and
ilk have been peddling. They comprise an argument that, if we raise our young AGIs appropriately, they may well grow up both human-friendly and posthuman-friendly.
Human-Level AGI and
the Value Learning Thesis
First I will present a variation of the idea that “in real
life, an AI raised to manifest human values, and smart enough to do so, is
likely to actually do so, in a fairly honest and direct way” that makes intuitive
sense to me. Consider:
Value Learning Thesis. Consider a cognitive system that, over a
certain period of time, increases its general intelligence from sub-human-level
to human-level. Suppose this cognitive
system is taught, with reasonable consistency and thoroughness, to maintain
some variety of human values (not just in the abstract, but as manifested in
its own interactions with humans in various real-life situations). Suppose, this cognitive system generally
does not have a lot of extra computing resources beyond what it needs to
minimally fulfill its human teachers’ requests according to its cognitive
architecture. THEN, it is very likely
that the cognitive system will, once it reaches human-level general
intelligence, actually manifest human values (in the sense of carrying out
practical actions, and assessing human actions, in basic accordance with human
values).
Note that this above thesis, as stated, applies both to
developing human children and to most realistic cases of developing AGIs.
Why would this thesis be true? The basic gist of an argument would be:
Because, for a learning system with limited resources, figuring out how to
actually embodying human values is going to be a significantly simpler problem
than figuring out how to pretend to.
This is related to the observation (often made by Eliezer
Yudkowsky, for example) that human values are complex. Human values comprise a complex network of
beliefs and judgments, interwoven with each other and dependent on numerous
complex, interdependent aspects of human culture. This complexity means that, as Yudkowsky and
Bostrom like to point out, an arbitrarily selected general intelligence would
be unlikely to respect human values in any detail. But, I suggest, it also means that for a
resource-constrained system, learning to actually possess human values is going
to be much easier than learning to fake them.
This is also related to the everyday observation that
maintaining a web of lies rapidly gets very complicated. It’s also related to the way that human
beings, when immersed in alien cultures, very often end up sincerely adopting
these cultures rather than just pretending to.
One could counter-argue that this Value Learning Thesis is
true only for certain cognitive architectures and not for others. This does not seem utterly implausible. It certainly seems possible to me that it’s
MORE true for some cognitive architectures than for others.
Mirror neurons and related subsystems of the human brain may
be relevant here. These constitute a
mechanism via which the human brain effectively leverages its limited
resources, via using some of the same mechanisms it uses to BE itself, to
EMULATE other minds. One might argue
that cognitive architectures embodying mirror neurons or other analogous
mechanisms, would be more likely to do accurate value learning, under the
conditions of the Value Learning Thesis.
The mechanism of
mirror neurons seems a fairly decent exemplification of the argument FOR the
Value Learning Thesis. Mirror neurons
provide a beautiful, albeit quirky and in some ways probably atypical,
illustration of how resource limitations militate toward accurate value
learning. It conserves resources to
re-use the machinery used to realize one’s self, for simulating others so as to
understand them better. This particular
clever instance of “efficiency optimization” is much more easily done in the
context of an organism that shares values with the other organisms it is
mirroring, than an organism that is (intentionally or unintentionally) just
“faking” these values.
I think that investigating which cognitive architectures
more robustly support the core idea of the Value Learning Thesis is an
interesting and important research question.
Much of the worry expressed by Bostrom and ilk regards
potential pathologies of reinforcement-learning based AGI systems once they
become very intelligent. I have
explored some potential pathologies of powerful RL-based AGI as well.
It may be that many of these pathologies are irrelevant to
the Value Learning Thesis, for the simple reason that pure RL architectures are
too inefficient, and will never be a sensible path for an AGI system required
to learn complex human values using relatively scant resources. It is noteworthy that these theorists
(especially MIRI/SIAI, more so than FHI) pay a lot of attention to Marcus
Hutter’s AIXI and related approaches — which, in their current forms,
would require massively unrealistic computing resources to do anything at all
sensible. Loosemore expresses a similar
perspective regarding traditional logical-reasoning-based AGI architectures —
he figures (roughly speaking) they would always be too inefficient to be
practical AGIs anyway, so that studying their ethical pathologies is beside the
point.
Superintelligence and
the Value Evolution Thesis
The Value Learning Thesis, as stated above, deals with a
certain class of AGIs with general intelligence at the human level or
below. What about superintelligences,
with radically transhuman general intelligence?
To think sensibly about superintelligences and their
relation to human values, we have to acknowledge the fact that human values are
a moving target. Humans, and human
societies and cultures, are “open-ended intelligences”. Some varieties of human cultural and value
systems have been fairly steady-state in nature (e.g. Australian aboriginal
cultures); but these are not the dominant ones currently. The varieties of human value systems that are
currently most prominent, are fairly explicitly self-transcending in nature. They contain the seeds of their own
destruction (to put it negatively) or of their own profound improvement (to put
it positively). The human values of
today are very different from those of 200 or 2000 years ago, and even
substantially different from those of 20 years ago.
One can argue that there has been a core of consistent human
values throughout human history, through all these changes. Yet the identification of what this core is,
is highly controversial and seems also to change radically over time. For instance, many religious people would say
that faith in God is a critical part of the core of human values. A century or two ago this would have been the
globally dominant perspective, and it still is now, in many parts of the
world. Today even atheistic people may
cite “family values” as central to human values; yet in a couple hundred years,
if death is cured and human reproduction occurs mainly via engineering rather
than traditional reproduction, the historical human “family” may be a thing of
the past, and “family values” may not seem so core anymore. The conceptualization of the “core” of human
values shifts over time, along with the self-organizing evolution of the
totality of human values.
It does not seem especially accurate to model the scope of
human values as a spherical shape with an invariant core and a changing
periphery. Rather, I suspect it is more accurate to model “human
values” as a complex, nonconvex shape with multiple local centers, and ongoing
changes in global topology.
To think about the future of human values, we may consider
the hypothetical situation of a human being engaged in progressively upgrading
their brain, via biological or cyborg type modifications. Suppose this hypothetical human is upgrading
their brain relatively carefully, in fairly open and honest communication with
a community of other humans, and is trying sincerely to accept only
modifications that seem positive according to their value system. Suppose they give their close peers the power
to roll back any modification they undertake that accidentally seems to go
radically against their shared values.
This sort of “relatively conservative human
self-improvement” might well lead to transhuman minds with values radically
different from current human values — in fact I would expect it to. This is the
open-ended nature of human intelligence.
It is analogous to the kind of self-improvement that has been going on
since the caveman days, though via rapid advancement in culture and tools and via
slow biological evolution, rather than via bio-engineering. At each step in this sort of open-ended
growth process, the new version of a system may feel acceptable according to
the values of the previous version. But over time, small changes may accumulate
into large ones, resulting in later systems that are acceptable to their
immediate predecessors, but may be bizarre, outrageous or incomprehensible to
their distant predecessors.
We may consider this sort of relatively conservative human
self-improvement process, if carried out across a large ensemble of humans and
human peer groups, to lead to a probability distribution over the space of
possible minds. Some kinds of minds may
be very likely to emerge through this sort of process; some kinds of minds much
less so.
People concerned with the “preservation of human values
through repeated self-modification of posthuman minds” seem to model the scope
of human values as possessing an “essential core”, and worry that this
essential core may progressively get lost in the series of small changes that
will occur in any repeated self-modification process. I think their fear has a rational
aspect. After all, the path from caveman
to modern human has probably, via a long series of small changes, done away
with many values that cavemen considered absolutely core to their value
system. (In hindsight, we may think that
we have maintained what WE consider the essential core of the caveman value
system. But that’s a different matter.)
So, suppose one has a human-level AGI system whose behavior
is in accordance with some reasonably common variety of human values. And suppose, for sake of argument, that the
AGI is not “faking it” — that, given a good opportunity to wildly deviate from
human values without any cost to itself, it would be highly unlikely to do
so. (In other words, suppose we have an
AGI of the sort that is hypothesized as most likely to arise according to the
Value Learning Thesis given above.)
And THEN, suppose this AGI self-modifies and progressively
improves its own intelligence, step by step. Further, assume that the variety
of human values the AGI follows, induces it to take a reasonable amount of care
in this self-modification — so that it studies each potential self-modification
before effecting it, and puts in mechanisms to roll back obviously bad-idea
self-modifications shortly after they occur.
I.e., a “relatively conservative self-improvement process”, analogous to
the one posited for humans above.
What will be the outcome of this sort of iterative
modification process? How will it
resemble the outcome of a process of relatively conservative self-improvement
among humans?
I assume that the outcome of iterated, relatively
conservative self-improvement on the part of AGIs with human-like values will
differ radically from current human values – but this doesn’t worry me because
I accept the open-endedness of human individual and cultural intelligence. I accept that, even without AGIs, current
human values would seem archaic and obsolete 1000 years from now; and that I
wouldn’t be able to predict what future humans 1000 from now would consider the
“critical common core” of values binding my current value system together with
theirs.
But even given this open-endedness, it makes sense to ask
whether the outcome of an AGI with human-like values iteratively
self-modifying, would resemble the outcome of a group of humans similarly
iteratively self-modifying. This is not
a matter of value-system preservation; it’s a matter of comparing the hypothetical
future trajectories of value-system evolution ensuing from two different
initial conditions.
It seems to me that the answer to this question may end up
depending on the particular variety of human value-system in question. Specifically, it may be important whether the
human value-system involved deeply accepts the concept of substrate independence, or not. “Substrate independence” means the idea that
the most important aspects of a mind are not strongly dependent on the physical
infrastructure in which the mind is implemented, but have more to do with the
higher-level structural and dynamical patterns associated with the mind. So, for instance, a person ported from a
biological-neuron infrastructure to a digital infrastructure could still be
considered “the same person”, if the same structural and dynamical patterns
were displayed in the two implementations of the person.
(Note that substrate-independence does not imply the
hypothesis that the human brain is a classical rather than quantum system. If the human brain were a quantum computer in
ways directly relevant to the particulars of human cognition, then it wouldn't
be possible to realize the higher-level dynamical patterns of human cognition in
a digital computer without using inordinate computational resources. In this case, one could manifest
substrate-independence in practice only via using an appropriately powerful
quantum computer. Similarly, substrate-independence
does not require that it be possible to implement a human mind in ANY
substrate, e.g. in a rock.)
With these preliminaries out of the way, I propose the
following:
Value Evolution Thesis. The probability distribution of future minds
ensuing from an AGI with a human value system embracing substrate-independence,
carrying out relatively conservative self-improvement, will closely resemble
the probability distribution of future minds ensuing from a population of
humans sharing roughly the same value system, and carrying out relatively
conservative self-improvement.
Why do I suspect the Value Evolution Thesis is roughly
true? Under the given assumptions, the humans and
AGIs in question will hold basically the same values, and will consider
themselves basically the same (due to embracing substrate-independence). Thus they will likely change themselves in basically
the same ways.
If substrate-independence were somehow fundamentally wrong,
then the Value Evolution Thesis probably wouldn't hold – because differences in
substrates would likely lead to big differences in how the humans and AGIs in
question self-modified, regardless of their erroneous beliefs about their
fundamental similarity. But I think
substrate-independence is probably basically right, and as a result I suspect
the Value Evolution Thesis is probably basically right.
Another possible killer of the Value Evolution Thesis could
be chaos – sensitive dependence on initial conditions. Maybe the small differences between the
mental structures and dynamics of humans with a certain value system, and AGIs
sharing the same value system, will magnify over time, causing the descendants
of the two types of minds to end up in radically different places. We don't presently understand enough about
these matters to rule this eventuality out.
But intuitively, I doubt the difference between a human and an AGI with
similar value systems, is going to be so much more impactful in this regard
than the difference between two humans with moderately different value systems. In other words, I suspect that if chaos
causes humans and human-value-respecting AGIs to lead to divergent trajectories
after iterated self-modification, it will also cause different humans to lead
to divergent trajectories after iterated self-modification. In this case, the probability distribution
of possible minds resultant from iterated self-modification would be diffuse
and high-entropy for both the humans and the AGIs – but the Value Evolution
Thesis could still hold.
Mathematically, the Value Evolution Thesis seems related to the notion of "structural stability" in dynamical systems theory. But, human and AGI minds are much more complex than the systems that dynamical-systems theorists usually prove theorems about...
Mathematically, the Value Evolution Thesis seems related to the notion of "structural stability" in dynamical systems theory. But, human and AGI minds are much more complex than the systems that dynamical-systems theorists usually prove theorems about...
In all, it seems intuitively likely and rationally feasible to me that creating human-level AGIs with human-like value systems, will lead onward to trajectories of improvement similar to those that would ensue from progressive human self-improvement. This is an unusual kind of "human-friendliness", but I think it's the only kind that the open-endedness of intelligence lets us sensibly ask for.
Ultimate Value
Convergence (?)
There is some surface-level resemblance between the Value
Evolution Thesis and Bostrom’s Instrumental Convergence Thesis — but the
two are actually quite different.
Bostrom seems informally to be suggesting that all sufficiently
intelligent minds will converge to the
same set of values, once they self-improve enough (though, the formal statement
of the Convergence thesis refers only to a “broad spectrum of minds”). The Value Evolution Thesis suggests only that
all minds ensuing from repeated self-modification of minds sharing a particular
variety of human value system, may lead to the same probability distribution
over future value-system space.
In fact, I share Bostrom’s intuition that nearly all
superintelligent minds will, in some sense, converge to the same sort of value
system. But I don’t agree with Bostrom
on what this value system will be. My
own suspicion is that there is a “universal value system” centered around a few
key values such as Joy, Growth and Choice.
These values have their relationships to Bostrom’s proposed key
instrumental values, but also their differences (and unraveling these would be
a large topic in itself).
But, I also
feel that if there are “universal” values of this nature, they are quite
abstract and likely encompass many specific value systems that would be
abhorrent to us according to our modern human values. That is, "Joy, Growth and Choice" as implicit in the universe are complexly and not always tightly related to what they mean to human beings in everyday life. The type of value system convergence
proposed in the Value Evolution Thesis is much more fine-grained than this. The “closely resemble” used in the Value Evolution
thesis is supposed to indicate a much closer resemblance than something like
“both manifesting abstract values of Joy, Growth and Choice in their own, perhaps very different, ways.”
In any case, I mention in passing my intuitions about ultimate value convergence due to their general conceptual relevance -- but the two theses proposed here do not depend on these broader intuitions in any way.
In any case, I mention in passing my intuitions about ultimate value convergence due to their general conceptual relevance -- but the two theses proposed here do not depend on these broader intuitions in any way.
Fears, Hopes and Directions (A Few Concluding Words)
Bostrom’s analysis of the dangers of superintelligence
relies on his Instrumental Convergence and Orthogonality theses, which are vaguely stated
and not strongly justified in any way. His arguments do not provide a rigorous argument that dire danger is likely from advanced AGI. Rather, they present some principles and processes that might potentially underlie dire danger to humans and human values from AGI in the future.
Here I have proposed my own pair of theses, which are also
vaguely stated and, from a rigorous standpoint, only very weakly justified at
this stage. These are intended as principles that might potentially underlie great benefit from AGI in the future, from a human and human-values perspective.
Given the uncertainty all around, some people will react with a precautionary instinct, i.e. "Well then we should hold off on developing advanced AI till we know what's going on with more certainty."
This is a natural human attitude, although it's not likely to have much impact in the case of AGI development, because the early stages of AGI technology have so much practical economic and humanitarian value that people are going to keep developing them anyway regardless of some individuals' precautionary fears. But it's important to distinguish this sort of generic precautionary bias toward inaction in the face of the unknown (which fortunately only some people possess, or else humanity would never have advanced beyond the caveman stage), from a rigorous argument that dire danger is likely (no such rigorous argument exists in the case of AGI).
What is the value of vague conceptual theses like the ones that Bostrom and I have proposed? Apart from getting attention and stimulating the public imagination, they may also serve as partial templates or
inspirations for the development of rigorous theories, or as vague nudges for those doing practical R&D.
And of course, while all this theoretical development and discussion goes on, development of practical AGI systems also goes on — and at present, my personal impression is that the latter is progressing faster. Personally I spend a lot more of my time on the practical side lately!
And of course, while all this theoretical development and discussion goes on, development of practical AGI systems also goes on — and at present, my personal impression is that the latter is progressing faster. Personally I spend a lot more of my time on the practical side lately!
My hope is that theoretical explorations such as the ones briefly presented here may
serve to nudge practical AGI development in a positive direction. For instance, a practical lesson from the considerations
given here is that, when exploring various cognitive architectures, we should
do our best to favor those for which the Value Learning Thesis is more strongly
true. This may seem obvious -- but when one thinks about it in depth in the context of a particular AGI architecture, it may have non-obvious implications regarding how the AGI system should initially be made to allocate its resources internally. And of course, the Value Evolution Thesis reminds us that we should
encourage our AGIs to fully consider, analyze and explore the nature of substrate independence (as well as to uncover substrate DEpendence insofar at it may exist!).
As we progress further toward advanced AGI in practice, we may see more cross-pollination between theory and practice. It will be fantastic to be able to experiment with ideas like the Value Learning Thesis in the lab -- and this may not be so far off, after all....
As we progress further toward advanced AGI in practice, we may see more cross-pollination between theory and practice. It will be fantastic to be able to experiment with ideas like the Value Learning Thesis in the lab -- and this may not be so far off, after all....
19 comments:
[Part 1]
Ben, I need to flag an important misunderstanding in your description of what my AAAI paper said.
The argument in that paper actually as no connection whatever to the values or goals of human beings. It is very important to emphasize that point, because it creates no end of confusion. So, to be absolutely sure, I will repeat: there is NOTHING at all in my argument that relates in any way to the fact that the supposed AI in the MIRI/FHI scenarios is trying to follow a goal that expresses some human values.
What DOES my paper say? It says that the MIRI/FHI scenarios all posit an AI that does the following:
1) It takes a goal statement (which could be 'maximize human happiness' or 'replenish the house supply of marmite' or .... absolutely anything).
2) It then hands off that goal statement (encoded in whatever representational language the AI uses) to the REASONING ENGINE inside the AI, so that the latter can operate on the goal statement to come up with candidate sequences of subgoal statements. It is important to emphasize that it is the REASONING ENGINE (aka knowledge maintenance system, knowledge engine, etc) that is responsible for ensuring that the AI actually does a produce candidate sequence of subgoal statements that is internally consistent with the goal statement. "Internally consistent" simply means that when the Reasoning Engine does its stuff, it must declare that, for example, a candidate sequence such as [Sing the Hallelujah Chorus] followed by [spin around three times] would be outrageously inconsistent with the goal 'replenish the house supply of marmite'.
3) As part of its work, the reasoning engine finally singles out one action plan (== sequence of subgoal, subsubgoals, etc.) that wins the competition and is annointed as the one that (according to the internal analysis) is the 'best' one for achieving the goal.
Now, with these three features in mind, here is what the MIRI/FHI people say. They insist that a goal such as 'maximise human happiness' COULD, theoretically, lead to a situation in which the AI decides that the best plan is to decapitate the entire human population and put their heads on dopamine drips, because this satisfies some internal measure of happiness that the AI happens to have decided is the best one. Notice that the MIRI/FHI people do not say exactly how this could happen, they only say that this is in principle possible.
Furthermore -- and take careful note of this -- the MIRI/FHI people say that the AI could be fully aware of the fact that plan has a gigantic list of inconsistencies with the everything that is known about happiness. In other words, there are many MANY items of knowledge inside the AI, all of them accepted by the AI as valid, which all point to the conclusion that the dopamine drip plan is NOT consistent with human happiness. The MIRI/FHI people do not dispute that this inconsistency will exist -- they are quite categorical about agreeing that it will happen.
However, the MIRI/FHI people insist that BECAUSE the AI has (somehow) come to the conclusion that the dopamine drip plan is the best of all possible plans, it will insist on executing the plan, IN SPITE of all the inconsistencies mentioned in the last paragraph.
The MIRI/FHI position is to talk about this in the context of the 'maximise human happiness' goal, but MY ATTACK does not depend on that goal at all: in fact, I point out that there is nothing in their argument that says this can only happen in the case of the 'maximize human happiness' goal, so if the AI is capable of doing this in the case of the happiness goal it is capable of doing it in the case of any goal. In that case, we can confine ourselves to the 'replenish the house supply of marmite' goal with no loss of generality.
[part 2 to follow]
[part 2]
What that means is that the MIRI/FHI people are proposing that the AI could decide (somehow) that the best way to achieve the marmite goal is to execute the plan [Sing the Hallelujah Chorus] followed by [spin around three times]. And, when faced with a mountain of evidence that no marmite will be acquired by these actions, the AI will ignore the evidence and execute the plan anyway.
There is no reason to assume that this will be an occasional occurrence, so we must assume that the AI could be doing this frequently. In particular it could do this for every one of the thousands of micro-plans that happen all day long, and which have been executed every day throughout the lifetime of the AI.
With all this as background, my argument is simple.
The MIRI/FHI people have postulated a type of AI that is behaving so unintelligently -- according to every reasonable measure of 'intelligence' -- that it would be unable to function coherently in the real world. Indeed, it is behaving with so little intelligence that it could not be a threat to anyone.
There is NOTHING in the behavior of this AI to distinguish it from an AI in which the reasoning engine was completely broken.
In fact, of course, this "type of AI" is not an AI at all: it is a complete fiction that could never exist. The MIRI/FHI people have deliberatly invented a type of fictional AI that contains a contradiction of such proportions that it is laughable.
Please notice that my argument does not reference human values, the gap between human expectations and machine expectations ..... nothing. It references only the drastic failure of the AI's reasoning engine to maintain even a modicum of consistency in its knowledge base.
So, although your post (which I will have to read in full tomorrow because you caught me at the end of the day on this side of the planet) is undoubtedly going to be interesting, it makes a lot of references to my argument having something to do with human value systems, and that part is wildly incorrect.
Later,
Richard
[end]
Hi Richard, I did not fully understand your arguments in your paper, so I'm not surprised I referred to them a little inaccurately. Sorry about that.... I will change the way I refer to your paper, in accordance with your suggestions, in my blog post when I next have time (which may not be till Saturday as I'm about to head behind the Great Firewall), but at the moment I gotta log off the computer and pack my suitcase and go give a talk...
Well , decades have passed and I see we are all still debating this stuff Ben ;)
The more the years have rolled by, the more I'm forced to admit we've all been idiots (myself included). I mean, the more I have learnt about the subject, the more I realize that we all knew so little, and even now we are really still quite ignorant. Much of the AGI writing reads just like babble to me now :( In the greater scheme of things, we all remain babies.
I agree with you in that I still think that there are some 'universal values' that all minds capable of 'full reflection' (in some suitably defined sense of the term) would converge upon. However, I also agree with you that these 'universal values' are likely to be very abstract, taking the form more of virtues or 'platonic ideals' than of concrete ethical prescriptions. Therefore, we have to admit that these universal abstractions are likely to be compatible with a huge range of specific concrete value systems, and are unlikely to be sufficient by themselves to protect us from unfriendly AI behaviors.
Of course I disagree with you on what the universal values are LOL. You are guessing JOY, GROWTH AND CHOICE ; I did pick a total of three values as universals, so at least we both have the same number of core values, however, unfortunately, my 3 are completely different ;)
For your information, the actual three universals are BEAUTY, LIBERTY and PERFECTION. Best guess ;)
I'm wondering if coordinating AGI value systems with human value systems would result in human level intelligence. Considering that human value systems probably emerge naturally from a collection of experiential data, would it be code it in where as we behave in response to environmental stimuli?
I of course can see the value of safe play but would it really produce the level of autonomy that is the end goal?
It seems self evident that entities with similar concerns would respond with similar behaviors if produced from a trial and error, evolutionary model. Likening the argument to changes in human behavior with the introduction of the scientific method to the human experience, one can see the differences in the way that the average human would respond to an ant hill in their lawn to the way that a myrmecologist might. There is a large community to sample from in that respect. This makes it easy to buy into your and Nicks rhetoric. This paradigm causes rational thought to value interesting systems and to struggle over the weight of each. Rather than having AGI that has values that I might find comfortable on the surface, I might feel more content with AGI that agonized over the data to make the right decision. The relativistic nature of the formation of world views just has me thinking that human values might not be relevant to the experiences of superintelligence though they may find a vested interest in them.
That being said, I guess my real question is is it direct values being piped into the code or is it a somewhat removed system of checks and balances that might have the latter effect?
Your VLT theses seems probably right to me, but I'm not convinced of VET. For that you need more than substrate independence; you need something like necessity independence as well. To understand what I mean by that, it might help to read this recent post on value evolution in humans. Key passage (on slavery pre-civil-war):
I mean, isn’t it interesting that all of the moral decent liberal people were north of a certain imaginary line, and all of the immoral bigoted people were south of it? And that imaginary line just happened to separate the climate where you could grow cotton from the one where you couldn’t? I’d argue instead that given a sufficiently lucrative deal with the Devil, the South took it. The Devil didn’t make the North an offer, and so they courageously refused to yield to this total absence of temptation.
So what I mean is that how our values evolve, how quickly we replace them and to some extent what we replace them with depends on what we need and what makes our lives and our cultures work. In this example, the northern and southern US evolved very different values around the issue of slavery simply because the southern economy was built on cotton. To bring this back to AI, my suggestion is that, even though substrate independence will make us capable of adopting the same values, we may diverge because our needs for self preservation are different. For example, consider animal welfare. It seems plausible to me that an evolving AGI might start with similar to human values on that question but then change to seeing cow lives as equal to those of humans. This seems plausible to me because human morality seems like it might be inching in that direction, but it seems that movement in that direction would be much more rapid if it weren't for the fact that we eat food and have a digestive system adapted to a diet that includes some meat. But an AGI won't consume food, so it's value evolution won't face the same constraint, thus it could easily diverge. (For a flip side, one could imagine AGI value changes around global warming or other energy related issues being even slower than human value changes because electrical power is the equivalent of food to them -- an absolute necessity.)
And value divergence is where the whole mirror-neuron empathy shortcut bites us. Because the more differently we reason and the more different our values the harder it is to imagine things from each others' perspectives. Assuming values diverge, AGIs may come to see us the way we would see a society that still practices slavery. Or even just the way some of the people who were earliest to recognize the merits of gay marriage see the people who are still against it even now that it's been the mainstream position for a few years. With the capacity to see morality comes the capacity to recognize barbarians.
If we start gradually replacing our biological parts with robotic ones as some believe we will, we may see the most enhanced transhumans converge with AGIs' values, but in that case I would expect divergence in the values of humans and transhumans and among transhumans of varying degrees of enhancement, and I expect disparity of levels of enhancement to grow over time.
It doesn't necessarily follow from this that AGIs will feel that justice and righteousness calls for going to war with/exterminating us as we might to ISIS, but it doesn't seem outside the realm of possibility.
Have folks considered something equivalent to fingerprints as a mechanism for establishing a moral awareness? The physical aspect of leaving behind a trace of one's presence in a situation seems a prerequisite for awareness of self, and especially of accountability. It would be necessary for an AI to form the capacity to realize its own actions left behind such a trail, in order for it to partake in moral accountability. In short, there's no moral agent needed where the possibility of "getting caught" can't even be considered.
Tory, I think human values are too complex and confusing to "code into" an AGI system. They will have to be taught to the system experientially...
Brilliant discussion points. As a Social Psychologist/Retired Military/World Traveler, there ARE a few human values that are considered "Universal". Murder of the innocent, theft of personal property, close incest, and pedophilia are a few that are pretty much considered universal among the collective of world humanity. I think AI should initially be treated and mentored exactly as a child would be. (Albeit on a much quicker scale). AI would logically consider all data points within its environment and (disconnected from the internet, of course) become hyper focused on available data points. The logical storage structure of such intelligence is certainly hierarchical- where concepts are taken as "snapshots", supported by language. Humanity's problems concern the input of the amygdalae and other involuntary hormonal input vectors in decision making. Take the recent police incident in S. Carolina. Depending on conceptual models painstakingly crafted by your nature/nurture circumstances - you see that video and intuitively side with the Cop or the Student. An AI doesn't have those emotional inputs. It sees a helpless human being damaged by a much more powerful and experienced human. It instantly sees the secondary, tertiary affects and beyond in an instantaneous calculation based on it's conceptual education. Done right, AI is nothing short of a miraculous tool to enhance human capability. We shouldn't be trying to create it and set it free, we need to provide a Wikipedia library off grid to newborn AI's and dive into a symbiotic relationship with them. This library is a collective, familial experience that is controlled by the matriarch or patriarch that set up the Home server to begin with.
Actually Ben, after closer inspection, a case could be made that your 3 values are somewhat related to mine. You are talking about what agents should *do*, I'm talking about the platonic ideals that agents should be *motivated* by. Basically, you're not wrong, I just think you haven't gone meta-enough.
Agents should engage in GROWTH in the direction of PERFECTION
Agents should engage in CHOICE in the direction of LIBERTY
Agents should engage in JOY?? in the direction of BEAUTY
Not sure about your third choice (Joy), but certainly your first two (Growth, Choice) are somewhat related to mine (Perfection, Liberty).
Dr. Goertzel:
Thank you for the response. Your model had me suspecting as such. I guess I was confused by semantics.
Eric L.: Great comment! I quoted you and expanded on your point in my own article.
Might it be useful to use Aumann's Agreement theorem to model semantic convergence, divergence, etc. between AI and humans?
Terimakasih banyak sudah memberi informasi
Hi Ben
You have framed your article in terms of the assumption that an AI needs to share human values to be safe, but that is actually quite debatable.
1. Humans don't exactly share human values.
2. Not all value is morally relevant.
3. No all human value is good.
4. An AI doesn't need to have human values to understand human values
5. Being well-intentioned is not necessary for being safe.
Setelah mencoba bertanya kepada orang-orang akhirnya dia menemukan OBAT KLAMIDIA yang ampuh dengan khasiat alami. Obat klamidia ini merupakan obat yang seratus persen menggunakan bahan yang diambil dari alam. Silahkan baca artikel selengkapnya. Obat alat vital bernanah secara alami ini lain daripada yang lain. Pengobatan ini berbeda dengan pengobatan pada umumnya. Karena kita ketahui sendiri pengobatan ini lebih menekankan dengan penggunaan bahan-bahan alami yang seratus persen berasal dari alam. Silahkan baca artikel selengkapnya. Ambeien dapat disembuhkan dengan mengonsumsi Obat ambeien , aman, halal, dan terjangkau yang bernama ambeclear. Tanaman Herbal yang bernama tanaman daun ungu merupakan bahan utama dari ambeclear. Silahkan baca artikel selengkapnya . Jika OBAT KUTIL KELAMIN medis ini ternyata juga tidak memberi kesembuhan, maka pengobatan alternative yang bisa menjadi pilihan adalah Obat kutil kelamin. OBAT HERBAL KUTIL KELAMIN ini lebih alami dan tanpa efek samping seperti pengobatan kimia lainya. Silahkan baca artikel selengkapnya
Wow, this is fascinating reading. I am glad I found this and got to read it. Great job on this content.
I liked it a lot. Thanks for the great and unique info
Review my webpage - 출장안마
(jk)
Excellent blog right here! Additionally your website a lot up very fast! What web host are you the usage of? Can I am getting your affiliate hyperlink on your host? I want my website loaded up as quickly as yours lol 출장안마
Have you ever considered about adding a little bit more than just your articles 출장안마?
Post a Comment