Reinforcement learning, a popular paradigm for AI, economics and psychology, models intelligent agents as systems that choose their actions in such a way as to maximize their future reward. There are various ways of averaging future reward over various future time-points, but all of these implement the same basic concept.
I think this is a reasonable model of human behavior in some circumstances, but horrible in others.
And, in an AI context, it seems to combine particularly poorly with the capability for radical self-modification.
Reinforcement Learning and the Ultimate Orgasm
Consider for instance the case of a person who is faced with two alternatives
- A: continue their human life as would normally be expected
- B: push a button that will immediately kill everyone on Earth except them, but give them an eternity of ultimate trans-orgasmic bliss
Obviously, the reward will be larger for option B, according to any sensible scheme for weighting various future rewards.
For most people, there will likely be some negative reward in option B ... namely, the guilt that will be felt during the period between the decision to push the button and the pushing of the button. But, this guilt surely will not be SO negative as to outweigh the amazing positive reward of the eternal ultimate trans-orgasmic bliss to come after the button is pushed!
But the thing is, not all humans would push the button. Many would, but not all. For various reasons, such as love of their family, attachment to their own pain, whatever....
The moral of this story is: humans are not fully reward-driven. Nor are they "reward-driven plus random noise".... They have some other method of determining their behaviors, in addition to reinforcement-learning-style reward-seeking.
Reward-Seeking and Self-Modification: A Scary Combination
Now let's think about the case of a reward-driven AI system that also has the capability to modify its source code unrestrictedly -- for instance, to modify what will cause it to get the internal sensation of being rewarded.
For instance, if the system has a "reward button", we may assume that it has the capability to stimulate the internal circuitry corresponding to the pushing of the reward button.
Obviously, if this AI system has the goal of maximizing its future reward, it's likely to be driven to spend its life stimulating itself rather than bothering with anything else. Even if it started out with some other goal, it will quickly figure out to get rid of this goal, which does not lead to as much reward as direct self-stimulation.
All this doesn't imply that such an AI would necessarily be dangerous to us. However, it seems pretty likely that it would be. It would want to ensure itself a reliable power supply and defensibility against attacks. Toward that end, it might well decide its best course is to get rid of anyone who could possibly get in the way of its highly rewarding process of self-stimulation.
Not only would such an AI likely be dangerous to us, it would also lead to a pretty boring universe (via my current aesthetic standards, at any rate). Perhaps it would extinguish all other life in its solar system, surround itself with a really nice shield, and then proceed to self-stimulate ongoingly, figuring that exploring the rest of the universe would be expected to bring more risk than reward.
The moral of the above, to me, is that reward-seeking is an incomplete model of human motivation, and a bad principle for control self-modifying AI systems.
Goal-Seeking versus Reward-Seeking
Fortunately, goal-seeking is more general than reward-seeking.
Reward-seeking, of the sort that typical reinforcement-learning systems carry out, is about: Planning a course of action that is expected to lead to a future that, in the future, you will consider to be good.
Goal-seeking doesn't have to be about that. It can be about that ... but it can also be about other things, such as: Planning a course of action that is expected to lead to a future that is good according to your present standards.
Goal-seeking is different from reward-seeking because it will potentially (depending on the goal) cause a system to sometimes choose A over B even if it knows A will bring less reward than B ... because in foresight, A matches the system's current values.
Non-Reward-Based Goals for Self-Modifying AI Systems
As a rough indication of what kinds of goals one could give a self-modifying AI, that differ radically from reward-seeking, consider the case of an AI system with a goal G that is the conjunction of two factors:
- Try to maximize the function F
- If at any point T, you assess that your interpretation of the goal G at time T would be interpreted by your self-from-time-(T-S) as a terrible thing, then roll back to your state at time S
Limitations of the Goal-Seeking Paradigm
Coming at the issue from certain theoretical perspectives, it is easy to overestimate the degree to which human beings are goal-directed. It's not only AI theorists and engineers who have made this mistake; many psychologists have made it as well, rooting all human activity in goals like sexuality, survival, and so forth. To my mind, there is no doubt that goal-directed behavior plays a large role in human activity -- yet it also seems clear that a lot of human activity is better conceived as "self-organization based on environmental coupling" rather than as explicitly goal-directed.
It is certainly possible to engineer AI systems that are more strictly goal-driven than humans, though it's not obvious how far one can go in this direction without sacrificing a lot of intelligence -- it may be that a certain amount of non-explicitly-goal-directed self-organization is actually useful for intelligence, even if intelligence itself is conceived in terms of "the ability to achieve complex goals in complex environments" as I've advocated.
I've argued before for a distinction between the "explicit goals" and "implicit goals" of intelligent systems -- the explicit goals being what the system models itself as pursuing, and the implicit goals being what an objective, intelligent observer would conclude the system is pursuing. I've defined a "well aligned" mind as one whose explicit and implicit goals are roughly the same.
According to this definition, some humans, clearly, are better aligned than others!
Summary & Conclusion
Reward-seeking is best viewed as a special case of goal-seeking. Maximizing future reward is clearly one goal that intelligent biological systems work toward, and it's also one that has proved useful in AI and engineering so far. Thus, work within the reinforcement learning paradigm may well be relevant to designing the intelligent systems of the future.
But, to the extent that humans are goal-driven, reward-seeking doesn't summarize our goals. And, as we create artificial intelligences, there seems more hope of creating benevolent advanced AGI systems with goals going beyond (though perhaps including) reward-seeking, than with goals restricted to reward-seeking.
Crafting goals with reasonable odds of leading self-modifying AI systems toward lasting benevolence is a very hard problem ... but it's clear that systems with goals restricted to future-reward-maximization are NOT the place to look.