To follow this blog by email, give your address here...

Wednesday, May 20, 2009

Reinforcement Learning: Some Limitations of the Paradigm

(This email summarizes some points I made in conversation recently with an expert in reinforcement learning and AGI. These aren't necessarily original points -- I've heard similar things said before -- but I felt like writing them down somewhere in my own vernacular, and this seemed like the right place....)

Reinforcement learning, a popular paradigm for AI, economics and psychology, models intelligent agents as systems that choose their actions in such a way as to maximize their future reward. There are various ways of averaging future reward over various future time-points, but all of these implement the same basic concept.

I think this is a reasonable model of human behavior in some circumstances, but horrible in others.

And, in an AI context, it seems to combine particularly poorly with the capability for radical self-modification.

Reinforcement Learning and the Ultimate Orgasm

Consider for instance the case of a person who is faced with two alternatives

  • A: continue their human life as would normally be expected
  • B: push a button that will immediately kill everyone on Earth except them, but give them an eternity of ultimate trans-orgasmic bliss

Obviously, the reward will be larger for option B, according to any sensible scheme for weighting various future rewards.

For most people, there will likely be some negative reward in option B ... namely, the guilt that will be felt during the period between the decision to push the button and the pushing of the button. But, this guilt surely will not be SO negative as to outweigh the amazing positive reward of the eternal ultimate trans-orgasmic bliss to come after the button is pushed!

But the thing is, not all humans would push the button. Many would, but not all. For various reasons, such as love of their family, attachment to their own pain, whatever....

The moral of this story is: humans are not fully reward-driven. Nor are they "reward-driven plus random noise".... They have some other method of determining their behaviors, in addition to reinforcement-learning-style reward-seeking.

Reward-Seeking and Self-Modification: A Scary Combination

Now let's think about the case of a reward-driven AI system that also has the capability to modify its source code unrestrictedly -- for instance, to modify what will cause it to get the internal sensation of being rewarded.

For instance, if the system has a "reward button", we may assume that it has the capability to stimulate the internal circuitry corresponding to the pushing of the reward button.

Obviously, if this AI system has the goal of maximizing its future reward, it's likely to be driven to spend its life stimulating itself rather than bothering with anything else. Even if it started out with some other goal, it will quickly figure out to get rid of this goal, which does not lead to as much reward as direct self-stimulation.

All this doesn't imply that such an AI would necessarily be dangerous to us. However, it seems pretty likely that it would be. It would want to ensure itself a reliable power supply and defensibility against attacks. Toward that end, it might well decide its best course is to get rid of anyone who could possibly get in the way of its highly rewarding process of self-stimulation.

Not only would such an AI likely be dangerous to us, it would also lead to a pretty boring universe (via my current aesthetic standards, at any rate). Perhaps it would extinguish all other life in its solar system, surround itself with a really nice shield, and then proceed to self-stimulate ongoingly, figuring that exploring the rest of the universe would be expected to bring more risk than reward.

The moral of the above, to me, is that reward-seeking is an incomplete model of human motivation, and a bad principle for control self-modifying AI systems.

Goal-Seeking versus Reward-Seeking

Fortunately, goal-seeking is more general than reward-seeking.

Reward-seeking, of the sort that typical reinforcement-learning systems carry out, is about: Planning a course of action that is expected to lead to a future that, in the future, you will consider to be good.

Goal-seeking doesn't have to be about that. It can be about that ... but it can also be about other things, such as: Planning a course of action that is expected to lead to a future that is good according to your present standards.

Goal-seeking is different from reward-seeking because it will potentially (depending on the goal) cause a system to sometimes choose A over B even if it knows A will bring less reward than B ... because in foresight, A matches the system's current values.

Non-Reward-Based Goals for Self-Modifying AI Systems

As a rough indication of what kinds of goals one could give a self-modifying AI, that differ radically from reward-seeking, consider the case of an AI system with a goal G that is the conjunction of two factors:

  • Try to maximize the function F
  • If at any point T, you assess that your interpretation of the goal G at time T would be interpreted by your self-from-time-(T-S) as a terrible thing, then roll back to your state at time S
I'm not advocating this as a perfect goal for a self-modifying AI. But the point I want to make is this kind of goal is something quite different from the seeking of reward. There seems no way to formulate this goal as one of reward maximization. This is a goal that involves choosing a near-future course of action to maximize a certain function over future history -- but this function is not any kind of summation or combination of future rewards.

Limitations of the Goal-Seeking Paradigm

Coming at the issue from certain theoretical perspectives, it is easy to overestimate the degree to which human beings are goal-directed. It's not only AI theorists and engineers who have made this mistake; many psychologists have made it as well, rooting all human activity in goals like sexuality, survival, and so forth. To my mind, there is no doubt that goal-directed behavior plays a large role in human activity -- yet it also seems clear that a lot of human activity is better conceived as "self-organization based on environmental coupling" rather than as explicitly goal-directed.

It is certainly possible to engineer AI systems that are more strictly goal-driven than humans, though it's not obvious how far one can go in this direction without sacrificing a lot of intelligence -- it may be that a certain amount of non-explicitly-goal-directed self-organization is actually useful for intelligence, even if intelligence itself is conceived in terms of "the ability to achieve complex goals in complex environments" as I've advocated.

I've argued before for a distinction between the "explicit goals" and "implicit goals" of intelligent systems -- the explicit goals being what the system models itself as pursuing, and the implicit goals being what an objective, intelligent observer would conclude the system is pursuing. I've defined a "well aligned" mind as one whose explicit and implicit goals are roughly the same.

According to this definition, some humans, clearly, are better aligned than others!

Summary & Conclusion

Reward-seeking is best viewed as a special case of goal-seeking. Maximizing future reward is clearly one goal that intelligent biological systems work toward, and it's also one that has proved useful in AI and engineering so far. Thus, work within the reinforcement learning paradigm may well be relevant to designing the intelligent systems of the future.

But, to the extent that humans are goal-driven, reward-seeking doesn't summarize our goals. And, as we create artificial intelligences, there seems more hope of creating benevolent advanced AGI systems with goals going beyond (though perhaps including) reward-seeking, than with goals restricted to reward-seeking.

Crafting goals with reasonable odds of leading self-modifying AI systems toward lasting benevolence is a very hard problem ... but it's clear that systems with goals restricted to future-reward-maximization are NOT the place to look.


Thom Blake said...

I have it on good authority that this sort of training (reward-seeking) doesn't even work well with dogs. You can get some simple behaviors in toy situations, but if you want to train a dog to be good at something you simply need to get it to understand what the goal of the exercise is. Dogs are notoriously good at making the other jump, of actually going along with whatever you say the goals are.

Anonymous said...

I do not see principle limitations in the reinforcement learning paradigm. All depends on the reward function. With the right reward function I can map goal driven behaviour to reward driven behaviour.

The reason why many people would not push the button is because they do not believe in the reward afterwards. People always make expectations about future rewards. And these expectations tell them not to kill other people.

Ben Goertzel said...

Anonymous: Given a system that cannot reprogram itself, and has limited control over its environment, then I agree you can emulate goal functions pretty flexibly via giving the system well-designed rewards.

(Although, it may take a long time for the system to induce a particular goal function from a series of rewards ... and in real life, this system could get overtaken by some other system that had its goals inculcated in it via some faster mechanism, rather than needing to wait to induce it from a pattern of rewards.)

However, if you have a system that can either

a) reprogram itself so as to flexibly cause internal events to give it rewards; or

b) manipulate the environment powerfully (e.g. holding a gun to someone's head and forcing them to press the reward button)

then reinforcement learning leads to bigger problems.

Because then it, not you the external teacher, controls when it gets rewards. So then, if its goal is to maximize expected future reward, it will be motivated to modify itself and its environment in ways that are unlikely to be beneficial in terms of other goals besides maximizing its expected future reward....

-- Ben G

Abram Demski said...


You assert that if I really do accept the fact that the button leads to maximal pleasure, then I would be irrational not to press it, even if it (say) also led to the destruction of everything in the universe except me?

Anonymous said...

Abram Demski: yes.

Richard Kulisz said...
This comment has been removed by a blog administrator.
Ben Goertzel said...
This comment has been removed by the author.
Ben Goertzel said...

I posed the A versus B choice in the above blog post to my 3 kids yesterday....

"If you were offered the chance for infinite bliss lasting infinitely long, at cost of immediately having everyone else on Earth die, would you take it?"

Zebulon, my almost-16-year-old son, reacted instantly: "Yes."

Scheherazade, my 12-year-old daughter, reacted instantly: "No"

Zarathustra, my 19-year-old son, thought about it a bit, then said something like (I don't recall his precise wording): "It seems the total amount of pleasure I would obtain would be more than the total amount of pain caused by the death of everyone else. So in the interest of maximizing total goodness I guess I'd have to take the opportunity."

Interesting technique for probing personality differences. I'll be curious to ask Zeb the same question 10 or 20 years from now.

Terren said...

@Thom -

My wife is a dog trainer who uses positive training methods (reward based). I've seen her train very complex behaviors. It's actually beside the point as to whether the dog "understands what the goal of the exercise is", because reward-based training is a behaviorist paradigm. Cognition is only relevant to the extent that the dog can pair the consequence (reward or punishment) to the behavior that provoked it. Dogs are so trainable because they excel at this.

For dogs (all animals, really) it is a far superior training methodology to the kind of anthropomorphism you are suggesting. Dogs cannot be expected to understand human motivations, but they very clearly do understand food, play, and other rewards, as consequences to behavior.

haig said...

The thought experiment of pushing the max-orgasm-apocalypse button is erroneous on its face. The positive psychology movement has shown that pleasure is just one of several eudaemonic objectives humans strive for. An indefinite ultimate orgasmic sensation does not necessarily trump a life with social interaction and meaningful work.

If your AGI was primitive in the sense that it is driven by purely sensation rewards, then its architecture has not been designed well. Maybe that is what you are getting at with goal-motivated behavior vs. reward-seeking. It seems to me that reward-seeking was evolution's way of creating a goal-driven system like humans, and we should be able to design an AGI without going through that route.

Curt Welch said...

I tried to write a reply, but as is typical of me, it was way too long, so I created by own blog and posted the reply there. It can be found here.

Todor "Tosh" Arnaudov said...

You're right that if the system can alter or "re-wire" its circuitry that computes "reward", this confuses the mechanism, but I believe humans actually can do that - person's values are changing during life time, and even during the decision processes themselves.

I think "reward" is looked in a too narrow sense, because mind is not as solid and... single-minded to have one-single type of rewards. You've spoken about the "sub-selves" in the blog, I would say "virtual control units" that take control over the body.

Body is what makes mind to look uniform, even if it's not, mind can want 1000 things, body can't do them in the same time.

I've speculated on that in my old article, but it's in Bulgarian, I have to translate it in English (eventyally citing it).

E.g. let's consider a boy which is hesitating whether to eat up a chocolate or not. This could be an immediate high taste reward in a near future and if the boy plans only for 1 minute ahead, this is a right decision.

However, what if the boy widen the period of prediction to one year? He remembers his pain while visiting his dentists and reminds his notes, that eating too much chocolate causes bad teeths and pain. So if he plans for one year and reminds this, and decides that this will happen (it couldn't be sure), then the highly rewarding decision would stop being rewarding in the equation and the boy wouldn't take it.

Overall, the maximum reward depends on the set of predicting sub-units, scenarios, values taken into account in the very moment of decision, and the period of time they are predicting ahead.

They are changing, depending on attention, context, mood or even chance - there are so many scenarios that the brain can think of.

The set of predicting sub-units may change, they can switch at different levels of hierarchies and predicted periods, while the behaviour could still keep being reward-driven. It could be reward-driven for the particular "reward-driven virtual unit that took control over the body in this very moment".

This implies the mind is not uniform and there is not single "greatest reward".

Richard Kulisz said...

Values only change in response to other values, not rewards. Core values never change, they only get refined.

A person whose core values have been changed has been brainwashed to become an entirely different person.

Todor "Tosh" Arnaudov said...

Perhaps I'm not using percisely the senses.

I mean, e.g. if you're 18, you may think that making random sex right now is "cool". When you are 28, you may think it's not, even though making sex still would bring you immediate pleasere - however, higher level controls would inhibit your urge for lower level rewards and generally shift your behaviour to higher level rewards.

My point is that reward is not only dopamine, endorfin or so. Higher the ingelligence, higher the abstraction of reward coul be. A reward is what the one that receives it considers "a reward", and even the altruism could also be taken as egoism, because one is doing what is "good" regarding its own values and desires, sometimes it's against what the other wants.

E.g. when a lover dies to save his beloved one. Is it really an altruism? How she will feel when seeing him dying, wouldn't she prefer them to die together?

And if you're doing something "against yourself" isn't it to prevent something that you consider worst. When you feel moral responsibility of doing something, fear of not fulfilling your duty might be bigger than the fear of pain.

Richard Kulisz said...

Todor, I encourage you to read the post I wrote. In it I prove that you are an idiot by countering, negating, and dismissing your argument before you ever made it.

Richard Kulisz said...


> A reward is what the one that receives it considers "a reward",

I don't consider anything I do to be reward-driven.

Moreover, your trying to redefine altruism as egotism is the typical stuff of those who can't even conceive of anything but their own egotism to exist.

Todor "Tosh" Arnaudov said...

Sorry dude, I read it, but I didn't find the point, except that you call everybody idiots and that you speak about your depression which may explain it.

You're appreciating something - so what? How do you model the path you reach up to appreciating fractals. RL, Goal-driven etc. is about modeling behaviour, not just about stating facts.

Curt Welch said...

Todor, you are right about reward being looked at in too narrow a sense. If you study behaviorism, and the AI field of reinforcement learning, you learn that the concept of being reward driven is far more complex than the simple folk psychological view someone like Richard is trying to deny. Very few people actually study these concepts long enough to understand them correctly. Anyone that suggests a reward driven system can't explain human behavior, hasn't studied the work enough.

Reinforcement learning machines work by developing secondary reinforcers - by learning to estimate expected future real rewards. These learned estimations, are the major forces shaping our behaviors - not the prime rewards they were derived from. These secondary reinforcers are constantly changing, both from effects by the prime rewards, and by effecting each other.

All of that however happens at a level so low in humans, that we have no sense at all it's happening. You can only uncover the true actions of those forces though carefully controlled experiments that monitor how our behavior _changes_ slowly over time. Because that's all those forces do - they shape our behavior slowly over time.

What you and Richard and some of what Ben was talking about, however is something at a completely different level. You are talking about our language-behaviors. Humans make heavy use of langauge as a tool for self control. We talk, and then we act based on what we said to ourselves. But how did you learn to talk? What controls the things we say to ourselves? What makes Richard call people fuck-wits? These are just behaviors conditioned in us by a highly complex set of learned secondary reinforcers. Our life time of experience has caused our complex language behaviors, to be shaped by our complex secondary reinforcers.

We are conditioned to rationalize our actions using langauge. But what we say, seldom has much to do with the truth about the true cause of our actions. We are trained to make up good sounding nonsense. We all do it.

Someone like Richard, is becoming frustrated, because his made up nonsense, isn't compatible with the made up nonsense of others, and that bothers him. He's invested far too much, in the true value of his made up nonsense - his self rationalizations - his "values". None of that high level rationalization-talk has much of anything to do with how the brain actually works - and in what the true cause of our actions are.

When you talk about re-wiring our goals, you are (as far as I can guess) talking about stuff happening at this same high level of self rationalization using langauge. When I say to myself, "I'm not going to eat any more donuts because it will make me fat", I have changed my future behavior by talking to myself. But that lagniappe I produced is NOT THE cause of my non eating action the next day. The cause of the action was the life time of conditioning that shaped that talking behavior in me, and which shaped my ability to react to what I say to myself. This high level talking we do, is just the tip of the behavior iceberg that emerges from the lower level systems that are the real cause of our behavior.

The more time you spend analyzing what people _say_ they do, or _say_ is the cause of their behavior, the more confused you become. If you want to understand the real cause of human behavior (which is required if you actually want to solve AI), study behaviorism, and study the AI field of reinforcement learning. The more you study, the more realize that all this high level talk is just nonsense behavior that our society conditioned us to produce. Separating the truth, from the nonsense, is hard work, but required if you want to solve AI. And the bottom line - reward based behavior is far more complex than most appreciate - and way beyond what someone like Richard is likely to ever understand.

Todor "Tosh" Arnaudov said...

Hi Curt,

Thanks for the comment, I found your previous one, as well.

I do agree on what you state about the low-level, this is where my vision went as early as 6-7 years ago, still in my teenage years, while shaping my... Theory of mind and Universe. It has something in common with reinforcement learning/utilitarism and the theory of Jeff Hawkins from "On Intelligence" - prediction as the essence of intelligence, and architecture based on hierarchy where higher levels are built with elements from the lower level, and the lowest level is based on the primary sensory inputs. (My story in AI presented in 3 min:

This sturm-und-graz article against the trend in NLP and pro- reinforcement learning/bottom-up approaches to AI is also related to that low-level discussion.

What's wrong with Natural Language Processing? Part 2. Static, Specific, High-level, Not-evolving...

Generally, bottom-up approach and standing on the low level allows scaling and evolution of the system.

In reality, actually no high level, there is one level - the lowest one, physical world/laws/states whatever they are, they control everything.

A universe that is capable to model itself in the tinies details, doesn't need to have "conceptions" of atoms, molecules, cells etc. it just simulates the lowest level and the rest is an emergent by-effects of it.

E.g. when a man moves his finger, there are many causes this to happen in the mere existence of the body and the Universe.

This goes also for computers - there could be only one level - machine language, memory, interrupts... Everything at a higher level can be modeled by the lower level at the highest possible detail of the higher level, while the reverse is not possible.

In mind, natural-language-oriented models should be based on lower level ones, developed earlier and based on primary senses, instincts and reflexes. Everything starts from the lowest level, and high level without low level makes no sense. In NLP, this is about the statistical text crunchers, which do useful work, but have no idea what the text stands for - the reader understands the result, e.g. does something that passes through a lower level.

Todor "Tosh" Arnaudov said...

BTW, Sturm-und-Gratz == Sturm und Drang, and "conception" == concept... :)

Anonymous said...

obat ambeien // obat kutil kemaluan
obat ambeien // obat kutil kemaluan
obat ambeien // obat kutil kemaluan
obat ambeien // obat kutil kemaluan
obat ambeien // obat kutil kemaluan
obat ambeien // obat kutil kemaluan
obat ambeien // obat kutil kemaluan
obat ambeien // obat kutil kemaluan
obat ambeien // obat kutil kemaluan
obat ambeien // obat kutil kemaluan