The basic issue is: what can you do to help mitigate against the problem of "goal drift", wherein an AGI system starts out with a certain top-level goal governing its behavior, but then gradually modifies its own code in various ways, and ultimately -- through inadvertent consequences of the code revisions -- winds up drifting into having different goals than it started with. I certainly didn't answer the question but I came up with some new ways of thinking about the problem, and formalizing the problem, that I think might be interesting....
While the language of math is used in the paper, don't be fooled into thinking I've proved anything there ... the paper just contains speculative ideas without any real proof, just as surely as if they were formulated in words without any equations. I just find that math is sometimes the clearest way to say what I'm thinking, even if I haven't come close to proving the correctness of what I'm thinking yet...
An abstract of the speculative paper is:
in Self-Modifying Cognitive Systems
A new approach to thinking about the problem of “preservation of AI goal systems under repeated self-modification” (or, more compactly, “goal drift”) is presented, based on representing self-referential goals using hypersets and multi-objective optimization, and understanding self-modification of goals in terms of repeated iteration of mappings. The potential applicability of results from the theory of iterated random functions is discussed. Some heuristic conclusions are proposed regarding what kinds of concrete real-world objectives may best lend themselves to preservation under repeated self-modification. While the analysis presented is semi-rigorous at best, and highly preliminary, it does intuitively suggest that important humanly-desirable AI goals might plausibly be preserved under repeated self-modification. The practical severity of the problem of goal drift remains unresolved, but a set of conceptual and mathematical tools are proposed which may be useful for more thoroughly addressing the problem.