Abstract: Carl Shulman

Risk-averse preferences as AGI safety technique

Carl Shulman

Abstract: How can AGI designers ensure desirable behavior in future systems of greater than human general intelligence? Some proposals involve rigorously specifying AGI decision algorithms to select actions using standards precisely mirroring those of the designers. However, such approaches face a ‘bootstrap problem’ for early systems which have not yet learned concepts necessary to specify designer preferences, and great challenges in transferring implicit knowledge of preferences.

An alternative approach would rely on the the observation (Omohundro, 2008), that a wide variety of AGI utility functions convergently favor similar actions as instrumentally useful in most circumstances, e.g. the acquisition of resources to pursue strategies to increase utility. Provided that the resources to satisfy AGI preferences are securely under human control, compliance with human requests would be relatively insensitive to the details of the utility function, which could be selected for simplicity of engineering. However, sufficiently powerful AGI could simply seize resources.

We discuss intermediate cases, in which an AGI would face significant uncertainty about its ability to succeed in conflict. We argue that, save for a narrow class of exceptions, risk-aversion with respect to resources is a convergent result of diverse utility functions. Such risk-aversion would favor rejecting conflict to accept credible offers of cooperation unless an AGI assigned very high probability to success in any conflict. AGI utility functions could also be designed to further increase this risk-aversion, extending the range of possibilities for AGI safety.