- Main
Value Internalization: Learning and Generalizing from Social Reward
Abstract
Social rewards shape human behavior. During development, a caregiver guides a learner's behavior towards culturally aligned goals and values. How do these behaviors persist and generalize when the caregiver is no longer present, and the learner must continue autonomously? Here, we propose a model of value internalization where social feedback trains an internal social reward (ISR) model that generates internal rewards when social rewards are unavailable. Through empirical simulations, we show that an ISR model prevents agents from unlearning socialized behaviors and enables generalization in out-of-distribution tasks. Incomplete internalization, akin to "reward hacking" on the ISR, is observed when the model is undertrained. Finally, we show that our model internalizes prosocial behavior in a multi-agent environment. Our work provides a framework for understanding how humans acquire and generalize values and offers insights for aligning AI with human values.
Main Content
Enter the password to open this PDF file:
-
-
-
-
-
-
-
-
-
-
-
-
-
-