MONA: Myopic Optimization with Non-myopic Approval Can Mitigate Multi-step Reward Hacking

Published: 22 January 2025

Abstract

Future advanced AI systems may learn sophisticated strategies through reinforcement learning (RL) that humans cannot understand well enough to safely evaluate. We propose a training method which avoids agents learning undesired multi-step plans that receive high reward even if humans are not able to detect that the behaviour is undesired. The method, Myopic Optimization with Non-myopic Approval (MONA), works by combining short-sighted optimization with far-sighted reward. We demonstrate that MONA can prevent multi-step reward hacking that ordinary RL causes, even without being able to detect the reward hacking and without any extra information that the ordinary RL does not get access to. We study MONA empirically in three settings which model different misalignment failure modes including 2-step environments with LLMs representing delegated oversight and encoded reasoning and longer-horizon gridworld environments representing sensor tampering.

Authors

Sebastian Farquhar, Vikrant Varma, David Lindner, David Elson, Caleb Biddulph, Ian Goodfellow, Rohin Shah

Venue

arXiv

Gemini

Gemma

Generative models

Gemini model ecosystem

Projects

Publications

News

AI for biology

AI for climate and sustainability

AI for mathematics and computer science

AI for physics and chemistry

AI transparency

News

Careers

Milestones

Education

Responsibility

The Podcast

MONA: Myopic Optimization with Non-myopic Approval Can Mitigate Multi-step Reward Hacking

Abstract

Authors

Venue

MONA: Myopic Optimization with Non-myopic Approval Can Mitigate Multi-step Reward Hacking

Share

Abstract

Authors

Venue