AGENTS

The Alignment Problem

A deeply reported investigation into the challenge of making AI systems that reliably do what we want — covering reinforcement learning, reward hacking, fairness, and AI safety.

Brian Christian's 'The Alignment Problem' is the most thorough non-technical account of AI safety and the challenge of building agents that behave as intended. The book draws on extensive interviews with AI researchers at OpenAI, DeepMind, Berkeley, and MIT to explain reward hacking (agents optimising for the wrong thing), distributional shift, fairness and bias, interpretability, and the long-horizon challenge of aligning superintelligent systems. A must-read for anyone who wants to understand why getting AI agents to do what we actually want is so hard.

ai-safetyalignmentreinforcement-learningreward-hackingai-ethics

Visit The Alignment Problem