The technical hurdles of aligning LLMs with non-monotonic human values - Lex Fridman

As we advance toward more autonomous agents, the challenge of technical alignment transcends mere RLHF (Reinforcement Learning from Human Feedback). A significant concern is the potential for "reward hacking" in complex systems where the objective function is inherently underspecified. Given that human values are non-monotonic and culturally contingent, can we realistically expect a centralized alignment protocol to scale, or should we be looking toward decentralized, pluralistic frameworks that allow for a spectrum of ethical weights? I'm interested in hearing perspectives on the trade-offs between safety-centric constraints and the emergence of unforeseen heuristic behaviors