Joe Carlsmith Audio

Full audio for "Scheming AIs: Will AIs fake alignment during training in order to get power?"

November 15, 2023 Joe Carlsmith
Joe Carlsmith Audio
Full audio for "Scheming AIs: Will AIs fake alignment during training in order to get power?"
Show Notes Chapter Markers

This is the full audio for my report "Scheming AIs: Will AIs fake alignment during training in order to get power?"

(I’m also posting audio for individual sections of the report on this podcast, but the ordering was getting messed up on various podcast apps, and I think some people might want one big audio file regardless, so here it is. I’m going to be posting the individual sections one by one, in the right order, over the coming days. )

Full text of the report here:
Summary here:

0. Introduction
0.1 Preliminaries
0.2 Summary of the report
0.2.1 Summary of section 1
0.2.2 Summary of section 2
0.2.3 Summary of section 3
0.2.4 Summary of section 4
0.2.5 Summary of section 5
0.2.6 Summary of section 6
1. Scheming and its significance
1.1 Varieties of fake alignment
1.1.1 Alignment fakers
1.1.2 Training-gamers
1.1.3 Power-motivated instrumental training-gamers, or “schemers”
1.1.4 Goal-guarding schemers
1.2 Other models training might produce
1.2.1 Terminal training-gamers (or, “reward-on-the-episode seekers”)
1.2.2 Models that aren’t playing the training game Training saints Misgeneralized non-training-gamers
1.2.3 Contra “internal” vs. “corrigible” alignment
1.2.4 The overall taxonomy
1.3 Why focus on schemers in particular?
1.3.1 The type of misalignment I’m most worried about
1.3.2 Contrast with reward-on-the-episode seekers Responsiveness to honest tests Temporal scope and general “ambition” Sandbagging and “early undermining”
1.3.3 Contrast with models that aren’t playing the training game
1.3.4 Non-schemers with schemer-like traits
1.3.5 Mixed models
1.4 Are theoretical arguments about this topic even useful?
1.5 On “slack” in training
2. What’s required for scheming?
2.1 Situational awareness
2.2 Beyond-episode goals
2.2.1 Two concepts of an “episode” The incentivized episode The intuitive episode
2.2.2 Two sources of beyond-episode goals Training-game-independent beyond-episode goals Are beyond-episode goals the default? How will models think about time? The role of “reflection” Pushing back on beyond-episode goals using adversarial training Training-game-dependent beyond-episode goals Can gradient descent “notice” the benefits of turning a non-schemer into a schemer? Is SGD pulling scheming out of models by any means necessary?
2.2.3 “Clean” vs. “messy” goal-directedness Does scheming require a higher standard of goal-directedness?
2.2.4 What if you intentionally train models to have long-term goals? Training the model on long episodes Using short episodes to train a model to pursue long-term goals How much useful, alignment-relevant cognitive work can be done using AIs with
2.3 Aiming at reward-on-the-episode as part of a power-motivated instrumental strategy
2.3.1 The classic goal-guarding story The goal-guarding hypothesis The crystallization hypothesis Would the goals of a would-be schemer “float around”? What about looser forms of goal-guarding? Introspective goal-guarding methods Adequate future empowerment When is the “pay off” supposed to happen? Even if the model’s values survive this generation of training, will they survive long Will escape/take-over be suitably likely to succeed? Will the time horizon of the model’s goals extend to cover escape/take-over? Will the model’s values get enough power after escape/takeover? How much does the model stand to gain from not training-gaming? How “ambitious” is the model? Overall assessment of the classic goal-guarding story
2.3.2 Non-classic stories AI coordination AIs with similar values by default Terminal values that happen to favor escape/takeover Models with false beliefs about whether scheming is a good strategy Self-deception Goal-uncertainty and haziness Overall assessment of the non-classic stories
2.4 Take-aways re: the requirements of scheming
2.5 Path dependence
3. Arguments for/against scheming that focus on the path that SGD takes
3.1 The training-game-independent proxy-goals story
3.2 The “nearest max-reward goal” story
3.2.1 Barriers to schemer-like modifications from SGD’s incrementalism
3.2.2 Which model is “nearest”? The common-ness of schemer-like goals in goal space The nearness of non-schemer goals The relevance of messy goal-directedness to nearness
3.2.3 Overall take on the “nearest max-reward goal” argument
3.3 The possible relevance of properties like simplicity and speed to the path SGD takes
3.4 Overall assessment of arguments that focus on the path SGD takes
4. Arguments for/against scheming that focus on the final properties of the
4.1 Contributors to reward vs. extra criteria
4.2 The counting argument
4.3 Simplicity arguments
4.3.1 What is “simplicity”?
4.3.2 Does SGD select for simplicity?
4.3.3 The simplicity advantages of schemer-like goals
4.3.4 How big are these simplicity advantages?
4.3.5 Does this sort of simplicity-focused argument make plausible predictions about the sort
4.3.6 Overall assessment of simplicity arguments
4.4 Speed arguments
4.4.1 How big are the absolute costs of this extra reasoning?
4.4.3 Can we actively shape training to bias towards speed over simplicity?
4.4.2 How big are the costs of this extra reasoning relative to the simplicity benefits of
4.5 The “not-your-passion” argument
4.6 The relevance of “slack” to these arguments
4.7 Takeaways re: arguments that focus on the final properties of the model
5. Summing up
6. Empirical work that might shed light on scheming
6.1 Empirical work on situational awareness
6.2 Empirical work on beyond-episode goals
6.3 Empirical work on the viability of scheming as an instrumental strategy
6.4 The “model organisms” paradigm
6.5 Traps and honest tests
6.6 Interpretability and transparency
6.7 Security, control, and oversight
6.8 Other possibilities