Joe Carlsmith Audio

Why focus on schemers in particular? (Sections 1.3-1.4 of "Scheming AIs")

November 16, 2023 Joe Carlsmith
Why focus on schemers in particular? (Sections 1.3-1.4 of "Scheming AIs")
Joe Carlsmith Audio
More Info
Joe Carlsmith Audio
Why focus on schemers in particular? (Sections 1.3-1.4 of "Scheming AIs")
Nov 16, 2023
Joe Carlsmith

This is sections 1.3-1.4 of my report “Scheming AIs: Will AIs fake alignment during training in order to get power?” 

Text of the report here: https://arxiv.org/abs/2311.08379
 
Summary of the report here: https://joecarlsmith.com/2023/11/15/new-report-scheming-ais-will-ais-fake-alignment-during-training-in-order-to-get-power
 
Audio summary here: https://joecarlsmithaudio.buzzsprout.com/2034731/13969977-introduction-and-summary-of-scheming-ais-will-ais-fake-alignment-during-training-in-order-to-get-power

Show Notes Chapter Markers

This is sections 1.3-1.4 of my report “Scheming AIs: Will AIs fake alignment during training in order to get power?” 

Text of the report here: https://arxiv.org/abs/2311.08379
 
Summary of the report here: https://joecarlsmith.com/2023/11/15/new-report-scheming-ais-will-ais-fake-alignment-during-training-in-order-to-get-power
 
Audio summary here: https://joecarlsmithaudio.buzzsprout.com/2034731/13969977-introduction-and-summary-of-scheming-ais-will-ais-fake-alignment-during-training-in-order-to-get-power

1.3 Why focus on schemers in particular?
1.3.1 The type of misalignment I’m most worried about
1.3.2 Contrast with reward-on-the-episode seekers
1.3.2.1 Responsiveness to honest tests
1.3.2.2 Temporal scope and general “ambition”
1.3.2.3 Sandbagging and “early undermining”
1.3.3 Contrast with models that aren’t playing the training game
1.3.4 Non-schemers with schemer-like traits
1.3.5 Mixed models
1.4 Are theoretical arguments about this topic even useful?