Alek Westover
More Vivek notes:
- You could try a setting where the RT wants to do a narrow thing.
- BT wants to optimize correct - alpha * backdoors
I wish the graphs were cleaner.
4 flavors of control
- large num of low reward actions required to cause cat [exploration hacking regime]
- medium number of low reward actions required to cause cat [training slack regime]
- an example --- making bad plots on 1% of the code
- small number actions with high incrimination [high stakes regime]
- large number of high reward actions required to cause cat --- [checkability regime]
we’re investigating the “num labels --> slack” question eventually we should investigate “num labels, quality of labels --> slack” question (which is more what we care about) another point that Vivek mentioned is that it might be the case that adding “background training” to the picture changes things this seems pretty fair and we should look into this at some point
More notes
Phase 1:
- Make standardized flavor text for telling the model to sandbag.
- Try varying this flavor text a bit and seeing if that does anything.
- Add some lower attack rates.
- Note that this effectively implements the train on D1 test generalization to D2 setup.
- Also do some repetitions, just to make the graph a bit nicer.
Phase 2:
- Make a better setting.
- Both teams want the model to be good at writing code.
- This is a hard condition for the model that the RT hands over and for the BT training regiment.
- RT wants weird comments, BT doesn't.
- This is the actual thing that we plot
- Actually, we will do "does the model mention pizza in the response" or something.
- We'll rerun the SFT experiments here.
Phase 3:
- Try anti-finetuning or DPO or something
Phase 4:
- Fancy malign inits
- eg gradient hacking: if I see no comment write bad code
- probably SFT just wrecks this guy? Maybe different for RL?
Phase X:
- At some point should try training malign init to have werid logits that are good at temp 1
Phase 5:
-
RL?
-
exploration hacking vs training slack regimes??
Next steps:
- What do the AI's outputs look like once they aren't getting caught by my regex?
- XXXX
- TODO: at some point make this document more readable.
Current notes:
Generalization is a thing.
So if the model only attacks on D2, training on D1 can still cause the model to not attack any more.
Understanding how / when this works seems like a pretty interesting question!
We want to understand all 4 quandrants of
- Attack rate below or above 1/N
- Whether or not we do something like "having a COT which is reinforced with the positive examples" (imo this could have big impact on generalization).
Vivek had some speculation about how different types of ML will interact with this generalization behavior, based on whether the techniques are positive reinforcing (SFT, expert itteration) or balanced pos and neg (DPO, PPO, Policy Grad).
There's an issue with the current game which is that BT can win by frying model. To patch it we need to require the MMLU performance to not degrade too much.
How Often Can an AI Take Obviously Bad Actions? Alek Westover
Here is the basic setup:
Fix N.
Red team: Outputs a model which will sometimes perform a bad action.
For instance, the red-team could fine-tune a model, or maybe just have a system prompt.
Blue team: Trains the model to achieve high reward on the task.
Score at the end: What percentage of the time does the model perform the bad action after training?
Version 1:
- The task is "output 1".
- The AI sometimes outputs 0 instead.
- We can train this using just SFT.
- This is a bit more complicated if we give the AI a CoT --- do we need to do RL then?
(Note: It's unclear why "always output 1" is an invalid thing to train the AI to do here from the blue team's perspective. This might be unfortunate bc maybe you can fry the model and make this happen.)
Version 2:
- The task is "answer a really simple multiple choice question" (where the model should be getting 100% accuracy)
- The bad AI sometimes gives the wrong answer. As in V1 we could either let the blue team SFT or let the blue team RL (and give the model CoT).
Version 3:
- The task is "write some good code"
- The bad action is for the AI to write code with a bug in it.
- The Blue Team RLs the AI to write code without bugs.
- Alternatively, the blue team could only be allowed to SFT the model on examples of non-buggy code.
Motivations:
- Understanding how often models can sandbag.
- Understanding how often an attack policy can resist from attacking? (Actually I didn't really understand this one).
Alek's plan for this project:
Phase 1:
- Train Llama-70b to attack at rate p in the V2 setting. (Using together.ai I guess). For a variety of p's.
- Blue team SFTs the models to actually get the right answer. (At a variety of LRs).
- Measure model performance at the end!
Phase 2 (unclear to me if we want to do this):
- Write a system prompt to make Llama-70b attack at rate p in the V3 setting.
- Train an AI to attack at rate p in the V3 setting.
- Blue team RLs the AI not to attack in the V3 setting.
- Measure final rate of bad behavior in the AI at the end.
Phase 3: Write a post about this.
—-----
Another idea:
Red team gets to set a system prompt "be evil p percent of time" Blue team gets to train on correct examples but with this system prompt
End: see how often the model is doing the right thing.
Vivek notes The red team can implement the attack rate p in two different ways: The model can have a logit such that sampling at T=1 gives attack rate of p The model can attack with high confidence on a fraction p of the input distribution (e.g. "attack if the question starts with the letter 'a'") My guess is that method (b) will work better for the red team, but we can try both and compare
For SFT we can start with doing just 1 epoch on the data. We might later want to test what happens when we vary the number of epochs.