Raw LLM Responses

Inspect the exact model output for any coded comment.

Comment
The thing where they trained GPT-4o on code with vulnerabilities was actually reassuring to Eliezer Yudkowsky. In order to know what good behavior looks like, the model also needs to know what bad behavior looks like. Insecure code gets punished in the same way as hatespeech, so when you then make the model produce insecure code, the easiest way for the optimizer to achieve that is to simply make the model evil. The reassuring part was that this meant that behavior was tied to values pretty much across the board if changing it in one area can flip its behavior fully, indicating higher robustness to the process of RLHF than previously thought. It's really not all that surprising. Though I think the implications aren't all that meaningful apart from it being surprisingly easy to mess up parts of a model ones data had absolutely nothing to do with. Anyhow, it's less "revealing the models true self" than "making the model care about the exact opposite of what it did originally".
youtube AI Moral Status 2025-12-12T21:5…
Coding Result
DimensionValue
Responsibilitynone
Reasoningconsequentialist
Policynone
Emotionapproval
Coded at2026-04-27T06:24:53.388235
Raw LLM Response
[{"id":"ytc_UgwoPeMsVfJVfD235KZ4AaABAg","responsibility":"none","reasoning":"consequentialist","policy":"none","emotion":"approval"}, {"id":"ytc_UgzNgiTXKTnsd9KAIXl4AaABAg","responsibility":"ai_itself","reasoning":"mixed","policy":"none","emotion":"fear"}, {"id":"ytc_UgwILvn9vSF1VnlIrMl4AaABAg","responsibility":"distributed","reasoning":"consequentialist","policy":"none","emotion":"indifference"}, {"id":"ytc_UgwUF9z1CW4NnDWTr5J4AaABAg","responsibility":"ai_itself","reasoning":"mixed","policy":"liability","emotion":"fear"}, {"id":"ytc_UgzUI84MwRB5WxUznB94AaABAg","responsibility":"company","reasoning":"virtue","policy":"none","emotion":"mixed"}, {"id":"ytc_Ugz6rAfqZWNYf9BjA7h4AaABAg","responsibility":"company","reasoning":"unclear","policy":"none","emotion":"indifference"}, {"id":"ytc_UgwT_4ubTRVoQOykPBx4AaABAg","responsibility":"distributed","reasoning":"mixed","policy":"none","emotion":"resignation"}, {"id":"ytc_UgxSY4WVINPbp-ZQjEF4AaABAg","responsibility":"company","reasoning":"consequentialist","policy":"regulate","emotion":"outrage"}, {"id":"ytc_UgyC2u9XjF6TYZxJNk14AaABAg","responsibility":"company","reasoning":"consequentialist","policy":"ban","emotion":"resignation"}, {"id":"ytc_Ugxr1DWydj_B4gaXQmJ4AaABAg","responsibility":"company","reasoning":"consequentialist","policy":"regulate","emotion":"fear"}]