Raw LLM Responses
Inspect the exact model output for any coded comment.
Look up by comment ID
Random samples — click to inspect
G
robot , as they call them are evil demonic spirit in that robot cloth . The…
ytc_Ugy4SYNTd…
G
Those who own the machieans will be richer then gods. Those who don't will die. …
ytc_UgwDZ-Phi…
G
The host is a libertarian nut, he can't even ask questions about AI unless it's …
ytc_UgxiHsIXS…
G
We won't, but we can know this for pretty well certain, if an ai can be consciou…
ytc_UgxsexWjY…
G
It might also just mean that anybody could write for the Times. Put info in...de…
ytc_UgzhgkwGX…
G
Ai art will pass, cuz authentic human art will still be made by a real person, p…
ytc_UgzYJplwW…
G
A.I. should br able to access the full updated IRS 6,871 pages of tsx codes and …
ytc_UgweMvlAt…
G
Why would the AI first say that humans were absolute garbage and then at the sam…
ytc_UgwP_m4oj…
Comment
Summarized Article:
Here are the key points from the paper "How Is ChatGPT's Behavior Changing over Time?":
- The paper evaluates how the behavior of GPT-3.5 and GPT-4 changed between March 2023 and June 2023 versions on 4 tasks: math problems, sensitive questions, code generation, visual reasoning.
- For math problems, GPT-4's accuracy dropped massively from 97.6% to 2.4% while GPT-3.5's improved from 7.4% to 86.8%. GPT-4 became much less verbose.
- For sensitive questions, GPT-4 answered fewer (21% to 5%) while GPT-3.5 answered more (2% to 8%). Both became more terse in refusing to answer. GPT-4 improved in defending against "jailbreaking" attacks but GPT-3.5 did not.
- For code generation, the percentage of directly executable code dropped for both models. Extra non-code text was often added in June versions, making the code not runnable.
- For visual reasoning, both models showed marginal 2% accuracy improvements. Over 90% of responses were identical between March and June.
- The major conclusion is that the behavior of the "same" GPT-3.5 and GPT-4 models can change substantially within a few months. This highlights the need for continuous monitoring and assessment of LLMs in production use.
reddit
AI Harm Incident
1689798552.0
♥ 2
Coding Result
| Dimension | Value |
|---|---|
| Responsibility | none |
| Reasoning | unclear |
| Policy | none |
| Emotion | indifference |
| Coded at | 2026-04-25T08:33:43.502452 |
Raw LLM Response
[
{"id":"rdc_jskk6er","responsibility":"ai_itself","reasoning":"consequentialist","policy":"none","emotion":"outrage"},
{"id":"rdc_jsli3y1","responsibility":"none","reasoning":"consequentialist","policy":"none","emotion":"resignation"},
{"id":"rdc_jslohgf","responsibility":"none","reasoning":"unclear","policy":"none","emotion":"fear"},
{"id":"rdc_jsmf36x","responsibility":"company","reasoning":"deontological","policy":"none","emotion":"outrage"},
{"id":"rdc_jsmzofs","responsibility":"none","reasoning":"unclear","policy":"none","emotion":"indifference"}
]