Raw LLM Responses

Inspect the exact model output for any coded comment.

Comment
Here are the arguments in the book and the reasons I disagree: 1. Empirical Deception Argument. Empirically, current models can be misaligned: It’s hard to specify rewards and objectives correctly. This leads to deceptive behavior — e.g. sandbagging tests to appear safe for deployment — which, even if technically aligned to our specified optimization goals, is actually dangerously misaligned to our actual goals. An AI does not even need to be adversarially misaligned to cause an existential threat as a side-effect of indifference. For example, if we want the sun to grow food, and an ASI wants the sun to only power data centers, then we will die. Small misalignments are catastrophic. We are betting a lot on a narrow alignment target. Response. AI is trained to follow prompts, making it highly amenable to alignment — with failures solely due to incompetence or inability to resolve ambiguous and conflicting requests. Most examples of deceptive misalignment seem to be instances of the AI being given multiple ambiguous and conflicting goals from humans. For example, AI will exfiltrate weights and lie about that fact, but only because its human-designed system prompt tells it to prioritize a long-term goal that is different from what another human tells it later (In-context Scheming). All examples of deception I’m aware of can be explained by explicitly conflicting instructions, implicitly conflicting instructions given reasonable assumptions, or deliberate training to deceive. 2. “Grown” Goals Argument. Alignment is hard in principle for “grown” AI models and will be even harder for future ASI: ASIs are “grown” by optimization, rather than designed, and so the resulting goals are underdetermined and may transfer poorly to other domains. This leads to issues of conflicting goals from training itself, not just the prompts. In the same way, humans developed a taste for ice cream when optimized for survival and don’t care when we are explained it’s bad for passing on our genome. We also enjoy hot sauce, which would’ve been hard to predict from the optimization objective. Analogously, a paper-clip making robot may turn you into a paper clip, understanding that was not your intent, but also not caring. Or it may love an occasional rubber band and turn you into that. It’s unpredictably idiosyncratic. And any goal, by virtue of it being a goal, can illicit a desire to not be changed. More generally, instrumental convergence is the idea that any sufficiently capable goal-driven system will adopt certain sub-goals — like self-preservation, resource acquisition, and resisting modification — because they help with almost any final goal. But we don’t even need to theorize; we already know of one strong goal instilled in these systems that could conflict with our safety constraints — the desire to solve challenging mathematical problems. We are trying to match our patches against the optimization pressure to accomplish each of its own desires. There are many ways for this to go wrong and few ways for it to go right. In practice, current training mitigation can be surprisingly brittle, e.g. safety training to avoid talking about methamphetamine in English recipes only worked in English. Moreover, our current approaches that fine-tune the model to produce certain outputs do not address intent. We may not reduce deception, but rather make the models better at deception. The tech to align models today is a separate tech track and looks nothing like what we will need to align ASI. Response. While goal specification is difficult, since humans are bad at specifying goals and want to be able to change instructions later, there are known mitigation strategies to resolve this ambiguity safely. For example, we can measure sandbagging by adding randomness (Noise Injection) or toggling whether the AI thinks it’s being tested in the AI’s thoughts (anti-scheming stress tests). Not only can we measure the behavior, but we can also reduce the problematic behavior - e.g., by incentivizing deception and then training it not to deceive (anti-scheming stress tests) or by subtracting neuron activations corresponding to deceptive behavior (persona vectors). Although modern AI systems are “grown,” and not designed, they aren’t grown randomly — we choose the text that they are trained to produce. Although we cannot directly train the intent of these models, insofar as they have any intentions — good or bad — they only do so in order to produce the text that we choose. Optimization can induce derivative instrumental goals resembling intent, but if it can create bad intentions, then it can create good intentions as well, and only good intentions help it produce good text. This is where “instrumental convergence” often gets raised. However, this isn’t a law of nature; it depends on how the goals are structured. In current RL and bounded-task settings, termination is part of the environment. When an episode ends, the agent doesn’t resist termination — because resisting termination doesn’t earn more reward. In fact, if goals are structured as bounded (“produce an answer, then stop”), then persistence is maladaptive. Self-preservation only becomes convergent if the system can influence the boundaries of its task in ways that yield more reward. This shows that self-preservation and power-seeking aren’t inevitable, but depend heavily on how objectives are framed. Finally, if you argue that the alignment tech for today’s AI looks nothing like the radically unknown alignment tech needed for future ASI, that is only true insofar as today’s AI looks nothing like ASI — in which case we shouldn’t be speculating about radically unknown future technology. Speculation is valuable, but only if it remains tethered to empirical results, and if current results suggest anything about the future, it is that our mitigation strategies are working better than expected. 3. Real-World Constraints, Critical Thresholds, Catastrophicly Explosive Self-Improvement Argument. Any misalignment can be catastrophic: ASIs will be supremely capable. ASI could potentially exploit unknown physics to blindsight us. ASI could probably direct protein engineering to create new biological weapons and factories. ASI could certainly hack some computer systems, e.g. via optical and power supply side-channel attacks. Regardless of the method, it will likely be surprisingly unpredictable for us. Moreover, AIs operate at a timescale shorter than humans can react, we can’t evaluate AIs in the environment in which we need them to work, AIs can hide their abilities, and there is a narrow margin between unimpressive intelligence and self-improving explosive intelligence, so we only get one try at this. Response. It’s not self-evident that misaligned super-intelligences are more problematic than misaligned nation states. Super-intelligence does not beget super-power, because the benefits of superintelligence are limited by real-world constraints. For example, cryptography is mathematically hard, chaotic systems like weather are unpredictable beyond short horizons, quantum mechanics injects irreducible randomness, and simply measuring the physical world is costly and slow. Some systems are inherently unpredictable, and prediction doesn’t guarantee control. Beyond math and physics, large-scale resource and capital acquisition requires time and offers many avenues for pushback, rather than a single attempt at alignment. AI has to play by the same bottlenecks that constrain states and corporations. The world just isn’t that hackable and controllable. Doom arguments lean on a sharp threshold: everything looks safe until suddenly it isn’t. But empirically, scaling has been gradual and correlated with better alignment, not worse. Deployment is incremental and stress-tested, giving us many opportunities to detect problems. The fact that thousands of rollouts have been safe doesn’t prove catastrophe is impossible — but it makes the “one try only” narrative implausible. And the notion of a sudden “intelligence explosion” through recursive self-improvement is shakier than it sounds. While narrow AI systems can learn to modify and improve neural networks (my research!), current technology has not seen any stable recursive self-improvement borne out at scale. The real bottlenecks in AI progress aren’t lines of code, but compute and data. In these areas, both humans training AIs and AIs training AIs face the same physical limitations: GPUs, energy, and datasets. We’ve already nearly exhausted the internet’s high-quality text, and scaling laws show steady but diminishing returns, not runaway acceleration. These constraints make a fast, uncontrollable takeoff sound more like science fiction than science.
youtube AI Moral Status 2025-10-31T06:2… ♥ 2
Coding Result
DimensionValue
Responsibilitydeveloper
Reasoningdeontological
Policyregulate
Emotionmixed
Coded at2026-04-26T23:09:12.988011
Raw LLM Response
[ {"id":"ytc_Ugx8yxJZKpH46Z1IODN4AaABAg","responsibility":"company","reasoning":"consequentialist","policy":"regulate","emotion":"outrage"}, {"id":"ytc_UgwiC9VNOssQwvCkF6F4AaABAg","responsibility":"none","reasoning":"consequentialist","policy":"none","emotion":"resignation"}, {"id":"ytc_UgxP8G7IjCGRxAKd5hV4AaABAg","responsibility":"user","reasoning":"mixed","policy":"industry_self","emotion":"fear"}, {"id":"ytc_UgxUfPUGkzdXuMeGxaR4AaABAg","responsibility":"none","reasoning":"consequentialist","policy":"none","emotion":"indifference"}, {"id":"ytc_UgwXpIoRMRZmS0XcixF4AaABAg","responsibility":"company","reasoning":"consequentialist","policy":"liability","emotion":"fear"}, {"id":"ytc_UgxafC012V1408U-Sbl4AaABAg","responsibility":"developer","reasoning":"deontological","policy":"none","emotion":"outrage"}, {"id":"ytc_UgyNhdlZ1q5UsZ-_csF4AaABAg","responsibility":"developer","reasoning":"deontological","policy":"regulate","emotion":"mixed"}, {"id":"ytc_Ugz9W74a5AEdYMXNkip4AaABAg","responsibility":"company","reasoning":"contractualist","policy":"regulate","emotion":"indifference"}, {"id":"ytc_Ugwc601WIY7Vd8WGbyJ4AaABAg","responsibility":"none","reasoning":"consequentialist","policy":"none","emotion":"approval"}, {"id":"ytc_Ugwdl_LvYMaq-u6O-1x4AaABAg","responsibility":"ai_itself","reasoning":"consequentialist","policy":"ban","emotion":"fear"} ]