Raw LLM — Corpus Dashboard

Look up by comment ID

Random samples — click to inspect

G having a nitwit like elon saying "slow your roll" on tech is a joke, tesla cars … ytc_Ugzbi0Ks8… G As a neurosurgeon I don't see a robot managing massive hemorrhages, infarcts or … ytc_UgwvEGNAM… G The problems with AI will arise when someone develops and integrates a simulated… ytc_UgxJveLO3… G The scariest part is that this guy thinks AI is an Alien Bible plot where Satan … ytc_UgyG8pmGn… G Yeah, I asked my vacuum cleaning robot to clean for me and it just said "no"… ytc_Ugzr2RwVK… G If the AI systems can do things already they are not trained to do, then scienti… ytc_UgywshKs5… G “I was the new one and also the innocent one and multi. It’s true. I haven’t sle… ytc_UgzVpI-cW… G It’s easy most of the time. They just repeat the same information in 50 differe… ytc_UgxPEa12G…

Comment

@JoshuaKolden, arguably, this makes it worse, because, just to be clear here, we have already reached a "generalizable conclusion relating to learning from copyrighted material" that they did, since they admitted it. What we deal with here is the what/how LLMs are learning from copyrighted material. As you elucidated, this data is over-represented, and learning it verbatim is overfitting. But this is not a bug, it’s a feature, dialled up to 11. You see, when an educated person quotes, it is more likely they are quoting the Bible or Shakespeare than a Minnesota phone book from 1956. They were told these quotes are the OGs of quotes, we agree, and that’s why we call them educated. The way we tell LLMs that data is important is by repeating it, and then we can agree that it is educated. You were pointing out that it would be ideal if we could stop just shy of “overfitting” and remain in the promised land of fitting. Either way, we find ourselves with the need to determine how much is enough, but who’s to say how much is enough? Is recognising Bible quotes essential for an educated bot? Which phrases? How about quotes from The Lord of the Rings? The thing is, the way we tell LLMs that yes, they should remember that, is by repeating it more than that other stuff, which is less critical. But to avoid the need to explicitly determine how many times we should repeat what, we crowdsource the decisions to ‘the distribution of data on the interwebs’. Being able to be “sloppy”, as you put it, is the killer feature of LLM pretraining. What it all amounts to is that the way LLMs learn from copyrighted material, as from any other material, is by learning more from seminal works and relying more on the most popular content, without any accounting for the fact that the benefit the companies accrued from training on this material was disproportionately large. Now, granted, they did not intend the LLMs to learn this material verbatim and would try to prevent them from repeating it word-for-word. So, if the only concern of copyright holders were that LLMs might be used to produce cheap, flawed copies of their work, I’d say, like you, it was all a big mistake - nothing to see here. Look, officer, nobody here is doing anything suspicious, at least, not any longer. But the market is hot for training data, and what you demonstrated is that not all data is created equal, and some data is flaming hot. This, IMO, is what makes it worse. The mere fact that overfitting is a thing, that companies have to put guardrails in place to prevent it, and that (on top of that) LLMs are discouraged from quoting whole chapters demonstrates how acutely essential some of this material is to the process of educating the beast, and how acutely aware companies are of this fact. The real violation is the learning itself, by virtue of the method used to implement it. Admittedly, it is a spectrum. If LLMs could do the equivalent of seeing some clips of Pink Floyd on YouTube and getting the hang of playing the guitar in a style that resembles David Gilmour’s style, i.e. if they could efficiently extract abstractions and general patterns, that would be another thing. But what we do is more akin to forcing Gilmour to tutor a model for a decade, in the hope that some of it will stick. Gilmour can’t spend decades training many players, and his songs will not be used (that) many times to train generative models. The proof is in the market. There is a market for training data. What your description proves is that the companies knew they were using something with market value and chose to conveniently ignore it until someone presented them with a subpoena. Ohh, you’re only learn some patterns? So why don’t you learn some patterns about Harry Potter from the thousands of non-copyrighted blogs written about the book? Ohh, so you need the analogue of ten years of master classes from Rowling to pull off your pattern matching? So pay up, dude. And thank Rowling for her time and for contributing to your education, instead of pretending you were just skimming her work once or twice to get the gist of things. If you feel you were forced to spit into the well you were drinking from, at least don’t call it rain.

youtube 2026-01-24T02:2…

Coding Result

Dimension	Value
Responsibility	company
Reasoning	consequentialist
Policy	regulate
Emotion	fear
Coded at	2026-04-27T06:26:44.938723

Raw LLM Response

[
{"id":"ytr_UgwxcKl4B92AvDU3GJp4AaABAg.AS3GKO2L0LZAS5-6AgsPRX","responsibility":"none","reasoning":"unclear","policy":"unclear","emotion":"indifference"},
{"id":"ytr_UgwxcKl4B92AvDU3GJp4AaABAg.AS3GKO2L0LZAS7DZGyXZL0","responsibility":"government","reasoning":"deontological","policy":"liability","emotion":"outrage"},
{"id":"ytr_UgwTxn5nWOeZSciYSbx4AaABAg.AS342zgtgCSASLGP6uKRul","responsibility":"none","reasoning":"mixed","policy":"unclear","emotion":"indifference"},
{"id":"ytr_UgwTxn5nWOeZSciYSbx4AaABAg.AS342zgtgCSASLbh98lduO","responsibility":"company","reasoning":"consequentialist","policy":"regulate","emotion":"fear"},
{"id":"ytr_UgwTxn5nWOeZSciYSbx4AaABAg.AS342zgtgCSASNFZGnJw0j","responsibility":"company","reasoning":"mixed","policy":"industry_self","emotion":"approval"},
{"id":"ytr_UgwYMh4HWSnqiJIvuZ54AaABAg.AS2ytOWq-jJAS6b6C9mzQk","responsibility":"ai_itself","reasoning":"deontological","policy":"none","emotion":"approval"},
{"id":"ytr_UgxTpwOqlCLgISlJwXZ4AaABAg.AS2wksN_TYgAS3haijVLxT","responsibility":"company","reasoning":"deontological","policy":"liability","emotion":"outrage"},
{"id":"ytr_UgxTpwOqlCLgISlJwXZ4AaABAg.AS2wksN_TYgAS6dkEe4Tvm","responsibility":"none","reasoning":"mixed","policy":"none","emotion":"indifference"},
{"id":"ytr_UgyuLBQvV1IdDSyxP1F4AaABAg.AS2vzUWJEb_AS6ejj7VD7d","responsibility":"none","reasoning":"mixed","policy":"unclear","emotion":"resignation"},
{"id":"ytr_UgwyahGFIVuQXcs44aJ4AaABAg.AS2tE9KfDXEAS2tnpYQJw7","responsibility":"company","reasoning":"unclear","policy":"ban","emotion":"outrage"}
]

Raw LLM Responses