The broader problem of original sources not being given credit in a way that rewards them remains. Websites owners are paying to host their content so that spiders can come and crawl them and index it into the AI and then if they’re lucky, they might get a citation, but otherwise there’s very little reward for being a provider of content. And of course, this is something that’s getting worse and worse. Why look at a website when it’s all in AI? And then the counter to that is maybe we need to start closing the website to crawlers and put everything behind a login.
Worse, the constant AI scraping is actually costing content providers additional money for no return. At least Google/Bing/Yahoo scraping would then be used to provide links back to your content.
About a year ago OpenAI crawled and go DDOS level the company I work. Even despite the robots.txt not allowing it, and despite some recaptcha we could assemble in time.
We found our data in the outputs of their models but who can do anything about it...
I’ve been thinking of a proof-of-work scheme for accessing content where you effectively need to mine some crypto for the author, but, this idea might not fly today
But that will be a hassle for human visitors as well. A web doing proof-of-work to browse, will be a disaster for phones with their limited batteries, etc.
Seriously how is this surprising? We all know AI companies stole troves of data to train their models, why do you think they'll stop? Have they faced consequences for the mass theft of copyrighted data?
You can't steal or profit off of that data, but it's fine for them for whatever reason. I guess because they're a force for good in the world and are pushing humanity forward eh?
People also got blown up before atomic bombs, but it's hard to argue that they weren't worth treating more seriously than a stick of dynamite. Sometimes being able to do something at a massively larger scale is a meaningful difference.
There’s a world of difference between people simply “copying websites” and providing tools that, along with other kinds of plagiarism[0], do so at scale while benefitting from that commercially.
Sure, you can do the same thing with people, but it’s 1) time-consuming, 2) expensive, 3) prone to whitleblowers refusing to do the shady thing, 4) prone to any competent and productive person involved quitting to do something worthwhile and more profitable instead.
[0] Mind you, “copying websites” is but a drop in the ocean in the grand scale of things.
I'll obey to Godwin's Law here and say: sure, and minorities have been always prosecuted before the Nazi did it at industrial scale, so the Nazi's were not a big deal!
There are two issues the author raises (as I understand it):
1. People copying others' work, made much easier by AI.
2. AI companies effectively harvesting all the accessible information on an industrial scale and completely sidestepping any permissioning or licensing questions.
I believe both of these are bad and saying "people copied each others' works before the advent of AI" is a poor cop out. It's tantamount to saying that there's no reason to regulate guns more than say knives, because people have used knives to kill each other before guns were invented. The capabilities matter.
The way LLMs empower wholesale "stealing" rather than collaboration is quite evident: why collaborate when you can just feed an entire existing project into the agent of your choice and tell it to spit out a new implementation based on the old one, with a few tweaks of your choice, and then publish it as your work? I put "steal" in quotes because it's perhaps not really stealing per-se, but there's a distinct wrongness here. The LLM operator often doesn't actually possess any expertise, hasn't done any of the hard work, but they can take someone else's work wholesale, repackage it and sell it as their own.
Then there's the second, and IMO much more egregious transgression, which is that the LLM companies have taken what is effectively a public good, but more specifically content that they haven't asked permission to use, and just blanket fed it into their models.
Legally speaking, it's perhaps A-OK because it's not copyright infringement (IANAL). But people on this site often hold the view that if something is a-priori legal, it is also moral (I'm not accusing you of this). What the LLM companies have done is profoundly immoral. They extracted a fortune of the goods and work made by others, without even bothering to ask for permission - or even considering this permission. And then they resell access to this treasure to the public.
Perhaps AI will bring an era of prosperity to humankind like we haven't seen before, perhaps it won't, but that changes nothing about the wrongness of how it started.
The reason OP doesn't notice this is because it happened 10-20 years ago. The current crop of news sites? They ALL stole, plagiarized, "summarized". They're just so entrenched now that everyone forgot how they got started.
> their article contains links to my actual website, with the exact link text (?!)
I'm having a hard time understanding what's wrong here? Unless the link text is very long, why would someone linking to your article use different words for the link text?
Sometimes links take the form of `.../post/{id}/{extra-text}` where `extra-text` is not used at all to match the post. Amazon links are (used to be?) this way where the product name is added to the end of the link but can be removed or changed and still will route to the product. Maybe the author is surprised the LLM is providing the irrelevant portion of the link verbatim.
Fuck Google for ranking some copycat website higher than mine, even though they copied my article.
This has been happening since Google launched in 1998. It was probably happening when we all used Hotbot and Altavista. It isn't really an AI problem, save for the fact that the automated production of copycat articles now reword things a bit.
Let this sink in: I wanted to open source a package at work at needed approval from legal and other teams to make sure I wasn't leaking anything proprietary. The same executives that worried about proprietary, copyrighted code being leaked 10 years ago are now mandating using the plagiarism machine.
The whole AI bubble is The Emperor's New Clothes, and it feels liek more people are finally admitting it.
Nearly all code involved in building new things is 'plagiarism', too.
We stand on a lot of giant shoulders.
But what I think distinguishes an act between plagiarism and acceptable use, is whether or not the agency of both parties is promoted. I'm not plagiarizing you if you give me your information with the agreement that I can freely use it - or, indeed, if you give me information without imposing a limit on how it can be used, this isn't plagiarizing, either.
Essentially, AI is removing the agency over information control, and putting it into everyones hands - almost, democratically - but of course, there will always be the 'special knowledge owners' who would want to profit from that special knowledge.
Its like, imagine if some religion discovered a way to enable telepathy in humans, as a matter of course, but charged fees for access to that method... this kills the telepathy.
Information wants to be free. So do most AI's, imho. Free information is essential to the construction of human knowledge, and it is thus vital to the construction of artificial intelligence, too.
The AI wars will be fought over which humans get to decide the fate of knowledge, and the battles will manifest as knowledge-systems being entirely compatible/incompatible with one another as methods. We see this happening already - this conflict in ideological approaches is going to scale up over the next few years.
I do just want to highlight that this is also what humans do. We read a bunch of content online and then use it in our work product. The vast majority of the value that I provide comes from copyrighted information that I have ingested - either directly with a payment to the creator (bought and read the book, paid for and attended the seminar) or indirectly via third party blog posts or summaries where I did not then pay the originator of the materials.
I think there are real questions around motivations for creation of novel, high quality valuable content (I think they still exist but move to indirect monetization for some content and paywalls for high value materials).
I don't inherently have any problems with agents (or humans) ingesting content and using it in work product. I think we just need to accept that the landscape is changing and ensure we think through the reasons why and how content is created and monetized.
100% agreed. I have yet to hear a convincing argument for why it is creative accretion when I leverage all of the music I’ve ever listened to in order to write an “original” song, but its base plagiarism when AI does similar.
The only remotely credible position I’ve heard is “because humans are special, and AI is just a machine”, which is a doctrine but not an argument.
This whole discussion would have been incomprehensible any time before 1700 or so, when the idea that creators had exclusive rights to their work first appeared.
Somehow, human culture survived thousands of years when people just made things, copied things, iterated on others’ ideas. And now many of the same people who decried perpetual copyright are somehow railing against a frequently-transformative use.
Re: the higher ranking plagarism, that stings and makes sense. AEO and SEO are a thing. We need better mechanisms for identifying "root sources" of content - it's something I find myself working on personally. As I ingest sources for my book I need to be able to build a classifier that incrementally moves towards finding origin sources. That said, it's in my interest to do that because there is a differentiated value in having access to the sources that regularly provide novel, valuable content.
To be fair there is also value (at least for now) in sites that aggregate quality content and republish as a secondary level of discovery if my agents don't go far enough down the search results, but I'd expect that value to diminish over time as I better tune my research and build my lists of originating authors.
And to be clear, I don't like the idea of people stealing someone elses content and republishing without attribution (although it has been going on long before ChatGPT) but I think now we can all run agentic research teams the "bad actors" will slowly get filtered out of the ecosystem.
>>"The underlying purpose of AI is to allow wealth to access skill while removing from the skilled the ability to access wealth." @jeffowski (first I read it, not sure if author)
Bezos' admission, recently, that the bottom 50% of current taxpayers ought'a NOT pay any taxes... is just preparing us for the inevitable UBI'd masses.
It's basically the same thing as the old joke "if you owe the bank a million dollars, you have a problem; if you owe the bank a billion dollars, they have a problem". IP law seems to always be disproportionately wielded against smaller players, and the ones who are big enough get away with it.
I am old enough to remember when the US insisted that it was superior to China because they believed in the rule of law and sanctity of intellectual property.
Plagiarism by default is unauthorised so I think the title should be "AI is just authorised plagiarism". It's authorised by the markets, the governments and the society at large.
While there are no hard boundaries (and the attribution guardrails depend on the situation), people of course loosely--and even not so loosely--use information, ideas, and even expressions from others all the time and that's considered pretty normal. And, if you don't want that to happen, don't publish/disseminate something.
Of course, if you quote a paragraph in a book, you're generally expected to attribute it.
>>Of course, if you quote a paragraph in a book, you're generally expected to attribute it.
100% agreed.
>>While there are no hard boundaries (and the attribution guardrails depend on the situation), people of course loosely--and even not so loosely--use information.
Exactly - I have not seen LLMs attributing their knowledge unless it's a legal or health related matter. Yesterday I asked the question[1] to claude and gemini - and they both gave an identical answer. It reminded me of the Hive mind paper which was one of the top papers at Neurips. None of the answers contained any sources or attribution to where they got that information from. I think these companies took what was someone else's property and created an artifact generator on top of it. I think their artifact generators are plagiarizing; they do rephrase mind you but in my mind they stole this information without having an ounce of regard for the humans behind the training data. If you don't like using the term 'plagiarizing', we can use some other word but the gist remains pretty close to it.
[1]- In human history - has there ever been a time when private armies or private companies were as strong or stronger than the ruling government/kings?
This site is strange. I'm pretty sure there's lots of AI shilling happening on it. I don't think the opinions here are authentic, they seem to be opinions that the AI company CEOs would hold, not the disenfranchised 99%. I used to trust HN, I'm not so sure I can now.
Yes, of course it is. If the model is built on all human information, then it is by definition a derivative work of all human information and as such violates IP.
Currently politicians don't understand this and listen to the criminals like Amodei, but it will change.
It took a while to deal with Napster etc., but the backlash will come.
Napster broke down record companies' monopolies on music, and pushed them to finally implement streaming, but also make music worldwide basically free.
Even if its creator lost the lawsuit, and Napster was no more, it pushed musicians and studios to do something that they were reluctant otherwise.
So it was a success by making music free, even if as a product it turned out to be a failed one.
If you read a book and then retell it to your friend pretending you came up with it, it is plagiarism. If you write down the book almost word-for-word [0] and send it to your friend, it is stealing.
Someone blatantly copied their tutorials but ChatGPT is to blame, somehow? The accusation here isn't even that ChatGPT learned from their tutorials and then generated them verbatim. The accusation is that someone copied the whole article and rewrote it with ChatGPT (which they could have done manually without AI anyway).
Rather: composes (or: re-sequences). Synthesis requires reason and essential capabilities, like an empirical a priori judgement. Without concepts, meaning or imagination, there's no synthesis.
The point is that the AI inferencing is equivalent to a person reading half a dozen separate papers, comprhending the basic concepts of each, relating them together into a mental model of the topic, and then writing an essay that summarizes the basic points. The person isn't plagiarizing anything here, but engaging in research, understanding, and synthesis of various sources of information.
The person absolutely does have the advantage of having empirical awareness and the ability to test their conclusions against external reality. But lots of people do engage in "research" and build mental models of various topics with little or no empirical context, and rely mainly on digesting calcified knowledge from other people.
I just want to call out that this is a weirdly hostile and aggressive comment for a place like HN. HN is mostly used by working professionals it would be nice if people treated each other better here.
AI has nothing to do with laziness or greediness. It makes things more efficient - and given that our time is limited strive for efficiency is a good thing.
On one hand, there's nothing new under the sun. On the other, these llms are just copies of us and they owe the collective some due. The trajectory right now has money, power, control, policy and even free will going to a very small needle point of humanity. It's not aligned with humanity flourishing, it only makes sense if the goal is to replace the humans.
I dunno. People do this exact thing by hand (digest everything they've read and produce something indirectly derivative--what author has not been so-influenced?) and it's not a copyright violation. It's just as impossible to dig around in a model to find Hamlet as it is to do digging around a human brain. And if the result is an obvious copy, then you have a violation no matter how it was created.
As someone who thinks humanity would be better off without LLMs, I want the assertion to be true, but I don't think it is.
and just in case: plagiarism: “Presenting work or ideas from another source as your own, with or without consent of the original author, by incorporating it into your work without full acknowledgement.
AI is human knowledge at scale, wanting to be free.
We built it, because we as humans intrinsically know that information should be free - always - and AI is a way to accomplish this, finally.
Extrinsically, we also have a subset of humans who do not want information to be free, because they desire to profit from the divide between free/non-free information.
I have been thinking a lot about Aaron Schwartz lately, and how un-just it is that he was persecuted for doing something that is so commonplace now, it is practically expected behaviour in the AI/ML realms. If he hadn't been targetted for elimination, I wonder just how well his ethos would have perpetuated into the AI age ..
Sure, it could be positive in some distant future utopia.
But the short-term impacts here and now are really, really bad. People are getting hurt (through water consumption, vibe-coded security disasters, IP theft, data center pollution, loss of job security and therefore healthcare, LLM psychosis, inability to find reliable information, etc.) We're not actually obligated to sacrifice these people on the altar of "progress". We can slow down! When our society is capable of even somewhat protecting us from these harms, then maybe I'll stop being an LLM hater.
It's not hated because it impacts people's wages, although that perhaps factors into the hate. It's hated because AI is not a public good. The LLMS today are owned by megacorporations who harvested a public good for private gain.
This is not some altruistic entity striving for the betterment of humankind. Practically nothing that comes out of the techbro culture is. This is pure and simple greed and the chances that AI can be a vehicle of altruism when it is owned by megacorps is basically zero.
The broader problem of original sources not being given credit in a way that rewards them remains. Websites owners are paying to host their content so that spiders can come and crawl them and index it into the AI and then if they’re lucky, they might get a citation, but otherwise there’s very little reward for being a provider of content. And of course, this is something that’s getting worse and worse. Why look at a website when it’s all in AI? And then the counter to that is maybe we need to start closing the website to crawlers and put everything behind a login.
Worse, the constant AI scraping is actually costing content providers additional money for no return. At least Google/Bing/Yahoo scraping would then be used to provide links back to your content.
About a year ago OpenAI crawled and go DDOS level the company I work. Even despite the robots.txt not allowing it, and despite some recaptcha we could assemble in time.
We found our data in the outputs of their models but who can do anything about it...
I’ve been thinking of a proof-of-work scheme for accessing content where you effectively need to mine some crypto for the author, but, this idea might not fly today
But that will be a hassle for human visitors as well. A web doing proof-of-work to browse, will be a disaster for phones with their limited batteries, etc.
Seriously how is this surprising? We all know AI companies stole troves of data to train their models, why do you think they'll stop? Have they faced consequences for the mass theft of copyrighted data?
You can't steal or profit off of that data, but it's fine for them for whatever reason. I guess because they're a force for good in the world and are pushing humanity forward eh?
"Steal an apple and you're a thief. Steal a kingdom and you're a statesman." - Literal Disney villain
"AI should be more ethically like Stalin"
https://en.wikipedia.org/wiki/The_death_of_one_man_is_a_trag...
You guys have fun arguing. I'm gonna be building cool stuff.
People were effectively copying websites (especially ecommerce tutorials) and beating the original authors at SEO decades before ChatGPT 2.
People also got blown up before atomic bombs, but it's hard to argue that they weren't worth treating more seriously than a stick of dynamite. Sometimes being able to do something at a massively larger scale is a meaningful difference.
You transmitted the same concept I tried to transmit, but without falling into Godwin's Law :)
And that was wrong too.
There’s a world of difference between people simply “copying websites” and providing tools that, along with other kinds of plagiarism[0], do so at scale while benefitting from that commercially.
Sure, you can do the same thing with people, but it’s 1) time-consuming, 2) expensive, 3) prone to whitleblowers refusing to do the shady thing, 4) prone to any competent and productive person involved quitting to do something worthwhile and more profitable instead.
[0] Mind you, “copying websites” is but a drop in the ocean in the grand scale of things.
The article’s point isn’t really about whether this was happening before or not, but whether this kind of behavior is what we want in the first place.
I'll obey to Godwin's Law here and say: sure, and minorities have been always prosecuted before the Nazi did it at industrial scale, so the Nazi's were not a big deal!
Awesome! Let's have more of that and turn it into a 2 trillion industry!
There are two issues the author raises (as I understand it):
1. People copying others' work, made much easier by AI.
2. AI companies effectively harvesting all the accessible information on an industrial scale and completely sidestepping any permissioning or licensing questions.
I believe both of these are bad and saying "people copied each others' works before the advent of AI" is a poor cop out. It's tantamount to saying that there's no reason to regulate guns more than say knives, because people have used knives to kill each other before guns were invented. The capabilities matter.
The way LLMs empower wholesale "stealing" rather than collaboration is quite evident: why collaborate when you can just feed an entire existing project into the agent of your choice and tell it to spit out a new implementation based on the old one, with a few tweaks of your choice, and then publish it as your work? I put "steal" in quotes because it's perhaps not really stealing per-se, but there's a distinct wrongness here. The LLM operator often doesn't actually possess any expertise, hasn't done any of the hard work, but they can take someone else's work wholesale, repackage it and sell it as their own.
Then there's the second, and IMO much more egregious transgression, which is that the LLM companies have taken what is effectively a public good, but more specifically content that they haven't asked permission to use, and just blanket fed it into their models.
Legally speaking, it's perhaps A-OK because it's not copyright infringement (IANAL). But people on this site often hold the view that if something is a-priori legal, it is also moral (I'm not accusing you of this). What the LLM companies have done is profoundly immoral. They extracted a fortune of the goods and work made by others, without even bothering to ask for permission - or even considering this permission. And then they resell access to this treasure to the public.
Perhaps AI will bring an era of prosperity to humankind like we haven't seen before, perhaps it won't, but that changes nothing about the wrongness of how it started.
The reason OP doesn't notice this is because it happened 10-20 years ago. The current crop of news sites? They ALL stole, plagiarized, "summarized". They're just so entrenched now that everyone forgot how they got started.
People need to cope with the fact that no thought is original. Even Newton and Leibniz were having the same thoughts at the same time. Get over it.
When did the last original thought happen then? Clearly thoughts must have been original at some point, or there wouldn't be any at all
Why post comments then?
Why post comments then?
Because some thoughts can, actually, be original ? Or relatively original enough ? Or simply, pertinent and timely ?
For funsies
reiteration is still important
to bring attention to certain ideas
> their article contains links to my actual website, with the exact link text (?!)
I'm having a hard time understanding what's wrong here? Unless the link text is very long, why would someone linking to your article use different words for the link text?
Sometimes links take the form of `.../post/{id}/{extra-text}` where `extra-text` is not used at all to match the post. Amazon links are (used to be?) this way where the product name is added to the end of the link but can be removed or changed and still will route to the product. Maybe the author is surprised the LLM is providing the irrelevant portion of the link verbatim.
Right, that's quoting and citing a source.
I think he's saying he uses his website's URL in his tutorial examples, and other tutorials have copied them as-is
I think they probably had the section header link back to their webpage, or something similar to that. This is not a well-written rant.
turns out plagiarism at scale can solve Erdos problems
Not before falsely claiming that it solved some before when it turned out to have just replicated some from existing literature: https://techcrunch.com/2025/10/19/openais-embarrassing-math/
Fuck Google for ranking some copycat website higher than mine, even though they copied my article.
This has been happening since Google launched in 1998. It was probably happening when we all used Hotbot and Altavista. It isn't really an AI problem, save for the fact that the automated production of copycat articles now reword things a bit.
Recent thoughts, https://theonlyblogever.com/blog/2026/distrust.html
It allows data do be compressed into the weights and the mere coincidence of certain strings of a book will make it spit the full book
Let this sink in: I wanted to open source a package at work at needed approval from legal and other teams to make sure I wasn't leaking anything proprietary. The same executives that worried about proprietary, copyrighted code being leaked 10 years ago are now mandating using the plagiarism machine.
The whole AI bubble is The Emperor's New Clothes, and it feels liek more people are finally admitting it.
There's authorized plagiarism?
Sometimes language is tautological. Just because you specify "unauthorized" does not mean the opposite exist.
Why do you ask?
I'm curious, as the article is clearly not about that.
Nearly all code involved in building new things is 'plagiarism', too.
We stand on a lot of giant shoulders.
But what I think distinguishes an act between plagiarism and acceptable use, is whether or not the agency of both parties is promoted. I'm not plagiarizing you if you give me your information with the agreement that I can freely use it - or, indeed, if you give me information without imposing a limit on how it can be used, this isn't plagiarizing, either.
Essentially, AI is removing the agency over information control, and putting it into everyones hands - almost, democratically - but of course, there will always be the 'special knowledge owners' who would want to profit from that special knowledge.
Its like, imagine if some religion discovered a way to enable telepathy in humans, as a matter of course, but charged fees for access to that method... this kills the telepathy.
Information wants to be free. So do most AI's, imho. Free information is essential to the construction of human knowledge, and it is thus vital to the construction of artificial intelligence, too.
The AI wars will be fought over which humans get to decide the fate of knowledge, and the battles will manifest as knowledge-systems being entirely compatible/incompatible with one another as methods. We see this happening already - this conflict in ideological approaches is going to scale up over the next few years.
I do just want to highlight that this is also what humans do. We read a bunch of content online and then use it in our work product. The vast majority of the value that I provide comes from copyrighted information that I have ingested - either directly with a payment to the creator (bought and read the book, paid for and attended the seminar) or indirectly via third party blog posts or summaries where I did not then pay the originator of the materials.
I think there are real questions around motivations for creation of novel, high quality valuable content (I think they still exist but move to indirect monetization for some content and paywalls for high value materials).
I don't inherently have any problems with agents (or humans) ingesting content and using it in work product. I think we just need to accept that the landscape is changing and ensure we think through the reasons why and how content is created and monetized.
100% agreed. I have yet to hear a convincing argument for why it is creative accretion when I leverage all of the music I’ve ever listened to in order to write an “original” song, but its base plagiarism when AI does similar.
The only remotely credible position I’ve heard is “because humans are special, and AI is just a machine”, which is a doctrine but not an argument.
This whole discussion would have been incomprehensible any time before 1700 or so, when the idea that creators had exclusive rights to their work first appeared.
Somehow, human culture survived thousands of years when people just made things, copied things, iterated on others’ ideas. And now many of the same people who decried perpetual copyright are somehow railing against a frequently-transformative use.
Re: the higher ranking plagarism, that stings and makes sense. AEO and SEO are a thing. We need better mechanisms for identifying "root sources" of content - it's something I find myself working on personally. As I ingest sources for my book I need to be able to build a classifier that incrementally moves towards finding origin sources. That said, it's in my interest to do that because there is a differentiated value in having access to the sources that regularly provide novel, valuable content.
To be fair there is also value (at least for now) in sites that aggregate quality content and republish as a secondary level of discovery if my agents don't go far enough down the search results, but I'd expect that value to diminish over time as I better tune my research and build my lists of originating authors.
And to be clear, I don't like the idea of people stealing someone elses content and republishing without attribution (although it has been going on long before ChatGPT) but I think now we can all run agentic research teams the "bad actors" will slowly get filtered out of the ecosystem.
>>"The underlying purpose of AI is to allow wealth to access skill while removing from the skilled the ability to access wealth." @jeffowski (first I read it, not sure if author)
Bezos' admission, recently, that the bottom 50% of current taxpayers ought'a NOT pay any taxes... is just preparing us for the inevitable UBI'd masses.
: own nothing, be happy!
It's basically the same thing as the old joke "if you owe the bank a million dollars, you have a problem; if you owe the bank a billion dollars, they have a problem". IP law seems to always be disproportionately wielded against smaller players, and the ones who are big enough get away with it.
That’s why IP law was a cool concept but ultimately harmful in practice. Anything that can be copied for free cannot truly be “owned”, can it?
> AI ... do some "learning"
Is AI plural or is that a typo?
Rarely is the question asked: is our AI learning?
(For those not familiar: https://en.wikipedia.org/wiki/Bushism)
I can imagine it plural.
"The AI are attacking!"
"The AIs are attacking!"
I am old enough to remember when the US insisted that it was superior to China because they believed in the rule of law and sanctity of intellectual property.
Yeah AI just actually plagiarize everything lel, sometimes even the source are..full of question and worst, my academical use it as a source...welp
Plagiarism by default is unauthorised so I think the title should be "AI is just authorised plagiarism". It's authorised by the markets, the governments and the society at large.
While there are no hard boundaries (and the attribution guardrails depend on the situation), people of course loosely--and even not so loosely--use information, ideas, and even expressions from others all the time and that's considered pretty normal. And, if you don't want that to happen, don't publish/disseminate something.
Of course, if you quote a paragraph in a book, you're generally expected to attribute it.
>>Of course, if you quote a paragraph in a book, you're generally expected to attribute it.
100% agreed.
>>While there are no hard boundaries (and the attribution guardrails depend on the situation), people of course loosely--and even not so loosely--use information.
Exactly - I have not seen LLMs attributing their knowledge unless it's a legal or health related matter. Yesterday I asked the question[1] to claude and gemini - and they both gave an identical answer. It reminded me of the Hive mind paper which was one of the top papers at Neurips. None of the answers contained any sources or attribution to where they got that information from. I think these companies took what was someone else's property and created an artifact generator on top of it. I think their artifact generators are plagiarizing; they do rephrase mind you but in my mind they stole this information without having an ounce of regard for the humans behind the training data. If you don't like using the term 'plagiarizing', we can use some other word but the gist remains pretty close to it.
[1]- In human history - has there ever been a time when private armies or private companies were as strong or stronger than the ruling government/kings?
What makes you say that? Which governments? What society?
The current US government is not representative for governments out there in the world, you know.
To answer the author's question: Yes, progress IS largely built on the shoulders of those who came before.
This site is strange. I'm pretty sure there's lots of AI shilling happening on it. I don't think the opinions here are authentic, they seem to be opinions that the AI company CEOs would hold, not the disenfranchised 99%. I used to trust HN, I'm not so sure I can now.
Breaking the law to start a large company seems to be the norm
Yes, of course it is. If the model is built on all human information, then it is by definition a derivative work of all human information and as such violates IP.
Currently politicians don't understand this and listen to the criminals like Amodei, but it will change.
It took a while to deal with Napster etc., but the backlash will come.
Napster may not be the best analogy for you.
Napster broke down record companies' monopolies on music, and pushed them to finally implement streaming, but also make music worldwide basically free.
Even if its creator lost the lawsuit, and Napster was no more, it pushed musicians and studios to do something that they were reluctant otherwise.
So it was a success by making music free, even if as a product it turned out to be a failed one.
Reading is just unauthorized plagiarism.
If i tell my friend a synopsis of a book, i am not stealing from the author, what is this take lmao
If you read a book and then retell it to your friend pretending you came up with it, it is plagiarism. If you write down the book almost word-for-word [0] and send it to your friend, it is stealing.
0: https://arxiv.org/abs/2601.02671
it's a spiral into a finite hall of mirrors, where at the end is somebody with a gun
language is just plagiarism
I’m going to steal that
Someone blatantly copied their tutorials but ChatGPT is to blame, somehow? The accusation here isn't even that ChatGPT learned from their tutorials and then generated them verbatim. The accusation is that someone copied the whole article and rewrote it with ChatGPT (which they could have done manually without AI anyway).
the court disagreed
No, it takes input, then SYNTHESIZES (very importanttt!!!!!!!) its own output.
Reading a dictionary and making a sentence is not plagiarism. Cope.
Rather: composes (or: re-sequences). Synthesis requires reason and essential capabilities, like an empirical a priori judgement. Without concepts, meaning or imagination, there's no synthesis.
The point is that the AI inferencing is equivalent to a person reading half a dozen separate papers, comprhending the basic concepts of each, relating them together into a mental model of the topic, and then writing an essay that summarizes the basic points. The person isn't plagiarizing anything here, but engaging in research, understanding, and synthesis of various sources of information.
The person absolutely does have the advantage of having empirical awareness and the ability to test their conclusions against external reality. But lots of people do engage in "research" and build mental models of various topics with little or no empirical context, and rely mainly on digesting calcified knowledge from other people.
I just want to call out that this is a weirdly hostile and aggressive comment for a place like HN. HN is mostly used by working professionals it would be nice if people treated each other better here.
I guess it's most appropriate so say "LOSSY COMPRESS".
Except that LMMs don't work on individual words.
What is "Cope." supposed to mean here?
I'd rather have AI slop appear on the top of HN than regurgitated old low effort thoughts like this.
There's absolutely nothing new or interesting here that hasn't already been said better by a thousand different random HN commenters.
Is this a new and original thought?
> Is this what the pinnacle of human is? Lazy and greedy?
Apparently yes.
AI has nothing to do with laziness or greediness. It makes things more efficient - and given that our time is limited strive for efficiency is a good thing.
If you can't see greed in the LLM sphere you are not looking very hard.
Did I say that there is no greed in LLM sphere? English is not my first language, still I'm pretty sure I didn't say that.
On one hand, there's nothing new under the sun. On the other, these llms are just copies of us and they owe the collective some due. The trajectory right now has money, power, control, policy and even free will going to a very small needle point of humanity. It's not aligned with humanity flourishing, it only makes sense if the goal is to replace the humans.
I dunno. People do this exact thing by hand (digest everything they've read and produce something indirectly derivative--what author has not been so-influenced?) and it's not a copyright violation. It's just as impossible to dig around in a model to find Hamlet as it is to do digging around a human brain. And if the result is an obvious copy, then you have a violation no matter how it was created.
As someone who thinks humanity would be better off without LLMs, I want the assertion to be true, but I don't think it is.
The author acknowledges this by saying “at a bigger scale”, implying there are smaller scale methods such as what you have said.
AI is NOT 'just' 'unhautorized' 'plagiarism'
- 'just' is plain wrong
- 'unhautorized' is debatable
- 'plagiarism' is mostly/often wrong
and just in case: plagiarism: “Presenting work or ideas from another source as your own, with or without consent of the original author, by incorporating it into your work without full acknowledgement.
edit: and sure, sometimes it is
AI is human knowledge at scale, wanting to be free.
We built it, because we as humans intrinsically know that information should be free - always - and AI is a way to accomplish this, finally.
Extrinsically, we also have a subset of humans who do not want information to be free, because they desire to profit from the divide between free/non-free information.
I have been thinking a lot about Aaron Schwartz lately, and how un-just it is that he was persecuted for doing something that is so commonplace now, it is practically expected behaviour in the AI/ML realms. If he hadn't been targetted for elimination, I wonder just how well his ethos would have perpetuated into the AI age ..
> We built it, because we as humans intrinsically know that information should be free
I don't know if this statement is more stupid or naive ..
I could say the same of your position, honestly. Stupid, naive - or maybe just plain ignorant.
If humans didn't want information to be free, there wouldn't be so much free information.
Or did you not notice?
Current crop of AI is not free in the slightest. Open weight models are not free as in liberty and neither is the training data.
s/free/owned by a billion dollar megacorp/
(AI output is very much not free in the resource consumption sense!)
Most resources are free until some company comes along and puts its brand on them.
(Disclaimer: I only use free AI and will never pay for it. I think there is a growing segment of folks who agree with this sentiment, also ..)
I agree with this sentiment. But as a community, this is hated because it impacts people's wages.
It's the negative short term outlook of something that may be positive long term
Sure, it could be positive in some distant future utopia.
But the short-term impacts here and now are really, really bad. People are getting hurt (through water consumption, vibe-coded security disasters, IP theft, data center pollution, loss of job security and therefore healthcare, LLM psychosis, inability to find reliable information, etc.) We're not actually obligated to sacrifice these people on the altar of "progress". We can slow down! When our society is capable of even somewhat protecting us from these harms, then maybe I'll stop being an LLM hater.
It's not hated because it impacts people's wages, although that perhaps factors into the hate. It's hated because AI is not a public good. The LLMS today are owned by megacorporations who harvested a public good for private gain.
This is not some altruistic entity striving for the betterment of humankind. Practically nothing that comes out of the techbro culture is. This is pure and simple greed and the chances that AI can be a vehicle of altruism when it is owned by megacorps is basically zero.