Those claiming AI training on copyrighted works is “theft” misunderstand key aspects of copyright law and AI technology. Copyright protects specific expressions of ideas, not the ideas themselves. When AI systems ingest copyrighted works, they’re extracting general patterns and concepts - the “Bob Dylan-ness” or “Hemingway-ness” - not copying specific text or images.
This process is akin to how humans learn by reading widely and absorbing styles and techniques, rather than memorizing and reproducing exact passages. The AI discards the original text, keeping only abstract representations in “vector space”. When generating new content, the AI isn’t recreating copyrighted works, but producing new expressions inspired by the concepts it’s learned.
This is fundamentally different from copying a book or song. It’s more like the long-standing artistic tradition of being influenced by others’ work. The law has always recognized that ideas themselves can’t be owned - only particular expressions of them.
Moreover, there’s precedent for this kind of use being considered “transformative” and thus fair use. The Google Books project, which scanned millions of books to create a searchable index, was ruled legal despite protests from authors and publishers. AI training is arguably even more transformative.
While it’s understandable that creators feel uneasy about this new technology, labeling it “theft” is both legally and technically inaccurate. We may need new ways to support and compensate creators in the AI age, but that doesn’t make the current use of copyrighted works for AI training illegal or unethical.
For those interested, this argument is nicely laid out by Damien Riehl in FLOSS Weekly episode 744. https://twit.tv/shows/floss-weekly/episodes/744
The Times’ lawyers must be chuffed reading this.
Maybe if you would pay for training data they would let you use copyright data or something?
The other part of it is they broke the rules so they need to face the consequences. They are asking for forgiveness and in this case I don’t think they deserve it.
Their business strategy is built on top of assumption they won’t. They don’t want this door opened at all. It was a great deal for Google to buy Reddit’s data for some $mil., because it is a huge collection behind one entity. Now imagine communicating to each individual site owner whose resources they scrapped.
If that could’ve been how it started, the development of these AI tools could be much slower because of (1) data being added to the bunch only after an agreement, (2) more expenses meaning less money for hardware expansion and (3) investors and companies being less hyped up about that thing because it doesn’t grow like a mushroom cloud while following legal procedures. Also, (4) the ability to investigate and collect a public list of what sites they have agreement with is pretty damning making it’s own news stories and conflicts.
Had the company paid for the training data and/or left it as voluntary, there would be less of a problem with it to begin with.
Part of the problem is that they didn’t, but are still using it for commercial purposes.
Bullshit. AI are not human. We shouldn’t treat them as such. AI are not creative. They just regurgitate what they are trained on. We call what it does “learning”, but that doesn’t mean we should elevate what they do to be legally equal to human learning.
It’s this same kind of twisted logic that makes people think Corporations are People.
Ok, ignore this specific company and technology.
In the abstract, if you wanted to make artificial intelligence, how would you do it without using the training data that we humans use to train our own intelligence?
We learn by reading copyrighted material. Do we pay for it? Sometimes. Sometimes a teacher read it a while ago and then just regurgitated basically the same copyrighted information back to us in a slightly changed form.
And that’s all paid for. Think how much just the average high school graduate has has invested in them, ai companies want all that, but for free
It’s not though.
A huge amount of what you learn, someone else paid for, then they taught that knowledge to the next person, and so on. By the time you learned it, it had effectively been pirated and copied by human brains several times before it got to you.
Literally anything you learned from a Reddit comment or a Stack Overflow post for instance.
If only there was a profession that exchanges knowledge for money. Some one who “teaches.” I wonder who would pay them
We learn by reading copyrighted material.
We are human beings. The comparison is false on it’s face because what you all are calling AI isn’t in any conceivable way comparable to the complexity and versatility of a human mind, yet you continue to spit this lie out, over and over again, trying to play it up like it’s Data from Star Trek.
This model isn’t “learning” anything in any way that is even remotely like how humans learn. You are deliberately simplifying the complexity of the human brain to make that comparison.
Moreover, human beings make their own choices, they aren’t actual tools.
They pointed a tool at copyrighted works and told it to copy, do some math, and regurgitate it. What the AI “does” is not relevant, what the people that programmed it told it to do with that copyrighted information is what matters.
There is no intelligence here except theirs. There is no intent here except theirs.
This model isn’t “learning” anything in any way that is even remotely like how humans learn. You are deliberately simplifying the complexity of the human brain to make that comparison.
I do think the complexity of artificial neural networks is overstated. A real neuron is a lot more complex than an artificial one, and real neurons are not simply feed forward like ANNs (which have to be because they are trained using back-propagation), but instead have their own spontaneous activity (which kinda implies that real neural networks don’t learn using stochastic gradient descent with back-propagation). But to say that there’s nothing at all comparable between the way humans learn and the way ANNs learn is wrong IMO.
If you read books such as V.S. Ramachandran and Sandra Blakeslee’s Phantoms in the Brain or Oliver Sacks’ The Man Who Mistook His Wife For a Hat you will see lots of descriptions of patients with anosognosia brought on by brain injury. These are people who, for example, are unable to see but also incapable of recognizing this inability. If you ask them to describe what they see in front of them they will make something up on the spot (in a process called confabulation) and not realize they’ve done it. They’ll tell you what they’ve made up while believing that they’re telling the truth. (Vision is just one example, anosognosia can manifest in many different cognitive domains).
It is V.S Ramachandran’s belief that there are two processes that occur in the Brain, a confabulator (or “yes man” so to speak) and an anomaly detector (or “critic”). The yes-man’s job is to offer up explanations for sensory input that fit within the existing mental model of the world, whereas the critic’s job is to advocate for changing the world-model to fit the sensory input. In patients with anosognosia something has gone wrong in the connection between the critic and the yes man in a particular cognitive domain, and as a result the yes-man is the only one doing any work. Even in a healthy brain you can see the effects of the interplay between these two processes, such as with the placebo effect and in hallucinations brought on by sensory deprivation.
I think ANNs in general and LLMs in particular are similar to the yes-man process, but lack a critic to go along with it.
What implications does that have on copyright law? I don’t know. Real neurons in a petri dish have already been trained to play games like DOOM and control the yoke of a simulated airplane. If they were trained instead to somehow draw pictures what would the legal implications of that be?
There’s a belief that laws and political systems are derived from some sort of deep philosophical insight, but I think most of the time they’re really just whatever works in practice. So, what I’m trying to say is that we can just agree that what OpenAI does is bad and should be illegal without having to come up with a moral imperative that forces us to ban it.
We are human beings. The comparison is false on it’s face because what you all are calling AI isn’t in any conceivable way comparable to the complexity and versatility of a human mind, yet you continue to spit this lie out, over and over again, trying to play it up like it’s Data from Star Trek.
If you fundamentally do not think that artificial intelligences can be created, the onus is on yo uto explain why it’s impossible to replicate the circuitry of our brains. Everything in science we’ve seen this far has shown that we are merely physical beings that can be recreated physically.
Otherwise, I asked you to examine a thought experiment where you are trying to build an artificial intelligence, not necessarily an LLM.
This model isn’t “learning” anything in any way that is even remotely like how humans learn. You are deliberately simplifying the complexity of the human brain to make that comparison.
Or you are over complicating yourself to seem more important and special. Definitely no way that most people would be biased towards that, is there?
Moreover, human beings make their own choices, they aren’t actual tools.
Oh please do go ahead and show us your proof that free will exists! Thank god you finally solved that one! I heard people were really stressing about it for a while!
They pointed a tool at copyrighted works and told it to copy, do some math, and regurgitate it. What the AI “does” is not relevant, what the people that programmed it told it to do with that copyrighted information is what matters.
“I don’t know how this works but it’s math and that scares me so I’ll minimize it!”
If we have an AI that’s equivalent to humanity in capability of learning and creative output/transformation, it would be immoral to just use it as a tool. At least that’s how I see it.
I think that’s a huge risk, but we’ve only ever seen a single, very specific type of intelligence, our own / that of animals that are pretty closely related to us.
Movies like Ex Machina and Her do a good job of pointing out that there is nothing that inherently means that an AI will be anything like us, even if they can appear that way or pass at tasks.
It’s entirely possible that we could develop an AI that was so specifically trained that it would provide the best script editing notes but be incapable of anything else for instance, including self reflection or feeling loss.
The things is, they can have scads of free stuff that is not copyrighted. But they are greedy and want copyrighted stuff, too
We all should. Copyright is fucking horseshit.
It costs literally nothing to make a digital copy of something. There is ZERO reason to restrict access to things.
Making a copy is free. Making the original is not. I don’t expect a professional photographer to hand out their work for free because making copies of it costs nothing. You’re not paying for the copy, you’re paying for the money and effort needed to create the original.
Making a copy is free. Making the original is not.
Yes, exactly. Do you see how that is different from the world of physical objects and energy? That is not the case for a physical object. Even once you design something and build a factory to produce it, the first item off the line takes the same amount of resources as the last one.
Capitalism is based on the idea that things are scarce. If I have something, you can’t have it, and if you want it, then I have to give up my thing, so we end up trading. Information does not work that way. We can freely copy a piece of information as much as we want. Which is why monopolies and capitalism are a bad system of rewarding creators. They inherently cause us to impose scarcity where there is no need for it, because in capitalism things that are abundant do not have value. Capitalism fundamentally fails to function when there is abundance of resources, which is why copyright was a dumb system for the digital age. Rather than recognize that we now live in an age of information abundance, we spend billions of dollars trying to impose artificial scarcity.
You sound like someone who has not tried to make an artistic creation for profit.
You sound like someone unwilling to think about a better system.
Better system for WHOM? Tech-bros that want to steal my content as their own?
I’m a writer, performing artist, designer, and illustrator. I have thought about copyright quite a bit. I have released some of my stuff into the public domain, as well as the Creative Commons. If you want to use my work, you may - according to the licenses that I provide.
I also think copyright law is way out of whack. It should go back to - at most - life of author. This “life of author plus 95 years” is ridiculous. I lament that so much great work is being lost or forgotten because of the oppressive copyright laws - especially in the area of computer software.
But tech-bros that want my work to train their LLMs - they can fuck right off. There are legal thresholds that constitute “fair use” - Is it used for an academic purpose? Is it used for a non-profit use? Is the portion that is being used a small part or the whole thing? LLM software fail all of these tests.
They can slurp up the entirety of Wikipedia, and they do. But they are not satisfied with the free stuff. But they want my artistic creations, too, without asking. And they want to sell something based on my work, making money off of my work, without asking.
Better system for WHOM? Tech-bros that want to steal my content as their own?
A better system for EVERYONE. One where we all have access to all creative works, rather than spending billions on engineers nad lawyers to create walled gardens and DRM and artificial scarcity. What if literally all the money we spent on all of that instead went to artist royalties?
But tech-bros that want my work to train their LLMs - they can fuck right off. There are legal thresholds that constitute “fair use” - Is it used for an academic purpose? Is it used for a non-profit use? Is the portion that is being used a small part or the whole thing? LLM software fail all of these tests.
No. It doesn’t.
They can literally pass all of those tests.
You are confusing OpenAI keeping their LLM closed source and charging access to it, with LLMs in general. The open source models that Microsoft and Meta publish for instance, pass literally all of the criteria you just stated.
The ingredient thing is a bit amusing, because that’s basically how one of the major fast food chains got to be so big (I can’t remember which one it was ATM though; just that it wasn’t McDonald’s). They cut out the middle-man and just bought their own farm to start growing the vegetables and later on expanded to raising the animals used for the meat as well.
Wait… they actually STOLE the cheese from the cows?
😆
It’s an interesting area. Are they suggesting that a human reading copyright material and learning from it is a breach?
If they can base their business on stealing, then we can steal their AI services, right?
Pirating isn’t stealing but yes the collective works of humanity should belong to humanity, not some slimy cabal of venture capitalists.
Unlike regular piracy, accessing “their” product hosted on their servers using their power and compute is pretty clearly theft. Morally correct theft that I wholeheartedly support, but theft nonetheless.
Is that how this technology works? I’m not the most knowledgeable about tech stuff honestly (at least by Lemmy standards).
There’s self-hosted LLMs, (e.g. Ollama), but for the purposes of this conversation, yeah - they’re centrally hosted, compute intensive software services.
Also, ingredients to a recipe aren’t covered under copyright law.
ingredients to a recipe may well be subject to copyright, which is why food writers make sure their recipes are “unique” in some small way. Enough to make them different enough to avoid accusations of direct plagiarism.
E: removed unnecessary snark
In what country is that?
Under US law, you cannot copyright recipes. You can own a specific text in which you explain the recipe. But anyone can write down the same ingredients and instructions in a different way and own that text.
Keep in my that “ingredients to a recipe” here refers to the literal physical ingredients, based on the context of the OP (where a sandwich shop owner can’t afford to pay for their cheese).
While you can’t copyright a recipe, you can patent the ingredients themselves, especially if you had a hand in doing R&D to create it. See PepsiCo sues four Indian farmers for using its patented Lay’s potatoes.
No, you cannot patent an ingredient. What you can do - under Indian law - is get “protection” for a plant variety. In this case, a potato.
That law is called Protection of Plant Varieties and Farmers’ Rights Act, 2001. The farmer in this case being PepsiCo, which is how they successfully sued these 4 Indian farmers.
Farmers’ Rights for PepsiCo against farmers. Does that seem odd?
I’ve never met an intellectual property freak who didn’t lie through his teeth.
I think there is some confusion here between copyright and patent, similar in concept but legally exclusive. A person can copyright the order and selection of words used to express a recipe, but the recipe itself is not copy, it can however fall under patent law if proven to be unique enough, which is difficult to prove.
So you can technically own the patent to a recipe keeping other companies from selling the product of a recipe, however anyone can make the recipe themselves, if you can acquire it and not resell it. However that recipe can be expressed in many different ways, each having their own copyright.
Yes, that’s exactly the point. It should belong to humanity, which means that anyone can use it to improve themselves. Or to create something nice for themselves or others. That’s exactly what AI companies are doing. And because it is not stealing, it is all still there for anyone else. Unless, of course, the copyrightists get there way.
How do you feel about Meta and Microsoft who do the same thing but publish their models open source for anyone to use?
Well how long to you think that’s going to last? They are for-profit companies after all.
I mean we’re having a discussion about what’s fair, my inherent implication is whether or not that would be a fair regulation to impose.
i feel like its less meaningful because we dont have access to the datasets.
Those aren’t open source, neither by the OSI’s Open Source Definition nor by the OSI’s Open Source AI Definition.
The important part for the latter being a published listing of all the training data. (Trainers don’t have to provide the data, but they must provide at least a way to recreate the model given the same inputs).
Data information: Sufficiently detailed information about the data used to train the system, so that a skilled person can recreate a substantially equivalent system using the same or similar data. Data information shall be made available with licenses that comply with the Open Source Definition.
They are model-available if anything.
For the purposes of this conversation. That’s pretty much just a pedantic difference. They are paying to train those models and then providing them to the public to use completely freely in any way they want.
It would be like developing open source software and then not calling it open source because you didn’t publish the market research that guided your UX decisions.
You said open source. Open source is a type of licensure.
The entire point of licensure is legal pedantry.
And as far as your metaphor is concerned, pre-trained models are closer to pre-compiled binaries, which are expressly not considered Open Source according to the OSD.
You said open source. Open source is a type of licensure.
The entire point of licensure is legal pedantry.
No. Open source is a concept. That concept also has pedantic legal definitions, but the concept itself is not inherently pedantic.
And as far as your metaphor is concerned, pre-trained models are closer to pre-compiled binaries, which are expressly not considered Open Source according to the OSD.
No, they’re not. Which is why I didn’t use that metaphor.
A binary is explicitly a black box. There is nothing to learn from a binary, unless you explicitly decompile it back into source code.
In this case, literally all the source code is available. Any researcher can read through their model, learn from it, copy it, twist it, and build their own version of it wholesale. Not providing the training data, is more similar to saying that Yuzu or an emulator isn’t open source because it doesn’t provide copyrighted games. It is providing literally all of the parts of it that it can open source, and then letting the user feed it whatever training data they are allowed access to.
Tell me you’ve never compiled software from open source without saying you’ve never compiled software from open source.
The only differences between open source and freeware are pedantic, right guys?
Tell me you’ve never developed software without telling me you’ve never developed software.
A closed source binary that is copyrighted and illegal to use, is totally the same thing as a all the trained weights and underlying source code for a neural network published under the MIT license that anyone can learn from, copy, and use, however they want, right guys?
You drank the kool-aid.
deleted by creator
I thought the larger point was that they’re using plenty of sources that do not lie in the public domain. Like if I download a textbook to read for a class instead of buying it - I could be proscecuted for stealing. And they’ve downloaded and read millions of books without paying for them.
Like if I download a textbook to read for a class instead of buying it - I could be proscecuted for stealing
Ehh, no almost certainly not (But it does depend on your local laws). But that honestly just sounds like some corporate boogyman to prevent you from pirating their books. The person hosting the download, if they did not have the rights to publicize it freely, would possibly be prosecuted though.
To illustrate, there’s this story of John Cena who sold a special Ford after signing a contract with Ford to explicitly forbid him from doing that. However, the person who bought the car was never prosecuted or sued, because they received the car from Cena with no strings attached. They couldn’t be held responsible for Cena’s break of contract, but Cena was held personally responsible by Ford.
For physical goods there is ‘theft by proxy’ though (receiving stolen goods that you know are most likely stolen), but that quite certainly doesn’t apply to digital, copyable goods. As to even access any kind of information on the internet, you have to download and thus, copy it.
And they've downloaded and read millions of books without paying for them.
Do you have a source on that?
Most AI models used Books3 as part of their dataset which is a collection of pirated books. Here are a few articles talking about it:
https://www.theverge.com/2024/8/20/24224450/anthropic-copyright-lawsuit-pirated-books-ai
https://www.theatlantic.com/technology/archive/2023/08/books3-ai-meta-llama-pirated-books/675063/
Thank you
I finally understand Trump supporters “Fuck it, burn it all to the ground cause we can’t win” POV. Only instead of democracy, it is copyright and instead of Trump, it is AI.
Are the models that OpenAI creates open source? I don’t know enough about LLMs but if ChatGPT wants exemptions from the law, it result in a public good (emphasis on public).
Nothing about OpenAI is open-source. The name is a misdirection.
If you use my IP without my permission and profit it from it, then that is IP theft, whether or not you republish a plagiarized version.
So I guess every reaction and review on the internet that is ad supported or behind a payroll is theft too?
No, we have rules on fair use and derivative works. Sometimes they fall on one side, sometimes another.
Fair use by humans.
There is no fair use by computers, otherwise we couldn’t have piracy laws.
The STT (speech to text) model that they created is open source (Whisper) as well as a few others:
Those aren’t open source, neither by the OSI’s Open Source Definition nor by the OSI’s Open Source AI Definition.
The important part for the latter being a published listing of all the training data. (Trainers don’t have to provide the data, but they must provide at least a way to recreate the model given the same inputs).
Data information: Sufficiently detailed information about the data used to train the system, so that a skilled person can recreate a substantially equivalent system using the same or similar data. Data information shall be made available with licenses that comply with the Open Source Definition.
They are model-available if anything.
I did a quick check on the license for Whisper:
Whisper’s code and model weights are released under the MIT License. See LICENSE for further details.
So that definitely meets the Open Source Definition on your first link.
And it looks like it also meets the definition of open source as per your second link.
Additional WER/CER metrics corresponding to the other models and datasets can be found in Appendix D.1, D.2, and D.4 of the paper, as well as the BLEU (Bilingual Evaluation Understudy) scores for translation in Appendix D.3.
Whisper’s code and model weights are released under the MIT License. See LICENSE for further details. So that definitely meets the Open Source Definition on your first link.
Model weights by themselves do not qualify as “open source”, as the OSAID qualifies. Weights are not source.
Additional WER/CER metrics corresponding to the other models and datasets can be found in Appendix D.1, D.2, and D.4 of the paper, as well as the BLEU (Bilingual Evaluation Understudy) scores for translation in Appendix D.3.
This is not training data. These are testing metrics.
Edit: additionally, assuming you might have been talking about the link to the research paper. It’s not published under an OSD license. If it were this would qualify the model.
I don’t understand. What’s missing from the code, model, and weights provided to make this “open source” by the definition of your first link? it seems to meet all of those requirements.
As for the OSAID, the exact training dataset is not required, per your quote, they just need to provide enough information that someone else could train the model using a “similar dataset”.
Oh and for the OSAID part, the only issue stopping Whisper from being considered open source as per the OSAID is that the information on the training data is published through arxiv, so using the data as written could present licensing issues.
Ok, but the most important part of that research paper is published on the github repository, which explains how to provide audio data and text data to recreate any STT model in the same way that they have done.
See the “Approach” section of the github repository: https://github.com/openai/whisper?tab=readme-ov-file#approach
And the Traning Data section of their github: https://github.com/openai/whisper/blob/main/model-card.md#training-data
With this you don’t really need to use the paper hosted on arxiv, you have enough information on how to train/modify the model.
There are guides on how to Finetune the model yourself: https://huggingface.co/blog/fine-tune-whisper
Which, from what I understand on the link to the OSAID, is exactly what they are asking for. The ability to retrain/finetune a model fits this definition very well:
The preferred form of making modifications to a machine-learning system is:
- Data information […]
- Code […]
- Weights […]
All 3 of those have been provided.
The problem with just shipping AI model weights is that they run up against the issue of point 2 of the OSD:
The program must include source code, and must allow distribution in source code as well as compiled form. Where some form of a product is not distributed with source code, there must be a well-publicized means of obtaining the source code for no more than a reasonable reproduction cost, preferably downloading via the Internet without charge. The source code must be the preferred form in which a programmer would modify the program. Deliberately obfuscated source code is not allowed. Intermediate forms such as the output of a preprocessor or translator are not allowed.
AI models can’t be distributed purely as source because they are pre-trained. It’s the same as distributing pre-compiled binaries.
It’s the entire reason the OSAID exists:
- The OSD doesn’t fit because it requires you distribute the source code in a non-preprocessed manner.
- AIs can’t necessarily distribute the training data alongside the code that trains the model, so in order to help bridge the gap the OSI made the OSAID - as long as you fully document the way you trained the model so that somebody that has access to the training data you used can make a mostly similar set of weights, you fall within the OSAID
Edit: also the information about the training data has to be published in an OSD-equivalent license (such as creative Commons) so that using it doesn’t cause licensing issues with research paper print companies (like arxiv)
OpenAI does not publish their models openly. Other companies like Microsoft and Meta do.
I wouldn’t say I’m on OAI’s side here, but I’m down to eliminate copyright. New economic models will emerge, especially if more creatives unionize.
Here’s an experiment for you to try at home. Ask an AI model a question, copy a sentence or two of what they give back, and paste it into a search engine. The results may surprise you.
And stop comparing AI to humans but then giving AI models more freedom. If I wrote a paper I’d need to cite my sources. Where the fuck are your sources ChatGPT? Oh right, we’re not allowed to see that but you can take whatever you want from us. Sounds fair.
It’s not a breach of copyright or other IP law not to cite sources on your paper.
Getting your paper rejected for lacking sources is also not infringing in your freedom. Being forced to pay damages and delete your paper from any public space would be infringement of your freedom.
I’m pretty sure that it’s true that citing sources isn’t really relevant to copyright violation, either you are violating or not. Saying where you copied from doesn’t change anything, but if you are using some ideas with your own analysis and words it isn’t a violation either way.
With music this often ends up in civil court. Pretty sure the same can in theory happen for written texts, but the commercial value of most written texts is not worth the cost of litigation.
I mean, you’re not necessarily wrong. But that doesn’t change the fact that it’s still stealing, which was my point. Just because laws haven’t caught up to it yet doesn’t make it any less of a shitty thing to do.
The original source material is still there. They just made a copy of it. If you think that’s stealing then online piracy is stealing as well.
Well they make a profit off of it, so yes. I have nothing against piracy, but if you’re reselling it that’s a different story.
But piracy saves you money which is effectively the same as making a profit. Also, it’s not just that they’re selling other people’s work for profit. You’re also paying for the insane amount of computing power it takes to train and run the AI plus salaries of the workers etc.
It’s not stealing, its not even ‘piracy’ which also is not stealing.
Copyright laws need to be scaled back, to not criminalize socially accepted behavior, not expand.
When I analyze a melody I play on a piano, I see that it reflects the music I heard that day or sometimes, even music I heard and liked years ago.
Having parts similar or a part that is (coincidentally) identical to a part from another song is not stealing and does not infringe upon any law.
You guys are missing a fundamental point. The copyright was created to protect an author for specific amount of time so somebody else doesn’t profit from their work essentially stealing their deserved revenue.
LLM AI was created to do exactly that.
Can you just give us the TLDE?
AI Chat bots copy/paste much of their “training data” verbatim.
This is the catch with OPs entire statement about transformation. Their premise is flawed, because the next most likely token is usually the same word the author of a work chose.
And that’s kinda my point. I understand that transformation is totally fine but these LLM literally copy and paste shit. And that’s still if you are comparing AI to people which I think is completely ridiculous. If anything these things are just more complicated search engines with half the usefulness. If I search online about how to change a tire I can find some reliable sources to do so. If I ask AI how to change a tire it would just spit something out that might not even be accurate and I’d have to search again afterwards just to make sure what it told me was even accurate.
It’s just a word calculator based on information stolen from people without their consent. It has no original thought process so it has no way to transform anything. All it can do is copy and paste in different combinations.
Not to fully argue against your point, but I do want to push back on the citations bit. Given the way an LLM is trained, it’s not really close to equivalent to me citing papers researched for a paper. That would be more akin to asking me to cite every piece of written or verbal media I’ve ever encountered as they all contributed in some small way to way that the words were formulated here.
Now, if specific data were injected into the prompt, or maybe if it was fine-tuned on a small subset of highly specific data, I would agree those should be cited as they are being accessed more verbatim. The whole “magic” of LLMs was that it needed to cross a threshold of data, combined with the attentional mechanism, and then the network was pretty suddenly able to maintain coherent sentences structure. It was only with loads of varied data from many different sources that this really emerged.
I personally am down for this punch-up between Alphabet and Sony. Microsoft v. Disney.
🍿
Surely it’s coming. We have The music publishing cartel vs Suno already.
I hear you about the cheese bro.