Those claiming AI training on copyrighted works is “theft” misunderstand key aspects of copyright law and AI technology. Copyright protects specific expressions of ideas, not the ideas themselves. When AI systems ingest copyrighted works, they’re extracting general patterns and concepts - the “Bob Dylan-ness” or “Hemingway-ness” - not copying specific text or images.

This process is akin to how humans learn by reading widely and absorbing styles and techniques, rather than memorizing and reproducing exact passages. The AI discards the original text, keeping only abstract representations in “vector space”. When generating new content, the AI isn’t recreating copyrighted works, but producing new expressions inspired by the concepts it’s learned.

This is fundamentally different from copying a book or song. It’s more like the long-standing artistic tradition of being influenced by others’ work. The law has always recognized that ideas themselves can’t be owned - only particular expressions of them.

Moreover, there’s precedent for this kind of use being considered “transformative” and thus fair use. The Google Books project, which scanned millions of books to create a searchable index, was ruled legal despite protests from authors and publishers. AI training is arguably even more transformative.

While it’s understandable that creators feel uneasy about this new technology, labeling it “theft” is both legally and technically inaccurate. We may need new ways to support and compensate creators in the AI age, but that doesn’t make the current use of copyrighted works for AI training illegal or unethical.

For those interested, this argument is nicely laid out by Damien Riehl in FLOSS Weekly episode 744. https://twit.tv/shows/floss-weekly/episodes/744

  • Shanedino@lemmy.world
    link
    fedilink
    English
    arrow-up
    10
    arrow-down
    1
    ·
    4 months ago

    Maybe if you would pay for training data they would let you use copyright data or something?

    • Shanedino@lemmy.world
      link
      fedilink
      English
      arrow-up
      1
      ·
      4 months ago

      The other part of it is they broke the rules so they need to face the consequences. They are asking for forgiveness and in this case I don’t think they deserve it.

    • andrew_bidlaw@sh.itjust.works
      link
      fedilink
      English
      arrow-up
      4
      ·
      4 months ago

      Their business strategy is built on top of assumption they won’t. They don’t want this door opened at all. It was a great deal for Google to buy Reddit’s data for some $mil., because it is a huge collection behind one entity. Now imagine communicating to each individual site owner whose resources they scrapped.

      If that could’ve been how it started, the development of these AI tools could be much slower because of (1) data being added to the bunch only after an agreement, (2) more expenses meaning less money for hardware expansion and (3) investors and companies being less hyped up about that thing because it doesn’t grow like a mushroom cloud while following legal procedures. Also, (4) the ability to investigate and collect a public list of what sites they have agreement with is pretty damning making it’s own news stories and conflicts.

    • T156@lemmy.world
      link
      fedilink
      English
      arrow-up
      4
      ·
      4 months ago

      Had the company paid for the training data and/or left it as voluntary, there would be less of a problem with it to begin with.

      Part of the problem is that they didn’t, but are still using it for commercial purposes.

  • dhork@lemmy.world
    link
    fedilink
    English
    arrow-up
    72
    arrow-down
    20
    ·
    4 months ago

    Bullshit. AI are not human. We shouldn’t treat them as such. AI are not creative. They just regurgitate what they are trained on. We call what it does “learning”, but that doesn’t mean we should elevate what they do to be legally equal to human learning.

    It’s this same kind of twisted logic that makes people think Corporations are People.

    • masterspace@lemmy.ca
      link
      fedilink
      English
      arrow-up
      15
      arrow-down
      22
      ·
      edit-2
      4 months ago

      Ok, ignore this specific company and technology.

      In the abstract, if you wanted to make artificial intelligence, how would you do it without using the training data that we humans use to train our own intelligence?

      We learn by reading copyrighted material. Do we pay for it? Sometimes. Sometimes a teacher read it a while ago and then just regurgitated basically the same copyrighted information back to us in a slightly changed form.

      • Geobloke@lemm.ee
        link
        fedilink
        English
        arrow-up
        11
        arrow-down
        3
        ·
        4 months ago

        And that’s all paid for. Think how much just the average high school graduate has has invested in them, ai companies want all that, but for free

        • masterspace@lemmy.ca
          link
          fedilink
          English
          arrow-up
          4
          arrow-down
          12
          ·
          edit-2
          4 months ago

          It’s not though.

          A huge amount of what you learn, someone else paid for, then they taught that knowledge to the next person, and so on. By the time you learned it, it had effectively been pirated and copied by human brains several times before it got to you.

          Literally anything you learned from a Reddit comment or a Stack Overflow post for instance.

          • Geobloke@lemm.ee
            link
            fedilink
            English
            arrow-up
            5
            arrow-down
            3
            ·
            4 months ago

            If only there was a profession that exchanges knowledge for money. Some one who “teaches.” I wonder who would pay them

      • doctortran@lemm.ee
        link
        fedilink
        English
        arrow-up
        32
        arrow-down
        5
        ·
        edit-2
        4 months ago

        We learn by reading copyrighted material.

        We are human beings. The comparison is false on it’s face because what you all are calling AI isn’t in any conceivable way comparable to the complexity and versatility of a human mind, yet you continue to spit this lie out, over and over again, trying to play it up like it’s Data from Star Trek.

        This model isn’t “learning” anything in any way that is even remotely like how humans learn. You are deliberately simplifying the complexity of the human brain to make that comparison.

        Moreover, human beings make their own choices, they aren’t actual tools.

        They pointed a tool at copyrighted works and told it to copy, do some math, and regurgitate it. What the AI “does” is not relevant, what the people that programmed it told it to do with that copyrighted information is what matters.

        There is no intelligence here except theirs. There is no intent here except theirs.

        • drosophila@lemmy.blahaj.zone
          link
          fedilink
          English
          arrow-up
          5
          ·
          edit-2
          4 months ago

          This model isn’t “learning” anything in any way that is even remotely like how humans learn. You are deliberately simplifying the complexity of the human brain to make that comparison.

          I do think the complexity of artificial neural networks is overstated. A real neuron is a lot more complex than an artificial one, and real neurons are not simply feed forward like ANNs (which have to be because they are trained using back-propagation), but instead have their own spontaneous activity (which kinda implies that real neural networks don’t learn using stochastic gradient descent with back-propagation). But to say that there’s nothing at all comparable between the way humans learn and the way ANNs learn is wrong IMO.

          If you read books such as V.S. Ramachandran and Sandra Blakeslee’s Phantoms in the Brain or Oliver Sacks’ The Man Who Mistook His Wife For a Hat you will see lots of descriptions of patients with anosognosia brought on by brain injury. These are people who, for example, are unable to see but also incapable of recognizing this inability. If you ask them to describe what they see in front of them they will make something up on the spot (in a process called confabulation) and not realize they’ve done it. They’ll tell you what they’ve made up while believing that they’re telling the truth. (Vision is just one example, anosognosia can manifest in many different cognitive domains).

          It is V.S Ramachandran’s belief that there are two processes that occur in the Brain, a confabulator (or “yes man” so to speak) and an anomaly detector (or “critic”). The yes-man’s job is to offer up explanations for sensory input that fit within the existing mental model of the world, whereas the critic’s job is to advocate for changing the world-model to fit the sensory input. In patients with anosognosia something has gone wrong in the connection between the critic and the yes man in a particular cognitive domain, and as a result the yes-man is the only one doing any work. Even in a healthy brain you can see the effects of the interplay between these two processes, such as with the placebo effect and in hallucinations brought on by sensory deprivation.

          I think ANNs in general and LLMs in particular are similar to the yes-man process, but lack a critic to go along with it.

          What implications does that have on copyright law? I don’t know. Real neurons in a petri dish have already been trained to play games like DOOM and control the yoke of a simulated airplane. If they were trained instead to somehow draw pictures what would the legal implications of that be?

          There’s a belief that laws and political systems are derived from some sort of deep philosophical insight, but I think most of the time they’re really just whatever works in practice. So, what I’m trying to say is that we can just agree that what OpenAI does is bad and should be illegal without having to come up with a moral imperative that forces us to ban it.

        • masterspace@lemmy.ca
          link
          fedilink
          English
          arrow-up
          2
          arrow-down
          13
          ·
          edit-2
          4 months ago

          We are human beings. The comparison is false on it’s face because what you all are calling AI isn’t in any conceivable way comparable to the complexity and versatility of a human mind, yet you continue to spit this lie out, over and over again, trying to play it up like it’s Data from Star Trek.

          If you fundamentally do not think that artificial intelligences can be created, the onus is on yo uto explain why it’s impossible to replicate the circuitry of our brains. Everything in science we’ve seen this far has shown that we are merely physical beings that can be recreated physically.

          Otherwise, I asked you to examine a thought experiment where you are trying to build an artificial intelligence, not necessarily an LLM.

          This model isn’t “learning” anything in any way that is even remotely like how humans learn. You are deliberately simplifying the complexity of the human brain to make that comparison.

          Or you are over complicating yourself to seem more important and special. Definitely no way that most people would be biased towards that, is there?

          Moreover, human beings make their own choices, they aren’t actual tools.

          Oh please do go ahead and show us your proof that free will exists! Thank god you finally solved that one! I heard people were really stressing about it for a while!

          They pointed a tool at copyrighted works and told it to copy, do some math, and regurgitate it. What the AI “does” is not relevant, what the people that programmed it told it to do with that copyrighted information is what matters.

          “I don’t know how this works but it’s math and that scares me so I’ll minimize it!”

          • pmc@lemmy.blahaj.zone
            link
            fedilink
            English
            arrow-up
            10
            arrow-down
            1
            ·
            edit-2
            4 months ago

            If we have an AI that’s equivalent to humanity in capability of learning and creative output/transformation, it would be immoral to just use it as a tool. At least that’s how I see it.

            • masterspace@lemmy.ca
              link
              fedilink
              English
              arrow-up
              2
              arrow-down
              10
              ·
              4 months ago

              I think that’s a huge risk, but we’ve only ever seen a single, very specific type of intelligence, our own / that of animals that are pretty closely related to us.

              Movies like Ex Machina and Her do a good job of pointing out that there is nothing that inherently means that an AI will be anything like us, even if they can appear that way or pass at tasks.

              It’s entirely possible that we could develop an AI that was so specifically trained that it would provide the best script editing notes but be incapable of anything else for instance, including self reflection or feeling loss.

      • Wiz@midwest.social
        link
        fedilink
        English
        arrow-up
        8
        arrow-down
        2
        ·
        4 months ago

        The things is, they can have scads of free stuff that is not copyrighted. But they are greedy and want copyrighted stuff, too

        • masterspace@lemmy.ca
          link
          fedilink
          English
          arrow-up
          2
          arrow-down
          8
          ·
          4 months ago

          We all should. Copyright is fucking horseshit.

          It costs literally nothing to make a digital copy of something. There is ZERO reason to restrict access to things.

          • ContrarianTrail@lemm.ee
            link
            fedilink
            English
            arrow-up
            5
            arrow-down
            2
            ·
            edit-2
            4 months ago

            Making a copy is free. Making the original is not. I don’t expect a professional photographer to hand out their work for free because making copies of it costs nothing. You’re not paying for the copy, you’re paying for the money and effort needed to create the original.

            • masterspace@lemmy.ca
              link
              fedilink
              English
              arrow-up
              2
              arrow-down
              3
              ·
              edit-2
              4 months ago

              Making a copy is free. Making the original is not.

              Yes, exactly. Do you see how that is different from the world of physical objects and energy? That is not the case for a physical object. Even once you design something and build a factory to produce it, the first item off the line takes the same amount of resources as the last one.

              Capitalism is based on the idea that things are scarce. If I have something, you can’t have it, and if you want it, then I have to give up my thing, so we end up trading. Information does not work that way. We can freely copy a piece of information as much as we want. Which is why monopolies and capitalism are a bad system of rewarding creators. They inherently cause us to impose scarcity where there is no need for it, because in capitalism things that are abundant do not have value. Capitalism fundamentally fails to function when there is abundance of resources, which is why copyright was a dumb system for the digital age. Rather than recognize that we now live in an age of information abundance, we spend billions of dollars trying to impose artificial scarcity.

          • Wiz@midwest.social
            link
            fedilink
            English
            arrow-up
            5
            arrow-down
            1
            ·
            4 months ago

            You sound like someone who has not tried to make an artistic creation for profit.

              • Wiz@midwest.social
                link
                fedilink
                English
                arrow-up
                5
                ·
                4 months ago

                Better system for WHOM? Tech-bros that want to steal my content as their own?

                I’m a writer, performing artist, designer, and illustrator. I have thought about copyright quite a bit. I have released some of my stuff into the public domain, as well as the Creative Commons. If you want to use my work, you may - according to the licenses that I provide.

                I also think copyright law is way out of whack. It should go back to - at most - life of author. This “life of author plus 95 years” is ridiculous. I lament that so much great work is being lost or forgotten because of the oppressive copyright laws - especially in the area of computer software.

                But tech-bros that want my work to train their LLMs - they can fuck right off. There are legal thresholds that constitute “fair use” - Is it used for an academic purpose? Is it used for a non-profit use? Is the portion that is being used a small part or the whole thing? LLM software fail all of these tests.

                They can slurp up the entirety of Wikipedia, and they do. But they are not satisfied with the free stuff. But they want my artistic creations, too, without asking. And they want to sell something based on my work, making money off of my work, without asking.

                • masterspace@lemmy.ca
                  link
                  fedilink
                  English
                  arrow-up
                  1
                  arrow-down
                  5
                  ·
                  edit-2
                  4 months ago

                  Better system for WHOM? Tech-bros that want to steal my content as their own?

                  A better system for EVERYONE. One where we all have access to all creative works, rather than spending billions on engineers nad lawyers to create walled gardens and DRM and artificial scarcity. What if literally all the money we spent on all of that instead went to artist royalties?

                  But tech-bros that want my work to train their LLMs - they can fuck right off. There are legal thresholds that constitute “fair use” - Is it used for an academic purpose? Is it used for a non-profit use? Is the portion that is being used a small part or the whole thing? LLM software fail all of these tests.

                  No. It doesn’t.

                  They can literally pass all of those tests.

                  You are confusing OpenAI keeping their LLM closed source and charging access to it, with LLMs in general. The open source models that Microsoft and Meta publish for instance, pass literally all of the criteria you just stated.

  • The ingredient thing is a bit amusing, because that’s basically how one of the major fast food chains got to be so big (I can’t remember which one it was ATM though; just that it wasn’t McDonald’s). They cut out the middle-man and just bought their own farm to start growing the vegetables and later on expanded to raising the animals used for the meat as well.

  • macrocephalic@lemmy.world
    link
    fedilink
    English
    arrow-up
    8
    arrow-down
    2
    ·
    4 months ago

    It’s an interesting area. Are they suggesting that a human reading copyright material and learning from it is a breach?

  • lettruthout@lemmy.world
    link
    fedilink
    English
    arrow-up
    234
    arrow-down
    10
    ·
    4 months ago

    If they can base their business on stealing, then we can steal their AI services, right?

    • LibertyLizard@slrpnk.net
      link
      fedilink
      English
      arrow-up
      175
      arrow-down
      2
      ·
      4 months ago

      Pirating isn’t stealing but yes the collective works of humanity should belong to humanity, not some slimy cabal of venture capitalists.

      • WaxedWookie@lemmy.world
        link
        fedilink
        English
        arrow-up
        11
        arrow-down
        1
        ·
        4 months ago

        Unlike regular piracy, accessing “their” product hosted on their servers using their power and compute is pretty clearly theft. Morally correct theft that I wholeheartedly support, but theft nonetheless.

        • LibertyLizard@slrpnk.net
          link
          fedilink
          English
          arrow-up
          4
          ·
          4 months ago

          Is that how this technology works? I’m not the most knowledgeable about tech stuff honestly (at least by Lemmy standards).

          • WaxedWookie@lemmy.world
            link
            fedilink
            English
            arrow-up
            3
            ·
            4 months ago

            There’s self-hosted LLMs, (e.g. Ollama), but for the purposes of this conversation, yeah - they’re centrally hosted, compute intensive software services.

        • ProstheticBrain@sh.itjust.works
          link
          fedilink
          English
          arrow-up
          9
          arrow-down
          6
          ·
          edit-2
          4 months ago

          ingredients to a recipe may well be subject to copyright, which is why food writers make sure their recipes are “unique” in some small way. Enough to make them different enough to avoid accusations of direct plagiarism.

          E: removed unnecessary snark

          • General_Effort@lemmy.world
            link
            fedilink
            English
            arrow-up
            20
            ·
            4 months ago

            In what country is that?

            Under US law, you cannot copyright recipes. You can own a specific text in which you explain the recipe. But anyone can write down the same ingredients and instructions in a different way and own that text.

              • General_Effort@lemmy.world
                link
                fedilink
                English
                arrow-up
                1
                arrow-down
                1
                ·
                3 months ago

                No, you cannot patent an ingredient. What you can do - under Indian law - is get “protection” for a plant variety. In this case, a potato.

                That law is called Protection of Plant Varieties and Farmers’ Rights Act, 2001. The farmer in this case being PepsiCo, which is how they successfully sued these 4 Indian farmers.

                Farmers’ Rights for PepsiCo against farmers. Does that seem odd?

                I’ve never met an intellectual property freak who didn’t lie through his teeth.

          • oxomoxo@lemmy.world
            link
            fedilink
            English
            arrow-up
            2
            ·
            4 months ago

            I think there is some confusion here between copyright and patent, similar in concept but legally exclusive. A person can copyright the order and selection of words used to express a recipe, but the recipe itself is not copy, it can however fall under patent law if proven to be unique enough, which is difficult to prove.

            So you can technically own the patent to a recipe keeping other companies from selling the product of a recipe, however anyone can make the recipe themselves, if you can acquire it and not resell it. However that recipe can be expressed in many different ways, each having their own copyright.

      • General_Effort@lemmy.world
        link
        fedilink
        English
        arrow-up
        10
        arrow-down
        6
        ·
        4 months ago

        Yes, that’s exactly the point. It should belong to humanity, which means that anyone can use it to improve themselves. Or to create something nice for themselves or others. That’s exactly what AI companies are doing. And because it is not stealing, it is all still there for anyone else. Unless, of course, the copyrightists get there way.

    • masterspace@lemmy.ca
      link
      fedilink
      English
      arrow-up
      10
      arrow-down
      3
      ·
      4 months ago

      How do you feel about Meta and Microsoft who do the same thing but publish their models open source for anyone to use?

      • lettruthout@lemmy.world
        link
        fedilink
        English
        arrow-up
        26
        arrow-down
        1
        ·
        4 months ago

        Well how long to you think that’s going to last? They are for-profit companies after all.

        • masterspace@lemmy.ca
          link
          fedilink
          English
          arrow-up
          5
          ·
          4 months ago

          I mean we’re having a discussion about what’s fair, my inherent implication is whether or not that would be a fair regulation to impose.

      • umbrella@lemmy.ml
        link
        fedilink
        English
        arrow-up
        1
        ·
        4 months ago

        i feel like its less meaningful because we dont have access to the datasets.

      • WalnutLum@lemmy.ml
        link
        fedilink
        English
        arrow-up
        15
        arrow-down
        2
        ·
        4 months ago

        Those aren’t open source, neither by the OSI’s Open Source Definition nor by the OSI’s Open Source AI Definition.

        The important part for the latter being a published listing of all the training data. (Trainers don’t have to provide the data, but they must provide at least a way to recreate the model given the same inputs).

        Data information: Sufficiently detailed information about the data used to train the system, so that a skilled person can recreate a substantially equivalent system using the same or similar data. Data information shall be made available with licenses that comply with the Open Source Definition.

        They are model-available if anything.

        • masterspace@lemmy.ca
          link
          fedilink
          English
          arrow-up
          1
          arrow-down
          5
          ·
          4 months ago

          For the purposes of this conversation. That’s pretty much just a pedantic difference. They are paying to train those models and then providing them to the public to use completely freely in any way they want.

          It would be like developing open source software and then not calling it open source because you didn’t publish the market research that guided your UX decisions.

          • WalnutLum@lemmy.ml
            link
            fedilink
            English
            arrow-up
            6
            arrow-down
            1
            ·
            4 months ago

            You said open source. Open source is a type of licensure.

            The entire point of licensure is legal pedantry.

            And as far as your metaphor is concerned, pre-trained models are closer to pre-compiled binaries, which are expressly not considered Open Source according to the OSD.

            • masterspace@lemmy.ca
              link
              fedilink
              English
              arrow-up
              1
              ·
              edit-2
              3 months ago

              You said open source. Open source is a type of licensure.

              The entire point of licensure is legal pedantry.

              No. Open source is a concept. That concept also has pedantic legal definitions, but the concept itself is not inherently pedantic.

              And as far as your metaphor is concerned, pre-trained models are closer to pre-compiled binaries, which are expressly not considered Open Source according to the OSD.

              No, they’re not. Which is why I didn’t use that metaphor.

              A binary is explicitly a black box. There is nothing to learn from a binary, unless you explicitly decompile it back into source code.

              In this case, literally all the source code is available. Any researcher can read through their model, learn from it, copy it, twist it, and build their own version of it wholesale. Not providing the training data, is more similar to saying that Yuzu or an emulator isn’t open source because it doesn’t provide copyrighted games. It is providing literally all of the parts of it that it can open source, and then letting the user feed it whatever training data they are allowed access to.

          • Arcka@midwest.social
            link
            fedilink
            English
            arrow-up
            6
            arrow-down
            2
            ·
            4 months ago

            Tell me you’ve never compiled software from open source without saying you’ve never compiled software from open source.

            The only differences between open source and freeware are pedantic, right guys?

            • masterspace@lemmy.ca
              link
              fedilink
              English
              arrow-up
              1
              ·
              3 months ago

              Tell me you’ve never developed software without telling me you’ve never developed software.

              A closed source binary that is copyrighted and illegal to use, is totally the same thing as a all the trained weights and underlying source code for a neural network published under the MIT license that anyone can learn from, copy, and use, however they want, right guys?

  • makyo@lemmy.world
    link
    fedilink
    English
    arrow-up
    21
    ·
    4 months ago

    I thought the larger point was that they’re using plenty of sources that do not lie in the public domain. Like if I download a textbook to read for a class instead of buying it - I could be proscecuted for stealing. And they’ve downloaded and read millions of books without paying for them.

  • randon31415@lemmy.world
    link
    fedilink
    English
    arrow-up
    5
    arrow-down
    5
    ·
    4 months ago

    I finally understand Trump supporters “Fuck it, burn it all to the ground cause we can’t win” POV. Only instead of democracy, it is copyright and instead of Trump, it is AI.

  • fancyl@lemmy.world
    link
    fedilink
    English
    arrow-up
    26
    arrow-down
    1
    ·
    4 months ago

    Are the models that OpenAI creates open source? I don’t know enough about LLMs but if ChatGPT wants exemptions from the law, it result in a public good (emphasis on public).

    • graycube@lemmy.world
      link
      fedilink
      English
      arrow-up
      57
      arrow-down
      7
      ·
      4 months ago

      Nothing about OpenAI is open-source. The name is a misdirection.

      If you use my IP without my permission and profit it from it, then that is IP theft, whether or not you republish a plagiarized version.

      • dariusj18@lemmy.world
        link
        fedilink
        English
        arrow-up
        8
        arrow-down
        13
        ·
        4 months ago

        So I guess every reaction and review on the internet that is ad supported or behind a payroll is theft too?

        • RicoBerto@lemmy.blahaj.zone
          link
          fedilink
          English
          arrow-up
          17
          arrow-down
          1
          ·
          4 months ago

          No, we have rules on fair use and derivative works. Sometimes they fall on one side, sometimes another.

          • InvertedParallax@lemm.ee
            link
            fedilink
            English
            arrow-up
            15
            arrow-down
            2
            ·
            4 months ago

            Fair use by humans.

            There is no fair use by computers, otherwise we couldn’t have piracy laws.

      • WalnutLum@lemmy.ml
        link
        fedilink
        English
        arrow-up
        7
        ·
        edit-2
        4 months ago

        Those aren’t open source, neither by the OSI’s Open Source Definition nor by the OSI’s Open Source AI Definition.

        The important part for the latter being a published listing of all the training data. (Trainers don’t have to provide the data, but they must provide at least a way to recreate the model given the same inputs).

        Data information: Sufficiently detailed information about the data used to train the system, so that a skilled person can recreate a substantially equivalent system using the same or similar data. Data information shall be made available with licenses that comply with the Open Source Definition.

        They are model-available if anything.

        • QuadratureSurfer@lemmy.world
          link
          fedilink
          English
          arrow-up
          1
          arrow-down
          1
          ·
          4 months ago

          I did a quick check on the license for Whisper:

          Whisper’s code and model weights are released under the MIT License. See LICENSE for further details.

          So that definitely meets the Open Source Definition on your first link.

          And it looks like it also meets the definition of open source as per your second link.

          Additional WER/CER metrics corresponding to the other models and datasets can be found in Appendix D.1, D.2, and D.4 of the paper, as well as the BLEU (Bilingual Evaluation Understudy) scores for translation in Appendix D.3.

          • WalnutLum@lemmy.ml
            link
            fedilink
            English
            arrow-up
            5
            ·
            edit-2
            4 months ago

            Whisper’s code and model weights are released under the MIT License. See LICENSE for further details. So that definitely meets the Open Source Definition on your first link.

            Model weights by themselves do not qualify as “open source”, as the OSAID qualifies. Weights are not source.

            Additional WER/CER metrics corresponding to the other models and datasets can be found in Appendix D.1, D.2, and D.4 of the paper, as well as the BLEU (Bilingual Evaluation Understudy) scores for translation in Appendix D.3.

            This is not training data. These are testing metrics.

            Edit: additionally, assuming you might have been talking about the link to the research paper. It’s not published under an OSD license. If it were this would qualify the model.

            • QuadratureSurfer@lemmy.world
              link
              fedilink
              English
              arrow-up
              1
              arrow-down
              1
              ·
              4 months ago

              I don’t understand. What’s missing from the code, model, and weights provided to make this “open source” by the definition of your first link? it seems to meet all of those requirements.

              As for the OSAID, the exact training dataset is not required, per your quote, they just need to provide enough information that someone else could train the model using a “similar dataset”.

              • WalnutLum@lemmy.ml
                link
                fedilink
                English
                arrow-up
                3
                ·
                4 months ago

                Oh and for the OSAID part, the only issue stopping Whisper from being considered open source as per the OSAID is that the information on the training data is published through arxiv, so using the data as written could present licensing issues.

                • QuadratureSurfer@lemmy.world
                  link
                  fedilink
                  English
                  arrow-up
                  1
                  ·
                  4 months ago

                  Ok, but the most important part of that research paper is published on the github repository, which explains how to provide audio data and text data to recreate any STT model in the same way that they have done.

                  See the “Approach” section of the github repository: https://github.com/openai/whisper?tab=readme-ov-file#approach

                  And the Traning Data section of their github: https://github.com/openai/whisper/blob/main/model-card.md#training-data

                  With this you don’t really need to use the paper hosted on arxiv, you have enough information on how to train/modify the model.

                  There are guides on how to Finetune the model yourself: https://huggingface.co/blog/fine-tune-whisper

                  Which, from what I understand on the link to the OSAID, is exactly what they are asking for. The ability to retrain/finetune a model fits this definition very well:

                  The preferred form of making modifications to a machine-learning system is:

                  • Data information […]
                  • Code […]
                  • Weights […]

                  All 3 of those have been provided.

              • WalnutLum@lemmy.ml
                link
                fedilink
                English
                arrow-up
                2
                ·
                edit-2
                4 months ago

                The problem with just shipping AI model weights is that they run up against the issue of point 2 of the OSD:

                The program must include source code, and must allow distribution in source code as well as compiled form. Where some form of a product is not distributed with source code, there must be a well-publicized means of obtaining the source code for no more than a reasonable reproduction cost, preferably downloading via the Internet without charge. The source code must be the preferred form in which a programmer would modify the program. Deliberately obfuscated source code is not allowed. Intermediate forms such as the output of a preprocessor or translator are not allowed.

                AI models can’t be distributed purely as source because they are pre-trained. It’s the same as distributing pre-compiled binaries.

                It’s the entire reason the OSAID exists:

                1. The OSD doesn’t fit because it requires you distribute the source code in a non-preprocessed manner.
                2. AIs can’t necessarily distribute the training data alongside the code that trains the model, so in order to help bridge the gap the OSI made the OSAID - as long as you fully document the way you trained the model so that somebody that has access to the training data you used can make a mostly similar set of weights, you fall within the OSAID

                Edit: also the information about the training data has to be published in an OSD-equivalent license (such as creative Commons) so that using it doesn’t cause licensing issues with research paper print companies (like arxiv)

    • masterspace@lemmy.ca
      link
      fedilink
      English
      arrow-up
      7
      arrow-down
      1
      ·
      4 months ago

      OpenAI does not publish their models openly. Other companies like Microsoft and Meta do.

  • TunaCowboy@lemmy.world
    link
    fedilink
    English
    arrow-up
    2
    arrow-down
    3
    ·
    4 months ago

    I wouldn’t say I’m on OAI’s side here, but I’m down to eliminate copyright. New economic models will emerge, especially if more creatives unionize.

  • TommySoda@lemmy.world
    link
    fedilink
    English
    arrow-up
    181
    arrow-down
    15
    ·
    edit-2
    4 months ago

    Here’s an experiment for you to try at home. Ask an AI model a question, copy a sentence or two of what they give back, and paste it into a search engine. The results may surprise you.

    And stop comparing AI to humans but then giving AI models more freedom. If I wrote a paper I’d need to cite my sources. Where the fuck are your sources ChatGPT? Oh right, we’re not allowed to see that but you can take whatever you want from us. Sounds fair.

    • azuth@sh.itjust.works
      link
      fedilink
      English
      arrow-up
      17
      arrow-down
      35
      ·
      4 months ago

      It’s not a breach of copyright or other IP law not to cite sources on your paper.

      Getting your paper rejected for lacking sources is also not infringing in your freedom. Being forced to pay damages and delete your paper from any public space would be infringement of your freedom.

      • explore_broaden@midwest.social
        link
        fedilink
        English
        arrow-up
        12
        arrow-down
        2
        ·
        4 months ago

        I’m pretty sure that it’s true that citing sources isn’t really relevant to copyright violation, either you are violating or not. Saying where you copied from doesn’t change anything, but if you are using some ideas with your own analysis and words it isn’t a violation either way.

        • Eatspancakes84@lemmy.world
          link
          fedilink
          English
          arrow-up
          3
          ·
          4 months ago

          With music this often ends up in civil court. Pretty sure the same can in theory happen for written texts, but the commercial value of most written texts is not worth the cost of litigation.

      • TommySoda@lemmy.world
        link
        fedilink
        English
        arrow-up
        6
        arrow-down
        7
        ·
        4 months ago

        I mean, you’re not necessarily wrong. But that doesn’t change the fact that it’s still stealing, which was my point. Just because laws haven’t caught up to it yet doesn’t make it any less of a shitty thing to do.

        • ContrarianTrail@lemm.ee
          link
          fedilink
          English
          arrow-up
          3
          arrow-down
          5
          ·
          edit-2
          4 months ago

          The original source material is still there. They just made a copy of it. If you think that’s stealing then online piracy is stealing as well.

          • TommySoda@lemmy.world
            link
            fedilink
            English
            arrow-up
            2
            arrow-down
            1
            ·
            4 months ago

            Well they make a profit off of it, so yes. I have nothing against piracy, but if you’re reselling it that’s a different story.

            • ContrarianTrail@lemm.ee
              link
              fedilink
              English
              arrow-up
              3
              ·
              4 months ago

              But piracy saves you money which is effectively the same as making a profit. Also, it’s not just that they’re selling other people’s work for profit. You’re also paying for the insane amount of computing power it takes to train and run the AI plus salaries of the workers etc.

        • azuth@sh.itjust.works
          link
          fedilink
          English
          arrow-up
          6
          arrow-down
          4
          ·
          4 months ago

          It’s not stealing, its not even ‘piracy’ which also is not stealing.

          Copyright laws need to be scaled back, to not criminalize socially accepted behavior, not expand.

        • Octopus1348@lemy.lol
          link
          fedilink
          English
          arrow-up
          7
          arrow-down
          4
          ·
          4 months ago

          When I analyze a melody I play on a piano, I see that it reflects the music I heard that day or sometimes, even music I heard and liked years ago.

          Having parts similar or a part that is (coincidentally) identical to a part from another song is not stealing and does not infringe upon any law.

          • takeda@lemmy.world
            link
            fedilink
            English
            arrow-up
            4
            arrow-down
            2
            ·
            4 months ago

            You guys are missing a fundamental point. The copyright was created to protect an author for specific amount of time so somebody else doesn’t profit from their work essentially stealing their deserved revenue.

            LLM AI was created to do exactly that.

    • fmstrat@lemmy.nowsci.com
      link
      fedilink
      English
      arrow-up
      3
      ·
      4 months ago

      This is the catch with OPs entire statement about transformation. Their premise is flawed, because the next most likely token is usually the same word the author of a work chose.

      • TommySoda@lemmy.world
        link
        fedilink
        English
        arrow-up
        10
        arrow-down
        2
        ·
        4 months ago

        And that’s kinda my point. I understand that transformation is totally fine but these LLM literally copy and paste shit. And that’s still if you are comparing AI to people which I think is completely ridiculous. If anything these things are just more complicated search engines with half the usefulness. If I search online about how to change a tire I can find some reliable sources to do so. If I ask AI how to change a tire it would just spit something out that might not even be accurate and I’d have to search again afterwards just to make sure what it told me was even accurate.

        It’s just a word calculator based on information stolen from people without their consent. It has no original thought process so it has no way to transform anything. All it can do is copy and paste in different combinations.

    • PixelProf@lemmy.ca
      link
      fedilink
      English
      arrow-up
      7
      ·
      4 months ago

      Not to fully argue against your point, but I do want to push back on the citations bit. Given the way an LLM is trained, it’s not really close to equivalent to me citing papers researched for a paper. That would be more akin to asking me to cite every piece of written or verbal media I’ve ever encountered as they all contributed in some small way to way that the words were formulated here.

      Now, if specific data were injected into the prompt, or maybe if it was fine-tuned on a small subset of highly specific data, I would agree those should be cited as they are being accessed more verbatim. The whole “magic” of LLMs was that it needed to cross a threshold of data, combined with the attentional mechanism, and then the network was pretty suddenly able to maintain coherent sentences structure. It was only with loads of varied data from many different sources that this really emerged.