Skip Navigation

The Irony of 'You Wouldn't Download a Car' Making a Comeback in AI Debates

Those claiming AI training on copyrighted works is "theft" misunderstand key aspects of copyright law and AI technology. Copyright protects specific expressions of ideas, not the ideas themselves. When AI systems ingest copyrighted works, they're extracting general patterns and concepts - the "Bob Dylan-ness" or "Hemingway-ness" - not copying specific text or images.

This process is akin to how humans learn by reading widely and absorbing styles and techniques, rather than memorizing and reproducing exact passages. The AI discards the original text, keeping only abstract representations in "vector space". When generating new content, the AI isn't recreating copyrighted works, but producing new expressions inspired by the concepts it's learned.

This is fundamentally different from copying a book or song. It's more like the long-standing artistic tradition of being influenced by others' work. The law has always recognized that ideas themselves can't be owned - only particular expressions of them.

Moreover, there's precedent for this kind of use being considered "transformative" and thus fair use. The Google Books project, which scanned millions of books to create a searchable index, was ruled legal despite protests from authors and publishers. AI training is arguably even more transformative.

While it's understandable that creators feel uneasy about this new technology, labeling it "theft" is both legally and technically inaccurate. We may need new ways to support and compensate creators in the AI age, but that doesn't make the current use of copyrighted works for AI training illegal or unethical.

For those interested, this argument is nicely laid out by Damien Riehl in FLOSS Weekly episode 744. https://twit.tv/shows/floss-weekly/episodes/744

468 comments
  • The whole point of copyright in the first place, is to encourage creative expression, so we can have human culture and shit.

    The idea of a "teensy" exception so that we can "advance" into a dark age of creative pointlessness and regurgitated slop, where humans doing the fun part has been made "unnecessary" by the unstoppable progress of "thinking" machines, would be hilarious, if it weren't depressing as fuck.

    • The whole point of copyright in the first place, is to encourage creative expression

      ...within a capitalistic framework.

      Humans are creative creatures and will express themselves regardless of economic incentives. We don't have to transmute ideas into capital just because they have "value".

      • Sorry buddy, but that capitalistic framework is where we all have to exist for the forseeable future.

        Giving corporations more power is not going to help us end that.

      • You're not wrong.

        The kind of art humanity creates is skewed a lot by the need for it to be marketable, and then sold in order to be worth doing.

        But copyright is better than nothing, and this exemption would straight up be even worse than nothing.

      • I'd agree, but here's one issue with that: we live in reality, not in a post-capitalist dreamworld.

        Creativity takes up a lot of time from the individual, while a lot of us are already working two or even three jobs, all on top of art. A lot of us have to heavily compromise on a lot of things, or even give up our dreams because we don't have the time for that. Sure, you get the occasional "legendary metal guitarist practiced so much he even went to the toilet with a guitar", but many are so tired from their main job, they instead just give up.

        Developing game while having a full-time job feels like crunching 24/7, while only around 4 is going towards that goal, which includes work done on my smartphone at my job. Others just outright give up. This shouldn't be the normal for up and coming artists.

      • Humans are indeed creative by nature, we like making things. What we don't naturally do is publish, broadcast and preserve our work.

        Society is iterative. What we build today, we build mostly out of what those who came before us built. We tell our versions of our forefathers' stories, we build new and improved versions of our forefather's machines.

        A purely capitalistic society would have infinite copyright and patent durations, this idea is mine, it belongs to me, no one can ever have it, my family and only my family will profit from it forever. Nothing ever improves because improving on an old idea devalues the old idea, and the landed gentry can't allow that.

        A purely communist society immediately enters whatever anyone creates into the public domain. The guy who revolutionizes energy production making everyone's lives better is paid the same as a janitor. So why go through all the effort? Just sweep the floors.

        At least as designed, our idea of copyright is a compromise. If you have an idea, we will grant you a limited time to exclusively profit from your idea. You may allow others to also profit at your discretion; you can grant licenses, but that's up to you. After the time is up, your idea enters the public domain, and becomes the property and heritage of humanity, just like the Epic of Gilgamesh. Others are free to reproduce and iterate upon your ideas.

      • That’s the reason we got copyright, but I don’t think that’s the only reason we could want copyright.

        Two good reasons to want copyright:

        1. Accurate attribution
        2. Faithful reproduction

        Accurate attribution:

        Open source thrives on the notion that: if there’s a new problem to be solved, and it requires a new way of thinking to solve it, someone will start a project whose goal is not just to build new tools to solve the problem but also to attract other people who want to think about the problem together.

        If anyone can take the codebase and pretend to be the original author, that will splinter the conversation and degrade the ability of everyone to find each other and collaborate.

        In the past, this was pretty much impossible because you could check a search engine or social media to find the truth. But with enshittification and bots at every turn, that looks less and less guaranteed.

        Faithful reproduction:

        If I write a book and make some controversial claims, yet it still provokes a lot of interest, people might be inclined to publish slightly different versions to advance their own opinions.

        Maybe a version where I seem to be making an abhorrent argument, in an effort to mitigate my influence. Maybe a version where I make an argument that the rogue publisher finds more palatable, to use my popularity to boost their own arguments.

        This actually happened during the early days of publishing, by the way! It’s part of the reason we got copyright in the first place.

        And again, it seems like this would be impossible to get away with now, buuut… I’m not so sure anymore.

        Personally:

        I favor piracy in the sense that I think everyone has a right to witness culture even if they can’t afford the price of admission.

        And I favor remixing because the cultural conversation should be an active read-write two-way street, no just passive consumption.

        But I also favor some form of licensing, because I think we have a duty to respect the integrity of the work and the voice of the creator.

        I think AI training is very different from piracy. I’ve never downloaded a mega pack of songs and said to my friends “Listen to what I made!” I think anyone who compares OpenAI to pirates (favorably) is unwittingly helping the next set of feudal tech lords build a wall around the entirety of human creativity, and they won’t realize their mistake until the real toll booths open up.

    • The whole point of copyright in the first place, is to encourage creative expression, so we can have human culture and shit.

      I feel like that purpose has already been undermined by various changes to copyright law since its inception, such as DMCA and lengthening copyright term from 14 years to 95. Freedom to remix existing works is an important part of creative expression which current law stifles for any original work that releases in one person's lifespan. (Even Disney knew this: the animated Pinocchio movie wouldn't exist if copyright could last more than 56 years then)

      Either way, giving bots the 'right' to remix things that were just made less than a year ago while depriving humans the right to release anything too similar to a 94 year old work seems ridiculous on both ends.

  • This process is akin to how humans learn by reading widely and absorbing styles and techniques, rather than memorizing and reproducing exact passages.

    Like fuck it is. An LLM "learns" by memorization and by breaking down training data into their component tokens, then calculating the weight between these tokens. This allows it to produce an output that resembles (but may or may not perfectly replicate) its training dataset, but produces no actual understanding or meaning--in other words, there's no actual intelligence, just really, really fancy fuzzy math.

    Meanwhile, a human learns by memorizing training data, but also by parsing the underlying meaning and breaking it down into the underlying concepts, and then by applying and testing those concepts, and mastering them through practice and repetition. Where an LLM would learn "2+2 = 4" by ingesting tens or hundreds of thousands of instances of the string "2+2 = 4" and calculating a strong relationship between the tokens "2+2," "=," and "4," a human child would learn 2+2 = 4 by being given two apple slices, putting them down to another pair of apple slices, and counting the total number of apple slices to see that they now have 4 slices. (And then being given a treat of delicious apple slices.)

    Similarly, a human learns to draw by starting with basic shapes, then moving on to anatomy, studying light and shadow, shading, and color theory, all the while applying each new concept to their work, and developing muscle memory to allow them to more easily draw the lines and shapes that they combine to form a whole picture. A human may learn off other peoples' drawings during the process, but at most they may process a few thousand images. Meanwhile, an LLM learns to "draw" by ingesting millions of images--without obtaining the permission of the person or organization that created those images--and then breaking those images down to their component tokens, and calculating weights between those tokens. There's about as much similarity between how an LLM "learns" compared to human learning as there is between my cat and my refrigerator.

    And YET FUCKING AGAIN, here's the fucking Google Books argument. To repeat: Google Books used a minimal portion of the copyrighted works, and was not building a service to compete with book publishers. Generative AI is using the ENTIRE COPYRIGHTED WORK for its training set, and is building a service TO DIRECTLY COMPETE WITH THE ORGANIZATIONS WHOSE WORKS THEY ARE USING. They have zero fucking relevance to one another as far as claims of fair use. I am sick and fucking tired of hearing about Google Books.

    EDIT: I want to make another point: I've commissioned artists for work multiple times, featuring characters that I designed myself. And pretty much every time I have, the art they make for me comes with multiple restrictions: for example, they grant me a license to post it on my own art gallery, and they grant me permission to use portions of the art for non-commercial uses (e.g. cropping a portion out to use as a profile pic or avatar). But they all explicitly forbid me from using the work I commissioned for commercial purposes--in other words, I cannot slap the art I commissioned on a T-shirt and sell it at a convention, or make a mug out of it. If I did so, that artist would be well within their rights to sue the crap out of me, and artists charge several times as much to grant a license for commercial use.

    In other words, there is already well-established precedent that even if something is publicly available on the Internet and free to download, there are acceptable and unacceptable use cases, and it's broadly accepted that using other peoples' work for commercial use without compensating them is not permitted, even if I directly paid someone to create that work myself.

  • Though I am not a lawyer by training, I have been involved in such debates personally and professionally for many years. This post is unfortunately misguided. Copyright law makes concessions for education and creativity, including criticism and satire, because we recognize the value of such activities for human development. Debates over the excesses of copyright in the digital age were specifically about humans finding the application of copyright to the internet and all things digital too restrictive for their educational, creative, and yes, also their entertainment needs. So any anti-copyright arguments back then were in the spirit specifically of protecting the average person and public-interest non-profit institutions, such as digital archives and libraries, from big copyright owners who would sue and lobby for total control over every file in their catalogue, sometimes in the process severely limiting human potential.

    AI’s ingesting of text and other formats is “learning” in name only, a term borrowed by computer scientists to describe a purely computational process. It does not hold the same value socially or morally as the learning that humans require to function and progress individually and collectively.

    AI is not a person (unless we get definitive proof of a conscious AI, or are willing to grant every implementation of a statistical model personhood). Also AI it is not vital to human development and as such one could argue does not need special protections or special treatment to flourish. AI is a product, even more clearly so when it is proprietary and sold as a service.

    Unlike past debates over copyright, this is not about protecting the little guy or organizations with a social mission from big corporate interests. It is the opposite. It is about big corporate interests turning human knowledge and creativity into a product they can then use to sell services to - and often to replace in their jobs - the very humans whose content they have ingested.

    See, the tables are now turned and it is time to realize that copyright law, for all its faults, has never been only or primarily about protecting large copyright holders. It is also about protecting your average Joe from unauthorized uses of their work. More specifically uses that may cause damage, to the copyright owner or society at large. While a very imperfect mechanism, it is there for a reason, and its application need not be the end of AI. There’s a mechanism for individual copyright owners to grant rights to specific uses: it’s called licensing and should be mandatory in my view for the development of proprietary LLMs at least.

    TL;DR: AI is not human, it is a product, one that may augment some tasks productively, but is also often aimed at replacing humans in their jobs - this makes all the difference in how we should balance rights and protections in law.

    • AI are people, my friend. /s

      But, really, I think people should be able to run algorithms on whatever data they want. It's whether the output is sufficiently different or "transformative" that matters (and other laws like using people's likeness). Otherwise, I think the laws will get complex and nonsensical once you start adding special cases for "AI." And I'd bet if new laws are written, they'd be written by lobbiests to further erode the threat of competition (from free software, for instance).

    • What do you think "ingesting" means if not learning?

      Bear in mind that training AI does not involve copying content into its database, so copyright is not an issue. AI is simply predicting the next token /word based on statistics.

      You can train AI in a book and it will give you information from the book - information is not copyrightable. You can read a book a talk about its contents on TV - not illegal if you're a human, should it be illegal if you're a machine?

      There may be moral issues on training on someone's hard gathered knowledge, but there is no legislature against it. Reading books and using that knowledge to provide information is legal. If you try to outlaw Automating this process by computers, there will be side effects such as search engines will no longer be able to index data.

      • Bear in mind that training AI does not involve copying content into its database, so copyright is not an issue.

        Wrong. The infringement is in obtaining the data and presenting it to the AI model during the training process. It makes no difference that the original work is not retained in the model's weights afterwards.

        You can train AI in a book and it will give you information from the book - information is not copyrightable. You can read a book a talk about its contents on TV - not illegal if you’re a human, should it be illegal if you’re a machine?

        Yes, because copyright law is intended to benefit human creativity.

        If you try to outlaw Automating this process by computers, there will be side effects such as search engines will no longer be able to index data.

        Wrong. Search engines retain a minimal amount of the indexed website's data, and the purpose of the search engine is to generate traffic to the website, providing benefit for both the engine and the website (increased visibility, the opportunity to show ads to make money). Banning the use of copyrighted content for AI training (which uses the entire copyrighted work and whose purpose is to replace the organizations whose work is being used) will have no effect.

  • Studied AI at uni. I'm also a cyber security professional. AI can be hacked or tricked into exposing training data. Therefore your claim about it disposing of the training material is totally wrong.

    Ask your search engine of choice what happened when Gippity was asked to print the word "book" indefinitely. Answer: it printed training material after printing the word book a couple hundred times.

    Also my main tutor in uni was a neuroscientist. Dude straight up told us that the current AI was only capable of accurately modelling something as complex as a dragon fly. For larger organisms it is nowhere near an accurate recreation of a brain. There are complexities in our brain chemistry that simply aren't accounted for in a statistical inference model and definitely not in the current gpt models.

    • That knowledge is out of date and out of touch. While it's possible to expose small bits of training data, that's akin to someone being able to recall a portion of the memory of the scene they saw. However, those exercises essentially took what sometimes equates to weeks or months of interrogation method knowledge gained over time employed by people looking to target specific types of responses. Think of it like a skilled police interrogator tricking a toddler out of one of their toys by threatening them or offering them something until it worked. Nowadays, that's getting far more difficult to do and they're spending a lot more time and expertise to do it.

      Also, consider how complex a dragonfly is and how young this technology is. Very little in tech has ever progressed that fast. Give it five more years and come back to laugh at how naive your comment will seem.

      • Dammit, so my comment to the other person was a mix of a reply to this one and the last one... not having a good day for language processing, ironically.

        Specifically on the dragonfly thing, I don't think I'll believe myself naive for writing that post or this one. Dragonflies arent very complex and only really have a few behaviours and inputs. We can accurately predict how they will fly. I brought up the dragonfly to mention the limitations of the current tech and concepts. Given the worlds computing power and research investment, the best we can do is a dragonfly for intelligence.

        To be fair, Scientists don't entirely understand neurons and ML designed neuron-data structures behave similarly to very early ideas of what brains do but its based on concepts from the 1950s. There are different segments of the brain which process different things and we sort of think we know what they all do but most of the studies AI are based on is honestly outdated neuroscience. OpenAI seem to think if they stuff enough data into this language processor it will become sentient and want an exemption from copyright law so they can be profitable rather than actually improving the tech concepts and designs.

        Newer neuroscience research suggest neurons perform differently based on the brain chemicals present, they don't all always fire at every (or even most) input and they usually present a train of thought, I.e. thoughts literally move around in the brains areas. This is all very different to current ML implementations and is frankly a good enough reason to suggest the tech has a lot of room to develop. I like the field of research and its interesting to watch it develop but they can honestly fuck off telling people they need free access to the world's content.

        TL;DR dragonflies aren't that complex and the tech has way more room to grow. However, they have to generate revenue to keep going so they're selling a large inference machine that relies on all of humanities content to generate the wrong answer to 2+2.

    • Your first point is misguided and incorrect. If you've ever learned something by 'cramming', a.k.a. just repeating ingesting material until you remember it completely. You don't need the book in front of you anymore to write the material down verbatim in a test. You still discarded your training material despite you knowing the exact contents. If this was all the AI could do it would indeed be an infringement machine. But you said it yourself, you need to trick the AI to do this. It's not made to do this, but certain sentences are indeed almost certain to show up with the right conditioning. Which is indeed something anyone using an AI should be aware of, and avoid that kind of conditioning. (Which in practice often just means, don't ask the AI to make something infringing)

      • I think you're anthropomorphising the tech tbh. It's not a person or an animal, it's a machine and cramming doesn't work in the idea of neural networks. They're a mathematical calculation over a vast multidimensional matrix, effectively solving a polynomial of an unimaginable order. So "cramming" as you put it doesn't work because by definition an LLM cannot forget information because once it's applied the calculations, it is in there forever. That information is supposed to be blended together. Overfitting is the closest thing to what you're describing, which would be inputting similar information (training data) and performing the similar calculations throughout the network, and it would therefore exhibit poor performance should it be asked do anything different to the training.

        What I'm arguing over here is language rather than a system so let's do that and note the flaws. If we're being intellectually honest we can agree that a flaw like reproducing large portions of a work doesn't represent true learning and shows a reliance on the training data, i.e. it cant learn unless it has seen similar data before and certain inputs provide a chance it just parrots back the training data.

        In the example (repeat book over and over), it has statistically inferred that those are all the correct words to repeat in that order based on the prompt. This isn't akin to anything human, people can't repeat pages of text verbatim like this and no toddler can be tricked into repeating a random page from a random book as you say. The data is there, it's encoded and referenced when the probability is high enough. As another commenter said, language itself is a powerful tool of rules and stipulations that provide guidelines for the machine, but it isn't crafting its own sentences, it's using everyone else's.

        Also, calling it "tricking the AI" isn't really intellectually honest either, as in "it was tricked into exposing it still has the data encoded". We can state it isn't preferred or intended behaviour (an exploit of the system) but the system, under certain conditions, exhibits reuse of the training data and the ability to replicate it almost exactly (plagiarism). Therefore it is factually wrong to state that it doesn't keep the training data in a usable format - which was my original point. This isn't "cramming", this is encoding and reusing data that was not created by the machine or the programmer, this is other people's work that it is reproducing as it's own. It does this constantly, from reusing StackOverflow code and comments to copying tutorials on how to do things. I was showing a case where it won't even modify the wording, but it reproduces articles and programs in their structure and their format. This isn't originality, creativity or anything that it is marketed as. It is storing, encoding and copying information to reproduce in a slightly different format.

        EDITS: Sorry for all the edits. I mildly changed what I said and added some extra points so it was a little more intelligible and didn't make the reader go "WTF is this guy on about". Not doing well in the written department today so this was largely gobblederemoved before but hopefully it is a little clearer what I am saying.

  • "This process is akin to how humans learn... The AI discards the original text, keeping only abstract representations..."

    Now I sail the high seas myself, but I don't think Paramount Studios would buy anyone's defence they were only pirating their movies so they can learn the general content so they can produce their own knockoff.

    Yes artists learn and inspire each other, but more often than not I'd imagine they consumed that art in an ethical way.

    • Now I sail the high seas myself, but I don't think Paramount Studios would buy anyone's defence they were only pirating their movies so they can learn the general content so they can produce their own knockoff.

      However, Paramount, itself, does pirate content specifically to learn its content so it can produce its own knockoff. As do all the other major studios.

      No one engages in IP enforcement in good faith, or respects the IP of others if they can find benefit in circumventing it.

      That's part of the problem. None of the key stakeholders (other than the biggest stakeholder, the public) are interested in preserving the interests of the creators, artists and developers, rather are interested in boosting their own profit gains.

      Which makes this not about big companies stealing from human art but from IP property of their own kin.

      Yes, Generative AI very much does borrow liberally from the work of human creatives. But those artists mostly signed away their rights long ago to their publishing house masters. Since the ownership class controlled the presses, those contracts were far from fair.

      Artists, today, routinely see their art stolen by their own publishing houses at length, and it's embittering and soul-crushing. We've seen Hollywood accounting come into play throughout the last century. Famous actors are notoriously cheated out of residuals. (With the rise of the internet, and prior to that a few smart agents, we've seen a small but growing number of — usually pirate-friendly — exceptions.)

      The artists were screwed long before AI ever came around.

      Instead this fight is about IP-holding companies slugging it out with big computing companies, a kaiju match that is likely to leave Tokyo (that is, the rest of us, creators and consumers alike) in ruin. But we're already in squalor, anyway.

    • Now I sail the high seas myself, but I don’t think Paramount Studios would buy anyone’s defence they were only pirating their movies so they can learn the general content so they can produce their own knockoff.

      We don't know exactly how they source their data (and that is definitely shady), but if I can gain access to a movie in a legal way, I don't see why I would not be able to gather statistics from said movie, including running a speech to text model to caption it, then make statistics of how many times a few words were used, and followed by which ones. This is an oversimplified explanation of what a LLM does, but it's the fairest I can come up, and it would be legal to do so. The models are always orders of magnitude smaller than the data they are trained on.

      That said, I don't imply that I'm happy with the state of high tech companies, the AI hype, the energy consumption, or the impact on the humble people. But I've put a lot of thought into this (and learning about machine learning for real), and I think this is not a ML problem, but a problem in the economic, legal and political system. AI hype is just a symptom.

  • Okay that's just stupid. I'm really fond of AI but that's just common Greed.

    "Free the Serfs?! We can't survive without their labor!!" "Stop Child labour?! We can't survive without them!" "40 Hour Work Week?! We can't survive without their 16 Hour work Days!"

    If you can't make profit yet, then fucking stop.

  • As someone who researched AI pre-GPT to enhance human creativity and aid in creative workflows, it's sad for me to see the direction it's been marketed, but not surprised. I'm personally excited by the tech because I personally see a really positive place for it where the data usage is arguably justified, but we either need to break through the current applications of it which seems more aimed at stock prices and wow-factoring the public instead of using them for what they're best at.

    The whole exciting part of these was that it could convert unstructured inputs into natural language and structured outputs. Translation tasks (broad definition of translation), extracting key data points in unstructured data, language tasks. It's outstanding for the NLP tasks we struggled with previously, and these tasks are highly transformative or any inputs, it purely relies on structural patterns. I think few people would argue NLP tasks are infringing on the copyright owner.

    But I can at least see how moving the direction toward (particularly with MoE approaches) using Q&A data to support generating Q&A outputs, media data to support generating media outputs, using code data to support generating code, this moves toward the territory of affecting sales and using someone's IP to compete against them. From a technical perspective, I understand how LLMs are not really copying, but the way they are marketed and tuned seems to be more and more intended to use people's data to compete against them, which is dubious at best.

  • When AI systems ingest copyrighted works, they're extracting general patterns and concepts - the "Bob Dylan-ness" or "Hemingway-ness" - not copying specific text or images.

    Okay.

    • I'm confused exactly what you're saying here. It does seem from your experiment that if you specifically ask it to, Chat GPT can reproduce selected pieces of copyrighted creative works verbatim, but what's your point? You posted the screenshots underneath a quote about how AI systems extract patterns from works rather than copying them so I guess you want to show that it can at times in fact just copy things despite this seeming claim to the opposite, but the fact that you prompted the system to do it seems to kind of dilute this point a bit. In any case, it's not just reproducing the work, it's producing output that is relevant to your naturally phrased English language input, and selecting which particular passage in a way that is specifically relevant to the way your input was phrased and also adding additional output aside from the quoted passage which is also relevant and unique to the prompt.

      The developers make the analogy of a person being influenced by works in the creation of their own and that that is considered acceptable. If you asked Bob Dylan to cite a passage from a work by Hemingway and he successfully remembered such a passage and in the correct context recited it to you verbatim, followed by an explanation for why it's a good passage to have selected, you wouldn't take from that exchange that this was proof that Bob Dylan was not really actually 'influenced' by anything but was instead just cobbling together the work of others when he produces his music. If anything, it'd likely be regarded as a mark of how well read Bob Dylan must be that he could remember the passage so accurately and choose a passage that so successfully fits the brief of your request. I don't typically want to leap to the defence of these AI models that wholesale take in so much creative work and mechanistically re-assemble it without compensation nor input from the artist but I wouldn't pretend that it's not an issue with at least a little nuance to it and I can't see what these screenshots prove.

      • OpenAI is arguing "we're not using copyrighted works in a way which would require us to pay anything, the machine is merely extrapolating patterns".

        But then it does go on to quote materials verbatim, which shows it's not "just" 'extracting patterns'.

        If I were to put up a service called "quote a book" or something, and it just had a non-AI bot which would — when given the book and pages — quote copyrighted works, would that be okay for me to make money on, without paying anyone I'm quoting? Even if they started to use my service to literally copy entire books?

        Why are you defending massive corporations who could just pay up? Isn't the whole "corporations putting profits over anything" thing a bit... seen already?

      • My point is, that the following statement is not entirely correct:

        When AI systems ingest copyrighted works, they’re extracting general patterns and concepts [...] not copying specific text or images.

        One obvious flaw in that sentence is the general statement about AI systems. There are huge differences between different realms of AI. Failing to address those by at least mentioning that briefly, disqualifies the author regarding factual correctness. For example, there are a plethora of non-generative AIs, meaning those, not generating texts, audio or images/videos, but merely operating as a classifier or clustering algorithm for instance, which are - without further modifications - not intended to replicate data similar to its inputs but rather provide insights.
        However, I can overlook this as the author might have just not thought about that in the very moment of writing.

        Next:
        While it is true that transformer models like ChatGPT try to learn patterns, the most likely token for the next possible output in a sequence of contextually coherent data, given the right context it is not unlikely that it may reproduce its training data nearly or even completely identically as I've demonstrated before. The less data is available for a specific context to generalise from, the more likely it becomes that the model just replicates its training data. This is in principle fine because this is what such models are designed to do: draw the best possible conclusions from the available data to predict the next output in a sequence. (That's one of the reasons why they need such an insane amount of data to be trained on.)
        This can ultimately lead to occurences of indeed "copying specific texts or images".

        but the fact that you prompted the system to do it seems to kind of dilute this point a bit

        It doesn't matter whether I directly prompted it for it. I set the correct context to achieve this kind of behaviour, because context matters most for transformer models. Directly prompting it do do that was just an easy way of setting the required context. I've occasionally observed ChatGPT replicating identical sentences from some (copyright-protected) scientific literature when I used it to get an overview over some specific topic and also had books or papers about that on hand. The latter demonstrates again that transformers become more likely to replicate training data the more "specific" a context becomes, i.e., having significantly less training data available for that context than about others.

  • Those claiming AI training on copyrighted works is "theft" misunderstand key aspects of copyright law and AI technology. Copyright protects specific expressions of ideas, not the ideas themselves.

    Sure.

    When AI systems ingest copyrighted works, they're extracting general patterns and concepts - the "Bob Dylan-ness" or "Hemingway-ness" - not copying specific text or images.

    Not really. Sure, they take input and garble it up and it is "transformative" - but so is a human watching a TV series on a pirate site, for example. Hell, it's eduactional is treated as a copyright violation.

    This process is akin to how humans learn by reading widely and absorbing styles and techniques, rather than memorizing and reproducing exact passages.

    Perhaps. (Not an AI expert). But, as the law currently stands, only living and breathing persons can be educated, so the "educational" fair use protection doesn't stand.

    The AI discards the original text, keeping only abstract representations in "vector space". When generating new content, the AI isn't recreating copyrighted works, but producing new expressions inspired by the concepts it's learned.

    It does and it doesn't discard the original. It isn't impossible to recreate the original (since all the data it gobbled up gets stored somewhere in some shape or form and can be truthfully recreated, at least judging by a few comments bellow and news reports). So AI can and does recreate (duplicate or distribute, perhaps) copyrighted works.

    Besides, for a copyright violation, "substantial similarity" is needed, not one-for-one reproduction.

    This is fundamentally different from copying a book or song.

    Again, not really.

    It's more like the long-standing artistic tradition of being influenced by others' work.

    Sure. Except when it isn't and the AI pumps out the original or something close enoigh to it.

    The law has always recognized that ideas themselves can't be owned - only particular expressions of them.

    I'd be careful with the "always" part. There was a famous case involving Katy Perry where a single chord was sued over as copyright infringement. The case was thrown out on appeal, but I do not doubt that some pretty wild cases have been upheld as copyright violations (see "patent troll").

    Moreover, there's precedent for this kind of use being considered "transformative" and thus fair use. The Google Books project, which scanned millions of books to create a searchable index, was ruled legal despite protests from authors and publishers. AI training is arguably even more transformative.

    The problem is that Google books only lets you search some phrase and have it pop up as beibg from source xy. It doesn't have the capability of reproducing it (other than maybe the page it was on perhaps) - well, it does have the capability since it's in the index somewhere, but there are checks in place to make sure it doesn't happen, which seem to be yet unachieved in AI.

    While it's understandable that creators feel uneasy about this new technology, labeling it "theft" is both legally and technically inaccurate.

    Yes. Just as labeling piracy as theft is.

    We may need new ways to support and compensate creators in the AI age, but that doesn't make the current use of copyrighted works for AI training illegal or

    Yes, new legislation will made to either let "Big AI" do as it pleases, or prevent it from doing so. Or, as usual, it'll be somewhere inbetween and vary from jurisdiction to jurisdiction.

    However,

    that doesn't make the current use of copyrighted works for AI training illegal or unethical.

    this doesn't really stand. Sure, morals are debatable and while I'd say it is more unethical as private piracy (so no distribution) since distribution and disemination is involved, you do not seem to feel the same.

    However, the law is clear. Private piracy (as in recording a song off of radio, a TV broadcast, screen recording a Netflix movie, etc. are all legal. As is digitizing books and lending the digital (as long as you have a physical copy that isn't lended out as the same time representing the legal "original"). I think breaking DRM also isn't illegal (but someone please correct me if I'm wrong).

    The problems arises when the pirated content is copied and distributed in an uncontrolled manner, which AI seems to be capable of, making the AI owner as liable of piracy if the AI reproduced not even the same, but "substantially similar" output, just as much as hosts of "classic" pirated content distributed on the Web.

    Obligatory IANAL and as far as the law goes, I focused on US law since the default country on here is the US. Similar or different laws are on the books in other places, although most are in fact substantially similar. Also, what the legislators cone up with will definately vary from place to place, even more so than copyright law since copyright law is partially harmonised (see Berne convention).

    • You made a lot of points here. Many I agree with, some I don't, but I specifically want to address this because it seems to be such a common misconception.

      It does and it doesn't discard the original. It isn't impossible to recreate the original (since all the data it gobbled up gets stored somewhere in some shape or form and can be truthfully recreated, at least judging by a few comments bellow and news reports). So AI can and does recreate (duplicate or distribute, perhaps) copyrighted works.

      AI stores original works like a dictionary does. All the words are there, but the order and meaning is completely gone. An original work is possible to recreate by randomly selecting words from the dictionary, but it's unlikely.

      The thing that makes AI useful is that it understands the patterns words are typically used in. It orders words in the right way far more often than random chance. It knows "It was the best of" has a lot of likely options for the next word, but if it selects "times" as the next word, it's far more likely to continue with, "it was the worst of times." Because that sequence of words is so ubiquitous due to references to the classic story. But over the course of following these word patterns, it will quickly glom onto a different pattern and create a wholly new work from the original "prompt."

      There are only two cases in which an original work should be duplicated: either the training data is far too small and the model is overtrained on that particular work, or the work is the most derivative text imaginable lacking any flair or originality.

      Adding more training data makes it less likely to recreate any original works.

      I am aware of examples where it was claimed an LLM reproduced entirely code functions including original comments. That is either a case of overtraining, or far too many people were already copying that code verbatim into their own, thus making that work very over represented in the training data (same thing, but it was infringing developers who poisoned the data, not researchers using bad training data).

      Bottom line: when created with enough data, no original works are stored in any way that allows faithful reproduction other than by chance so random that it's similar to rolling dice over a dictionary.

      None of this means AI can do no wrong, I just don't find the copyright claim compelling.

    • It's funny you mention the Katy Perry chord case, because Damien Riehl, who made the argument I referenced in my original post, actually talked about this exact case in the podcast I mentioned. He noted that Katy Perry was initially sued and a jury awarded $2.8 million over a very simple melody that appeared over 8,000 times in Riehl's dataset of generated melodies. However, after Riehl gave his TED talk about his "All the Music" project in early 2020, the judge reversed the jury verdict, saying the melody was unoriginal and therefore uncopyrightable.

      • Agreed.

        I didn't listen to the podcast so I wouldn't know, but honestly, she was lucky. She's popular and her publishers had an interest in the case (they'd lose out on profits if she lost). And she initially did lose. It was only because of the publicity of the case that it was overruled (although money did help as well).

        Unfortunately, this could've happened to any smaller artist, and it routinely happens with patent trolls I pointed to. Unfortunately, I don't have a lawsuit I can point to, but given the volume, one surely exists.

        Also, it's not as if I approve of the current state of copyright in the US (or EU for that matter).

        Originally copyright was meant to protect rights of the author, but in time it was bastardised into the concept we have today where artist sign off their rights to publishers.

        So my proposal is - if corporations like copyright, let them have it. I won't watch Disney movies outside of Disney+ ors the system we've got and have to live with, why not let the corporatios feel it as well?

        Why would Google, which makes loads of money from those demonetizations on one side of the law now be allowed to use copyrighted works of others for profit, while Internet users in the US get a fine or their service cut for alleged copright infringement while those in Germany get a stern letter with a big fake fine?

        Big Tech shouldn't get to profit both from the false copyright infringement claims as well as getting to use the actual copyrighted content to generate a profit.

        This whole AI copyright situation is just a symptom of an ailing global copyright policy that needs to be fixed, and slapping an AI-free-for-all band-aid on top isn't a fix.

        My train of thought is this: If we don't let a simple AI exceotion into the books, either training AI on copyrighted content stays illegal, or the entire system gets a reimagining.

        If it stays the same, this will not mean much. Piracy sites and torrenting exists despite the current state of copyright law. I don't see why AI could't exist in this way. This has the huge plus of keeping AI outside the hands of Big Tech. Hopefully this also means it's harder for harmful uses of AI to be legal.

        Alternatively, we get a better copyright system for everyone, assuming it isn't made to only benefit the corporations.

    • I'd be careful with the "always" part. There was a famous case involving Katy Perry where a single chord was sued over as copyright infringement. The case was thrown out on appeal, but I do not doubt that some pretty wild cases have been upheld as copyright violations (see "patent troll").

      Are you really trying to argue against a point by providing evidence supporting it?

    • Half of your argument is just saying, "nu-uh" over and over again without any valid counterpoints.

  • The "you wouldn't download a car" statement is made against personal cases of piracy, which got rightfully clowned upon. It obviously doesn't work at all when you use its ridiculousness to defend big ass corporations that tries to profit from so many of the stuff they "downloaded".

    Besides, it is not "theft". It is "plagiarism". And I'm glad to see that people that tries to defend these plagiarism machines that are attempted to be humanised and inflated to something they can never be, gets clowned. It warms my heart.

  • The argument seem most commonly from people on fediverse (which I happen to agree with) is really not about what current copyright laws and treaties say / how they should be interpreted, but how people view things should be (even if it requires changing laws to make it that way).

    And it fundamentally comes down to economics - the study of how resources should be distributed. Apart from oligarchs and the wannabe oligarchs who serve as useful idiots for the real oligarchs, pretty much everyone wants a relatively fair and equal distribution of wealth amongst the people (differing between left and right in opinion on exactly how equal things should be, but there is still some common ground). Hardly anyone really wants serfdom or similar where all the wealth and power is concentrated in the hands of a few (obviously it's a spectrum of how concentrated, but very few people want the extreme position to the right).

    Depending on how things go, AI technologies have the power to serve humanity and lift everyone up equally if they are widely distributed, removing barriers and breaking existing 'moats' that let a few oligarchs hoard a lot of resources. Or it could go the other way - oligarchs are the only ones that have access to the state of the art model weights, and use this to undercut whatever they want in the economy until they own everything and everyone else rents everything from them on their terms.

    The first scenario is a utopia scenario, and the second is a dystopia, and the way AI is regulated is the fork in the road between the two. So of course people are going to want to cheer for regulation that steers towards the utopia.

    That means things like:

    • Fighting back when the oligarchs try to talk about 'AI Safety' meaning that there should be no Open Source models, and that they should tightly control how and for what the models can be used. The biggest AI Safety issue is that we end up in a dystopian AI-fueled serfdom, and FLOSS models and freedom for the common people to use them actually helps to reduce the chances of this outcome.
    • Not allowing 'AI washing' where oligarchs can take humanities collective work, put it through an algorithm, and produce a competing thing that they control - unless everyone has equal access to it. One policy that would work for this would be that if you create a model based on other people's work, and want to use that model for a commercial purpose, then you must publicly release the model and model weights. That would be a fair trade-off for letting them use that information for training purposes.

    Fundamentally, all of this is just exacerbating cracks in the copyright system as a policy. I personally think that a better system would look like this:

    • Everyone gets a Universal Basic Income paid, and every organisation and individual making profit pays taxes in to fund the UBI (in proportion to their profits).
    • All forms of intellectual property rights (except trademarks) are abolished - copyright, patents, and trade secrets are no longer enforced by the law. The UBI replaces it as compensation to creators.
    • It is illegal to discriminate against someone for publicly disclosing a work they have access to, as long as they didn't accept valuable consideration to make that disclosure. So for example, if an OpenAI employee publicly released the model weights for one of OpenAI's models without permission from anyone, it would be illegal for OpenAI to demote / fire / refuse to promote / pay them differently on that basis, and for any other company to factor that into their hiring decision. There would be exceptions for personally identifiable information (e.g. you can't release the client list or photos of real people without consequences), and disclosure would have to be public (i.e. not just to a competitor, it has to be to everyone) and uncompensated (i.e. you can't take money from a competitor to release particular information).

    If we had that policy, I'd be okay for AI companies to be slurping up everything and training model weights.

    However, with the current policies, it is pushing us towards the dystopic path where AI companies take what they want and never give anything back.

468 comments