2 authors say OpenAI 'ingested' their books to train ChatGPT. Now they're suing, and a 'wave' of similar court cases may follow.

L4sBot@lemmy.world · 2 years ago

2 authors say OpenAI 'ingested' their books to train ChatGPT. Now they're suing, and a 'wave' of similar court cases may follow.

trial_and_err@lemmy.world · 2 years ago

ChatGPT got entire books memorised. You can and (or could at least when I tried a few weeks back) make it print entire pages of for example Harry Potter.

kescusay@lemmy.world · 2 years ago

I think this is exposing a fundamental conceptual flaw in LLMs as they’re designed today. They can’t seem to simultaneously respect intellectual property / licensing and be useful.

Their current best use case - that is to say, a use case where copyright isn’t an issue - is dedicated instances trained on internal organization data. For example, Copilot Enterprise, which can be configured to use only the enterprise’s data, without any public inputs. If you’re only using your own data to train it, then copyright doesn’t come into play.

That’s been implemented where I work, and the best thing about it is that you get suggestions already tailored to your company’s coding style. And its suggestions improve the more you use it.

But AI for public consumption? Nope. Too problematic. In fact, public AI has been explicitly banned in our environment.

dhork@lemmy.world · 2 years ago

There’s an additional question: who holds the copyright on the output of an algorithm? I don’t think that is copyrightable at all. The bot doesn’t really add anything to the output, it’s just a fancy search engine. In the US, in particular, the agency in charge of Copyrights has been quite insistent that a copyright can only be given to the output if a human.

So when an AI incorporates parts of copyrighted works into its output, how can that not be infringement?

cerevant@lemmy.world · 2 years ago

How can you write a blog post reviewing a book you read without copyright infringement? How can you post a plot summary to Wikipedia without copyright infringement?

I think these blanket conclusions about AI consuming content being automatically infringing are wrong. What is important is whether or not the output is infringing.

dhork@lemmy.world · edit-2 2 years ago

You can write that blog post because you are a human, and your summary qualifies for copyright protection, because it is the unique output of a human based on reading the copywrited material.

But the US authorities are quite clear that a work that is purely AI generated can never qualify for copyright protection. Yet since it is based on the synthesis of works under copyright, it can’t really be considered public domain either. Otherwise you could ask the AI “Write me a summary of this book that has exactly the same number of words”, and likely get a direct copy of the book which is clear of copyright.

I think that these AI companies are going to face a reckoning, when it is ruled that they misappropriated all this content that they didn’t explicitly license for use, and all their output is just fringing by definition.

Whimsical@lemmy.world · 2 years ago

I’m expecting a much messier “resolution” that’ll look a lot like YouTube’s copyright situation - their product can be used for copyright infringement, and they’ll be required by law to try and take appropriate measures to prevent it, but will otherwise not be held liable as long as they can claim such measures are being taken.

Having an AI recite a long text to bypass copyright seems equivalent in my mind to uploading a full movie to youtube. In both cases, some amount of moderation (itself increasingly algorithmic) is required to not only be applied, but actively developed and advanced to flout efforts to bypass it. For instance, youtube pirates will upload things with some superficial changes like a filter applied or showing the movie on a weird angle or mirrored to bypass copyright bots, which means the bots need to be more strict and better trained, or else youtube once again becomes liable for knowing about these pirates and not stopping them.

The end result, just like with youtube, will probably be that AI models have to have big, clunky algorithms applied against their outputs to recalculate or otherwise make copyright-safe anything that might remotely be an infringement. It’ll suck for normal users, pirates will still dig for ways to bypass it, and everyone will be unhappy. If youtube is any indicator, this situation can somehow remain stable for over a decade - long enough for AI devs to release a new-generation bot to restart the whole issue.

Yaaaaaaaaay

cerevant@lemmy.world · 2 years ago

But the US authorities are quite clear that a work that is purely AI generated can never qualify for copyright protection.

Which law says this? The government is certainly discussing the problem, but I wasn’t aware of any legislation.

If there is such a law, it seems to overlook an important point: an algorithm - an AI - is itself an expression of human intelligence. Having a computer carry out an algorithm for summarizing content can be indistinguishable from a person having a pattern they follow for writing summaries.

whoisearth@lemmy.ca · edit-2 2 years ago

I have a post consumerism pipe dream that one day we will collectively realize all the stupid shit we waste time and resources on are not worth it and we enter a future like star trek.

As a species we waste so much simply making sure that those less privileged either by money or means, are not allowed to take from those with either. It’s stupid.

Edit - if we spent half the energy helping out brothers and sisters to succeed as we did to keep them down the world would be a better place. And by help them succeed I don’t mean money. Money is the lowest possible threshold.

totallynotarobot@lemmy.world · 2 years ago

Can’t reply directly to @OldGreyTroll@kbin.social because of that “language” bug, but:

The problem is that they then sell the notes in that database for giant piles of cash. Props to you if you’re profiting off your research the way OpenAI can profit off its model.

But yes, the lack of meat is an issue. If I read that article right, it’s not the one being contested here though. (IANAL and this is the only article I’ve read on this particular suit, so I may be wrong).

totallynotarobot@lemmy.world · 2 years ago

@owf@kbin.social can’t reply directly to you either, same language bug between lemmy and kbin.

That’s a great way to put it.

Frankly idc if it’s “technically legal,” it’s fucking slimy and desperately short-term. The aforementioned chuckleheads will doom our collective creativity for their own immediate gain if they’re not stopped.

MiddleWeigh@lemmy.world · 2 years ago

I was actually thinking about this the other day for some reason. AI scraping my own original stuff and doing whatever with it. I can see the concern and I’m curious where this goes and how a court would rule on a pretty technical topic like this.

phx@lemmy.ca · 2 years ago

If you’re doing research, there are actually some limits on the use of the source material and you’re supposed to be citing said sources.

But yeah, there’s plenty of stuff where there needs to be a firm line between what a random human can do versus an automated intelligent system with potential unlimited memory/storage and processing power. A human can see where I am in public. An automated system can record it for permanent record. An integrated AI can tell you detailed information about my daily activities including inferences which - even if legal - is a pretty slippery slope.

Nevoic@lemmy.world · 2 years ago

Capitalism hit a massive roadblock with the dawn of the internet, information has a tendency to want to be free and easily accessible, but corporations need to own our productive output to maximize profits. In the age of the internet, our productive output more and more becomes our ideas and thoughts manifest into code or other forms of digital information.

Capitalists somewhat fought off the first wave of this, but AI will be a second and more challenging wave to overcome. I hope the capitalists fail and we don’t restrict the learning and power of AI so corporations can maximize profits again, but I recognize there’s a world where they successfully slow down or even entirely hault these learning systems and stop the technology from developing.

We already see people like Tucker Carlson calling for bans on AI because it’ll put people out of work. Of course, we should be trying to reduce the amount of work needed, but the natural tendency of capitalism in this environment is to maximize efficiency in favor of capital owners. Once workers aren’t needed anymore, the best thing (from a capitalist perspective) to do is let them starve in the streets instead of “giving them stuff for just existing”. We already live in a world where millions of people die from hunger a year, and almost a billion people are dangerously underfed, because global capitalism dictates these people don’t deserve enough food.

OldGreyTroll@kbin.social · 2 years ago

If I read a book to inform myself, put my notes in a database, and then write articles, it is called “research”. If I write a computer program to read a book to put the notes in my database, it is called “copyright infringement”. Is the problem that there just isn’t a meatware component? Or is it that the OpenAI computer isn’t going a good enough job of following the “three references” rule to avoid plagiarism?

bioemerl@kbin.social · 2 years ago

Yeah. There are valid copyright claims because there are times that chat GPT will reproduce stuff like code line for line over 10 20 or 30 lines which is really obviously a violation of copyright.

However, just pulling in a story from context and then summarizing it? That’s not a copyright violation that’s a book report.

randomdude567@lemmy.world · edit-2 2 years ago

I don’t really understand why people are so upset by this. Except for people who train networks based on someone’s stolen art style, people shouldn’t be getting mad at this. OpenAI has practically the entire internet as its source, so GPT is going to have so much information that any specific author barely has an effect on the output. OpenAI isn’t stealing peoples art because they are not copying the artwork, they are using it to train models. imagine getting sued for looking at reference artwork before creating artwork.

whereisk@lemmy.world · 2 years ago

Unless you provide for personhood to those statistical inference models the analogy falls flat.

We’re talking about a corporation using copyrighted data to feed their database to create a product.

If you were ever in a copyright negotiation you’d see that everything is relevant: intended use, audience size, sample size, projected income, length of usage, mode of transmission, quality etc.

They’ve negotiated none of it and worst of all they commercialised it. I’d consider that being in real trouble.

assassin_aragorn@lemmy.world · 2 years ago

Not to mention, if we’re going to judge them based on personhood, then companies need to be treating it like a person. They can’t have it both ways. Either pay it a fair human wage for its work, or it isn’t a person.

Frankly, the fact that the follow-up question would be “well what’s it going to do with the money?” tells us it isn’t a person.

mikeyBoy14@lemmy.world · 2 years ago

Pay what a fair wage, the GPU farm? 😂

Aviandelight @mander.xyz · 2 years ago

Can’t reply directly to @OldGreyTroll@kbin.social because of that “language” bug, as well. This is an interesting argument. I would imagine that the AI does not have the ability to follow plagiarism rules. Does it even credit sources? I’ve seen plenty of complaints from students getting in trouble because anti cheating software flags their original work as plagiarism. More importantly I really believe we need to take a firm stance on what is ethical to feed into chat gpt. Right now it’s the wild west.