Microsoft, OpenAI sued for copyright infringement by nonfiction book authors in class action claim

L4sBot@lemmy.world · 10 months ago

Microsoft, OpenAI sued for copyright infringement by nonfiction book authors in class action claim

fruitycoder@sh.itjust.works · 10 months ago

I wish the protections placed on corporate control on cultural and intellectual assets were placed on the average persons privacy instead.

Like I really don’t care that someone’s publicly available book and movie in the last century is analysed and used to create tools, but I do care that without people’s actual knowledge a intense surveillance apparatus is being built to collect every minute piece of data about their lives and the lives of those around them to be sold without ethical oversight or consent.

IP is bull, but privacy is a real concern. No one is going to using a extra copy of NY times article to hurt someone, but surveillance is used by authoritarians to oppress and harass innocent people.

bassomitron@lemmy.world · edit-2 10 months ago

I’m not a huge fan of Microsoft or even OpenAI by any means, but all these lawsuits just seem so… lazy and greedy?

It isn’t like ChatGPT is just spewing out the entirety of their works in a single chat. In that context, I fail to see how seeing snippets of said work returned in a Google summary is any different than ChatGPT or any other LLM doing the same.

Should OpenAI and other LLM creators use ethically sourced data in the future? Absolutely. They should’ve been doing so all along. But to me, these rich chumps like George R. R. Martin complaining that they felt their data was stolen without their knowledge and profited off of just feels a little ironic.

Welcome to the rest of the 6+ billion people on the Internet who’ve been spied on, data mined, and profited off of by large corps for the last two decades. Where’s my god damn check? Maybe regulators should’ve put tougher laws and regulations in place long ago to protect all of us against this sort of shit, not just businesses and wealthy folk able to afford launching civil suits and shakey grounds. It’s not like deep learning models are anything new.

Edit:

Already seeing people come in to defend these suits. I just see it like this: AI is a tool, much like a computer or a pencil are tools. You can use a computer to copyright infringe all day, just like a pencil can. To me, an AI is only going to be plagiarizing or infringing if you tell it to. How often does AI plagiarize without a user purposefully trying to get it to do so? That’s a genuine question.

Regardless, the cat’s out of the bag. Multiple LLMs are already out in the wild and more variations are made each week, and there’s no way in hell they’re all going to be reigned in. I’d rather AI not exist, personally, as I don’t see protections coming for normal workers over the next decade or two against further evolutions of the technology. But, regardless, good luck to these companies fighting the new Pirate Bay-esque legal wars for the next couple of decades.

patatahooligan@lemmy.world · 10 months ago

Already seeing people come in to defend these suits. I just see it like this: AI is a tool, much like a computer or a pencil are tools. You can use a computer to copyright infringe all day, just like a pencil can. To me, an AI is only going to be plagiarizing or infringing if you tell it to. How often does AI plagiarize without a user purposefully trying to get it to do so? That’s a genuine question.

You are misrepresenting the issue. The issue here is not if a tool just happens to be able to be used for copyright infringement in the hands of a malicious entity. The issue here is whether LLM outputs are just derivative works of their training data. This is something you cannot compare to tools like pencils and pcs which are much more general purpose and which are not built on stole copyright works. Notice also how AI companies bring up “fair use” in their arguments. This means that they are not arguing that they are not using copryighted works without permission nor that the output of the LLM does not contain any copyrighted part of its training data (they can’t do that because you can’t trace the flow of data through an LLM), but rather that their use of the works is novel enough to be an exception. And that is a really shaky argument when their services are actually not novel at all. In fact they are designing services that are as close as possible to the services provided by the original work creators.

bassomitron@lemmy.world · edit-2 10 months ago

In fact they are designing services that are as close as possible to the services provided by the original work creators.

I disagree and I feel like you’re equally misrepresenting the issue if I must be as well. LLMs can do far more than simply write stories. They can write stories, but that is just one capability among numerous. Can it write stories in the style of GRRM? I suppose, but honestly doesn’t GRRM also borrow a lot of inspiration from other authors? Any writer claiming to be so unique that they aren’t borrowing from other writers is full of shit.

I’m not a lawyer or legal expert, I’m just giving a layman’s opinion on a topic. I hope Sam Altman and his merry band get nailed to the wall, I really do. It’s going to be a clusterfuck of endless legal battles for the foreseeable future, especially now that OpenAI isn’t even pretending to be nonprofit anymore.

wewbull@feddit.uk · 10 months ago

This story is about a non-fiction work.

What is the purpose of a non-fiction work? It’s to give the reader further knowledge on a subject.

Why does an LLM manufacturer train their model on a non-fiction work? To be able to act as a substitute source of the knowledge.

End result is that

the original is made redundant.
the original author is no longer credited.

So, not only have they stolen their work, they’ve stolen their income and reputation.

bassomitron@lemmy.world · edit-2 10 months ago

If you’re using an LLM as any form of authoritative source-and literally any LLM specifically warns NOT to do that–then you’re going to have a bad time. No one is using them to learn in any serious capacity. Ideally, the AI should absolutely be citing its sources, and if someone is able to figure out how to do that reliably, they’ll be made quite rich, I’d imagine. In my opinion, the fiction writers have a stronger case than non-fiction (I believe the fiction writers’ class action against OpenAI in September is still ongoing).

Stoneykins [any]@mander.xyz · 10 months ago

For someone who claimed to not be a fan of OpenAI, you sure do know all the fan arguments against regulation for AI.

bassomitron@lemmy.world · 10 months ago

deleted by creator

Stoneykins [any]@mander.xyz · 10 months ago

I’m not here to argue the finer points, and in general I simply try to aim for the practical actions that lead to better circumstances. I agree with many of your points.

This lawsuit won’t fix anything but it will slow down the progress of OpenAI and their ability to loot culture and content for all it’s value. I see it as a foot in the door for less economically capable artists and such.

Lawsuits are not isolated incidents. The outcome of this will have far reaching impacts on the future of how people’s work is treated in regards to AI and training data.

SlopppyEngineer@lemmy.world · 10 months ago

There’s a big difference between borrowing inspiration and just using entire paragraphs of text or images wholesale. If GRRM uses entire paragraphs of JK Rowling with just the names changed and uses the same cover with a few different colors you have the same fight. LLM can do the first, but also does the second.

The “in the style of” is a different issue that’s being debated, as style isn’t protected by law. But apparently if you ask in the style of, the LLM can get lazy and produces parts of the (copyrighted) source material instead of something original.

Blue_Morpho@lemmy.world · 10 months ago

Just as with the right query you could get a LLM to output a paragraph of copyrighted material, you can with the right query get Google to give you a link to copyrighted material. Does that make all search engines illegal?

SlopppyEngineer@lemmy.world · edit-2 10 months ago

Legally it’s very different. One is a link, the other content. It’s the same difference as pointing someone to the street where the dealers hang out or opening your coat and asking how many grams they want.

Blue_Morpho@lemmy.world · 10 months ago

Websites that provide links to copyrighted material are illegal in the US. It’s why torrent sites are taken down and need to be hosted in countries with different copyright laws .

So Google can be used to pirate but that’s not it’s intention. It requires careful queries to get Google to show pirate links. Making a tool that could be used for unintentional copyright violation illegal makes all search engines illegal.

It could even make all programming languages illegal. I could use C to write a program to add two numbers or to crawl the web and return illegal movies.

SlopppyEngineer@lemmy.world · 10 months ago

Oh. Linking and even downloading torrents is legal in my place. Hosting and sharing is not. My bad.

From how I understand it is that the copyright holders want the LLM to do at least the same as Google is doing against torrents: it checks so no parts of the source material is in the output.

grue@lemmy.world · 10 months ago

If I want to be able to argue that having any copyleft stuff in the training dataset makes all the output copyleft – and I do – then I necessarily have to also side with the rich chumps as a matter of consistency. It’s not ideal, but it can’t be helped. ¯\_(ツ)_/¯

wewbull@feddit.uk · 10 months ago

In your mind are the publishers the rich chumps, or Microsoft?

For copyleft to work, copyright needs to be strong.

grue@lemmy.world · 10 months ago

I was just repeating the language the parent commenter used (probably should’ve quoted it in retrospect). In this case, “rich chumps” are George R.R. Martin and other authors suing Microsoft.

General_Effort@lemmy.world · 10 months ago

Wait. I first thought this was sarcasm. Is this sarcasm?

grue@lemmy.world · edit-2 10 months ago

No. I really do think that all AI output should be required to be copyleft if there’s any copyleft in the training dataset (edit for clarity: unless there’s also something else with an incompatible license in it, in which case the output isn’t usable at all – but protecting copyleft is the part I care about).

General_Effort@lemmy.world · 10 months ago

Huh. Obviously, you don’t believe that a copyleft license should trump other licenses (or lack thereof). So, what are you hoping this to achieve?

grue@lemmy.world · 10 months ago

Obviously, you don’t believe that a copyleft license should trump other licenses (or lack thereof)

I’m not sure what you mean. No licenses “trump” any other license; that’s not how it works. You can only make something that’s a derivative work of multiple differently-licensed things if the terms of all the licenses allow it, something the FSF calls “compatibility.” Obviously, a proprietary license can never be compatible with a copyleft one, so what I’m hoping to achieve is a ruling that says any AI whose training dataset included both copyleft and proprietary items has completely legally-unusable output. (And also that any AI whose training dataset includes copyleft items along with permissively-licensed and public domain ones must have its output be copyleft.)

General_Effort@lemmy.world · 10 months ago

Yes, but what do you hope to achieve by that?

LWD@lemm.ee · 10 months ago

these rich chumps like George R. R. Martin complaining that they felt their data was stolen without their knowledge and profited off of just feels a little ironic.

I welcome a lawsuit from any content creator who has enough money to put into it. That benefits all content creators, especially the ones that can’t afford lawyers, from being exploited by giant corporations.

Does anybody think, for a moment, that the average person who creates art as a side job, who lives paycheck to paycheck, should be the one to fight massive plagiaristic megacorporations like OpenAI? That the battle between those who create and those who take should be fought on the most uneven grounds possible?

Womble@lemmy.world · 10 months ago

Its wild to me how so many people seem to have got it into their head that cheering for the IP laws that corporations fought so hard for is somehow left wing and sticking up for the little guy.

LWD@lemm.ee · 10 months ago

Can you bring an actual argument to the table, instead of gesturing towards some perceived hypocrisy?

If your ideology brings you to the same conclusions as libertarian techbros who support the theft of content from the powerless and giving it to the powerful, such as is the case with OpenAI shills, I would say you are not, in fact, a leftist. And if all you can do is indirectly play defense for them, there is no difference between a devil’s advocate and a full-throated techbro evangelist.

General_Effort@lemmy.world · edit-2 10 months ago

Just a heads-up, libertarian is usually understood, in the american sense, as meaning right libertarian, including so-called anarcho-capitalists. It’s understood to mean people who believe that the right to own property is absolutely fundamental. Many libertarians don’t believe in intellectual property but some do. Which is to say that in american parlance, the label “libertarian” would probably include you. Just FYI.

Also, I don’t know what definition of “left” you are using, but it’s not a common one. Left ideologies typically favor progress, including technological progress. They also tend to be critical of property, and (AFAIK universally) reject forms of property that allow people to draw unearned rents. They tend to side with the wider interests of the public over an individual’s right to property. The grandfather comment is perfectly consistent with left ideology.

LWD@lemm.ee · edit-2 10 months ago

deleted by creator

Womble@lemmy.world · 10 months ago

And your argument boils down to “Hitler was a vegetarian, all vegetarians are Fascists”. IP laws are a huge stifle on human creativity designed to allow corporate entities to capture, control and milk innate human culture for profit. The fact that some times some corporate interests end up opposing them when it suits them does not change that.

LWD@lemm.ee · 10 months ago

Okay, we can set your support of cultural appropriation for profit aside for a moment, and talk about the thing I asked you to do earlier: actually provide an argument, rather than gesture at this imagined hypocrisy you are claiming.

The fact you can’t do this, and the fact that you paint with a broad brush anyone who does not buy into your libertarian beliefs as a supporter of all copyright law (with zero nuance, of course) demonstrates your own hypocrisy, which is demonstrable.

LWD@lemm.ee · edit-2 10 months ago

deleted by creator

Womble@lemmy.world · edit-2 10 months ago

I already have:

IP laws are a huge stifle on human creativity designed to allow corporate entities to capture, control and milk innate human culture for profit

I thought that was a prima facie reason for why they are bad, And no I do not believe all copyright law is bad with no nuance, as you would have seen if you stalked deeper into my profile rather than just picking one that you thought you could have fun with.

LWD@lemm.ee · edit-2 10 months ago

IP laws are a huge stifle on human creativity

Great, now do you have any sources for this? Because in the real world, authors appear to disagree with you.

“If my work is just going to get stolen, and if some company’s shareholders are going to get the benefit of my labor and skill without compensating me, I see no reason to continue sharing my work with the public – and a lot of other artists will make the same choice.”
- N.K. Jemisin

no I do not believe all copyright law is bad with no nuance

Then you shouldn’t say things like “IP laws are a huge stifle on human creativity”. In fact, since you don’t believe it, you should edit your comments to say something like “Some IP laws are bad.”

Where do you stand on the case of James Somerton and his plaigirism of the works of multiple small queer creators? Is he entitled to their cultural output while bashing the minorities they belong to?

General_Effort@lemmy.world · 10 months ago

Sure. Trickle-down FTW.

FreeFacts@sopuli.xyz · 10 months ago

I fail to see how seeing snippets of said work returned in a Google summary is any different than ChatGPT or any other LLM doing the same.

Just because it was available for the public internet doesn’t mean it was available legally. Google has a way to remove it from their index when asked, while it seems that OpenAI has no way to do so (or will to do so).

LWD@lemm.ee · 10 months ago

The SFWA has actually talked about this: when they made their books more accessible, they became easier to scrape.

“Our authors and readers have been asking for this for a long time,” president and publisher Tom Doherty explained at the time. “They’re a technically sophisticated bunch, and DRM is a constant annoyance to them. It prevents them from using legitimately-purchased e-books in perfectly legal ways, like moving them from one kind of e-reader to another.”

But DRM-free e-books that circulate online are easy for scrapers to ingest.

The SFWA submission suggests “Authors who have made their work available in forms free of restrictive technology such as DRM for the benefit of their readers may have especially been taken advantage of.”

CosmoNova@lemmy.world · 10 months ago

I hear those kinds of arguments a lot, though usually from the exact same people who claimed nobody would be convicted of fraud for NFT and crypto scams when those were at their peak. The days of the wild west internet are long over.

Theft in the digital space is a very real thing in the eyes of the law, especially when it comes to copyright infringement. It‘s wild to me how many people seem to think Microsoft will just get a freebie here because they helped pioneering a new technology for personal gain. Copyright holders have a very real case here and I‘d argue even a strong one.

Even using user data (that they own legally) for machine learning could get them into trouble in some parts of the developed world because users 10 years ago couldn‘t anticipate it could be used that way and not give their full consent for that.

LWD@lemm.ee · 10 months ago

Bit odd how openly hostile to consent all the fans of OpenAI and other mega-corporations are.

theneverfox@pawb.social · 10 months ago

Personally, I think public info is fair game - consent or not, it’s public. They’re not sharing the source material, and the goal was never plagiarism. There was a period where it became coherent enough to get very close to plagiarism, but it’s been moving past that phase very quickly

Microsoft, especially with how they scraped private GitHub repos (and the things I’m sure Google and Facebook just haven’t gotten caught doing with private data) is way over the line for me. But I see that more as being bad stewards of private data - they shouldn’t be looking at it, their AI shouldn’t be looking at it, the public shouldn’t be able to see it, and they probably failed on all counts

Granted, I think copyright is a bullshit system. Normal people don’t get any protection, because you need to pay to play. Being unable to defend it means you lose it, and in most situations you’re going to spend way more on legal costs than you could possibly get back.

I also think the most important thing is that this tech is spread everywhere, because we can’t have one group in charge of the miracle technology… It’s too powerful.

Google has all the data they could need, they’ve bullied the web into submission… They don’t have to worry about copyright, they control the largest ad network and dominate search (at least for now).

It sucks that you can take any artist’s visual work, and fine tune a network to replicate endless rough facsimile in a few days. I genuinely get how that must feel violating.

But they’re going to be screwed when the corporate work dries up for a much cheaper option, and they’re going to have to deal with the flood of AI work… Copyright won’t help them, it’s too late for it to even slow it down

If companies did something wrong, have it out in court. My concern is that they’re going to pass laws on this that claim it’s for the artists, but effectively gatekeep AI to tech giants

General_Effort@lemmy.world · 10 months ago

Even using user data (that they own legally) for machine learning could get them into trouble in some parts of the developed world because users 10 years ago couldn‘t anticipate it could be used that way and not give their full consent for that.

Where, for example?

Melllvar@startrek.website · 10 months ago

If it’s not infringement to input copyrighted materials, then it’s not infringement to take the output.

Copyright can be enforced at both ends or neither end, not one or the other.

danielbln@lemmy.world · 10 months ago

Because… why?

Melllvar@startrek.website · 10 months ago

A better question is: Why not?

If Copyright doesn’t protect what goes in, why should it protect what comes out?

Patches@sh.itjust.works · edit-2 10 months ago

If I read a book - it is not punishable by anyone right now.

If I write that book down word for word, and put my name as the author - it’s illegal for and it should be for AI.

What is hard to understand here?

You would prefer that it’s either legal or do both, or illegal to do both?

Melllvar@startrek.website · 10 months ago

deleted by creator

Melllvar@startrek.website · 10 months ago

The part that you’re apparently having trouble understanding is that a language model is not a human mind and a human mind is not a language model.

Adanisi@lemmy.zip · edit-2 10 months ago

Because sometimes it spits it out verbatim, and sometimes GPLed code gets spat out in the case of Copilot.

See: the time Copilot spat out the Quake inverse square root algorithm, comments and all.

Also, if it’s legal to disregard libre/open source licenses for this, then why isn’t it legal for me to look at leaked code, which I also do not have permission to use, and use the knowledge gained from that to write something else?

Melllvar@startrek.website · 10 months ago

Which is exactly why the output of an AI trained on copyrighted inputs should not be copyrightable. It should not become the private property of whichever company owns the language model. That would be bad for a lot more reasons than the potential for laundering open source code.

Dr. Moose@lemmy.world · 10 months ago

All the grifters coming out to feed 🫣