ChatGPT is full of sensitive private information and spits out verbatim text from CNN, Goodreads, WordPress blogs, fandom wikis, Terms of Service agreements, Stack Overflow source code, Wikipedia pages, news blogs, random internet comments, and much more.

  • fubo@lemmy.world
    link
    fedilink
    English
    arrow-up
    8
    ·
    11 months ago

    It doesn’t have to have a copy of all copyrighted works it trained from in order to violate copyright law, just a single one.

    Sure, which would create liability to that one work’s copyright owner; not to every author. Each violation has to be independently shown: it’s not enough to say “well, it recited Harry Potter so therefore it knows Star Wars too;” it has to be separately shown to recite Star Wars.

    It’s not surprising that some works can be recited; just as it’s not surprising for a person to remember the full text of some poem they read in school. However, it would be very surprising if all works from the training data can be recited this way, just as it’s surprising if someone remembers every poem they ever read.