• the_doktor@lemmy.zip
    link
    fedilink
    arrow-up
    1
    ·
    5 months ago

    The only relevant training material to make a truly complete dataset must include copyrighted material or you do not have a full set of data to draw from and thus it is useless. Stop defending this horrible technology.

    • sweng@programming.dev
      link
      fedilink
      arrow-up
      1
      ·
      edit-2
      5 months ago

      What do you mean “full set if data”?

      Obviously you can not train on 100% of material ever created, so you pick a subset. There is a a lot of permissively licensed content (e.g. Wikipedia) and content you can license (e.g. Reddit). While not sufficient for an advanced LLM, it certainly is for smaller models that do not need wide knowledge.