Kinda. But like, a compression algorithm that isnt all that good at exact decompression. It’s really good at outputting text that makes you think “wow that sounds pretty similar to what a person might write”. So even if it’s entirely wrong about something thats fine, as long as youd look at it and be satisfied its answer sounded right.
As you can learn from reading the article, they do also store the information itself.
They learn and store a compression algorithm that fits the data, then use it to store that data. The former part of this is not new, AI and compression theory go back decades. What’s new and surprising is that you can get the original work out of attention transformers. Even in traditional overfit models that isn’t a given. And attention transformers shine at generality, so it’s not evident that they should do this, but all models tested do it, so maybe it is even necessary?
Storing data isn’t a theoretical failure, some very useful AI algorithms do it by design. It’s a legal and ethical failure because openai etc have been claiming from the beginning that this isn’t happening, and it also provides proof of the pirated work it’s been trained on.
The images on the article clearly show that they’re not storing the data, they’re storing enough information about the data to reconstruct a rough and mostly useless approximation of the data (and they do so in such a way that the information about one piece of data can be combined with the information about another one to produce another rough and mostly useless approximation of a combination of those two pieces of data, which was not in the original dataset).
It’s like playing a telephone game with a description of an image, with the last person drawing the result.
The legal and ethical failure is in commercially using artists’ works (as a training model) without permission, not in storing or even reproducing them, since the slop they produce is evidently an approximation and not the real thing.
The law disagrees. Compression has never been a valid argument. A crunchy 360p rip of a movie is a mostly useless approximation but sharing it is definitely illegal.
Fun fact, you can use mpeg for a very decent perceptual image comparison algorithm (eg for facial recognition) , by using the file size of a two-frame video. This works mostly for the same theoretical reasons as neural network based methods. Of course, mpeg was built by humans using legally obtained videos for evaluation, but it does so without being able to reproduce any of those at all. So that’s not a requirement for compression.
That‘s what I keep arguing for years. It‘s not so different from printing out frames of a movie, then scanning them again and claim it‘s a completely new art piece. Everything has been altered so much it‘s completely different. However it‘s still very much recognizable with extremely little personal expression involved.
Oh, but you chose the paper and the printer, so it‘s definitely your completely unique work, right? No, of course not.
AI works pretty much the same. You can tell what protected material the LLM was fed by the output of a given prompt. The theft already happened when the model was trained and it‘s not that hard to prove, really.
AI companies get away with the biggest heist in human history by being overwhelming, not by being something completely new and unregulated. Those things are already regulated but being ignored. They have big tech and therefore politics to back them up, but definitely not the written law in any country that protects intellectual property.
It’s a very complicated compression algorithm.
It’s glorified autocorrect (/predictive text).
People fight me on this every time I say it but it’s literally doing the same thing just with much further lookbehind.
In fact, there’s probably a paper to be written about how LLMs are just lossily compressed Markov chains.
Kinda. But like, a compression algorithm that isnt all that good at exact decompression. It’s really good at outputting text that makes you think “wow that sounds pretty similar to what a person might write”. So even if it’s entirely wrong about something thats fine, as long as youd look at it and be satisfied its answer sounded right.
It stores the shape of the information, not the information itself.
Which might be useful from a statistics and analytics viewpoint, but isn’t very practical as an information storage mechanism.
As you can learn from reading the article, they do also store the information itself.
They learn and store a compression algorithm that fits the data, then use it to store that data. The former part of this is not new, AI and compression theory go back decades. What’s new and surprising is that you can get the original work out of attention transformers. Even in traditional overfit models that isn’t a given. And attention transformers shine at generality, so it’s not evident that they should do this, but all models tested do it, so maybe it is even necessary?
Storing data isn’t a theoretical failure, some very useful AI algorithms do it by design. It’s a legal and ethical failure because openai etc have been claiming from the beginning that this isn’t happening, and it also provides proof of the pirated work it’s been trained on.
The images on the article clearly show that they’re not storing the data, they’re storing enough information about the data to reconstruct a rough and mostly useless approximation of the data (and they do so in such a way that the information about one piece of data can be combined with the information about another one to produce another rough and mostly useless approximation of a combination of those two pieces of data, which was not in the original dataset).
It’s like playing a telephone game with a description of an image, with the last person drawing the result.
The legal and ethical failure is in commercially using artists’ works (as a training model) without permission, not in storing or even reproducing them, since the slop they produce is evidently an approximation and not the real thing.
The law disagrees. Compression has never been a valid argument. A crunchy 360p rip of a movie is a mostly useless approximation but sharing it is definitely illegal.
Fun fact, you can use mpeg for a very decent perceptual image comparison algorithm (eg for facial recognition) , by using the file size of a two-frame video. This works mostly for the same theoretical reasons as neural network based methods. Of course, mpeg was built by humans using legally obtained videos for evaluation, but it does so without being able to reproduce any of those at all. So that’s not a requirement for compression.
That‘s what I keep arguing for years. It‘s not so different from printing out frames of a movie, then scanning them again and claim it‘s a completely new art piece. Everything has been altered so much it‘s completely different. However it‘s still very much recognizable with extremely little personal expression involved.
Oh, but you chose the paper and the printer, so it‘s definitely your completely unique work, right? No, of course not.
AI works pretty much the same. You can tell what protected material the LLM was fed by the output of a given prompt. The theft already happened when the model was trained and it‘s not that hard to prove, really.
AI companies get away with the biggest heist in human history by being overwhelming, not by being something completely new and unregulated. Those things are already regulated but being ignored. They have big tech and therefore politics to back them up, but definitely not the written law in any country that protects intellectual property.
Complex predictive text arranged into what is essentially a rorschach test.