@sag Jeeeeeesus now I’m scared to click it, what if it’s really in Typescript?
@sag Jeeeeeesus now I’m scared to click it, what if it’s really in Typescript?
@dandi8 the license of Adobe Photoshop is not open-source because it specifically restricts reverse-engineering and modifications, and a lot of other things. The license of Mistral Nemo IS open-source, because it’s Apache2.0, you are free to use it, study it, redistribute it, … open-source doesn’t say anything about giving you all the tools to re-create it, because that would mean they would need to give you the GPU time. “Open-source” simply means something else than what you think.
> E.g, Mistral Nemo can’t be considered open source, because there is no Mistral Nemo without the training data set.
Right here - that’s your logical conflict. By downloading the model file, you can run it, thereby you can “have Mistral Nemo” even without having the training data, contradicting your statement -> your statement is invalid.
@dandi8 I’m not changing the definition of open-source. And I’m not saying models are magic. Please take your strawmen back. You are the one saying that dataset is source code, and you have no backing for this argument. I agree that dataset is the “source for training”, but that doesn’t make it “source code” as per the open-source licenses. And the tools are not the compiler. Just because something was created from something else, that doesn’t turn it into “source code”.
@dandi8 surprise surprise, LLMs are not a classic compiled software, in case you haven’t noticed yet. You can’t just transfer the same notions between these two. That’s like wondering why quantum physics doesn’t work the same as agriculture.
Think of it as a database. If you have an open-source social network, all tools and code is published, free to use, but the value of the network is in the posts, the accounts, the people who keep coming back. The data in the database is not the source code
@dandi8 But the proof is in your quote. Open source is a license which allows people to study the source code. The source code of a model is a bunch of float numbers, and you can study it as much as you want in Mixtral and others. Clearly a model can be published without the dataset (Mixtral), and also a model can be closed, hosted, unavailable for study (OpenAI). I think you need to find some argument showing how “source code” of a model = the dataset. It just isn’t so.
> The training data set is a vital part of the source code because without it, the rest of it is useless.
This is simply false. Dataset is not the “source code” of a model. You need to delete this notion from your brain. Model is not the same as a compiled binary.
@dandi8 but you are the one who is changing it. And who said it’s not feasible? Mixtral model is open-source. WizardLM2 is open-source. Phi3:mini is open-source… what’s your point?
But the license of the model is not related to the license of the data used for training, nor the license for the scripts and libraries. Those are three separate things.
@sunstoned @Ephera That’s nonsense. You could write the scripts, collect the data, publish all, but without the months of GPU training you wouldn’t have the trained model, so it would all be worthless. The code used to train all the proprietary models is already open-source, it’s things like PyTorch, Tensorflow etc. For a model to be open-source means you can download the weights and you are allowed to use it as you please, including modifying it and publishing again. It’s not about the dataset.
@astro_ray @marvelous_coyote It seems you have the incorrect idea about what open-source means, which is quite sad here in the open-source lemmy community. Being trained on public domain material does NOT make the model open-source. It’s about the license - what the recipients of the model are allowed to do with it - open-source must allow derivative works and commercial use, on top of seeing the code, but for LLM models the “code” is just a bunch of float numbers, nothing interesting to see.
@thingsiplay Kiwix was amazing for me during traveling, because I could browse Wikivoyage offline in a bus or plane and plan my next move.
GDPR applies only to people (even non-EU citizens) who “live” on the territory of EU. EU citizens who leave, don’t have the GDPR protection anymore. There was an affair last year when google started notifying people about transferring their account data to non-EU datacenters after it detected them connecting from a foreign IP when they went for a holiday to Thailand for a month. So clearly you have some misunderstandings of GDPR. Also GDPR prevents selling stuff??
@mihor oh oh, someone drank too much russian kool-aid
@danielquinn @Tomkoid That might change very quickly after Gitlab finds a buyer.
@gomp Yes but the point is that it comes from a different place and a different time, so for you to execute a compromised program, it would have to be compromised for a prolonged time without anyone else noticing. You are protected by the crowd. In curl|sh you are not protected from this at all
@gomp You mean, as seldom available as every apt install
ever? https://superuser.com/a/990153
@gomp Why would you be taking the signature from the same website? Ever heard of PGP key servers?
@gomp try comparing it with apt install
, not with downloading a .deb file from a random website - that is obviously also very insecure. But the main thing curl|sh
will never have is verifying the signature of the downloaded file - what if the server got compromised, and someone simply replaced it. You want to make sure that it comes from the actual author (you still need to trust the author, but that’s a given, since you are running their code). Even a signed tarball is better than curl|sh.
@Sethayy cool, go ahead. But still nobody made that take, so … you are arguing with the wind.
@Untold1707 As opposed to the hardware requirements of windows, who force you to buy a new computer for every new windows version just because?