I found this link aggregator that someone made for a personal project and they had an exciting idea for a sorting algorithm whose basic principle is the following:
- Upvotes show you more links from other people who have upvoted that content
- Downvotes show you fewer links from other people who have upvoted that content
I thought the idea was interesting and wondered if something similar could be implemented in the fediverse.
They currently don’t have plans of open-sourcing their work which is fine but I think it shouldn’t be too hard to try and replicate something similar here right?
They have the option to try this out in guest mode where you don’t have to sign in, but it seems to be giving me relevant content after upvoting only 3 times.
There is more information on their website if you guys are interested.
Edit: Changed title to something more informative.
No, not as simply as that. That’s the basic idea of recommendation systems that were common in the 1990s. The algorithm requires a tremendous amount of dimensionality reduction to work at scale. In that simple description it would need a trillion weights to compare the preferences of a million users to a million other users. If you reduce it to some standard 100-1000ish dimensions of preference it becomes feasible, but at the low end only contains about as much information as your own choices about subscribed to or blocked communities (obviously it has a much lower barrier of entry).
There’s another important aspect of learning that the simple description leaves out, which is exploration. It will quickly start showing you things you reliably like, but won’t experiment with things it doesn’t know you’d like or not to find out.
There’s another important aspect of learning that the simple description leaves out, which is exploration. It will quickly start showing you things you reliably like, but won’t experiment with things it doesn’t know you’d like or not to find out.
Why would this be the case? It shows you stuff that people who like similar stuff that you do like, but people have diverse interests so wouldn’t it be likely that the people that like one thing like other things that you hadn’t known about and that leads to a form of guided exploration?
There’s two problems. The first is that those other things you might like will be rated lower than things you appear to certainly like. That’s the “easy” problem and has solutions where a learning agent is forced to prefer exploring new options over sticking to preferences to some degree, but becomes difficult when you no longer know what is explored or unexplored due to some abstraction like dimension reduction or some practical limitation like a human can’t explore all of Lemmy like a robot in a maze.
The second is that you might have preferences that other people who like the same things you’ve already indicated a taste for tend to dislike. For example there may be other people who like both Boba and Cofee but people who like one or the other tend to dislike the other. If you happen to encounter Boba first then Cofee will be predicted to be disliked based on the overall preferences of people who agree with your Boba preference.
If you happen to encounter Boba first then Cofee will be predicted to be disliked based on the overall preferences of people who agree with your Boba preference.
With this specific algorithm, I don’t necessarily think that would be the case. It only shows you fewer links from people who like the links that you dislike. It doesn’t show you fewer links based on what people who are like you dislike which is what it seems like you are describing.
Also, it doesn’t have to be this specific algorithm that we implement but I thought the idea was unique so I thought I’d share it anyway.
It seems to be working well enough for me now so I plan to keep using it and see what it’s like.
Whether or not you use downvotes doesn’t really matter.
If what you like is well represented by the Boba drinkers and the Boba drinkers disproportionally don’t like Cofee then Cofee will be disproportionally excluded from the top of your results. Unless you explore deeper the Cofee results will be pushed to the bottom of your results. And any that happen to come to the top will have arrived there from broad appeal and will have very little contribution to thinking you like Cofee.
If you don’t let the math effectively push things away that are disliked by the people who like similar things as you then everything will saturate at maximum appeal and the whole system does nothing.
Based on these criteria:
- Upvotes show you more links from other people who have upvoted that content
- Downvotes show you fewer links from other people who have upvoted that content
That sounds a lot like you would need to keep track of the vote weights of every user for every other user.
With 100 users, that would require tracking and updating 10000 values. That seems quite manageable!
There are about 39863 active Lemmy users, according to Fediverse.observer. That means keeping track of (up to) 1,589,058,769 weights (assuming you use f16 representations, that’s over 3GB of data). For every single upvote and downvote.
Would this be possible to implement? Yeah, for sure! Would this be practical? Not really, no! This is an O(n²) complex algorithm in terms of data storage, and that simply won’t do.
Imagine if Lemmy were to gain popularity and all of its inactive accounts came back. We’re now up to 2 million users. For busy servers (.ml, world) that means tracking 4 trillion variables for every single interaction (that’d be 8TB of data).
Did I say 2 million? Threads.net has 140 million users. Foursquare has 55 million. The “classic” fediverse has a mere 14 million users (active + inactive). That’s 4.3681×10¹⁶ weights to update for each upvote.
Now, perhaps there are ways to do this more efficiently. For example, users’ devices could track these numbers, so exposing the upvote data would allow the end users’ device to “only” track a couple million data points in their browser and on their devices, and then locally calculate the score of batches of a couple hundred posts collected based on some kind of heuristic. Sorting would take a while and drain your phone’s battery, but it wouldn’t kill the server. With “just” a couple hundred megabytes of data transfer every time the browser gets refreshed and running the device’s GPU/AI accelerator chip full blow for a while, you could use this algorithm.
The idea is enticing, but it doesn’t scale well.
You’d also turn Lemmy into the strongest echo chamber you could possibly create. I’m not sure if that’s what we want to do. If that’s your goal, you should consider moving to Facebook or Threads, maybe?
Storing it as a sparse graph should reduce the storage requirements drastically, since most edges wouldn’t exist.
Probably, we’d need to analyze the statistics for a good overview. However, just upvoting one super popular post once would cause a huge spike in graph growth across the board. Malicious users could also mass upvote to take down other servers, by generating random usernames and upvoting every post from as many accounts as possible.
Even with reduced storage requirements, the processing power required to keep this graph up to date would be quite significant. I don’t think it’s feasible for servers larger than the “private server for me and my friends” type that usually have some leftover CPU resources.
you should consider moving to Facebook or Threads, maybe?
Not an option
As for the rest yeah those do seem like genuine obstacles. Partially think the reason I liked the algorithm is because it reminded me of the Web of Trust things like Scuttlebutt use to get relevant information to users but with a lower barrier to entry.
Also as I’ve said elsewhere it doesn’t have to be this exact thing but since this is a new platform we have the chance to make algorithms that work for us and are transparent so I wanted to share examples that I thought were worthwhile.
Edit:
You’d also turn Lemmy into the strongest echo chamber you could possibly create.
PS. I don’t think that’s true. Big tech companies that have more advanced algorithms would probably be much better at creating echo chambers.
Also as I’ve said elsewhere it doesn’t have to be this exact thing but since this is a new platform we have the chance to make algorithms that work for us and are transparent so I wanted to share examples that I thought were worthwhile.
I agree with that. I’m glad Lemmy added the “scaled” algorithm to give posts from smaller instances as chance, and I think the algorithms will be tweaked further in the future. I think this particular example has too many downsides, but there’s no doubt there are better sorting algorithms out there, especially for platforms with a general lack of content like Lemmy seems to have.
Big tech companies that have more advanced algorithms would probably be much better at creating echo chambers.
All your proposed algorithm does is increase the likelihood of seeing things you like and decreasing the likelihood of things you dislike. You can rely on big tech companies will at least try to introduce some variety so they can serve you more lucrative ads.
One of the way Facebook and Youtube trap you is by generating engagement. The best way to do that, is to make you mad. A sprinkle of dissenting ideas in your echo chamber will have you foaming at the mouth at the “bad types”. The algorithm’s goal itself may not be good or ethical, but at least it detracts from the echo chamber.
Through the complete lack of an algorithm, I find Mastodon to be a much stronger echo chamber than its corporate alternatives. You don’t get to see things you’re not interested in (which is a good concept) but you also end up creating an experience tailored to your world view. I think user control should be at the forefront of this type of software, but we should avoid reinforcing this mechanism where we can.
It might be easier to have posts be given tags and weights and have up voting and down voting change a users tags and weights and maybe have new content sorted by closeness to the users vectors space.
That way you aren’t having to track every event but instead having events update the objects values.
That would be my thought at least. Though I would think it would be best that users could sliding scale the effect. As in let the user determine how “aligned” they want the posts they see to be with them.
easier to have posts be given tags
If this is not being done automatically by the server by analysing the content, people will not use tags, or use irrelevant tags, or fill it with tens of tags like Instagram’s early days or whatever else I cannot think of now. But I think it is not easy to work as intended
Admittedly I’m really studying vector databases for retrieval augmented generation (RAG) AI. So it could just be my mind seeing a nail for that hammer, but it seems like vector search between a user for posts instead of a query and documents might work
Sounds interesting.
Do not want. It puts everyone into their own little echo chamber. What we have is already bad enough. Just sort chronological for everyone.
Interesting idea, but it sounds computationally expensive which could pose problems. Smaller, hardware-restricted servers would be less viable
Now that is a fascinating idea. I really like it
That way people can get content relevant to them. Genius.