Researchers published a massive database of more than 2 billion Discord messages that they say they scraped using Discord’s public API. The data was pulled from 3,167 servers and covers posts made between 2015 and 2024, the entire time Discord has been active.
Though the researchers claim they’ve anonymized the data, it’s hard to imagine anyone is comfortable with almost a decade of their Discord messages sitting in a public JSON file online. Separately, a different programmer released a Discord tool called “Searchcord” based on a different data set that shows non-anonymized chat histories.
Probably our only chance to find solutions to problems with open source software that uses Discord as their forum
Seriously. It’s beyond painful when some open source project only uses Discord for communication. You have to hope that you post your question at a time when the right people are online, and that there’s not a more interesting conversation going on, otherwise it just gets lost. Index that whole dataset.
Index that whole dataset
I’ve seen a few projects doing just that with answeroverflow.com and they have come up in my web searches. Not really a solution but at least a stopgap.
Given some similar issues, why is it some projects still use IRC then?
there’s a difference between using irc for livetime troubleshooting and not having a forum at all and directing everyone to your livechat discord. i’m sure some sicko out there has run an OSS project on only IRC, but their project likely got no traction because a history of problemsolving posts is important in open source. generally speaking, you need:
- a wiki
- a static indexable searchable forum
- a live chat place for real time communication for novel problems
too many projects these days only have that last one in the form of discord
I spent nearly three hours today between discord and matrix trying to figure out how to get these two pieces of software to talk using a certain protocol.
Imagine if there were online indexable platforms where people could publish this information so it’s easily accessible rather than having to scour through message logs hoping to find the right keywords. Such a technology surely doesn’t exist already, right?
I hate discord.
I don’t hate Discord, I simply hate that so many projects and companies have unanimously decided to use it as the wrong tool for the wrong job.
It’s fine for its intended use case, which is bickering with my friends about video games and fiction, and spamming each other with .gifs and meme images.
Discord is genuinely a great tool for what I used to use Skype for. Talking to my friends, and sharing dumb memes with them in a groupchat format. Companies need to learn that using it as a forum, a Q&A service, a wiki or any other information sharing purpose, is simply fucking retarded.
Language.
We can’t say the F-word on Lemmy?
That’s retarded!
So basically discord finally got a usable search. I count that as a win.
“anonymized” sure. I highly doubt they read every message. I’m sure there is lots of de-anonymizing information in the messages themselves
For example–
Anon1: “hey jeff, wanna play Minecraft?”
Anon2: “sure”
Thus we know Anon2’s name is Jeff. I imagine there’s a lot of this.
Shit. My name is Jeff. Now they know
If they were on OPEN servers, I doubt they cared that much.
I was hoping people would do this!!!
Well yeah, it’s not encrypted. It would be the same as 10 years of Reddit posts or Lemmy posts scraped
There’s literally no difference. Each Discord server is like a tiny chunk of Reddit. If anyone expected any privacy on these servers, they’re nuts.
This isn’t even them scraping private chats and small servers, they just scraped public servers in the discovery tab. None of that information was ever private, and every user can browse the chat history there.
I was hoping to play around with the dataset over the weekend to toy with some text-embedding techniques, but they’ve pulled the cord on the download links.
Anyone have a copy of the full archive they’re willing to share, or a magnet link?
I see a lot of drama here in the thread, people decrying data leaks, how Discord is very very bad, and a number of people wanting the “good old days” of forums.
Yes. I like forums too, but, uh…
These researchers scraped publicly posted messages. Keyword here being “public”. How would anything similarly public, like a forum, be better?
I actually remember the times when forums were at their peak. I hung out on BZPower for Bionicle things, and the Relic News Forum for Homeworld modding. You know what they had? Google bots that scraped messages, looked for certain words, and populated websites with advertisements based on what it could scrape from forums.
Pretty sure Lemmy doesn’t do encryption either, unless there’s some very special, private Lemmy server that nobody has access to. So the researchers could’ve just as well scraped the fediverse.
“scraped” via API? I don’t think It means what you think it means.
wtf…… going to get worse after IPO!
If you don’t want strangers knowing what you say don’t join open servers it’s pretty easy
Open or close, going to get worse!
If they release closed discord chats they may as well go out of business people will flee
They and companies already doing so.
So this is:
'Uh guys, Discord chats leaked…"
For… what, just literally everyone who used Discord between 2015 and 2017, everyone who was an early adopter?
Dear fucking god.
I used to say ‘someday, people will learn’, but fucking no obviously not, no they won’t, almost everyone is an idiot and/or truly doesn’t care.
… I guess this’ll be fodder for a whole bunch of dramatubers / pedohunters for the next year or so…
The disappearance of forum public discussion to unsearchable, unpreserved, discord semi-private discussion chambers is probably the largest informational catastrophe of the internet so far.
404? another source please? I don’t trust them on this exact thing.
That’s good news. Internet archiving is an important endeavor because you never know when they‘ll pull the plug. Now it‘s a little more secured and probably far more useful than in Discord‘s hands alone.
Not for messages that are supposed to be private lol. Let me just make a copy of all texts you’ve sent over the last decade, for “archiving”.
This says it was done via the API so they wouldn’t be private messages.
If they aren’t comfortable with their Discord messages being public, perhaps they shouldn’t have posted those messages in a public forum that the public can access.