Self-host Reddit – 2.38B posts, works offline, yours forever

19-84@lemmy.dbzer0.com · 6 months ago

Self-host Reddit – 2.38B posts, works offline, yours forever

offspec@lemmy.world · 6 months ago

It would be neat for someone to migrate this data set to a Lemmy instance

TeddE@lemmy.world · 6 months ago

It would be inviting a lawsuit for sure. I like the essence of the idea, but it’s probably more trouble than it’s worth for all but the most fanatic.

a1studmuffin@aussie.zone · 6 months ago

This seems especially handy for anyone who wants a snapshot of Reddit from pre-enshittification and AI era, where content was more authentic and less driven by bots and commercial manipulation of opinion. Just choose the cutoff date you want and stick with that dataset.

SteveCC@lemmy.world · 6 months ago

Wow, great idea. So much useful information and discussion that users have contributed. Looking forward to checking this out.

19-84@lemmy.dbzer0.com · 6 months ago

thank you!!! i built on great ideas from others! i cant take all the credit 😋

Tanis Nikana@lemmy.world · 6 months ago

Reddit is hot stinky garbage but can be useful for stuff like technical support and home maintenance.

Voat and Ruqqus are straight-up misinformation and fascist propaganda, and if you excise them from your data set, your data will dramatically improve.

19-84@lemmy.dbzer0.com · 6 months ago

the great part is that since everything is built it is easy to support any additional data! there is even an issue template to submit new data source! https://github.com/19-84/redd-archiver/blob/main/.github/ISSUE_TEMPLATE/submit-data-source.yml

19-84@lemmy.dbzer0.com · 6 months ago

PLEASE SHARE ON REDDIT!!! I have never had a reddit account and they will NOT let me post about this!!

BigDiction@lemmy.world · 6 months ago

You should be very proud of this project!! Thank you for sharing.

lautan@lemmy.ca · 6 months ago

Thanks. This is great for mining data and urls.

frongt@lemmy.zip · 6 months ago

And only a 3.28 TB database? Oh, because it’s compressed. Includes comments too, though.

19-84@lemmy.dbzer0.com · 6 months ago

Yes! Too many comments to count in a reasonable amount of time!

vane@lemmy.world · 6 months ago

How long it takes to download this 3TB torrent ?

19-84@lemmy.dbzer0.com · 6 months ago

week(s)

vane@lemmy.world · 6 months ago

Thank you for answer. I think I do this one instead https://academictorrents.com/details/30dee5f0406da7a353aff6a8caa2d54fd01f2ca1 Looks like it’s divided by year-month.

19-84@lemmy.dbzer0.com · 6 months ago

those are not split by subreddit so they will not work with the tool

Howlinghowler110th@kbin.earth · 6 months ago

I think this is a good use case for AI and Impressed with it. wish the instructions were more clear how to set up though.

19-84@lemmy.dbzer0.com · 6 months ago

thank you! the instruction are little overwhelming, check out the quickstart if you haven’t yet! https://github.com/19-84/redd-archiver/blob/main/QUICKSTART.md

Butterphinger@lemmy.zip · 6 months ago

grabs external

Clbull@lemmy.world · 6 months ago

Eww, Voat and Ruqqus.

sj_zero@lotide.fbxl.net · 5 months ago

I’d be worried about having some of the voat stuff on a hard drive I own.

I’m surprised GitHub hasn’t automatically nixed the archive.

19-84@lemmy.dbzer0.com · 6 months ago

i will always take more data sources, including lemmy!

polarity_inverter@startrek.website · 6 months ago

… for building your personal Grok?

19-84@lemmy.dbzer0.com · 6 months ago

if you didn’t notice, this project was released into the public domain

inspxtr@lemmy.world · 6 months ago

Very cool! Do you know how your project may compare with arctic shift ? For those more interested in research with reddit data, is there benefit of one vs another?

Seefoo@lemmy.world · 5 months ago

Does this decompress the files preemptively and leave them? Or is it only decompressing as a post/subreddit is accessed? Basically i am wondering what kind of storage footprint would be required to search through this

Mubelotix@jlai.lu · 6 months ago

I do not consent for this

Self-host Reddit – 2.38B posts, works offline, yours forever

Self-host Reddit – 2.38B posts, works offline, yours forever

GitHub - 19-84/redd-archiver: A PostgreSQL-backed archive generator that creates browsable HTML archives from link aggregator platforms including Reddit, Voat, and Ruqqus.