Reddit’s API is effectively dead for archival. Third-party apps are gone. Reddit has threatened to cut off access to the Pushshift dataset multiple times. But 3.28TB of Reddit history exists as a torrent right now, and I built a tool to turn it into something you can browse on your own hardware.
The key point: This doesn’t touch Reddit’s servers. Ever. Download the Pushshift dataset, run my tool locally, get a fully browsable archive. Works on an air-gapped machine. Works on a Raspberry Pi serving your LAN. Works on a USB drive you hand to someone.
What it does: Takes compressed data dumps from Reddit (.zst), Voat (SQL), and Ruqqus (.7z) and generates static HTML. No JavaScript, no external requests, no tracking. Open index.html and browse. Want search? Run the optional Docker stack with PostgreSQL – still entirely on your machine.
API & AI Integration: Full REST API with 30+ endpoints – posts, comments, users, subreddits, full-text search, aggregations. Also ships with an MCP server (29 tools) so you can query your archive directly from AI tools.
Self-hosting options:
- USB drive / local folder (just open the HTML files)
- Home server on your LAN
- Tor hidden service (2 commands, no port forwarding needed)
- VPS with HTTPS
- GitHub Pages for small archives
Why this matters: Once you have the data, you own it. No API keys, no rate limits, no ToS changes can take it away.
Scale: Tens of millions of posts per instance. PostgreSQL backend keeps memory constant regardless of dataset size. For the full 2.38B post dataset, run multiple instances by topic.
How I built it: Python, PostgreSQL, Jinja2 templates, Docker. Used Claude Code throughout as an experiment in AI-assisted development. Learned that the workflow is “trust but verify” – it accelerates the boring parts but you still own the architecture.
Live demo: https://online-archives.github.io/redd-archiver-example/ GitHub: https://github.com/19-84/redd-archiver (Public Domain)
Pushshift torrent: https://academictorrents.com/details/1614740ac8c94505e4ecb9d88be8bed7b6afddd4
It would be neat for someone to migrate this data set to a Lemmy instance
It would be inviting a lawsuit for sure. I like the essence of the idea, but it’s probably more trouble than it’s worth for all but the most fanatic.
This seems especially handy for anyone who wants a snapshot of Reddit from pre-enshittification and AI era, where content was more authentic and less driven by bots and commercial manipulation of opinion. Just choose the cutoff date you want and stick with that dataset.
Wow, great idea. So much useful information and discussion that users have contributed. Looking forward to checking this out.
thank you!!! i built on great ideas from others! i cant take all the credit 😋
Reddit is hot stinky garbage but can be useful for stuff like technical support and home maintenance.
Voat and Ruqqus are straight-up misinformation and fascist propaganda, and if you excise them from your data set, your data will dramatically improve.
the great part is that since everything is built it is easy to support any additional data! there is even an issue template to submit new data source! https://github.com/19-84/redd-archiver/blob/main/.github/ISSUE_TEMPLATE/submit-data-source.yml
PLEASE SHARE ON REDDIT!!! I have never had a reddit account and they will NOT let me post about this!!
You should be very proud of this project!! Thank you for sharing.
Thanks. This is great for mining data and urls.
And only a 3.28 TB database? Oh, because it’s compressed. Includes comments too, though.
Yes! Too many comments to count in a reasonable amount of time!
How long it takes to download this 3TB torrent ?
week(s)
Thank you for answer. I think I do this one instead https://academictorrents.com/details/30dee5f0406da7a353aff6a8caa2d54fd01f2ca1 Looks like it’s divided by year-month.
those are not split by subreddit so they will not work with the tool
I think this is a good use case for AI and Impressed with it. wish the instructions were more clear how to set up though.
thank you! the instruction are little overwhelming, check out the quickstart if you haven’t yet! https://github.com/19-84/redd-archiver/blob/main/QUICKSTART.md
grabs external
Eww, Voat and Ruqqus.
I’d be worried about having some of the voat stuff on a hard drive I own.
I’m surprised GitHub hasn’t automatically nixed the archive.
i will always take more data sources, including lemmy!
… for building your personal Grok?
if you didn’t notice, this project was released into the public domain
Very cool! Do you know how your project may compare with arctic shift ? For those more interested in research with reddit data, is there benefit of one vs another?
Does this decompress the files preemptively and leave them? Or is it only decompressing as a post/subreddit is accessed? Basically i am wondering what kind of storage footprint would be required to search through this
I do not consent for this







