- Cool way to self-host archives.
What I'd really like is a plugin that automatically pulls from archives somewhere and replaces deleted comments and those bot-overwritten comments with the original context.
Reddit is becoming maddening to use because half the old links I click have comments overwritten with garbage out of protest for something. Ironically the original content is available in these archives (which are used for AI training) but now missing for actual users like me just trying to figure out how someone fixed their printer driver 2 years ago.
by NickNaraghi
1 subcomments
- Data is available via torrent in this section: https://github.com/19-84/redd-archiver?tab=readme-ov-file#-g...
- This is a neat project, nice work.
You've probably come across this already but there are alternative archives to PushShift that may have differing sets of posts and comments (perhaps depending on removal request coverage?)
One is Arctic Shift: https://github.com/ArthurHeitmann/arctic_shift/releases
Another is PullPush: https://pullpush.io/
- I wonder if you could use this to "Seed" a new distributed social media thing and just take over from there.
sort of like forking a project.
by feconroses
1 subcomments
- Very cool project! Quick question: is the underlying Pushshift dataset updated with new Reddit data on any regular cadence (daily/weekly/monthly), or is this essentially a fixed historical snapshot up to a certain date? Just want to understand if self-hosters would need to periodically re-download for fresh content or if it's archival-only.
- I tried spinning up the local approach with docker compose, but it fails.
There's no `.env.example` file to copy from. And even if the env vars are set manually, there are issues with the mentioned volumes not existing locally.
Seems like this needs more polish.
by elSidCampeador
1 subcomments
- I wonder if this can be hooked up with the now-dead Apollo app in some way, to get back a slice of time that is forever lost now?
by twobitshifter
1 subcomments
- If reddit was a squeaky clean place, or if I could pick certain subs, maybe I would be interested, but I really wouldn't want ALL of reddit on my machine even temporarily.
- Hey, I’m working on a similar project and have uploaded Pushshift Reddit data to Hugging Face Datasets. If anyone wants to download specific files when torrents aren’t seeding well, you can use:
https://huggingface.co/datasets/nick007x/pushshift-reddit
It’s handy for grabbing individual months or subreddit slices without needing to pull the full torrent. Might be useful for smaller-scale archiving or testing.
- Is there any way to check if a subreddit that was made private (2-3 years ago) is in the data dump?
by vivzkestrel
1 subcomments
- - slightly offtopic here but does anyone have a similar data set of all youtube channels out there?
- details probably include the 400 million youtube accounts, channel id, name, creator url, etc
by justsomehnguy
1 subcomments
- Appreciated.
EDIT: Is there any cheap way to search? I have MS TechNet archive which is useless without search, so I realky want to know a way to have a cheap local search w/o grepping everyting.
- Does it also contains countless NSFW content?
by leshokunin
0 subcomment
- Is there a docker compose?
- Opened the live demo, went into programming subreddit, felt like I was showered with liquid shit. I tend to forget what kind of edgelord hellhole Reddit was (and stil is sometimes).
- I want to do the same thing for tiktok. I have 5k videos starting from the pandemic downloaded. want to find a way to use AI to tag and categorize the videos to scroll locally.
- This is a great way to participate in arguments you missed three years ago.
by inquirerGeneral
0 subcomment
- [dead]
by Jordan-117
5 subcomments
- [flagged]
by kylehotchkiss
2 subcomments
- _Hacker News collectively grabs the dataset to train their models on how to become effective reddit trolls_
by syngrog66
3 subcomments
- Did you pay all the people who created its content?