by crazygringo
24 subcomments
- This is insane.
I definitely was not aware Spotify DRM had been cracked to enable downloading at scale like this.
The thing is, this doesn't even seem particularly useful for average consumers/listeners, since Spotify itself is so convenient, and trying to locate individual tracks in massive torrent files of presumably 10,000's of tracks each sounds horrible.
But this does seem like it will be a godsend for researchers working on things like music classification and generation. The only thing is, you can't really publicly admit exactly what dataset you trained/tested on...?
Definitely wondering if this was in response to desire from AI researchers/companies who wanted this stuff. Or if the major record labels already license their entire catalogs for training purposes cheaply enough, so this really is just solely intended as a preservation effort?
by Etheryte
11 subcomments
- To put this into perspective, What.CD [0] was widely considered to be the music library of Alexandria, unparalleled in both its high quality standard and it's depth. What had in the ballpark of a few million torrents when it got raided and shut down. Anna's rip of Spotify includes roughly 186 million unique records. Granted, the tail end is a mixed bag of bot music and whatnot, but the scale is staggering.
[0] https://en.wikipedia.org/wiki/What.CD
by virtualritz
6 subcomments
- I just found out that https://annas-archive.li/ is masked by my German internet provider (SIM.de/Drillisch).
I usually use a VPN but I had it switched off temp. to watch Fallout (Prime Video won't let you watch through a VPN). Only when I switched Mullvad back on could I open the site.
I didn't know German providers do this.
- This work is so critical.
Read an article that was published just 10 years ago, and witness the bit rot as most external links will 404, gone forever.
I think it's worth questioning the value of preserving -everything-, but it seems like if we can, we should.
- I recall many interesting tracks that were very aggressively deleted from all platforms in sync. I wonder if I could find them in this archive.
There is contemporary lost media being created every day because of how we distribute things now. I think in some cases, the intent of the publisher was to literally destroy every copy of the information. I understand the legal arguments for this, but from a spiritual perspective, this is one of the most offensive things I can imagine. Intentionally destroying all copies of a creative work is simply evil. I don't care how you frame it.
Making media effectively lost is not much different in my mind. Is it available if it's sitting on a tape in an iron mountain bunker that no one will ever look at again?
- Incredible.
> A while ago, we discovered a way to scrape Spotify at scale.
They wont and shouldn’t divulge the details, but I imagine that would be a fun read!
- Truly amazing work. I couldn't help but being sad of the less popular songs not being currently stored, as those are definitely the ones more in risk of being lost forever.
If you like the goal and you have even a few 100gb available on your server, consider "donating" some of that space to seeding the data (music or books). It's absolutely how we can fight the system, even if just a tiny bit. https://annas-archive.org/torrents
by shevy-java
0 subcomment
- Hmm. This is actually not really something I need, I think; but
I consider anna's archive etc... as about as important as the
internet web archive. We need to preserve data, at the least
important data, also historic data - how the original websites
looked. Creativity of past generations. Same for games and books.
It may be only ~30 years for webpages to have emerged, but there
are also many young people who may not have experienced that since
they are too young to have experienced it. There is always a
generational change; our generation has the opportunity to store
more things.
- Hmmm I don’t like this. There are sources for music with better quality out there and all this will do is paint them a bigger target for takedowns/prosecution. I am worried about losing their ebook library. Quoting from the announcement: “Generally speaking, music is already fairly well preserved.“ They should have done this as a separate identity.
- This is something really important, especially in the days when music and film vanishes from platforms one by one. I myself have three playlists with greyed out titles (titles are missing so there's no possibility for me to find out what was there).
That's why I divide music to the one that I want to have forever - I buy it on CDs - and dance music that I can live without one day
- Not that we should, but it's technically feasible to have a music streaming server with the torrent as the backend, and selectively download the part of the torrent in respond to on-demand streaming request from the client.
- The metadata alone is incredibly valuable for researchers. Having 186 million ISRCs catalogued with associated genre, tempo, and popularity data is a goldmine for music analysis that doesn't even require touching the audio files.
I've always found it interesting how streaming services have become the de facto music library of record, yet they can and do remove content at will. When Spotify pulled out of Russia, entire catalogs became inaccessible. Physical media and personal archives suddenly matter again in ways we thought were obsolete.
The copyright discussion is complex, but from a pure preservation standpoint, I'm glad someone is doing this work.
- Anna’s Archive has largely flown under the radar by focusing on books.
Even perceived involvement in music piracy puts a much bigger target on their back from far more aggressive actors (RIAA, major labels)
by gorbachev
2 subcomments
- Quoting from their page:
--------------
This is by far the largest music metadata database that is publicly available. For comparison, we have 256 million tracks, while others have 50-150 million. Our data is well-annotated: MusicBrainz has 5 million unique ISRCs, while our database has 186 million.
--------------
If they truly are on a mission to protect world's information from disappearing, they should work with MusicBrainz to get this data on it.
Alternatively, it would be amazing, if they built a MusicBrainz like service around it.
In either case, to make the data truly useful, they'd need to solve the problem on how to match the metadata to a fingerprint used to identify the music tracks, assuming that data is not part of the metadata they collected.
by yellow_lead
1 subcomments
- Is the music torrent not up yet? Only see the metadata one here:
https://annas-archive.li/torrents/spotify
- Since the article asks:
> We're curious about the peaks at whole minutes (particularly 2:00, 3:00, 4:00). If you know why this is, please let us know!
As a hobby video/audio editor, people will start with their track taking up a preset amount and fill up the time - even if it means having some dead space at the end.
The other alternative is algorithmically created music.
by syntaxing
1 subcomments
- Moral and legal discussion aside, this is technically very impressive. I also wouldn’t be surprised if this somehow kickstarts open source music generative AI from China.
- Site is down for me. Archive link: https://archive.is/jf3HW
- This is one of the greatest news I've ever heard for the digital preservation community. Just so many projects over the years could have used resources like this. Thank you for contributing to humankind!
by Motorbytes
0 subcomment
- Does the Spotify backup contain any so far grayed out or unavailable songs on their list?
I'm a music archivist & preservationist, I've archived and found several formerly lost or on the verge of becoming lost albums, EPs, and Singles, and I've been wondering if the backup of Spotify so far, even with the available info, contain any taken down, region limited, or no longer available songs?
any response is appreciated!
by nighthawk454
1 subcomments
- Amazing! I wonder if the Every Noise At Once[1] site could be updated with the metadata from this?
[1] https://everynoise.com/
- I have Spotify premium but the constant shuffle of content availability has meant I’ve stared routinely archiving my liked songs to avoid any rug pull. Zspotify and co still work a charm.
by throwaway613745
1 subcomments
- I wonder how deep the hole they're gonna put whoever runs this site into is gonna be?
by peterburkimsher
2 subcomments
- For a fully-legal alternative of metadata archiving, I suggest the iTunes EPF (Enterprise Partner Feed).
https://performance-partners.apple.com/epf
The best metadata I've found, though, is the MySpace Dragon Hoard: https://archive.org/details/myspace_dragon_hoard_2010
That included the artist location, allowing me to tag songs based on their country. I then created playlists such as "NERAS" Non-English Rock Artist Sample, where the one most popular song for a particular artist was chosen, and only when the country of origin was not English-speaking, and the genre was Rock. I like listening to music while working, but English lyrics distract me because I understand what they're saying.
After discovering music via the MySpace archive, I've since purchased 73 songs from 35 artists that I'd never heard of before digging into the data. I rebuilt my playlist on Spotify, but got greyed out tracks, and YouTube Music, but got "unavailable video". So I still prefer purchasing tracks via the iTunes Music Store, Qobuz, Bandcamp, and 7digital.
Other data sources such as the MP3.com rescue barge, PureVolume archive, and Anna's Spotify archive lack the country-of-origin metadata, so are of less interest to me. It may be possible to use an LLM to guess the language of each track title, but someone else will have to do that.
Meanwhile, if you're interested in the genre-by-country MySpace data, or have questions about the iTunes EPF, feel free to reach out and we can discuss your research.
by DoctorOetker
0 subcomment
- I'd rather see them use AI to convert all the scanned scientific articles into proper PDF or other formats.
Also sort and classify the articles by binary size, vs page count, plot count, raster image count etc, in order to compress the outliers and detect when a raster image should have been a plot and convert it to vectorized images etc.
How compact can we get the collective human scientific corpus?
- It seems to be that the metadata doesn't include the lyrics, probably because they are provided by Musixmatch. It would have been nice to have a database of lyrics linked to ISRCs. AFAIK Lrclib doesn't support downloading lyrics for a given ISRC.
- great. Spotify just removes things all the time (things I actively listen to and work on for my jazz practices, one day just go "poof" because they didn't want to pay the record company anymore), and they are not as a company deserving of the role of "keeper of all the world's music". They don't give a shit and they'd vastly prefer we all listen to their AI generated royalty free crap and Joe Rogan.
- Unrelated, but I just can't stop myself from saying that I absolutely hate Spotify even though I'm a paying customer. Fuck you Spotify. You were supposed to be a convenient way to discover and listen to music. Now you are only convenient for listening to music, and absolutely terrible for any recommendations. This is sad really. Spotify had good recommendations. It's absolutely in a position where it can provide good recommendations — it has both a vast music library and a vast amount of data on user preferences. And it chooses to push procedural/ai-generated slop instead to earn more money. I thought that maybe buying $SPOT stock will make me more at peace with its greed, but it didn't work. Spotify fucking deserves to crash and burn because it sees paying customers as idiots who might not notice they are fed garbage. Fuck you Spotify, fuck you.
- 199GB, only metadata released for now.
Magnet link found here: https://annas-archive.li/torrents/spotify
Are magnet links allowed on HN?
by bguberfain
0 subcomment
- We can finally search for playlists with a giving song! A basic feature that Spotify is missing!
by acjohnson55
0 subcomment
- This is incredible. I once assembled a collection of 100,000 tracks for research on exploration of large music libraries. Essentially vector search. I was limited in storage and processing power to a single machine.
If I were to do it today, I could get so much farther with hyperscaler products and this dataset.
- This might be the perfect time to do archiving before the entire internet gets inundated by sub-par AI generated content.
- Can someone explain why C#/Db (major/minor) is the third most popular key? Very unexpected for me, since its relatively more difficult to play.
by userbinator
1 subcomments
- Music files (releasing in order of popularity)
Increasing or decreasing? IMHO increasing would make more sense, as the most popular music is already mirrored in countless other places. It's the rare stuff that is most in need of preservation.
I wonder how much of the content there is AI-generated. Honestly, even as someone who was initially skeptical, I've found some of it to be rather good --- not knowing that it was AI-generated at first. Now if they could only reverse-engineer the prompt and only store the model, that would be an extremely efficient form of "compression".
- Attracting the ire of the music industry seems like a huge, unnecessary risk. I wish they had performed this as some kind of other entity to try to keep the ebook archive protected from the fallout. I fear this will not end well.
- Has anyone tried to add up the track file size from the metadata dump?
In spotify_clean_track_files.sqlite3:
SELECT count(*), sum(filesize_bytes) FROM track_files;
255966403|15970064861274
That's only 14.5 TiB, nowhere near 300 TiB. What makes up the other 285 TiB of content?
by Mr_Minderbinder
0 subcomment
- > Over-focus on the highest possible quality
This is not an issue in my view. I like the fact that I can download 100 MiB ultra-high resolution TIFF files of scans of photographs from the original negative from the Library of Congress and 24-bit/96kHz FLAC files of captures of 78 RPM records from the Internet Archive. In addition to maintaining completeness and quality of information, one of the main goals of preservation is to guard against further degradation and information loss. You should try to preserve the highest quality copies available (because they contain more information) and re-encoding (deliberate degradation) should only be used to create convenient access copies.
Inferior copies, in addition to being less informative, have the potential to misinform. Only the archivist will enjoy space savings. All the readers who might consult your library in the infinite future will bear the cost.
> ...(e.g. lossless FLAC). This inflates the file size...
This is entirely the wrong view. The file size of a raw capture compressed to FLAC should be thought of as the “true” or “correct” size. It is roughly the most efficient (balancing various trade-offs) representation of sampled audio data that we can presently achieve. In preservation we seek to preserve the item or signal itself and not simply what we might perceive thereof. This human-centric perception view is just wrong. There is data in film photographs which cannot be perceived visually yet can be of interest to researchers and be revealed with digital image analysis tools.
As an example of how much information celluloid can contain see: https://vimeo.com/89784677
(context: he is comparing a Blu-ray and a scan of a 35mm print)
- The data analysis here is interesting. One thing that stood out to me is that black metal is the 6th most common musical genre for bands, right after rockabilly. I would never have expected that.
- TIL Anna's Archive is blocked in Germany (by a rather obtrusive MitM, I might add). Get redirected to a "Copyright Clearing House" or something.
by TheAceOfHearts
1 subcomments
- I wonder if they'll explore other music services as well. As I understand it, Deezer, Qobuz, and Tidal can all get ripped easily enough. Although I'm not sure if they rate limit downloads past a certain point.
I'm a bit sad that they chose to focus on music rather than audiobooks. Creating an archive of audiobooks seem like it would be more aligned with their mission.
- I just want to be able to backup my playlists. Maybe thats possible but last time I looked I could only find sites that wanted your login, not gonna happen.
by romanovcode
1 subcomments
- `spotdl download "https://open.spotify.com/user/{username}" --user-auth --output '{list-name}/{title} - {artists}.{output-ext}'`
This is literally all you need to back up Spotify.
by Kerollmops
0 subcomment
- So nice! That's an excellent extract and looks useful for benchmarking Meilisearch. I'll probably spend my Christmas holidays importing the tracks, albums, and artists into Meilisearch, while my CEO builds a beautiful front-end for it. I'll probably replace [the current music search demo](https://music.meilisearch.com) we have with this much higher-quality dataset!
That would also be a good fit for [the new delta-encoded posting lists I am working on](https://github.com/meilisearch/meilisearch/pull/5985). Let's see how good it can get. My early benchmarks showed a 50% reduction in disk usage.
- Thats huge, altho as a musician myself i am kinda scared of ai just taking all this data so they could make music better then me, i dunno maybe drop in there an anti ai trap zipbomb or somthing, that way it will work for normal users but not for ai
- Merry Christmas!
by soundsgoodman
0 subcomment
- You need to seriously re-think this...
Releasing indie music, like really low-level indie music, for free in the name of "preservation" is so misguided.
Don't do this. You will only end up hurting the artists who rely on paid downloads.
- wow. Blocked in Belgium.
Error HTTP 451 - Unavailable For Legal Reasons
https://lumendatabase.org/notices/71398835
by performative
0 subcomment
- this is a really incredible effort. but, for the developers and analysts currently working with music metadata in a world where so much of music is being consumed thru streaming services that keep a tight hold on how their metadata and album art can be used, i am constantly yearning for a way to link streaming releases to public metadata sources that can be manipulated, embedded, and queried. i've done my best to build my own w/o a background in data science, but it's a hole that desperately needs filling to enable the new generation of scrobbling/music listening habit exploration.
by thenthenthen
0 subcomment
- Full circle! Thank you! (https://torrentfreak.com/how-the-pirate-bay-helped-spotify-b...)
- New multimodal training set just dropped.
- Uh, cool, I guess? I want to applaud that, but, first off, unless you are OpenAI or Facebook, it is not exactly plausibly easy to participate in the festivities. Even if I had spare 300 TB laying around, how the fuck do I download that?
But, more importantly, I cannot even say "good for you", because I don't actually think it is good for Anna's Archive. I wouldn't touch that thing, if I was them. Do we even have any solid alternatives for books, if Anna's Archive gets shot down, by the way? Don't recommend Amazon, please.
- Oh, just noticed my provider "Vodafone Germany" is blocking the domain annas-archive.li on DNS level.
- I hope someone builds an open API around this metadata. I'd love to have alternatives to the big player APIs.
by lelouch9099
7 subcomments
- How legal is this with regards to copyright laws?
- >Over-focus on the most popular artists. There is a long tail of music which only gets preserved when a single person cares enough to share it. And such files are often poorly seeded.
There is a ton of good bands with under 10k or even 1k monthly listeners.
- I am not enthused by this news. Let us entertain the possibility that similar institutions will eschew this catalog.
- Can this last?
I envision an army of lawyers and cyber security companies being
prepared to unleash a scorched earth campaign that book publishers
might want to be part of as well.
At the end it may take down more than just this publication but most
others as well.
by meysamazad
0 subcomment
- I wonder if Spotify will pursue any legal actions to take this archive or the site down!
by walthamstow
2 subcomments
- Very interesting that a white noise track for babies is the 4th most popular track on Spotify.
by artninja1988
1 subcomments
- Wow. Anna is a godsend. Hopefully now we get some really good open source music models
by eastoncrafter
0 subcomment
- Plans to upload all this to musicbrainz soundid program?
by fungonimus
0 subcomment
- I would like a downloader! :D this is such an awesome project
by puffpuff12345
0 subcomment
- Amazing!
Is there any way to search this spotify database without downloading the currently available metadata torrent?
by eastoncrafter
0 subcomment
- Plans to upload all of this to music brainz soundid?
by hmokiguess
0 subcomment
- What an early christmas gift for humanity. Now, asking for a friend, what's the ideal setup for torrenting this? Mullvad / Tailscale?
- Downloading of individual files to Anna’s Archive Please!
- Downloading of individual files to Anna’s Archive Please
by schmuckonwheels
0 subcomment
- I want to time-travel back to 2000 like Old Biff with the sports almanac so I can tell Shawn Fanning to use the "it's for historical preservation" defense.
by wartywhoa23
0 subcomment
- https://annas-archive.li/llm
- I want to peek in that metadata collection to see if it could be used to identify the AI slop that's infecting Spotify.
If you could identify a track supposedly by artist X was actually AI slop not created by artist X, you could use that information to skip tracks on (web) music players, for example.
- I wonder how definitive their collection is and how much ripping Google Music/YouTube would improve on this.
A distributed ripping project to do that would be a fine thing.
- > ≥70% of songs are ones almost no one ever listens to (stream count < 1000).
So much interesting but undiscovered music is out there!
- This is conspiracy theory territory but I wonder if big tech is sponsoring efforts like this as an easy way to get training data.
- I really don't understand how focusing on source quality files is supposed to be a "major issue" with the music preservation community. It's bizarre for them to talk about these being barriers for creating a "full archive of all music that humanity has ever produced" have and their answer be scraping Spotify to end up with a music library comprised of many AI and bulk produced songs at 75/160kbps.
by damnitbuilds
0 subcomment
- Well done !
Until we have reasonable copyright terms, Pirate On !
by littlecranky67
0 subcomment
- For some reason, the link does not work for me (spain). Works perfect at the same time in tor browser.
- Congrats! I’m sure the Spotify lawyers are gonna have some sleepless nights ahead.
- GREAT DAY
- the top 10,000 songs seem to be 99.9% top-40 corporate pop, which suprised me. thought a list that broad would pick up more that was outside the maintream ...
- Holy crap. This is going to trigger a five-alarm fire at Spotify Engineering. This has got to be among the largest proprietary datasets ever unintentionally publicized by a company.
- im thinking about the consolidation around minute marks. its at every minute mark below 10 minutes, albeit dropping precipitously after 4 minutes. i have 2 guesses. guess one is that people like even numbers so if a track was already going to be within so many seconds of exactly a minute mark that they are more likely to push it to that number. with people caring less above 4 minutes because you are already making a long song, i could imagine caring less at that point. but my second guess is that along with the vast increase of ai slop posted to spotify both by spotify themselves and by other people, some of the programs they use probably fix on minute increments. like how a lot of ai videos are 10 seconds long or a series of 10 second videos. just a guess, however. i have no information or facts to back this up
- Is this all regions? I'm assuming so but I can't be sure
- That’s why Spotify would lose against Apple. Spotify may need to pay a fortune for this scraper behaviour while Apple Music does not.
- > The quality is the original OGG Vorbis at 160kbit/s.
Yeah, the original quality is either a 320kbps OGG or lossless. Not 160.
While this is _a_ backup, it's a pretty lossy one.
- is there a torrent client already that is be good at partial downloads? I didn't realize how popcorn time worked until I read this thread.
by reactordev
0 subcomment
- Oh this is going to go over real well in Nashville, TN.
- If only Spotify paid musicians their fair share
by BaudouinVH
0 subcomment
- error 451 https://postimg.cc/QFddnW41
- Looking at the analysis, I'm totally surprised opera and psytrance are so prolific.
Psy-trance... I thought it was the same as any other electronic genres, but do people get high and just start shoveling psy-trance tracks out or something?
Opera I thought was a very strict discipline, needing rigorous somewhat esoteric training in order to produce the right sounds. How could there be so many opera artists?
I mean, I'm sure there's some misclassification, but chamber music is basically a couple people with any sort of music training on classical instruments so that doesn't surprise me nearly as much... I can easily imagine there being _lots_ of those, and you might come up with a different artist name for each unique set of people you collaborate with.
- We need insane for culture to survive.
- Is there a way to see the shape of the metadata?
by RickyLahey
0 subcomment
- This will be great to train AI on.
- I hope they get the new lossless versions
- spotify undressed
- yo, this is insane!! why would anyone do that? I think it is for AI music generation models, like training them. Maybe ai labs people did it?? yeah that is likely
- Yes, but do they have the one that goes like: to-to-to dotodoo? Hmmm? Do they?
- Now, anyone with some decent info on signal processing and machine learning can build his/her own Shazam.
- free the music
- Just buy music DRM-free in the first place.
- the metadata alone is a staggering couple hundred gb, however it contains quite handy information to play with. consider the following:
> /audio-features/{id} "Get audio feature information for a single track identified by its unique Spotify ID."
this combined with track metadata can finally allow those motivated enough to create their own personalized shuffle. potentially better than the slop we get nowadays. no generative ai required*.
- wow
by bekindtoartists
0 subcomment
- I’m hugely disappointed in Anna’s archive. As much as they believed they were doing this for good, they have now allowed bad faith actors to obtain all music for AI gen. This is just horrific for all artists out there who are fighting against so many issues that impact their creativity and sustainability. Why not just digest the data and not allow the music out there. As usual artists get fucked over.
by snoozebutton
1 subcomments
- is this not highly illegal?
by zoklet-enjoyer
1 subcomments
- Wow. Now I just need some hard drives and a way to download that without my ISP doing something about it. That's amazing.
by throw-12-16
0 subcomment
- I love coming to these threads to read the pearl clutching of "technologists" who suddenly care about IP and copyright law.
by provokateur
0 subcomment
- [dead]
by lawrenceFounta
0 subcomment
- [dead]
by lawrenceFounta
0 subcomment
- [dead]
by raducanu70e
0 subcomment
- [dead]
by MightyHousewife
0 subcomment
- [dead]
by provokateur
0 subcomment
- [dead]
- [dead]
- [dead]
- [dead]
- [flagged]
- [flagged]
- [flagged]
by basisword
12 subcomments
- [flagged]
- [flagged]
- Yuck. Just to make it easier to train slop machines. The point of art is not to have completionist archives of EVERYthing that’s ever been made! Let it die. Death is the most natural part of life. Art is about the human experience, not “for researchers”.
The point is human connection. Art is a living reflection and record of human experience.
Art will persevere- the kinds of folks who prioritize what they like based on popularity were never the supporters artists (contrast with craftspeople trying to make a buck) counted on in the first place. Enjoy your derivative slop - we’ll continue on our imperfect, messy, individual, human artistic lives.
- Unlike books, which are massively overpriced, this will hurt artists a lot as they need the fees paid by Spotify to make ends meet.