On a related note, probably a similar percentage of people claim on their car insurance. If only the rest realised they had "crap insurance" and were paying for nothing, they could save so much money!
This is obviously sarcasm, but I think it's important to remember that much of the data is stored because we don't know what we will need later. Photos of kids? Maybe that one will be The One that we end up framing? Miscellaneous business records? Maybe those will be the ones we have to dig out for a tax audit? Web pages on government sites? Maybe there will suddenly be an interest in obscure pages on public health policy if a global pandemic happens.
Complaining that data is mostly junk is not a particularly interesting conclusion without acknowledging this. Is there wastage? Yeah sure, but accuracy on what needs storing is directly traded off with time spent figuring that out, and often it's cheaper to store the data.
As we were gearing up to declare victory and start turning down the several dozen legacy storage clusters someone mused that given some users were subject to litigation holds -- not allowed to delete any data -- that at least some of the leftover data on the old system might be subject to litigation hold, and we'd need to figure that out before we could delete it or incur legal risk. IIRC the leftover 'junk' data amounted to a few dozen petabytes spread across multiple clusters around the world, in different jurisdictions. We spent several months talking with the lawyers figuring that out. It was an interesting dance, because on the one hand we were quite confident that there was unlikely to be anything in the leftovers which was both meaningful and not migrated to the new platform, while on the other hand explaining that it wasn't practical to just "go and look" through a few dozen PB of data. I recall we ended up somewhere in between, coming up with ways to distinguish categories of data like caches and working data from various pipelines. It added over six months to the project, but was quite an interesting problem to work through that hadn't occurred to any of us earlier on, as we were thinking entirely in technical terms about infrastructure migration.
If 90% of this data is "crap" and could be cut down, it would still be just a drop in the bucket compared to worldwide energy use.
What really bloats things out is surveillance (video and online behavioral) and logging/tracking/tracing data. Some of this ends up cold, but a lot of it is also warm, for analytics. It bloats CPU/RAM/network, which is pretty resource intensive.
The cost is justified because the margins of big tech companies are so wildly large. I'd argue those profits are mostly because of network effects and rentier behavior, not the actual value in the data being stored. If there was more competition pressure, these systems could be orders of magnitude more efficient without any significant different in value/quality/outcome, or really even productivity.
No we're not. I really dislike this "environmental" anti-technologist angle. A single steel plant in china has tenfold "environmental impact" than all photos stored on a platter everywhere.
Would you prefer the photos are a cocktail of weird chemicals on a negative and printed on glossy photo paper?
Digital data is the most ephemeral we are able to make it through vast effort.
This, by the way, has implications on storage systems design. You want something that's cheap yet dense to encode, potentially at the slight expense of decode speed. Normally people really lose sleep about decode speed first and foremost, which, while important, does not minimize the overall resource bill.
Storing "useless" data makes financial sense.
So the question isn't simply whether storage is wasted; it's how much waste there is relative to the environmental impact. Granted, books and photographs don't need to be continuously fed energy to make the information available. However, the cost of storage is now so cheap that even with 90% waste, it's economically viable to keep it online. So the problem, if you can call it one, is that energy is too cheap, and externalities are not accounted for in the cost.
We are using 4-6 times as much storage as we need to, and these are often not small files (on the order of 100 MB - 5 GB, several dozen times a day) but fixing this overuse is so far down the priority list that I don't think it survived the great Jira purge of mid-2024.
This article is mainly focusing about the unused data by website and enterprise databases, only toward the end of the article it barely touched upon "the elephant in the room" of data in cloud.
Now everywhere in the world data centers are being built at breakneck speed to cater for the AI data modeling, training and serving. Most of the AI based data are being kept in datalake in the form of raw data that will probably never see the light of that day i.e never being processed.
Bill Inmon warned us against this potential data swamps in data center due to the increasing popularity of the datalake [1].
Hopefully open table format like Apache Iceberg can rectify this unused raw data epidemic but time will tell [2].
[1] Lakehouses Prevent Data Swamps, Bill Inmon Says
https://www.datanami.com/2021/06/01/lakehouses-prevent-data-...
[2] What Are Apache Iceberg Tables and How Are They Useful?
https://www.snowflake.com/guides/what-are-apache-iceberg-tab...
While I agree that most of the stuff in data centers is probably crap, but that's because most of everything people do is crap. That's not for me to decide though, people save things because they find value in them. Most of what has value to another person won't have value to you. Most of what people treasure in their life is thrown away after they die because nobody wants it, even their closest family members. Who gets to tell everyone the bad news, that objectively their memories are trash and they don't have a right to keep them anymore? Gerry Fuckin' McGovern?
Secondly, we aren't destroying the environment for any of this. Data centers use like 5% or less of the overall electricity use. It's a lot, but we don't have to put datacenters in random locations, we can (and do) put them where electricity is cheap. That generally means that the 5% of electricity used for data centers, kwh for kwh isn't as impactful as an average kwh of end use. Large companies like Meta and Google claim to have zero net carbon by obtaining offsets. So in general we aren't "destroying the environment" to store copies of photos.
I mean, sure, there is some impact. Storage media has to be produced. But there's a reason storage is cheap, it's not a whole lot of resources going into it. And hard drives that are idle in some data center without being accessed don't consume a lot of electricity.
There are very real and concerning problems with the environmental impact of IT. But they are primarily found in other areas. Energy consumption is mostly a function of "how much you compute with data", not "how much data you have".
In other words: be concerned about so-called "AI", be concerned about Bitcoin. Don't worry about unused data too much.
One short video can equal a year worth of emails for someone. Similarly those many webpages that don't get viewed often probably require only a negligible amount of resources to keep online and might help someone who'd otherwise be faced with linkrot.
Best to focus on the low hanging fruit.
Or the cost of figuring out that it's not worth saving...
Proof of work. Look at all this data I/we created.
And the article didn't talk about logs and other operational data yet
In another site I found a mix of some old version windows disk images with data. With more crap inside.
In the end: storage may be cheap. Storing piles of disorganized crap is very costly if you want to find something
Didn't Facebook start to move most of their least-used data onto optical arrays a long time ago?
Regardless of what you think about the article. This rings so true at many Fortune 500 companies.
The number of times I have seen teams work through pointless bullshit to push some meaningless objective for the company. Just so the middle manager (aka “Director of SVP of X product of Y branch”) can get a bullet point(s) in the quarterly “all hands”.
Oh and those 10 developers/off shore people that were just hired? It was all to pump his/her “head count” number to get to the promotion to next grade/level.
Then when that person gets promoted, those people get scattered throughout the firm or just let go.
It’s truly just weaponized incompetence.
citation needed
Whoever sent this dude made a mistake. People who don't share your worldview need to be persuaded, not insulted! Some dude stomps in, thinks all the snaps in the cloud are crap, things the big bosses are stupid for not instantly deleting the pictures they saved into the cloud.... and then what? Download Lisp? Thought we got over this, pal.
WORSE IS BETTER.
P.S. do not erase our porn. WORSE IS BETTER.
Now it's clear the new deal could be implemented only in homes/sheds with domestic p.v. and storage, smart cities keeps to fail since the ancient Fordlandia, see Neom, Songdo, Masdar, PlanIT Valley, Lavasa, Ordos, Santander city, Toronto Quayside (Google Sidewalk Labs), Amazon HQ2, Egypt new Cairo still nameless, Modi's Indian 100-smart city program, Arkadag, Innopolis, Nusantara, Proton City, ... and can't be powered with a smart-grid at such scale.
So well, new well insulated buildings, with ventilation of course, with p.v. and storage, with room for a domestic rack(s), with FTTH. Anyone with such settlements could have his/her own "datacenter" at home, following the same trend for medical devices more and more cheaper and smaller. A LOM? Well a NanoKVM PCIe or an external JetKVM cost MUCH less than classic LOM and do much more. We have all the gear to makes such "datacenter at home" assemblies, anyone holding preferred crap and participating in distributed computing networks to pay at least a bit gear and bandwidth.
It's not for all of course, some will be trapped in dense cities while some large owner dream an obviously not possible conversion from offices to apartments and datacenters like https://finance.yahoo.com/news/southern-californias-hottest-... or https://www.euronews.com/next/2024/02/29/madrid-to-convert-u... and https://www.theguardian.com/society/2025/jan/05/office-to-ho... or https://czechdaily.cz/half-of-pragues-office-buildings-are-a... etc for all over the developed world. That's while we admit https://doi.org/10.1073/pnas.2304099120 we need a full-remote DISTRIBUTED shift.
Food, meds, general retail distributed by a single integrated logistic platform for maximum efficiency in a spread society, the IT evolution makes Distributism possible.
Doing so erase the big amount of concentrated energy, dense network, heat handling and water problem of datacenter and also reduce much the crap, because anyone keep it's own personal and being not free keeping it they'll learn to be storage continuous.
Storing the files for Mr. McGovern’s website requires plastics, metals, power and physical space, yet I assume he believes that environmental effect is worthwhile. Who is he to decide for others that their choice to pay for the storage of data is not equally worthwhile to them?
That’s the beauty of a price system: each of us gets to decide what we will buy, and what we will not buy.
Now, perhaps his argument should be that the price of storing digital data does not adequately reflect the true cost. Perhaps there are unaccounted-for externalities. If so, then he should make that argument, perhaps arguing for a tax to align prices with costs.
Someone else might argue that data is a liability as well as an asset. That’s another argument he could make.
But haranguing folks for spending their money in ways he doesn’t like doesn’t seem likely to produce the outcome he appears to wish.