FRESH

Hacker News

Home

The challenges of soft delete

257 points by buchanae

by MaxGabriel

8 subcomments

This might stem from the domain I work in (banking), but I have the opposite take. Soft delete pros to me:
* It's obvious from the schema: If there's a `deleted_at` column, I know how to query the table correctly (vs thinking rows aren't DELETEd, or knowing where to look in another table)
* One way to do things: Analytics queries, admin pages, it all can look at the same set of data, vs having separate handling for historical data.
* DELETEs are likely fairly rare by volume for many use cases
* I haven't found soft-deleted rows to be a big performance issue. Intuitively this should be true, since queries should be O log(N)
* Undoing is really easy, because all the relationships stay in place, vs data already being moved elsewhere (In practice, I haven't found much need for this kind of undo).
In most cases, I've really enjoyed going even further and making rows fully immutable, using a new row to handle updates. This makes it really easy to reference historical data.
If I was doing the logging approach described in the article, I'd use database triggers that keep a copy of every INSERT/UPDATE/DELETEd row in a duplicate table. This way it all stays in the same database—easy to query and replicate elsewhere.

by jackfranklyn

1 subcomments

The query complexity is the bit that catches teams off guard. You tell yourself "just add WHERE deleted_at IS NULL everywhere" but then you're six months in and someone's debugging why a report is showing ghost data because one query in a chain of 12 missed the filter.
Views help, but then you're maintaining parallel access patterns. And the moment you need to actually query deleted records (audit, support tickets, undo) you're back to bypassing your own abstractions.
Event sourcing solves this more cleanly but the operational overhead is real - most teams I've seen try it end up with a hybrid where core entities are event-sourced and everything else is just soft deleted with fingers crossed.

by hnthrow0287345

2 subcomments

Maybe I'm shooting for the moon, but I'd like soft delete to be some kind of built-in database feature. It would be nice to enable it on a table then choose some built-in strategies on how it's handled.
Soft-delete is a common enough ask that it's probably worth putting the best CS/database minds to developing some OOTB feature.

by patates

0 subcomment

Trigger-based approach is the only one that really works in my experience. Partition the archive table in a way that makes sense for your data and you're good to go.
Some more rules to keep it under control:
Partition table has to be append-only. Duh.
Recovering from a delete needs to be done in the application layer. The archive is meant to be a historical record, not an operational data store. Also by the time you need to recover something, the world may have changed. The application can validate that restoring this data still makes sense.
If you need to handle updates, treat them as soft deletes on the source table. The trigger captures both the old state (before update) and continues normally. Your application can then reconstruct the timeline by ordering archive records by timestamp.
Needless to say, make sure your trigger fires BEFORE the operation, not AFTER. You want to capture the row state before it's gone. And keep the trigger logic dead simple as any complexity there will bite you during high-traffic periods.
For the partition strategy, I've found monthly partitions work well for most use cases. Yearly if your volume is low, daily if you're in write-heavy territory. The key is making sure your common queries (usually "show me history for entity X" or "what changed between dates Y and Z") align with your partition boundaries.

by dagss

1 subcomments

I just long for DBs to evolve from "stateful" to "stateless". CQRS at the DB level.
* All inserts into append only tables. ("UserCreatedByEnrollment", "UserDeletedBySupport" instead of INSERT vs UPDATE on a stateful CRUD table)
* Declare views on these tables in the DB that present the data you want to query -- including automatically maintained materialized indices on multiple columns resulting from joins. So your "User" view is an expression involving those event tables (or "UserForApp" and "UserForSupport"), and the DB takes care of maintaining indices on these which are consistent with the insert-only tables.
* Put in archival policies saying to delete / archive events that do not affect the given subset of views. ("Delete everything in UserCreatedByEnrollment that isn't shown through UserForApp or UserForSupport")
I tend to structure my code and DB schemas like this anyway, but lack of smoother DB support means it's currently for people who are especially interested in it.
Some bleeding edge DBs let you do at least some of this efficient and user-friendly. I.e. they will maintain powerful materialized views and you don't have to write triggers etc manually. But I long for the day we get more OLTP focus in this area not just OLAP.

by talesmm14

2 subcomments

I've worked at companies where soft delete was implemented everywhere, even in irrelevant internal systems... I think it's a cultural thing! I still remember a college professor scolding me on an extension project because I hadn't implemented soft delete... in his words, "In the business world, data is never deleted!!"

by rorylaitila

3 subcomments

Databases store facts. Creating a record = new fact. "Deleting" a record = new fact. But destroying rows from tables = disappeared fact. That is not great for most cases. In rare cases the volume of records may be a technical hurdle; in which case, move facts to another database. The times I've wanted to destroy large volume of facts is approximately zero.

by jamilbk

1 subcomments

At Firezone we started with soft-deletes thinking it might be useful for an audit / compliance log and quickly ran into each of the problems described in this article. The real issue for us was migrations - having to maintain structure of deleted data alongside live data just didn't make sense, and undermined the point of an immutable audit trail.
We've switched to CDC using Postgres which emits into another (non-replicated) write-optimized table. The replication connection maintains a 'subject' variable to provide audit context for each INSERT/UPDATE/DELETE. So far, CDC has worked very well for us in this manner (Elixir / Postgrex).
I do think soft-deletes have their place in this world, maybe for user-facing "restore deleted" features. I don't think compliance or audit trails are the right place for them however.

by whalesalad

1 subcomments

A good solution here (can be) to utilize a view. The underlying table has soft-delete field and the view will hide rows that have been soft deleted. Then the application doesn't need to worry about this concern all over the place.

by maxchehab

1 subcomments

How do you handle schema drift?
The data archive serialized the schema of the deleted object representative the schema in that point in time.
But fast-forward some schema changes, now your system has to migrate the archived objects to the current schema?

by andy_ppp

0 subcomment

Could Postgres provide a mechanism where delete works as you'd expect but you can add WITH DELETED keyword to a SELECT and it returns everything even deleted records? I guess migrations are still an issue if you want to change the structure of the DB but maybe you could provide these as part of the database too - so INSERT INTO table(col1, col2, newCol...) FROM DELETED (col1, col2, newDataNotInDeleted) WHERE id = 123 CASCADE; or something like this.
There should be a preferred way to handle this as these are clearly real issues that the database should help you to deal with.

by tracker1

0 subcomment

I like having archive/history tables. I often do similar with job queues when persisting to a database, in this way the pending table can stay small and avoid full scans to skip the need for deleted records...
Aside, another idea that I've kicked forward for event driven databases is to just use a database like sqlite and copy/wipe the whole thing as necessary after an event or the work that's related to that database. For example, all validation/chain of custody info for ballot signatures... there's not much point in having it all online or active, or even mixed in with other ballot initiatives and the schema can change with the app as needed for new events. Just copy that file, and you have that archive. Compress the file even and just have it hard archived and backed up if needed.

by nottorp

1 subcomments

Why deleted_at?
We have soft_deleted as boolean which excludes data from all queries and last_updated which a particular query can use if it needs to.
If over 50% of your data is soft deleted then it's more like historical data for archiving purposes and yes, you need to move it somewhere else. But then maybe you shouldn't use soft delete for it but a separate "archive" procedure?

by 3rodents

4 subcomments

Soft deletes are an example of where engineers unintentionally lead product instead of product leading engineering. Soft delete isn’t language used by users so it should not be used by engineers when making product facing decisions.
“Delete” “archive” “hide” are the type of actions a user typically wants, each with their own semantics specific to the product. A flag on the row, a separate table, deleting a row, these are all implementation options that should be led by the product.

by alkonaut

1 subcomments

Can't most db systems just create a view over the data where archived_at is null, and this view is the table you use for 99% of your business needs (except auditing, undelete, ...)?

by ntonozzi

1 subcomments

I've given up on soft delete -- the nail in the coffin for me was my customers' legal requirements that data is fully deleted, not archived. It never worked that well anyways. I never had a successful restore from a large set of soft-deleted rows.

by theLiminator

2 subcomments

Privacy regulations make soft delete unviable in many of the cases where it's useful.

by LorenPechtel

1 subcomments

The % of records that are deleted is a huge factor.
You keep 99%, soft delete 1%, use some sort of deleted flag. While I have not tried it whalesalad's suggestion of a view sounds excellent. You delete 99%, keep 1%, move it!

by tbrownaw

0 subcomment

There are tables at $dayjob with both (begin, end) and also (incept, expire) fields. It's "on such-and-such date, X was true", but also allows for "as-of Z date, we believed that...".
Also you can have most data being currently unused even without being flagged deleted. Like if I go in to our ticketing system, I can still see my old requests that were closed ages ago.

by t1234s

0 subcomment

I can see a hybrid approach working where you use a deleted_at column for soft delete, then have a process that moves this data after X days to an archive and hard deletes from the main database. This makes undeletes in the short term simple and keeps all data if needed in the future.

by clickety_clack

0 subcomment

We have soft delete, with hard delete running on deletions over 45 days old. Sometimes people delete things by accident and this is the only way to practically recover that.

by cj

1 subcomments

We deal with soft delete in a Mongo app with hundreds of millions of records by simply moving the objects to a separate collection (table) separate from the “not deleted” data.
This works well especially in cases where you don’t want to waste CPU/memory scanning soft deleted records every time you do a lookup.
And avoids situations where app/backend logic forgets to apply the “deleted: false” filter.

by iterateoften

1 subcomments

I used to be pretty adamant about implementing soft delete for core business objects.
However after 15 years I prefer to just back up regularly, have point in time restores and then just delete normally.
The amount of times I have “undeleted” something are few and far between.

by stevefan1999

0 subcomment

That's why adding a DELETE FROM ... RETENTION UNTIL <date> for SQL would be very nice, combining both hard and soft delete with an internal TTL to reduce the impact

by cyberax

0 subcomment

Soft deletes + GC for the win!
We have an offline-first infrastructure that replicates the state to possibly offline clients. Hard deletes were causing a lot of fun issues with conflicts, where a client could "resurrect" a deleted object. Or deletion might succeed locally but fail later because somebody added a dependent object. There are ways around that, of course, but why bother?
Soft deletes can be handled just like any regular update. Then we just periodically run a garbage collector to hard-delete objects after some time.

by nemothekid

0 subcomment

The trigger architecture is actually quite interesting, especially because cleanup is relatively cheap. As far as compliance goes, it's also simply to declare that "after 45 days, deletions are permanent" as a catch all, and then you get to keep restores. For example, I think (IANAL), the CCPA gives you a 45 day buffer for right to erasure requests.
Now instead of chasing down different systems and backups, you can simply set ensure your archival process runs regularly and you should be good.

by moring

0 subcomment

Both the article and many comments here seem to miss that UPDATE deletes data -- the previous value of the field being updated -- which is a serious problem if soft-delete is your tool to keep old data. If you actually want historical data, you'll need logs or go straight to event sourcing.

by pjs_

0 subcomment

Tried implementing this crap once. Never again

by IgorPartola

0 subcomment

I have a love/hate relationship with soft deleted. There are cases where it’s not really a delete but rather a historical fact. For example, let’s say I have a table which stores an employee’s current hourly rate. They are hired at say $15/hour, then go to $17 six months later, then to $20/hour three months later. All of these three things are true and I want to be able to query which rate the employee had on a specific date even after their rate had changed. When I have a starts_on and an ends_on dates and the latter is nullable, with some data consistency logic I can create a linear history of compensation and can query historical and current data the same exact way. I also get
But this is such a huge PITA because you constantly have to mind if any given object has this setup or not and what if related objects have different start/end dates? And something like a scheduled raise for next year to $22/hour can get funny if I then try to insert that just for July it will be $24/hour (this would take my single record for next year and split it into two and then you gotta figure out which gets the original ID and which is the new row.
Another alternative to this is a pattern where you store the current state and separately you store mutations. So you have a compensation table and a compensation_mutations table which says how to evolve a specific row in a compensation table and when. The mutations for anything in the future can be deleted but the past ones cannot which lets you reconstruct who did what, when, and why. But this also has drawbacks. One of them is that you can’t query historical data the same way as current data. You also have to somehow apply these mutations (cron job? DB trigger?)
And of course there are database extensions that allow soft deletes but I have never tried them for vague portability reasons (as if anyone ever moved off Postgres).

by piratebroadcast

0 subcomment

thoughtbot wrote about this a while back https://thoughtbot.com/blog/the-hard-truth-about-soft-deleti...

by cadamsdotcom

0 subcomment

Why not use a trigger to prevent unarchiving?
And perf problems are only speculative until you actually have them. Premature optimization and all that.

by nerdponx

0 subcomment

One thing that often gets forgotten in the discussions about whether to soft delete and how to do it is: what about analysis of your data? Even if you don't have a data science team, or even a dedicated business analyst, there's a good chance that somebody at some point will want to analyze something in the data. And there's a good chance that the analysis will either be explicitly "intertemporal" in that it looks at and compares data from various points in time, or implicitly in that the data spans a long time range and you need to know the states of various entities "as of" a particular time in history. If you didn't keep snapshots and you don't have soft edits/deletes you're kinda SoL. Don't forget the data people down the line... which might include you, trying to make a product decision or diagnose a slippery production bug.

by MarginalGainz

1 subcomments

The hidden cost we battle in e-commerce isn't just DB storage/performance, it's Search Index Pollution. We treat 'availability' as a complex state machine (In Stock, Backorder, Discontinued-but-visible, Soft Deleted). Trying to map this logic directly into a Postgres query with WHERE deleted_at IS NULL works for CRUD, but it creates massive friction for discovery.
We found that strict CQRS/Decoupling is the only way to scale this. Let the operational DB keep the soft-deletes for audit/integrity (as mentioned by others), but the Search Index must be a clean, ephemeral projection of only what is currently purchasable.
Trying to filter soft-deletes at query time inside the search engine is a recipe for latency spikes.

by JohnLeitch

1 subcomments

My brother's now ex-wife learned the hard way about the challenges of soft delete. Too bad about the contents of that SQLite database, but his knowing was for the better.

by hirvi74

0 subcomment

I would never recommend my method for every type of application nor perhaps even most. However, I have had great success with not using soft deletes at all. I just write the records to a duplicate table then hard delete the records from the main table.
Of course, in a system with 1000s of tables, I would not likely do this. But for simpler systems, it's been quite a boon.

by cess11

0 subcomment

I don't know, pruning based on age and restoring by writing a new row based on the soft deleted one seems less complex than the cascade handling in the trigger solution.

by iamleppert

0 subcomment

There is another solution I use all the time: move deleted records to their own table. You probably don't need to do this for all tables. It allows you to not pepper your codebase with where clauses or statuses, everything works as intended, and you can easily restore records deleted by mistake, which is the original intent anyways. You can easily set this up by using a trigger at the database level in almost every database, that just works.

by MORPHOICES

0 subcomment

[dead]

by Barathkanna

0 subcomment

TLDR: Soft deletes look easy, but they spread complexity everywhere. Actually deleting data and archiving it separately often keeps databases simpler, faster, and easier to maintain.