FRESH

Hacker News

Home

Vacuum Is a Lie: About Your Indexes

89 points by birdculture

by ZeroConcerns

2 subcomments

I think this article goes a bit overboard with the negative language ('lies', 'fools'), especially since (auto)VACUUM and indexes really don’t have that much to do with each other: the former is indeed critical on PostgreSQL to ensure availability, but something of a niche feature for most other databases, while index maintenance is important regardless of platform.
For a certain class of applications ('SQLite level'), there’s not even much of that, though, other than ensuring there are no missing or obsolete indexes, which you can take care of with 15 minutes of quality time with the EXPLAIN statement every now and then.
When using a database with persistent index statistics (like SQL Server and Oracle and, yeah, PostgreSQL), it’s important to at least ensure those get updated on a regular basis (but that’s almost always automatic and sufficient unless you're prone to not-usually-done bulk operations) and to optimize or rebuild the underlying tree on a semi-regular basis. This does require some additional non-default setup and monitoring, and can be surprising when you first encounter it.
But it’s not exactly an obscure-slash-secret bit of DBA lore either, unlike what's suggested here...

by chuckadams

2 subcomments

Shorter:
* VACUUM does not compact your indexes (much).
* VACUUM FULL does. It's slow though.

by didgetmaster

2 subcomments

This article points out some of the pain associated with index maintenance. It should also point out that ALL indexes on a table suffer from the same issue. If your 20 column table has 7 indexes, then the suggestions should be applied 7x.
It is conventional wisdom that indexes are absolutely essential for any relational table of at least reasonable size (e.g. thousands of rows) and is accessed more often than daily. Indexes can be a pain to create and maintain; but can greatly speed up queries and primary key validations. The pain mostly comes from having to figure out what indexes to create and how often to maintain them, rather than doing the actual thing.
Indexes also have a performance penalty for any table updates. Creating new rows, updating existing rows, or deleting rows all require updates to each index.
But are indexes really required? I am creating a new kind of general purpose data management system (a kind of object store) called Didgets. The tagging mechanism that I invented to allow tags to be attached to each data object, are key-value stores that essentially form a set of columnar stores.
I found that these columnar stores could also be used to create regular relational database tables. The data is structured such that indexes are not needed. All the tests that I have run (up to a thousand columns with over 100 million rows), show that query speeds are equal to, or better than other database systems that are well indexed.
The system is still under development, so it is still missing some key features that would make it a drop-in replacement for other databases; but it proves that it is possible to structure relational data such that query speeds can be optimal without needing separate indexing structures that have to be maintained.

by lfittl

1 subcomments

The article has a section where it estimates index bloat based on comparing the number of index reltuples * 40 bytes (?), compared to the size of the file on disk.
This is problematic, first of all because I don't think the math is right (see [0] for a more comprehensive query that takes into account column sizes), and second because it ignores the effects of B-Tree index deduplication in Postgres 13+: [1]
In my experience, fast bloat estimation queries can work okay for table bloat, but for index bloat I'd recommend instead looking at the change in page density over time (i.e. track relpages divided by reltuples), or just go direct to running pgstatindex outside business hours.
[0]: https://github.com/pgexperts/pgx_scripts/blob/master/bloat/i... [1]: https://www.postgresql.org/docs/current/btree.html#BTREE-DED...

by dap

0 subcomment

> The exclusive lock is only needed during the final swap phase, and its duration can be configured.
FYI: even a very short operation that requires an exclusive lock can induce significant downtime if there’s anything else that holds a shared lock for extended periods. In [1], there was:
- a wraparound autovacuum (which holds a shared lock for potentially a long time — like hours)
- lots of data path operations wanting a shared lock
- one operation that should have been very brief that merely tried to take an exclusive lock
The result is that the presence of an operation wanting an exclusive lock blocked the data path for the duration of the autovacuum. Major outage.
[1] https://web.archive.org/web/20190320162510/https://www.joyen...
Edit: this was a while ago with v9.2, but I don’t know if any of this behavior has changed.

by dgreensp

0 subcomment

Upvoted because educational, despite the AI-ness and clickbait.
I’ve worked at orgs that used Postgres in production, but I’ve never been the one responsible for tuning/maintenance. I never knew that Postgres doesn’t merge pages or have a minimum page occupancy. I would have thought it’s not technically a B-tree if it doesn’t.

by JohnCClarke

2 subcomments

I'm (genuinely) curious about the overwhelming preference for PostgreSQL on HN. I've always used MySQL for OLTP, and been very happy with it.
If you've seriously considered both and then selected PostgreSQL please comment and tell me what drove that decision.
Note: I'm only talking about OLTP. I do see that PostgreSQL adds a lot for OLAP.

by throwaway613745

3 subcomments

Dont' forget to ANALYZE your tables sometimes too.
Just recently was trying to optimize a 12s index scan, turns out I didn't need to change anything about the query I just had to update the table statistics. 12s down to 100ms just form running ANALYZE (no vacuum needed).

by apothegm

0 subcomment

Good article, and easy to follow. I learned more than I’d expected from this one.

by horizion2025

2 subcomments

"When VACUUM runs, it removes those dead tuples and compacts the remaining rows within each page. If an entire page becomes empty, PostgreSQL can reclaim it entirely."
Is this true? I was of the belief that standard vacuum doesnt move any data even within a page... It merely enables dead tuples to be reused in the future. But I could be mistaken

by gethly

0 subcomment

Ah, the beloved PGSQL and its decades old obsolete MVCC model with all it's goodies :)

by supermatt

1 subcomments

> When VACUUM runs, it removes those dead tuples and compacts the remaining rows within each page.
No it doesn’t. It just removes unused line pointers and marks the space as free in the FSM.

by YouAreWRONGtoo

0 subcomment

[dead]