I am usually amused by the way really competent people judge other's context.
This post assumes understanding of:
- emacs (what it is, and terminology like buffers)
- strace
- linux directories and "everything is a file"
- environment variables
- grep and similar
- what git is
- the fact that 'git whatever' works to run a custom script if git-whatever exists in the path (this one was a TIL for me!)
- irc
- CVEs
- dynamic loaders
- file priviledges
but then feels important to explain to the audience that:
>A socket is a facility that enables interprocess communication
Years ago I worked on contract for a large blue 3 letter company doing outsourced server management for the fancy credit card company. The incident in question happened before my time on the team but I heard about it first hand from the server admin (let's call him Ben) who had been at the center of it.
The data center in question was (IIRC) 160K sqft of raised floor spread across multiple floors in a major metropolitan downtown area. It isn't there anymore. Windows, Unix, Linux, mainframe, San, all the associated fun stuff.
Ben was working the day after thanksgiving decommissioning a system. Full software and physical decommission. Approved through all the proper change management procedures.
As part of the decommission Ben removed the network cables from under raised floor. Standard snip the connector off and pull it back. Easy. Little did he know that network cable was ever so slightly entangled with another cable. Not enough to give him pause when pulling it though. It wouldn't have been an issue if the other cable had been properly latched in its ports. It wasn't. That little pull ended up pulling the network connection out of a completely unrelated system. A system managed by a completely different group. A system responsible for credit card processing. On USA Black Friday.
Oops. CC processing went down. It took far too long to resolve. Amazingly Ben didnt loose his job. After all he followed all the processes and procedures. Kudos to the management team who kept him protected.
Change management and change freezes were far more stringent by the time I joined the team. There was also now a raised floor infrastructure group and no one pulled a tile without their involvement.
Be careful what you tug on!
I wonder what could be done to make this type of problem less hidden and easier to diagnose.
The one thing that comes to mind is to have the loader fail fast. For security reasons, the loader needs to ensure TMPDIR isn't set. Right now it accomplishes this by un-setting TMPDIR, which leads to silent failures. Instead, it could check if TMPDIR is set, and if so, give a fatal error.
This would force you to unset TMPDIR yourself before you run a privileged program, which would be tedious, but at least you'd know it was happening because you'd be the one doing it.
(To be clear, I'm not proposing actually doing this. It would break compatibility. It's just interesting to think about alternative designs.)
Also, “direct” link: https://blog.plover.com/tech/tmpdir.html (This doesn't really matter, as the posted link is to https://blog.plover.com/2016/07/01/#tmpdir i.e. the blog post named “tmpdir” posted on 2016-07-01 and there is only post posted on that date, so the content of the page is basically the same.)
I was pleased that it was more interesting than that, and I want people to write more twitchy-detail-post-mortems like this :-)
You have your terminal window and your .bashrc (or equivalent), and that sets a bunch of environment variables. But your GUI runs with, most likely, different environment variables. It sucks.
And here’s my controversial take on things—the “correct” resolution is to reify some higher-level concept of environment. Each process should not have its own separate copy of environment variables. Some things should be handled… ugh, I hate to say it… through RPC to some centralized system like systemd.