This is a nontrivial problem when using properly modularized code and libraries that perform logging. They can’t tell whether their operational error is also a program-level error, which can depend on usage context, but they still want to log the operational error themselves, in order to provide the details that aren’t accessible to higher-level code. This lower-level logging has to choose some status.
Should only “top-level” code ever log an error? That can make it difficult to identify the low-level root causes of a top-level failure. It also can hamper modularization, because it means you can’t repackage one program’s high-level code as a library for use by other programs, without somehow factoring out the logging code again.
- Critical / Fatal: Unrecoverable without human intervention, someone needs to get out of bed, now.
- Error : Recoverable without human intervention, but not without data / state loss. Must be fixed asap. An assumption didn't hold.
- Warning: Recoverable without intervention. Must have an issue created and prioritised. ( If business as usual, this could be downgrading to INFO. )
The main difference therefore between error and warning is, "We didn't think this could happen" vs "We thought this might happen".So for example, a failure to parse JSON might be an error if you're responsible for generating that serialisation, but might be a warning if you're not.
Maybe or maybe not. If the connection problem is really due to the remote host then that's not the problem of the sender. But maybe the local network interface is down, maybe there's a local firewall rule blocking it,...
If you know the deployment scenario then you can make reasonable decisions on logging levels but quite often code is generic and can be deployed in multiple configurations so that's hard to do
https://docs.openstack.org/oslo.log/latest/user/guidelines.h...
FWIW, "ERROR: An error has occurred and an administrator should research the event." (vs WARNING: Indicates that there might be a systemic issue; potential predictive failure notice.)
But it is still an error condition, i.e. something does need to be fixed - either something about the connection string (i.e. in the local system) is wrong, or something in the other system or somewhere between the two is wrong (i.e. and therefore needs to be fixed). Either way, developers on this end (I mean someone reading the logs - true that it might not be the developers of the SMTP mailer) need to get involved, even if it is just to reach out to the third party and ask them to fix it on their end.
A condition that fundamentally prevents a piece of software from working not being considered an error is mad to me.
So the library you are using fires too many debug messages? You know, that you can always turn it off by ignoring specific sources, like ignoring namespaces? So what exactly do you lose? Right. Almost nothing.
As for my code and libraries I always tend to do both, log the error and then throw an exception. So I am on the safe side both ways. If the consumer doesn’t log the exception, then at least my code does it. And I give them the chance to do logging their way and ignore mine. I am doing a best-guess for you… thinking to myself, what’s an error when I’d use the library myself.
You don’t trust me? Log it the way you need to log it, my exception is going to transport all relevant data to you.
This has saved me so many times, when getting bug reports by developers and customers alike.
There are duplicate error logs? Simply turn my logging off and use your own. Problem solved.
If it is a program level error, maybe a warning and returning the error is the correct way to do. Maybe it’s not? It depends on the context.
And this basically is the answer to any software design question: It depends.
A connection timed out, retrying in 30 secs? That's a warning. Gave up connecting after 5 failed attempts? Now that's an error.
I don't care so much if the origin of the error is within the program, or the system, or the network. If I can't get what I'm asking for, it can't be a mere warning.
A warning can be ignored safely. Warnings may be 'debugging enabled, results cannot be certified' or something similar.
An error should not be ignored, an operation is failing, data loss may be occurring, etc.
Some users may be okay with that data loss or failing operation. Maybe it isnt important to them. If the program continues and does not error in the parts that matter to the user, then they can ignore it, but it is still objectively an error occurring.
A fatal message cannot be ignored, the system has crashed. Its the last thing you see before shutdown is attempted.
Warning, in contrast, is what I use for a condition that the developer predicted and handled but probably indicates the larger context is bad, like "this query arrived from a trusted source but had a configuration so invalid we had to drop it on the floor, or we assumed a default that allowed us to resolve the query but that was a massive assumption and you really should change the source data to be explicit." Warning is also where I put things like "a trusted source is calling a deprecated API, and the deprecation notification has been up long enough that they really should know better by now."
Where all of this matters is process. Errors trigger pages. Warnings get bundled up into a daily report that on-call is responsible for following up on, sometimes by filing tickets to correct trusted sources and sometimes by reaching out to owners of trusted sources and saying "Hey, let's synchronize on your team's plan to stop using that API we declared is going away 9 months ago."
If yes: ERROR If I want to check it tomorrow: WARNING If it's useful for debugging: INFO Everything else: DEBUG
The problem with the article's approach is that libraries don't have enough context. A timeout calling an external API might be totally fine if you're retrying, but it's an ERROR if you've exhausted retries and failed the user's request.
We solve this by having libraries emit structured events with severity hints, then the application layer decides the final log level based on business impact. A 500 from a recommendation service? Warning. A 500 from the payment processor? Error.
This post frames the problem almost entirely from a sysadmin-as-log-consumer perspective, and concludes that a correctly functioning system shouldn’t emit error logs at all. That only holds if sysadmins are the only "someone" who can act.
In practice, if there is a human who needs to take action - whether that’s a developer fixing a bug, an infra issue, or coordinating with an external dependency - then it’s an error. The solution isn’t to downgrade severity, but to route and notify the right owner.
Severity should encode actionability, not just system correctness.
So it's natural for error messages to be expected, as you progressively add and then clear up edge cases.
Did it do what it was supposed to do, but in a different way or defer for retrying later? Then WARN.
Did it fail to do what it needed to do? ERROR
Did it do what it needed to do in the normal way because it was totally recoverable? INFO
Did data get destroyed in the process? FATAL
It should be about what the result was, not who will fix it or how. Because that might change over time.
Even if your libraries use nothing but exceptions or return codes you still end up with levels. You still end up with logs that have information in them that gets ignored when it shouldn't be because there's so much noise that people get tired of all the "cries of wolf."
Occasionally one is at a high enough level to know for sure that something needs fixing and for this I use "CRITICAL" which is my code for "absolutely sure that you can't ignore this."
IMO it's about time AI was looking at the logs to find out if there was something we really need to be alerted to.
if error == NULL and operationFailed then log error Otherwise Let client side do the error handling (in terms of logging)
My company now has a log aggregator that scans the logs for errors, when it finds one, creates a Trello card, uses opus to fix the issue and then propose a PR against the card. These then get reviewed, finished if tweaks are necessary and merged if appropriate.
Obviously this depends on teams, application context and code bases. But "knowing if action needs to be taken" can't be boiled into a simple log level for most cases.
There is a reason most alerting software like pagerduty is just a trigger interface and the logic for what constitutes the "error" is typically some data level query in something like datadog, sumologic, elastic search, or graphana, that either looks for specific string messages, error types, or a collection of metric conditions.
Cool if you want to consider that any error level log needs to be an actionable error but what quickly happens is that some error cases are auto retry able due to infrastructure conditions that the application has completely no knowledge of. And to run some sort of infrastructure query at error write time in code, eg
1. Error is thrown 2. Prior to logging guess/determine if the case can be retired through a few http calls. 3. Log either a warning or an error
Seems to be a complete waste when we could just write some sort of query in our log/metrics management platform of choice which takes into account the infrastructure conditions for us.
I think discussions that argue over a specific approach are a form of playing checkers.
In an ideal world things like logs and alarms (alerting product support staff) should certainly cleanly separate things that are just informative, useful for the developer, and things that require some human intervention.
If you don't do this then it's like "the boy that cried wolf", and people will learn to ignore errors and alarms since you've trained them to understand that usually no action is needed. It's also useful to be able to grep though log files and distinguish failures of different categories, not just grep for specific failures.
You’re kind of telling a story to future potential trouble-shooters.
When you don’t think about it at all (it doesn’t take much), you tend to log too much and too little and at the wrong level.
But this article isn’t right either. Lower-level components typically don’t have the context to know whether a particular fault requires action or not. And since systems are complex, with many levels of abstractions and boxes things live in, actually not much is in a position to know this, even to a standard of “probably”.
* Database timeout (the database is owned by a separate oncall rotation that has alerts when this happens)
* ISE in downstream service (return HTTP 5xx and increment a metric but don’t emit an error log)
* Network error
* Downstream service overloaded
* Invalid request
Basically, when you make a request to another service and get back a status code, your handler should look like:
logfunc = logger.error if 400 <= status <= 499 and status != 429 else logger.warning
(Unless you have an SLO with the service about how often you’re allowed to hit it and they only send 429 when you’re over, which is how it’s supposed to work but sadly rare.)Does it ?
Don't most stacks have an additional level of triaging logs to detect anomalies etc ? It can be your New relic/DataDog/Sentry or a self made filtering system, but nowadays I'd assume the base log levels are only a rough estimate of whether an single event has any chance of being problematic.
I'd bet the author also has strong opinions about http error codes, and while I empathize, those ships have long sailed.
A mail program not being to checks notes send emails sounds like an error to me. (Unless you implement retries.)
Bold of you to assume that there are system administrators. All too often these days it's "devops" aka some devs you taught how to write k8s yamls.
eg. log level WARN, message "This error is...", but it then trips an error in monitoring and pages out.
Probably breaching multiple rules here around not parsing logs like that, etc. But it's cropped up so many times I get quite annoyed by it.
error_msg = "xyz went wrong"
log.warn(error_msg)
My comment on the CR was about this being an inherent contradiction and incredibly confusing to know if it's actually an error or a warning..That said, the thing I've cone find being useful as a subcategory of error are errors due to data problems vs errors due to other issues.
Not everything that a library considers an error is an application error. If you log an error, something is absolutely wrong and requires attention. If you consider such a log as "possibly wrong", it should be a warning instead.
How do you know?
I could live with 4
Error - alert me now.
Warning - examine these later,
Info - important context for investigations.
Debug - usually off in prod.
“External service down, not my problem, nothing I can do” is hardly ever the case - e.g. you may need to switch to a backup provider, initiate a support call, or at least try to figure out why it’s down and for how long.
That says it all:
- Backseat driving
- Not a developer by trade