On log levels

On log levels

Logs are an important part of observability. When something went wrong, logs help with the investigation and understanding of what happened.

One of the most common problems is inconsistent use of log levels. If that’s the case, filtering by log level is almost useless and logs become hard to read.

Defining and documenting log levels helps maintain consistency. Everyone working on the project should find the definitions one way or another. In Lighthouse, for example, there’s a logUtils.ts file which includes the documentation at the top and the defines log levels within each respective log function.

Log levels

There are different widespread definitions for log levels. For example, RFC 5424 defines them from priority 0-7.

           Numerical         Severity
             Code

              0       Emergency: system is unusable
              1       Alert: action must be taken immediately
              2       Critical: critical conditions
              3       Error: error conditions
              4       Warning: warning conditions
              5       Notice: normal but significant condition
              6       Informational: informational messages
              7       Debug: debug-level messages

              Table 2. Syslog Message Severities

npm defines log levels as "silent", "error", "warn", "notice", "http", "timing", "info", "verbose", "silly".

In JavaScript’s console API, available levels include error, warn, info, debug, and trace.

There’s not one right system. The most obvious difference is the number of log levels. RFC 5424 defines 8 levels, npm has 9, other projects use more, and some use fewer.

As a general rule of thumb, the more complex DevOps is in the organization, the more integrations with other systems exist, the more log levels are required.

For example, if logs integrate with a paging system, and engineers are paged for errors, they should only be paged for errors that need immediate attention. Logging non-critical errors should still be possible, allowing engineers to review them later. In that case, multiple error levels make sense.

Defining log levels

Regardless of the number of levels used, documenting the purpose and appropriate usage of each level is essential for consistency. The following are the definitions I use for Lighthouse:

  • Error

    • unexpected things that are not recoverable
  • Warning

    • unexpected things that are recoverable
  • Info:

    • high-level what happens in the system
    • it should be possible to read info logs without becoming overwhelmed
    • engineers, even if they’re unfamiliar with the code, should understand what’s going on
  • Debug:

    • significant changes made in the system
    • E.g. database updates, important new value computed and set on an object
  • Trace:

    • general information about code execution
    • E.g. function start, value returned, specific code branch visited

In my experience, error and warning are quite intuitive. The difference between info, debug, and trace is less so, and must be more clearly defined.

Other considerations

Defining log levels is a start, but does not automatically lead to good logs.

It’s important to keep log levels consistent. One unexpected event sometimes seems less important than another, for example. There’s the temptation to use info instead of warning. Don’t. It muddles the water and makes filtering more difficult later on.

Instead, it’s good to ask if it should be logged at all. Every log line should provide value. If it doesn’t, it should be removed. Just because a log line fits any of the definitions above doesn’t mean it should be added.

I find imagining reading through the final log helps finding out if a log line adds value. If I see this line, would it help me understand what happened?

The same principle applies to data. I use structured logging (log lines written as JSON objects), so data is part of every log line. The log aggregator I use shows the message of each log line in the overview, and its data in the details. Still, adding data just so it’s there is unnecessary. It increases load on the log aggregator and might obscure useful data.

Final words

As a junior engineer, I didn’t understand the value of logs yet. As I became more experienced, worked with larger systems, and bugs became more complicated, my appreciation increased.

Back then, bugs were obvious and I could always attach a debugger and reproduce them easily. With more complicated bugs, just finding out what happened and how it was caused is 90% of the work.

Good logs help, a lot.