The Goal — Correlation Ids in Logs
We collect logs throughout our applications, they are sent to an HTTP endpoint at Sumologic, meaning we can browse and search through all these logs in the same place.
This is great for exception messages, and anything where a single log message gives us all the information we need, however for other cases it would be great to identify all the logs related to a single operation, such as serving a page to a user.
In order to do this, a common practice is to associate a unique identifier with each operation and then include that in every log message. This is a “correlation id”, and so when I see an interesting log message I can then re-query the logs with the correlation id from that message, in order to see any other messages that were logged as part of that operation.
Initial Introduction of Correlation Ids
We’ve recently added correlation ids to our back-end microservices’ loggers. For those services, which are running as Azure Functions, we initially wrote our own logging class to POST messages to Sumo, so it was relatively easy to inject a per-request scoped service into our IoC container which would generate the correlation id and inject that into the (transient) loggers injected into each resolved service instance, so that every log written by the operation would use the same correlation id.
We’ve also added correlation ids to a few of the existing website pages. Since IoC isn’t widely used in that legacy codebase, it’s been a piecemeal process but we do send the correlation id in requests to the back-end microservices, which then log the id, meaning that with a “website” correlation id I can see all the website logs plus one result from the microservice, which in turn gives me the “microservice” correlation id which I can use to see all its logs for that request.
Finally, the website correlation id is written into the page response header x-eso-request-id and also appears on the page that we redirect users to in case of an error — this page includes a link to send an email to support, and that email includes the correlation id in the subject line.
How Does This Help?
For all our microservice logs, we can see each request in isolation now in the logs. This is way more helpful that just a sea of log messages, and it means we can add additional informational or debugging logs to the system and know that we’ll be able to read those logs back in context rather than in isolation — this makes our logs way more useful in themselves.
Additionally, by writing the id into the response header for the page it’s possible for a developer or support engineer to relate a specific page served by the system, which may be behaving in unexpected ways, to the logs for that request. This provides valuable data and saves a lot of time investigating issues because we can really be sure about what the system is doing rather than guessing based on what logs were written around the same time as a problem request.
Finally, by writing the id into the error page and error email, we help the support team to deliver better service to our users, by being able to precisely identify and diagnose the source of a page error from the logs, rather than having to have support troubleshoot interactively with users.
Limitations
Because the correlation id is injected manually into the messages, including correlation ids with all our page requests in the website would be a large undertaking. It would be better if there were a way to configure our logging framework (we’re using log4net right now) to include the correlation id automatically. If we could do that, all our logs would have correlation ids in and we wouldn’t need to make changes to individual pages in order to achieve that.
On the back-end microservices, we’ve only got the benefit of correlation ids across the board by “rolling our own” logger. This isn’t great, as although the logger works fine it’s all boilerplate functionality that I’d prefer we don’t own and support, and we obviously lack (or would have to build for ourselves) all the good stuff like fine-grained configuration of log levels that log4net gives you out of the box.
Research
Log4net is pretty extensible, and I was already familiar with the concept of custom appenders, from previous work in this area. A good example that we’re using right now is the appender that Sumologic provide for log4net — https://github.com/SumoLogic/sumologic-net-appenders/blob/master/SumoLogic.Logging.Log4Net/BufferedSumoLogicAppender.cs
From talking to colleagues, I’d also learned about the concept of “enrichment” and the suggestion that this is a new, killer feature of Serilog — see https://blog.elmah.io/serilog-vs-log4net/ . I’m not averse to changing our logging framework over but it would be a bit of work, and log4net has its own advantages, so I wanted to understand if enrichment was also possible in log4net.
I’d also noticed in the standard log4net configs I’ve been using, the Layout element (which is nested within the appender) includes a type attribute, suggesting that it’s also extensible:
I played around a bit with custom layouts, and was able to create a custom layout using an AsyncLocal<string> CorrelationId to generate the correlation id uniquely per request:
This actually works really well, but by replacing the original PatternLayout with my custom layout, I’ve lost control of the layout pattern — it’s hard-coded into the class so it’s an adequate solution but not completely ideal.
In the pattern layout, the conversion pattern is defined by placeholders like this:
So, I wondered if log4net provided an ability to add a custom pattern, so that I could just add %correlationId to the conversion pattern. It turns out this is indeed possible (see https://stackoverflow.com/questions/3853644/log4net-how-to-add-simple-indexer-row-number-to-each-log-line/3854272#3854272 ), by adding a converter element to the layout:
This means that rather than a custom layout, I can use a custom converter and continue to use the PatternLayout in my logging configs, and in fact that same converter can be used flexibly with other appenders as needed. There are some good examples of how to do this here: https://hendrikbulens.com/2016/01/14/custom-properties-in-log4net-logging/
All of this means that I all I need to do to add correlation ids to my log messages is to include the converter element and then add %correlationId to the existing conversion pattern. This is a perfect result because it’s minimally invasive to any of my existing code and requires me to make no changes to how I set up and configure my loggers beyond updating the XML. I can now add correlation ids to all my logs, even the legacy website pages, without needing make any code changes beyond adding the converter class to my codebase:
AsyncLocal<T> is a good solution for my needs because the context is retained across async resumptions onto different thread, but other methods of context tracking are available and could easily be replaced into the above code.
Conclusion
Correlation Ids are hugely useful tools for develops and support engineers to get value out of logging and monitoring tools, and also to add confidence and certainty to interactions with customers.
Passing a correlation id around an application and including it in every log message is likely to be impractical and unreliable, so a means of injecting it directly into every log message by configuration is a much better solution.
Log4net custom converters offer a simple solution for adding a new placeholder to the log message pattern, meaning that a correlation id can be configured into every message the system logs.