While I agree with some of it, I feel like there's a big gotcha here that isn't addressed.
Having 1 single wide event, at the end of a request, means that if something unexpected happens in the middle (stack overflow, some bug that throws an error that bypasses your logging system, lambda times out etc...) you don't get any visibility into what happens.
You also most likely lose out on a lot of logging frameworks your language has that your dependencies might use.
I would say this is a good layer to put on top of your regular logs. Make sure you have a request/session wide id and aggregate all those in your clickhouse or whatever into a single "log".
The way I have solved for this in my own framework in PHP is by having a Logging class with the following interface
interface LoggerInterface {
// calls $this->system(LEVEL_ERROR, ...);
public function exception(Throwable $e): void;
// Typical system logs
public function system(string $level, string $message, ?string $category = null, mixed ...$extra): void;
// User specific logs that can be seen in the user's "my history"
public function log(string $event, int|string|null $user_id = null, ?string $category = null, ?string $message = null, mixed ...$extra): void;
}
I also have a global exception handler that is registered at application bootstrap time that takes any exception that happens and runs $logger->exception($e);
There is obviously a tiny bit more of boilerplating to this thing, but it works so well that I can't live without it anymore.
That was difficult to read, smelt very AI assisted though the message was worthwhile, it could've been shorter and more to the point.
A few things I've been thinking about recently:
- we have authentication everywhere in our stack, so I've started including the user id on every log line. This makes getting a holistic view of what a user experienced much easier.
- logging an error as a separate log line to the request log is a pain. You can filter for the trace, but it makes it hard to surface "show me all the logs for 5xx requests and the error associated" - it's doable, but it's more difficult than filtering on the status code of the request log
- it's not enough to just start including that context, you have to educate your coworkers that it's now present. I've seen people making life hard for themselves because they didn't realize we'd added this context
On the other hand, investing in better tracing tools unlocks a whole nother level of logging and debugging capabilities that aren't feasible with just request logs. It's kind of like you mentioned with using the user id as a "trace" in your first message but on steroids.
These tools tend to be very expensive in my experience unless you are running your own monitoring cloud. Either you end up sampling traces at low rates to save on costs, or your observability bill is more than your infrastructure bill.
Doing stuff like turning on tracing for clients that saw errors in the last 2 minutes, or for requests that were retried should only gather a small portion of your data. Maybe you can include other sessions/requests at random if you want to have a baseline to compare against.
We do have both a span id and trace id - but I personally find this more cumbersome over filtering on a user id. YMMV if you're interested in a single trace then you'd filter for that, but I find you often also care what happened "around" a trace
I have seen pushback on this kind of behavior because "users don't like error codes" or other such nonsense. UX and Product like to pretend nothing will ever break, and when it does they want some funny little image, not useful output.
A good compromise is to log whenever a user would see the error code, and treat those events with very high priority.
I hope registering an entire domain name for a blog post doesn't become a trend. I like linking to things that are likely to last a long time - a personal blog is one thing, but expecting people to keep paying the renewal fee every year for a single article feels less likely to me.
A good alternative here is subdomains, since those don't have an additional annual fee. https://logging-sucks.boristane.com/ could work well here.
Because of the nature of how software is built and deployed nowadays, it’s generally not possible to write single log entries that tell the “whole story” of “what happened”.
I could write about this for hours, but instead I’ll just discuss two concepts that you need in modern logging: vertical correlation and horizontal correlation.
Within a system, requests tend to go “up” and “down” stacks of software. It is very useful in these scenarios to have “vertical correlation” fields shared between adjacent layers, so that activity in one layer can be unambiguously attributed to activity in the adjacent layers. But sharing such a correlation value requires passing the value between layers, which might be a breaking api change. Occasionally it’s possible to construct a correlation value at each adjacent layer by transforming existing parameters in exactly the same way on the calling side and called side.
Additionally, software on one system converses with software on other systems; in those cases you need to have pairwise correlation values between adjacent peer layers. Again, same limitations apply to carrying such a correlation value via the API or protocol.
Really foresighted devs can anticipate these requirements and generate unique transaction ids that can be shared between machines and up and down the stack.
A post on this topic feels incomplete without a shout-out to Charity Majors - she has been preaching this for a decade, branded the term "wide events" and "observability", and built honeycomb.io around this concept.
Also worth pointing out that you can implement this method with a lot of tools these days. Both structured Logs or Traces lend itself to capture wide events. Just make sure to use a tool that supports general query patterns and has rich visualizations (time-series, histograms).
> A post on this topic feels incomplete without a shout-out to Charity Majors
I concur. In fact, I strongly recommend anyone who has been working with observability tools or in the industry to read her blog, and the back story that lead to honeycomb. They were the first to recognize the value of this type of observability and have been a huge inspiration for many that came after.
Could you drop a few specific posts here that you think are good for someone (me) who hasn't read her stuff before? Looks like there's a decade of stuff on her blog and I'm not sure I want to start at the very beginning...
I've learned more from Charity about telemetry than from anyone else. Her book is great, as are her talks and blog posts. And Honeycomb, as a tool, is frankly pretty amazing
This post was so in-line with her writing that I was really expecting it to turn into an ad for Honeycomb at the end. I was pretty surprised with it turned out the author was unaffiliated!
The presentation is fantastic and I loved the interactive examples!
Too bad that all of this effort is spent arguing something which can be summarised as "add structured tags to your logs"
Generally speaking my biggest gripe with wide logs (and other "innovative" solutions to logging) is that whatever perceived benefit you argue for doesn't justify the increased complexity and loss of readability.
We're throwing away `grep "uid=user-123" application.log` to get what? The shipping method of the user attached to every log? Doesn't feel an improvement to me...
P.S. The checkboxes in the wide event builder don't work for me (brave - android)
> Logs were designed for a different era. An era of monoliths, single servers, and problems you could reproduce locally.
I worked with enterprise message bus loggers in semiconductor manufacturing context wherein we had thousands of participants on the message bus. It generated something like 300-400 megabytes per hour. Despite the insane volume we made this work really well using just grep and other basic CLI tools.
The logs were mere time series of events. Figuring out the detail about specific events (e.g. a list of all the tools a lot visited) required writing queries into the Oracle monster. You could derive history from the event logs if you had enough patience & disk space, but that would have been very silly given the alternative option. We used them predominantly to establish a casual chain between events when the details are still preliminary. Identifying suspects and such. Actually resolving really complicated business usually requires more than a perfectly detailed log file.
At last a sane person. Logs are for identifying the event timeline, not to acquire the whole reqs/resp data. Putting every detail into the logs is -in my experience - makes undertanding issues harder. Logs tell a story. When, what happened, not how or why that happened. Why is in the code, how is in the combination of, data, logs, events, code.
And loosely related, I also dislike log interfaces like elk stack. They make following track of events really hard. Most of the time you do not know what you are loooking for, just a vauge understanding of why you are looking into the logs. So a line passed 3 micro seconds ago maybe your euraka moment, where no search could identify , just intuition and following logs diligently can.
> It generated something like 300-400 megabytes per hour. Despite the insane volume we made this work really well using just grep and other basic CLI tools.
400MB of logs an hour is nothing at all, that's why a naive grep can work. You don't even need to rotate your log files frequently in this situation.
Horrid advice at the end about logging every error, exception, slow request, etc if you are sampling healthy requests.
Taking slow requests as an example, a dependency gets slower and now your log volume suddenly goes up 100x. Can your service handle that? Are you causing a cascading outage due to increased log volumes?
Recovery is easier if your service is doing the same or less work in a degraded state. Increasing logging by 20-100x when degraded is not that.
Good point. It also reminded me of when I was trying to optimize my app for some scenarios, then I realized it's better to optimize it for ALL scenarios, so it works fast and the servers can handle no matter what. To be more specific, I decided NOT to cache any common queries, but instead make sure that all queries are fast as possible.
Yea that was my thought too. I like the idea in principle, but these magic thresholds can really bite you. It claims to be P(99), probably off some historical measurement, but that's only true if it's dynamically changing. Maybe this could periodically query the OTEL provider for the real number to at least limit the time window of something bad happening.
I do not see how logging could bottleneck you in a degraded state unless your logging is terribly inefficient. A properly designed logging system can record on the order of 100 million logs per second per core.
Are you actually contemplating handling 10 million requests per second per core that are failing?
Generation and publication is just the beginning (never mind the fact that resources consumed by an application to log something are no longer available to do real work). You have to consider the scalability of each component in the logging architecture from end to end. There's ingestion, parsing, transformation, aggregation, derivation, indexing, and storage. Each one of those needs to scale to meet demand.
I already accounted for consumed resources when I said 10 million instead of 100 million. I allocated 10% to logging overhead. If your service is within 10% of overload you are already in for a bad time. And frankly, what systems are you using that are handling 10 million requests per second per core (100 nanoseconds per request)? Hell, what services are you deploying that you even have 10 million requests per second per core to handle?
All of those other costs are, again, trivial with proper design. You can easily handle billions of events per second on the backend with even a modest server. This is done regularly by time traveling debuggers which actually need to handle these data rates. So again, what are we even deploying that has billions of events per second?
In my experience working at AWS and with customers, you don't need billions of TPS to make an end-to-end logging infrastructure keel over. It takes much less than that. As a working example, you can host your own end-to-end infra (the LGTM stack is pretty easy to deploy in a Kubernetes cluster) and see what it takes to bring yours to a grind with a given set of resources and TPS/volume.
I prefaced all my statements with the assumption that the chosen logging system is not poorly designed and terribly inefficient. Sounds like their logging solutions are poorly designed and terribly inefficient then.
It is, in fact, a self-fulfilling prophecy to complain that logging can be a bottleneck if you then choose logging that is 100-1000x slower than it should be. What a concept.
At the end of the day, it comes down to what sort of functionality you want out of your observability. Modest needs usually require modest resources: sure, you could just append to log files on your application hosts and ship them to a central aggregator where they're stored as-is. That's cheap and fast, but you won't get a lot of functionality out of it. If you want more, like real-time indexing, transformation, analytics, alerting, etc., it requires more resources. Ain't no such thing as a free lunch.
My impression was that you would apply this filter after the logs have reach your log destination, so there should be no difference for your services unless you host your own log infra, in which case there might be issues on that side. At least that's how we do it with Datadog because ingestion is cheap but indexing and storing logs long term is the expensive part.
I've recently come off a team that was racking up a huge Splunk bill with ~70 log events for each request on a high traffic service, and this is all very resonant (except the bit about sampling, I never gave that much thought - reducing our Splunk bill 70x was ambitious enough for me!).
Hadn't heard the "wide event" name, but I had settled on the same idea myself in that time (called them "top-level events" - i.e. we would gather information from the duration of the request and only log it at the "top" of the stack at the end), and evangelised them internally mostly on the basis it gave you fantastic correlation ability.
In theory if you've got a trace id in Splunk you can do correlated queries anyway, but we were working in Spring and forever having issues with losing our MDC after doing cross-thread dispatch and forgetting to copy the MDC thread global across. This wasn't obvious from the top-level, and usually only during an incident would you realise you weren't seeing all the loglines you expected for a given trace. So absent a better solution there, tracking debug info more explicitly was appealing.
Also used these top-level events to store sub-durations (e.g. for calling downstream services, invoking a model etc), and with Splunk if you record not just the length of a sub-process but its absolute start, you can reconstruct a hacky waterfall chart of where time was spent in your query.
> Logs were designed for a different era. An era of monoliths, single servers, and problems you could reproduce locally. Today, a single user request might touch 15 services, 3 databases, 2 caches, and a message queue. Your logs are still acting like it's 2005.
Logs are fine. The job of local logs is to record the talk of a local process. They are doing this fine. Local logs were never meant to give you a picture of what's going on some other server. For such context, you need a transaction tracing that can stitch the story together across all processes involved.
Usually, looking at the logs at right place should lead you to the root cause.
One of the points the author is trying to make (although he doesn't make it well, and his attitude makes it hard to read) is that logs aren't just for root-causing incidents.
When properly seasoned with context, logs give you useful information like who is impacted (not every incident impacts every customer the same way), correlations between component performance and inputs, and so forth. When connected to analytical engines, logs with rich context can help you figure out things like behaviors that lead to abandonment, the impact of security vulnerability exploits, and much more. And in their never-ending quest to improve their offerings and make more money, product managers love being able to test their theories against real data.
It’s a wild violation of SRP to suggest that. Separating concerns is way more efficient. Database can handle audit trail and some key metrics much better, no special tools needed, you can join transaction log with domain tables as a bonus.
Are you assuming they're all stored identically? If so, that's not necessarily the case.
Once the logs have entered the ingestion endpoint, they can take the most optimal path for their use case. Metrics can be extracted and sent off to a time-series metric database, while logs can be multiplexed to different destinations, including stored raw in cheap archival storage, or matched to schemas, indexed, stored in purpose-built search engines like OpenSearch, and stored "cooked" in Apache Iceberg+Parquet tables for rapid querying with Spark, Trino, or other analytical engines.
Have you ever taken, say, VPC flow logs, saved them in Parquet format, and queried them with DuckDB? I just experimented with this the other day and it was mind-blowingly awesome--and fast. I, for one, am glad the days of writing parsers and report generators myself are over.
I agree with this statement: "Instead of logging what your code is doing, log what happened to this request." but the impression I can't shake is that this person lacks experience, or more likely has a lot of experience doing the same thing over and over.
"Bug parts" (as in "acceptable number of bug parts per candy bar") logging should include the precursors of processing metrics. I think what he calls "wide events" I call bug parts logging in order to emphasize that it also may include signals pertaining to which code paths were taken, how many times, and how long it took.
Logging is not metrics is not auditing. In particular processing can continue if logging (temporarily) fails but not if auditing has failed. I prefer the terminology "observables" to "logging" and "evaluatives" to "metrics".
In mature SCADA systems there is the well-worn notion of a "historian". Read up on it.
A fluid level sensor on CANbus sending events 10x a second isn't telling me whether or not I have enough fuel to get to my destination (a significant question); however, that granularity might be helpful for diagnosing a stuck sensor (or bad connection). It would be impossibly fatiguing and hopelessly distracting to try to answer the significan question from this firehose of low-information events. Even a de-noised fuel gauge doesn't directly diagnose my desired evaluative (will I get there or not?).
Does my fuel gauge need to also serve as the debugging interface for the sensor? No, it does not. Likewise, send metrics / evaluatives to the cloud not logging / observables; when something goes sideways the real work is getting off your ass and taking a look. Take the time to think about what that looks like: maybe that's the best takeaway.
I espouse a "grand theory of observability" that, like matter and energy, treats logs, metrics, and audits alike. At the end of the day, they're streams of bits, and so long as no fidelity is lost, they can be converted between each other. Audit trails are certainly carried over logs. Metrics are streams of time-series numeric data; they can be carried over log channels or embedded inside logs (as they often are).
How these signals are stored, transformed, queried, and presented may differ, but at the end of the day, the consumption endpoint and mechanism can be the same regardless of origin. Doing so simplifies both the conceptual framework and design of the processing system, and makes it flexible enough to suit any conceivable set of use cases. Plus, storing the ingested logs as-is in inexpensive long-term archival storage allows you to reprocess them later however you like.
Auditing is fundamentally different because it has different durability and consistency requirements. I can buffer my logs, but I might need to transact my audit.
For most cases, buffering audit logs on local storage is fine. What matters is that the data is available and durable somewhere in the path, not that it be transactionally durable at the final endpoint.
Saying they are all the same when no fidelity is lost is missing the point. The only distinction between logs, traces, and metrics is literally what to do when fidelity is lost.
If you have insufficient ingestion rate:
Logs are for events that can be independently sampled and be coherent. You can drop arbitrary logs to stay within ingestion rate.
Traces are for correlated sequences of events where the entire sequence needs to be retained to be useful/coherent. You can drop arbitrary whole sequences to stay within ingestion rate.
Metrics are pre-aggregated collections of events. You pre-limited your emission rate to fit your ingestion rate at the cost of upfront loss of fidelity.
If you have adequate ingestion rate, then you just emit your events bare and post-process/visualize your events however you want.
I would rather fix this problem than every other problem. If I'm seeing backpressure, I'd prefer to buffer locally on disk until the ingestion system can get caught up. If I need to prioritize signal delivery once the backpressure has resolved itself, I can do that locally as well by separating streams (i.e. priority queueing). It doesn't change the fundamental nature of the system, though.
Logs were designed for a different era. An era of monoliths, single servers, and problems you could reproduce locally. Today, a single user request might touch 15 services, 3 databases, 2 caches, and a message queue.
If a user request is hitting that many things, in my view, that is a deeply broken architecture.
I'm building an analytics SaaS and we made the conscious decision to keep it simple: Next.js API routes + Supabase + minimal external services. A single page view hits maybe 3 components max (CDN -> App -> Database).
That said, I agree completely on structured logging with rich context. We include user_id, session_id, and event_type on every log line. Makes debugging infinitely easier.
The "wide events" concept is solid, but the real win is just having consistent, searchable structure. You don't need a revolutionary new paradigm - just stop logging random strings and use JSON with a schema.
Persisting a data schema that represents business events is a great idea. That’s more about Event Sourcing though and doing that can answer a ton of questions about the system without doing it in log messages.
Wide events as a strategy is expensive, even with sampling, and doesn’t address the fundamental problem - why do we log messages?
I was hoping the article would enumerate why we log messages. Nailing down those scenarios first will lead to a happy life.
Why do we log?
- proof of life - is the system running?
- what is the state (in memory) when an error occurred?
- when did an error occur?
- do I need to get up at 2 am and fix something?
- what do I need to fix?
I feel like every team operating a system has their own reasons for logging.
But does it? Or is it bad logging, or excessive logging, or unsearchable logs?
A client of mine uses SnapLogic, which is a middleware / ETL that's supposed run pipelines in batch mode to pass data around between systems. It generates an enormous amount of logs that are so difficult to access, search and read that they may as well don't exist.
We're replacing all of that with simple Python scripts that do the same thing and generate normal simple logs with simple errors when something's truly wrong or the data is in the wrong format.
Terse logging is what you want, not an exhaustive (and exhausting) torrent of irrelevant information.
Just out of curiosity, how have you seen risk/compliance, regulatory, and audit departments at organizations deal with the disconnect between security and privacy for something like mainframe logging (e.g., JES2, JES3), which is typically inherently governed, and modern distributed logging, which is typically inherently permissive? Both are vastly different approaches, but each is somehow considered 'compliant.' Btw, employees at a company I was at were once investigated for insider trading simply because it was discovered the company used pooled logs that were accessible by production support programmers (the company decided to override the default mainframe security), which was deemed a possible source of insider trading information that could be tapped into by those who had log access (programmers were eventually cleared if it was discovered their small personal trades were immaterial and just coincidental with the company's trading, but the investigation led to uncomfortable confrontations for some!).
> Logs were designed for a different era. An era of monoliths, single servers, and problems you could reproduce locally. Today, a single user request might touch 15 services, 3 databases, 2 caches, and a message queue. Your logs are still acting like it's 2005.
If a user request is hitting that many things, in my view, that is a deeply broken architecture.
> If a user request is hitting that many things, in my view, that is a deeply broken architecture.
If we want it or not, a lot of modern software looks like that. I am also not a particular fan of building software this way, but it's a reality we're facing. In part it's because quite a few services that people used to build in-house are now outsourced to PaaS solutions. Even basic things such as authentication are more and more moving to third parties.
The reason we end up with very complex systems I don't think is because of incentives between "managers and technicians". If I were to put my finger to it, I would assume it's the very technicians who argued themselves into a world where increased complexity and more dependencies is seen as a good thing.
At least in my place of work, my non-technical manager is actually on board with my crusade against complex nonsense. Mostly because he agrees it would increase feature velocity to not have to touch 5 services per minor feature. The other engineers love the horrific mess they've built. It's almost like they're roleplaying working at Google and I'm ruining the fun.
> If a user request is hitting that many things, in my view, that is a deeply broken architecture.
Things can add up quickly. I wouldn't be surprised if some requests touch a lot of bases.
Here's an example: a user wants to start renting a bike from your public bike sharing service, using the app on their phone.
This could be an app developed by the bike sharing company itself, or a 3rd party app that bundles mobility options like ride sharing and public transport tickets in one place.
You need to authentice the request and figure out which customer account is making the request. Is the account allowed to start a ride? They might be blocked. They might need to confirm the rules first. Is this ride part of a group ride, and is the customer allowed to start multiple rides at once? Let's also get a small deposit by putting a hold of a small sum on their credit card. Or are they a reliable customer? Then let's not bother them. Or is there a fraud risk? And do we need to trigger special code paths to work around known problems for payment authorization for cards issued by this bank?
Everything good so far? Then let's start the ride.
First, let's lock in the necessary data. Which rental pricing did the customer agree to? Is that actually available to this customer, this geographical zone, for this bike, at this time, or do we need to abort with an error? Otherwise, let's remember this, so we can calculate the correct rental fee at the end.
We normally charge an unlock fee in addition to the per-minute price. Are we doing that in this case? If yes, does the customer have any free unlock credit that we need to consume or reserve now, so that the app can correctly show unlock costs if the user wants to start another group ride before this one ends?
Ok, let's unlock the bike and turn on the electric motor. We need to make sure it's ready to be used and talk to the IoT box on the bike, taking into account the kind of bike, kind of box and software version. Maybe this is a multistep process, because the particular lock needs manual action by the customer. The IoT box might have to know that we're in a zone where we throttle the max speed more than usual.
Now let's inform some downstream data aggregators that a ride started successfully. BI (business intelligence) will want to know, and the city might also require us to report this to them. The customer was referred by a friend, and this is their first ride, so now the friend gets his referral bonus in the form of app credit.
Did we change an unrefundable unlock fee? We might want to invoice that already (for whatever reason; otherwise this will happen after the ride). Let's record the revenue, create the invoice data and the PDF, email it, and report this to the country's tax agency, because that's required in the country this ride is starting in.
Or did things go wrong? Is the vehicle broken? Gotta mark it for service to swing by, and let's undo any payment holds. Or did the deposit fail, because the credit card is marked as stolen? Maybe block the customer and see if we have other recent payments using the same card fingerprint that we might want to proactively refund.
That's just off the top of my head, there may be more for a real life case. Some of these may happen synchronously, others may hit a queue or event bus. The point is, they are all tied to a single request.
So, depending on how you cut things, you might need several services that you can deploy and develop independently.
- auth
- core customer management, permissions, ToS agreement,
One thing this is missing: Standardization and probably the ECS' idea of "related" fields.
A common problem in a log aggregation is the question if you query for user.id, user_id, userID, buyer.user.id, buyer.id, buyer_user_id, buyer_id, ... Every log aggregation ends up being plagued by this. You need standard field names there, or it becomes a horrible mess.
And for a centralized aggregation, I like ECS' idea of "related". If you have a buyer and a seller, both with user IDs, you'd have a `related.user.id` with both id's in there. This makes it very simple to say "hey, give me everything related to request X" or "give me everything involving user Y in this time frame" (as long as this is kept up to date, naturally)
I actually wrote my bachelors on this topic, but instead of going the ECS route (which still has redundant fields in different components) I went in the RDF direction. That system has shifted towards more of a middleware/database hybrid over time (https://github.com/triblespace/triblespace-rs). I always wonder if we'd actually need logging if we had more data-oriented stacks where the logs fall out as a natural byproduct of communication and storage.
I always wondered why we didnt have some kind of fuzzy english words search regexes/tool, that is robust to keyboard typing mistakes, spelling mistake, synonyms, plural, conjugation etc.
I've recently added error tracking to my self-hosted analytics app (UXWizz), and the way I did it is simply add extra events to each user/session. Once you have the concept of a session or user, you can simply attach errors or logs as Events stored for that user. This solves the main problem mentioned in the article, where you don't know what happened, plus being an Event stored in a MySQL database, you can still query it.
Why not simply use Events for logging, instead of plain strings?
The article, AI or not, is extremely naive. It doesn't mention any premise or any problem to solve. Proposes a solution and just goes with it. What if your monster of a event is lost when your service crashes or is lost by the logging library/service/etc? What if you're interested in measuring, post factum, how long each step takes? What if you want to trace a log through several (micro-)services and maybe between a mobile app and some batch job executor that runs once a day?
"Logging sucks" when you don't understand the problem you're trying to solve.
How is grep a bad thing? I find myself using it all the time.
I’m not into graphical user interfaces. They overwhelm me. By the time I’ve clicked myself through the GUI or written some horrible proprietary $COMPANY Query Language string, I might have already figured out the bug using tried and tested CLI tools.
This seems like a classic time vs space trade off.
Instead of reconstructing a "wide event" from multiple log lines with the same request id, the suggestion seems to be logging wide events repeatedly to simplify reconstruction from request ids.
I personally don't see the advantage, and in either scenario, if you're not logging what's needed your screwed.
Structured Logging is not just JSON. It's the use of templates with context. It solves 90% of what this article complains about if you just log the template along with the variables and the message separately. Along with logging the right stuff. IE `"User {username} created order {orderid}"`
AI slop blogvert. The first example is disingenuous btw. Everyone these days uses requestIDs to be able to query all log lines emanated by a single request, usually set by the first backend service to receive the request and then propagated using headers (and also set in the server response).
There isn't anything radical about his proposed solutions either. Most log storage can be set with a rule where all warning logs or above can be retained, but only a sample of info and debug logs.
The "key insight" is also flawed. The reason why we log at every step is because sometimes your request never completes and it could be for 1000 reasons but you really need to know how far it got in your system. Logging only a summary at the end is happy path thinking.
Kinda get what he’s saying: provide more metadata with structured logging as opposed to lots of string only logs. Ok, modern logging frameworks steer you towards that anyway. But as a counterpoint: often it can be hard to safely enrich logging like that. In the example they include subscription age, user info, etc. More than once I’ve seen logging code lookup metadata or assume it existed, only to cause perf issues or outright errors as expected data didn’t exist. Similar with sampling, it can be frustrating when the thing you need gets sampled out. In the end “it depends” on scenario, but I still find myself not logging enough or else logging too much
The problem statement in this article sounds weird. I thought in 2025 everyone logs at least thread id and context id (user id, request id etc), and in microservice architecture at least transaction or saga id. You don’t need structured logging, because grep by this id is sufficient for incident investigation. And for analytics and metrics databases of events and requests make more sense.
Our logging guidance is: "Don't write comments, write logs" and that serves us pretty well. The point being, don't write code "clever code", write obvious code, and try to make it similar to everything else thats been done, regardless if you agree with it.
Gonna go on a tangent here. Why the single purpose domain? Especially since the author has a blog. My blog is full of links to single post domains that are no longer.
the best implementation of structured logging I've seen is dotnet build's binlogs (https://msbuildlog.com), I would love to see it evolve into a general purpose logging solution
You might also need different systems for low-cardinality, low-latency production monitoring (where you want to throw alerts quickly and high cardinality fields would just get in the way), and medium to long term logging with wide events.
Also if you're going to log wide events, for the sake of the person querying them after you, please don't let your schema be an ad hoc JSON dict of dicts, put some thought into the schema structure (and better have a logging system that enforces the schema).
From what I gather: This is referring to Web sites or other HTTP applications which are internally implemented as a collection of separate applications/ micro-services?
I've generally found that structured logs that include a correlation ID make it quite easy to narrow down the general area or exact cause of problems. Usually (in enterprise orgs) via Splunk or Datadog.
Where I've had problems it's usually been one of:
There wasn't anything logged in the error block. A comment saying "never happens" is often discovered later :)
Too much was logged and someone mandated dialing the logging down to save costs. Sigh.
A new thread was started and the thread-local details including the correlation ID got lost, then the error occurred downstream of that. I'd like better solutions for that one.
Edit: Incidentally a correlation ID is not (necessarily) the same thing as a request ID. An API often needs to allow for the caller making multiple calls to achieve an objective; 5 request IDs might be tied to a single correlation ID.
I see more and more blog posts that contain interactive elements. Despite the general enshittification of the average blog and the internet, this feels like a 'modern' touch that actually adds something valuable to the sufficient ad-free no-popups old blog style.
Slapping on OpenTelemetry actually will solve your problem.
Point #1 isn't true, auto instrumentation exists and is really good. When I integrate OTel I add my own auto instrumentors wherever possible to automatically add
lots of context. Which gets into point #2.
Point #2 also isn't true. It can add business context in a
hierarchal manner and ship wide events. You shouldn't have to tell every span all the information again. Just where it appears naturally the first time.
Point #3 also also isn't true because OTel libs make it really annoying to just write a log message and very strongly pushes you into a hierarchy of nested context managers.
Like the author's ideal
setup is basically using OTel
with Honeycomb. You get the querying and everything. And unlike rawdogging wide events all your traces are connected, can span multiple services and do timing for you.
This article is attacking a strawman. It makes up terrible logs and then says they are bad. Even if this was a single monolith the logs still don't include even something like a thread id, to avoid mixing different requests together.
On some languages the tracing frameworks are a godsend. In Rust the instrument macro will automatically record all function arguments as span tags. Plonk anything in e.g jaeger and any full trace can be looked up from pretty much any value.
That doesn't sound like a good plan. You're coupling logging with business logic. I don't want to have to think if i change a debug string am i going to break something.
You're also assuming your log infrastructure is a lot more durable than most are. Generally, logging is not a guaranteed action. Writing a log message is not normally something where you wait for a disk sync before proceeding. Dropping a log message here or there is not a fatal error. Logs get rotated and deleted automatically. They are designed for retroactive use and best effort event recording, not assumed to be a flawless record of everything the system did.
Tangential, but I wonder if the given example might be straying a step too far? Normally we want to keep sensitive data out of logs, but the example includes a user.lifetime_value_cents field. I'd want to have a chat with the rest of the business before sticking something like that in logs.
In some companies, this type of information is often very important and very easily available to everyone at all levels of the business to help prioritize and understand customer value. I would not consider it "sensitive" in the same way that e.g. PII would be.
Good to know! At previous jobs, that information wasn't available to me (and it didn't matter because the customer bases were small enough that every customer was top priority), so I assumed it was considered more sensitive than it perhaps is.
The framing is not, though. Why does it have to sound so dramatic and provocative? It’s insulting to its audience. Grumpiness, in the long term, is a career-limiting attitude.
Career-limiting perhaps (if expressing normal human emotion is a minus inside of an organization, it may be time to bail) but some of the best minds I've met/observed were absolute curmudgeons (with purpose—they were properly bothered by a problem and refused to go along with the "sweep it under the rug" behavior).
Sure, I've dealt with plenty of assholes, too, but the grumps are usually just tired of their valid insight being ignored by more foolish, orthogonally incentivized types (read: "playing the game" not "making it work well").
We know these people exist, but I also believe most of us would prefer to work with a person who's both smart and kind over someone who's smart and curmudgeonly. It is possible to be both smart and kind, and I've had the pleasure of working with such people.
Assholes can sap an organization's strength faster than any productive value their intelligence can provide. I'm not suggesting the author is an asshole, though; there's not enough evidence from this post.
While I agree with some of it, I feel like there's a big gotcha here that isn't addressed. Having 1 single wide event, at the end of a request, means that if something unexpected happens in the middle (stack overflow, some bug that throws an error that bypasses your logging system, lambda times out etc...) you don't get any visibility into what happens.
You also most likely lose out on a lot of logging frameworks your language has that your dependencies might use.
I would say this is a good layer to put on top of your regular logs. Make sure you have a request/session wide id and aggregate all those in your clickhouse or whatever into a single "log".
The way I have solved for this in my own framework in PHP is by having a Logging class with the following interface
I also have a global exception handler that is registered at application bootstrap time that takes any exception that happens and runs $logger->exception($e);There is obviously a tiny bit more of boilerplating to this thing, but it works so well that I can't live without it anymore.
That was difficult to read, smelt very AI assisted though the message was worthwhile, it could've been shorter and more to the point.
A few things I've been thinking about recently:
- we have authentication everywhere in our stack, so I've started including the user id on every log line. This makes getting a holistic view of what a user experienced much easier.
- logging an error as a separate log line to the request log is a pain. You can filter for the trace, but it makes it hard to surface "show me all the logs for 5xx requests and the error associated" - it's doable, but it's more difficult than filtering on the status code of the request log
- it's not enough to just start including that context, you have to educate your coworkers that it's now present. I've seen people making life hard for themselves because they didn't realize we'd added this context
On the other hand, investing in better tracing tools unlocks a whole nother level of logging and debugging capabilities that aren't feasible with just request logs. It's kind of like you mentioned with using the user id as a "trace" in your first message but on steroids.
These tools tend to be very expensive in my experience unless you are running your own monitoring cloud. Either you end up sampling traces at low rates to save on costs, or your observability bill is more than your infrastructure bill.
Doing stuff like turning on tracing for clients that saw errors in the last 2 minutes, or for requests that were retried should only gather a small portion of your data. Maybe you can include other sessions/requests at random if you want to have a baseline to compare against.
If your codebase has the concept of a request ID, you could also feasibly use that to trace what a user has been doing with more specificity.
We do have both a span id and trace id - but I personally find this more cumbersome over filtering on a user id. YMMV if you're interested in a single trace then you'd filter for that, but I find you often also care what happened "around" a trace
…and the same ID can be displayed to user on HTTP 500 with the support contact, making life of everyone much easier.
I have seen pushback on this kind of behavior because "users don't like error codes" or other such nonsense. UX and Product like to pretend nothing will ever break, and when it does they want some funny little image, not useful output.
A good compromise is to log whenever a user would see the error code, and treat those events with very high priority.
> That was difficult to read, smelt very AI assisted though the message was worthwhile...
It won’t be long before ad computem comments like this become unacceptable.
I hope registering an entire domain name for a blog post doesn't become a trend. I like linking to things that are likely to last a long time - a personal blog is one thing, but expecting people to keep paying the renewal fee every year for a single article feels less likely to me.
A good alternative here is subdomains, since those don't have an additional annual fee. https://logging-sucks.boristane.com/ could work well here.
Because of the nature of how software is built and deployed nowadays, it’s generally not possible to write single log entries that tell the “whole story” of “what happened”.
I could write about this for hours, but instead I’ll just discuss two concepts that you need in modern logging: vertical correlation and horizontal correlation.
Within a system, requests tend to go “up” and “down” stacks of software. It is very useful in these scenarios to have “vertical correlation” fields shared between adjacent layers, so that activity in one layer can be unambiguously attributed to activity in the adjacent layers. But sharing such a correlation value requires passing the value between layers, which might be a breaking api change. Occasionally it’s possible to construct a correlation value at each adjacent layer by transforming existing parameters in exactly the same way on the calling side and called side.
Additionally, software on one system converses with software on other systems; in those cases you need to have pairwise correlation values between adjacent peer layers. Again, same limitations apply to carrying such a correlation value via the API or protocol.
Really foresighted devs can anticipate these requirements and generate unique transaction ids that can be shared between machines and up and down the stack.
A post on this topic feels incomplete without a shout-out to Charity Majors - she has been preaching this for a decade, branded the term "wide events" and "observability", and built honeycomb.io around this concept.
Also worth pointing out that you can implement this method with a lot of tools these days. Both structured Logs or Traces lend itself to capture wide events. Just make sure to use a tool that supports general query patterns and has rich visualizations (time-series, histograms).
> A post on this topic feels incomplete without a shout-out to Charity Majors
I concur. In fact, I strongly recommend anyone who has been working with observability tools or in the industry to read her blog, and the back story that lead to honeycomb. They were the first to recognize the value of this type of observability and have been a huge inspiration for many that came after.
Could you drop a few specific posts here that you think are good for someone (me) who hasn't read her stuff before? Looks like there's a decade of stuff on her blog and I'm not sure I want to start at the very beginning...
I've learned more from Charity about telemetry than from anyone else. Her book is great, as are her talks and blog posts. And Honeycomb, as a tool, is frankly pretty amazing
Yep, I'm a fan.
This post was so in-line with her writing that I was really expecting it to turn into an ad for Honeycomb at the end. I was pretty surprised with it turned out the author was unaffiliated!
Nick Blumhardt for a while longer than that as "structured logging". Seq and Serilog as enabling software and library in the .net ecosystem.
She has good content but no single person branded the term "observability", what the heck. You can respect someone without making wild claims.
The presentation is fantastic and I loved the interactive examples!
Too bad that all of this effort is spent arguing something which can be summarised as "add structured tags to your logs"
Generally speaking my biggest gripe with wide logs (and other "innovative" solutions to logging) is that whatever perceived benefit you argue for doesn't justify the increased complexity and loss of readability.
We're throwing away `grep "uid=user-123" application.log` to get what? The shipping method of the user attached to every log? Doesn't feel an improvement to me...
P.S. The checkboxes in the wide event builder don't work for me (brave - android)
> Logs were designed for a different era. An era of monoliths, single servers, and problems you could reproduce locally.
I worked with enterprise message bus loggers in semiconductor manufacturing context wherein we had thousands of participants on the message bus. It generated something like 300-400 megabytes per hour. Despite the insane volume we made this work really well using just grep and other basic CLI tools.
The logs were mere time series of events. Figuring out the detail about specific events (e.g. a list of all the tools a lot visited) required writing queries into the Oracle monster. You could derive history from the event logs if you had enough patience & disk space, but that would have been very silly given the alternative option. We used them predominantly to establish a casual chain between events when the details are still preliminary. Identifying suspects and such. Actually resolving really complicated business usually requires more than a perfectly detailed log file.
At last a sane person. Logs are for identifying the event timeline, not to acquire the whole reqs/resp data. Putting every detail into the logs is -in my experience - makes undertanding issues harder. Logs tell a story. When, what happened, not how or why that happened. Why is in the code, how is in the combination of, data, logs, events, code.
And loosely related, I also dislike log interfaces like elk stack. They make following track of events really hard. Most of the time you do not know what you are loooking for, just a vauge understanding of why you are looking into the logs. So a line passed 3 micro seconds ago maybe your euraka moment, where no search could identify , just intuition and following logs diligently can.
> It generated something like 300-400 megabytes per hour. Despite the insane volume we made this work really well using just grep and other basic CLI tools.
400MB of logs an hour is nothing at all, that's why a naive grep can work. You don't even need to rotate your log files frequently in this situation.
Horrid advice at the end about logging every error, exception, slow request, etc if you are sampling healthy requests.
Taking slow requests as an example, a dependency gets slower and now your log volume suddenly goes up 100x. Can your service handle that? Are you causing a cascading outage due to increased log volumes?
Recovery is easier if your service is doing the same or less work in a degraded state. Increasing logging by 20-100x when degraded is not that.
Good point. It also reminded me of when I was trying to optimize my app for some scenarios, then I realized it's better to optimize it for ALL scenarios, so it works fast and the servers can handle no matter what. To be more specific, I decided NOT to cache any common queries, but instead make sure that all queries are fast as possible.
Yea that was my thought too. I like the idea in principle, but these magic thresholds can really bite you. It claims to be P(99), probably off some historical measurement, but that's only true if it's dynamically changing. Maybe this could periodically query the OTEL provider for the real number to at least limit the time window of something bad happening.
It’s an important architectural requirement for a production service to be able to scale out their log ingestion capabilities to meet demand.
Besides, a little local on-disk buffering goes a long way, and is cheap to boot. It’s an antipattern to flush logs directly over the network.
I do not see how logging could bottleneck you in a degraded state unless your logging is terribly inefficient. A properly designed logging system can record on the order of 100 million logs per second per core.
Are you actually contemplating handling 10 million requests per second per core that are failing?
Generation and publication is just the beginning (never mind the fact that resources consumed by an application to log something are no longer available to do real work). You have to consider the scalability of each component in the logging architecture from end to end. There's ingestion, parsing, transformation, aggregation, derivation, indexing, and storage. Each one of those needs to scale to meet demand.
I already accounted for consumed resources when I said 10 million instead of 100 million. I allocated 10% to logging overhead. If your service is within 10% of overload you are already in for a bad time. And frankly, what systems are you using that are handling 10 million requests per second per core (100 nanoseconds per request)? Hell, what services are you deploying that you even have 10 million requests per second per core to handle?
All of those other costs are, again, trivial with proper design. You can easily handle billions of events per second on the backend with even a modest server. This is done regularly by time traveling debuggers which actually need to handle these data rates. So again, what are we even deploying that has billions of events per second?
In my experience working at AWS and with customers, you don't need billions of TPS to make an end-to-end logging infrastructure keel over. It takes much less than that. As a working example, you can host your own end-to-end infra (the LGTM stack is pretty easy to deploy in a Kubernetes cluster) and see what it takes to bring yours to a grind with a given set of resources and TPS/volume.
I prefaced all my statements with the assumption that the chosen logging system is not poorly designed and terribly inefficient. Sounds like their logging solutions are poorly designed and terribly inefficient then.
It is, in fact, a self-fulfilling prophecy to complain that logging can be a bottleneck if you then choose logging that is 100-1000x slower than it should be. What a concept.
At the end of the day, it comes down to what sort of functionality you want out of your observability. Modest needs usually require modest resources: sure, you could just append to log files on your application hosts and ship them to a central aggregator where they're stored as-is. That's cheap and fast, but you won't get a lot of functionality out of it. If you want more, like real-time indexing, transformation, analytics, alerting, etc., it requires more resources. Ain't no such thing as a free lunch.
Just implement exponential backoff for slow requests logging, or some other heuristic, to control it. I definitely agree it is a concern though.
My impression was that you would apply this filter after the logs have reach your log destination, so there should be no difference for your services unless you host your own log infra, in which case there might be issues on that side. At least that's how we do it with Datadog because ingestion is cheap but indexing and storing logs long term is the expensive part.
I've recently come off a team that was racking up a huge Splunk bill with ~70 log events for each request on a high traffic service, and this is all very resonant (except the bit about sampling, I never gave that much thought - reducing our Splunk bill 70x was ambitious enough for me!).
Hadn't heard the "wide event" name, but I had settled on the same idea myself in that time (called them "top-level events" - i.e. we would gather information from the duration of the request and only log it at the "top" of the stack at the end), and evangelised them internally mostly on the basis it gave you fantastic correlation ability.
In theory if you've got a trace id in Splunk you can do correlated queries anyway, but we were working in Spring and forever having issues with losing our MDC after doing cross-thread dispatch and forgetting to copy the MDC thread global across. This wasn't obvious from the top-level, and usually only during an incident would you realise you weren't seeing all the loglines you expected for a given trace. So absent a better solution there, tracking debug info more explicitly was appealing.
Also used these top-level events to store sub-durations (e.g. for calling downstream services, invoking a model etc), and with Splunk if you record not just the length of a sub-process but its absolute start, you can reconstruct a hacky waterfall chart of where time was spent in your query.
> Logs were designed for a different era. An era of monoliths, single servers, and problems you could reproduce locally. Today, a single user request might touch 15 services, 3 databases, 2 caches, and a message queue. Your logs are still acting like it's 2005.
Logs are fine. The job of local logs is to record the talk of a local process. They are doing this fine. Local logs were never meant to give you a picture of what's going on some other server. For such context, you need a transaction tracing that can stitch the story together across all processes involved.
Usually, looking at the logs at right place should lead you to the root cause.
One of the points the author is trying to make (although he doesn't make it well, and his attitude makes it hard to read) is that logs aren't just for root-causing incidents.
When properly seasoned with context, logs give you useful information like who is impacted (not every incident impacts every customer the same way), correlations between component performance and inputs, and so forth. When connected to analytical engines, logs with rich context can help you figure out things like behaviors that lead to abandonment, the impact of security vulnerability exploits, and much more. And in their never-ending quest to improve their offerings and make more money, product managers love being able to test their theories against real data.
It’s a wild violation of SRP to suggest that. Separating concerns is way more efficient. Database can handle audit trail and some key metrics much better, no special tools needed, you can join transaction log with domain tables as a bonus.
Are you assuming they're all stored identically? If so, that's not necessarily the case.
Once the logs have entered the ingestion endpoint, they can take the most optimal path for their use case. Metrics can be extracted and sent off to a time-series metric database, while logs can be multiplexed to different destinations, including stored raw in cheap archival storage, or matched to schemas, indexed, stored in purpose-built search engines like OpenSearch, and stored "cooked" in Apache Iceberg+Parquet tables for rapid querying with Spark, Trino, or other analytical engines.
Have you ever taken, say, VPC flow logs, saved them in Parquet format, and queried them with DuckDB? I just experimented with this the other day and it was mind-blowingly awesome--and fast. I, for one, am glad the days of writing parsers and report generators myself are over.
Good joke.
>Today, a single user request might touch 15 services, 3 databases, 2 caches, and a message queue.
Not if I have anything to say about it.
>Your logs are still acting like it's 2005.
Yeah, because that's just before software development went absolutely insane.
APN/Kibana. All what I need for inspecting logs.
Shoutout to Kibana. Absolutely my favorite UI tool for trying to figure out what went wrong (and sometimes, IF anything went wrong in the first place)
I agree with this statement: "Instead of logging what your code is doing, log what happened to this request." but the impression I can't shake is that this person lacks experience, or more likely has a lot of experience doing the same thing over and over.
"Bug parts" (as in "acceptable number of bug parts per candy bar") logging should include the precursors of processing metrics. I think what he calls "wide events" I call bug parts logging in order to emphasize that it also may include signals pertaining to which code paths were taken, how many times, and how long it took.
Logging is not metrics is not auditing. In particular processing can continue if logging (temporarily) fails but not if auditing has failed. I prefer the terminology "observables" to "logging" and "evaluatives" to "metrics".
In mature SCADA systems there is the well-worn notion of a "historian". Read up on it.
A fluid level sensor on CANbus sending events 10x a second isn't telling me whether or not I have enough fuel to get to my destination (a significant question); however, that granularity might be helpful for diagnosing a stuck sensor (or bad connection). It would be impossibly fatiguing and hopelessly distracting to try to answer the significan question from this firehose of low-information events. Even a de-noised fuel gauge doesn't directly diagnose my desired evaluative (will I get there or not?).
Does my fuel gauge need to also serve as the debugging interface for the sensor? No, it does not. Likewise, send metrics / evaluatives to the cloud not logging / observables; when something goes sideways the real work is getting off your ass and taking a look. Take the time to think about what that looks like: maybe that's the best takeaway.
> Logging is not metrics is not auditing.
I espouse a "grand theory of observability" that, like matter and energy, treats logs, metrics, and audits alike. At the end of the day, they're streams of bits, and so long as no fidelity is lost, they can be converted between each other. Audit trails are certainly carried over logs. Metrics are streams of time-series numeric data; they can be carried over log channels or embedded inside logs (as they often are).
How these signals are stored, transformed, queried, and presented may differ, but at the end of the day, the consumption endpoint and mechanism can be the same regardless of origin. Doing so simplifies both the conceptual framework and design of the processing system, and makes it flexible enough to suit any conceivable set of use cases. Plus, storing the ingested logs as-is in inexpensive long-term archival storage allows you to reprocess them later however you like.
Auditing is fundamentally different because it has different durability and consistency requirements. I can buffer my logs, but I might need to transact my audit.
For most cases, buffering audit logs on local storage is fine. What matters is that the data is available and durable somewhere in the path, not that it be transactionally durable at the final endpoint.
You could have the log shipper filter events and create a separate audit stream with different behavior and destination.
Really, have sane log message types and include ”audit” as one of them.
Log levels could be considered an anti-pattern.
Saying they are all the same when no fidelity is lost is missing the point. The only distinction between logs, traces, and metrics is literally what to do when fidelity is lost.
If you have insufficient ingestion rate:
Logs are for events that can be independently sampled and be coherent. You can drop arbitrary logs to stay within ingestion rate.
Traces are for correlated sequences of events where the entire sequence needs to be retained to be useful/coherent. You can drop arbitrary whole sequences to stay within ingestion rate.
Metrics are pre-aggregated collections of events. You pre-limited your emission rate to fit your ingestion rate at the cost of upfront loss of fidelity.
If you have adequate ingestion rate, then you just emit your events bare and post-process/visualize your events however you want.
> If you have insufficient ingestion rate
I would rather fix this problem than every other problem. If I'm seeing backpressure, I'd prefer to buffer locally on disk until the ingestion system can get caught up. If I need to prioritize signal delivery once the backpressure has resolved itself, I can do that locally as well by separating streams (i.e. priority queueing). It doesn't change the fundamental nature of the system, though.
Logs were designed for a different era. An era of monoliths, single servers, and problems you could reproduce locally. Today, a single user request might touch 15 services, 3 databases, 2 caches, and a message queue.
If a user request is hitting that many things, in my view, that is a deeply broken architecture.
I'm building an analytics SaaS and we made the conscious decision to keep it simple: Next.js API routes + Supabase + minimal external services. A single page view hits maybe 3 components max (CDN -> App -> Database).
That said, I agree completely on structured logging with rich context. We include user_id, session_id, and event_type on every log line. Makes debugging infinitely easier.
The "wide events" concept is solid, but the real win is just having consistent, searchable structure. You don't need a revolutionary new paradigm - just stop logging random strings and use JSON with a schema.
Persisting a data schema that represents business events is a great idea. That’s more about Event Sourcing though and doing that can answer a ton of questions about the system without doing it in log messages.
Wide events as a strategy is expensive, even with sampling, and doesn’t address the fundamental problem - why do we log messages?
I was hoping the article would enumerate why we log messages. Nailing down those scenarios first will lead to a happy life.
Why do we log? - proof of life - is the system running? - what is the state (in memory) when an error occurred? - when did an error occur? - do I need to get up at 2 am and fix something? - what do I need to fix?
I feel like every team operating a system has their own reasons for logging.
> Logging Sucks
But does it? Or is it bad logging, or excessive logging, or unsearchable logs?
A client of mine uses SnapLogic, which is a middleware / ETL that's supposed run pipelines in batch mode to pass data around between systems. It generates an enormous amount of logs that are so difficult to access, search and read that they may as well don't exist.
We're replacing all of that with simple Python scripts that do the same thing and generate normal simple logs with simple errors when something's truly wrong or the data is in the wrong format.
Terse logging is what you want, not an exhaustive (and exhausting) torrent of irrelevant information.
Just out of curiosity, how have you seen risk/compliance, regulatory, and audit departments at organizations deal with the disconnect between security and privacy for something like mainframe logging (e.g., JES2, JES3), which is typically inherently governed, and modern distributed logging, which is typically inherently permissive? Both are vastly different approaches, but each is somehow considered 'compliant.' Btw, employees at a company I was at were once investigated for insider trading simply because it was discovered the company used pooled logs that were accessible by production support programmers (the company decided to override the default mainframe security), which was deemed a possible source of insider trading information that could be tapped into by those who had log access (programmers were eventually cleared if it was discovered their small personal trades were immaterial and just coincidental with the company's trading, but the investigation led to uncomfortable confrontations for some!).
> Logs were designed for a different era. An era of monoliths, single servers, and problems you could reproduce locally. Today, a single user request might touch 15 services, 3 databases, 2 caches, and a message queue. Your logs are still acting like it's 2005.
If a user request is hitting that many things, in my view, that is a deeply broken architecture.
> If a user request is hitting that many things, in my view, that is a deeply broken architecture.
If we want it or not, a lot of modern software looks like that. I am also not a particular fan of building software this way, but it's a reality we're facing. In part it's because quite a few services that people used to build in-house are now outsourced to PaaS solutions. Even basic things such as authentication are more and more moving to third parties.
> but it's a reality we're facing.
Yes. Most software is bad
The incentives between managers and technicians are all wrong
Bad software is more profitable, over the time frames managers care about, than good software
The reason we end up with very complex systems I don't think is because of incentives between "managers and technicians". If I were to put my finger to it, I would assume it's the very technicians who argued themselves into a world where increased complexity and more dependencies is seen as a good thing.
Fighting complexity is deeply unpopular.
At least in my place of work, my non-technical manager is actually on board with my crusade against complex nonsense. Mostly because he agrees it would increase feature velocity to not have to touch 5 services per minor feature. The other engineers love the horrific mess they've built. It's almost like they're roleplaying working at Google and I'm ruining the fun.
> If a user request is hitting that many things, in my view, that is a deeply broken architecture.
Things can add up quickly. I wouldn't be surprised if some requests touch a lot of bases.
Here's an example: a user wants to start renting a bike from your public bike sharing service, using the app on their phone.
This could be an app developed by the bike sharing company itself, or a 3rd party app that bundles mobility options like ride sharing and public transport tickets in one place.
You need to authentice the request and figure out which customer account is making the request. Is the account allowed to start a ride? They might be blocked. They might need to confirm the rules first. Is this ride part of a group ride, and is the customer allowed to start multiple rides at once? Let's also get a small deposit by putting a hold of a small sum on their credit card. Or are they a reliable customer? Then let's not bother them. Or is there a fraud risk? And do we need to trigger special code paths to work around known problems for payment authorization for cards issued by this bank?
Everything good so far? Then let's start the ride.
First, let's lock in the necessary data. Which rental pricing did the customer agree to? Is that actually available to this customer, this geographical zone, for this bike, at this time, or do we need to abort with an error? Otherwise, let's remember this, so we can calculate the correct rental fee at the end.
We normally charge an unlock fee in addition to the per-minute price. Are we doing that in this case? If yes, does the customer have any free unlock credit that we need to consume or reserve now, so that the app can correctly show unlock costs if the user wants to start another group ride before this one ends?
Ok, let's unlock the bike and turn on the electric motor. We need to make sure it's ready to be used and talk to the IoT box on the bike, taking into account the kind of bike, kind of box and software version. Maybe this is a multistep process, because the particular lock needs manual action by the customer. The IoT box might have to know that we're in a zone where we throttle the max speed more than usual.
Now let's inform some downstream data aggregators that a ride started successfully. BI (business intelligence) will want to know, and the city might also require us to report this to them. The customer was referred by a friend, and this is their first ride, so now the friend gets his referral bonus in the form of app credit.
Did we change an unrefundable unlock fee? We might want to invoice that already (for whatever reason; otherwise this will happen after the ride). Let's record the revenue, create the invoice data and the PDF, email it, and report this to the country's tax agency, because that's required in the country this ride is starting in.
Or did things go wrong? Is the vehicle broken? Gotta mark it for service to swing by, and let's undo any payment holds. Or did the deposit fail, because the credit card is marked as stolen? Maybe block the customer and see if we have other recent payments using the same card fingerprint that we might want to proactively refund.
That's just off the top of my head, there may be more for a real life case. Some of these may happen synchronously, others may hit a queue or event bus. The point is, they are all tied to a single request.
So, depending on how you cut things, you might need several services that you can deploy and develop independently.
- auth - core customer management, permissions, ToS agreement,
- pricing, - geo zone definitions, - zone rules,
- benefit programs,
- payments and payment provider integration, - app credits, - fraud handling,
- ride management, - vehicle management, - IoT integration,
- invoicing, - emails, - BI integration, - city hall integration, - tax authority integration,
- and an API gateway that fronts the app request.
These do not have to be separate services, but they are separate enough to warrant it. They wouldn't be exactly micro either.
Not every product will be this complicated, but it's also not that out there, I think.
> These do not have to be separate services, but they are separate enough to warrant it.
All of this arises from your failure to question this basic assumption though, doesn't it?
One thing this is missing: Standardization and probably the ECS' idea of "related" fields.
A common problem in a log aggregation is the question if you query for user.id, user_id, userID, buyer.user.id, buyer.id, buyer_user_id, buyer_id, ... Every log aggregation ends up being plagued by this. You need standard field names there, or it becomes a horrible mess.
And for a centralized aggregation, I like ECS' idea of "related". If you have a buyer and a seller, both with user IDs, you'd have a `related.user.id` with both id's in there. This makes it very simple to say "hey, give me everything related to request X" or "give me everything involving user Y in this time frame" (as long as this is kept up to date, naturally)
I actually wrote my bachelors on this topic, but instead of going the ECS route (which still has redundant fields in different components) I went in the RDF direction. That system has shifted towards more of a middleware/database hybrid over time (https://github.com/triblespace/triblespace-rs). I always wonder if we'd actually need logging if we had more data-oriented stacks where the logs fall out as a natural byproduct of communication and storage.
I always wondered why we didnt have some kind of fuzzy english words search regexes/tool, that is robust to keyboard typing mistakes, spelling mistake, synonyms, plural, conjugation etc.
I've recently added error tracking to my self-hosted analytics app (UXWizz), and the way I did it is simply add extra events to each user/session. Once you have the concept of a session or user, you can simply attach errors or logs as Events stored for that user. This solves the main problem mentioned in the article, where you don't know what happened, plus being an Event stored in a MySQL database, you can still query it.
Why not simply use Events for logging, instead of plain strings?
This was a brilliant write up, and loved the interactivity.
I do think "logs are broken" is a bit overstated. The real problem is unstructured events + weak conventions + poor correlation.
Brilliant write up regardless
"Logs were designed for a different era. An era of monoliths, single servers, and problems you could reproduce locally."
But the next era will be like the previous one. Today monolith is enough for most of apps.
The article, AI or not, is extremely naive. It doesn't mention any premise or any problem to solve. Proposes a solution and just goes with it. What if your monster of a event is lost when your service crashes or is lost by the logging library/service/etc? What if you're interested in measuring, post factum, how long each step takes? What if you want to trace a log through several (micro-)services and maybe between a mobile app and some batch job executor that runs once a day?
"Logging sucks" when you don't understand the problem you're trying to solve.
Use events instead of repetitious logging calls.
https://dave.autonoma.ca/blog/2022/01/08/logging-code-smell/
> No grep-ing.
How is grep a bad thing? I find myself using it all the time.
I’m not into graphical user interfaces. They overwhelm me. By the time I’ve clicked myself through the GUI or written some horrible proprietary $COMPANY Query Language string, I might have already figured out the bug using tried and tested CLI tools.
This seems like a classic time vs space trade off.
Instead of reconstructing a "wide event" from multiple log lines with the same request id, the suggestion seems to be logging wide events repeatedly to simplify reconstruction from request ids.
I personally don't see the advantage, and in either scenario, if you're not logging what's needed your screwed.
Structured Logging is not just JSON. It's the use of templates with context. It solves 90% of what this article complains about if you just log the template along with the variables and the message separately. Along with logging the right stuff. IE `"User {username} created order {orderid}"`
AI slop blogvert. The first example is disingenuous btw. Everyone these days uses requestIDs to be able to query all log lines emanated by a single request, usually set by the first backend service to receive the request and then propagated using headers (and also set in the server response).
There isn't anything radical about his proposed solutions either. Most log storage can be set with a rule where all warning logs or above can be retained, but only a sample of info and debug logs.
The "key insight" is also flawed. The reason why we log at every step is because sometimes your request never completes and it could be for 1000 reasons but you really need to know how far it got in your system. Logging only a summary at the end is happy path thinking.
Kinda get what he’s saying: provide more metadata with structured logging as opposed to lots of string only logs. Ok, modern logging frameworks steer you towards that anyway. But as a counterpoint: often it can be hard to safely enrich logging like that. In the example they include subscription age, user info, etc. More than once I’ve seen logging code lookup metadata or assume it existed, only to cause perf issues or outright errors as expected data didn’t exist. Similar with sampling, it can be frustrating when the thing you need gets sampled out. In the end “it depends” on scenario, but I still find myself not logging enough or else logging too much
> Today, a single user request might touch 15 services, 3 databases, 2 caches, and a message queue.
And this is why _the internet_ today sucks.
The problem statement in this article sounds weird. I thought in 2025 everyone logs at least thread id and context id (user id, request id etc), and in microservice architecture at least transaction or saga id. You don’t need structured logging, because grep by this id is sufficient for incident investigation. And for analytics and metrics databases of events and requests make more sense.
Our logging guidance is: "Don't write comments, write logs" and that serves us pretty well. The point being, don't write code "clever code", write obvious code, and try to make it similar to everything else thats been done, regardless if you agree with it.
Good write up.
Gonna go on a tangent here. Why the single purpose domain? Especially since the author has a blog. My blog is full of links to single post domains that are no longer.
Because it's an ad
it's an ad, for what?
i do not see a product upsell anywhere.
if it's an ad for the author themselves, then it's a very good one.
At the end there's a form where you can get a "personalized report", I have a feeling that'll advertise some kind of service, it's usually the case.
the best implementation of structured logging I've seen is dotnet build's binlogs (https://msbuildlog.com), I would love to see it evolve into a general purpose logging solution
You might also need different systems for low-cardinality, low-latency production monitoring (where you want to throw alerts quickly and high cardinality fields would just get in the way), and medium to long term logging with wide events.
Also if you're going to log wide events, for the sake of the person querying them after you, please don't let your schema be an ad hoc JSON dict of dicts, put some thought into the schema structure (and better have a logging system that enforces the schema).
From what I gather: This is referring to Web sites or other HTTP applications which are internally implemented as a collection of separate applications/ micro-services?
I've generally found that structured logs that include a correlation ID make it quite easy to narrow down the general area or exact cause of problems. Usually (in enterprise orgs) via Splunk or Datadog.
Where I've had problems it's usually been one of:
There wasn't anything logged in the error block. A comment saying "never happens" is often discovered later :)
Too much was logged and someone mandated dialing the logging down to save costs. Sigh.
A new thread was started and the thread-local details including the correlation ID got lost, then the error occurred downstream of that. I'd like better solutions for that one.
Edit: Incidentally a correlation ID is not (necessarily) the same thing as a request ID. An API often needs to allow for the caller making multiple calls to achieve an objective; 5 request IDs might be tied to a single correlation ID.
Java has a solution for the thread problem: Scoped Values [0]. If only the logging+tracing libraries would start using it...
[0] https://openjdk.org/jeps/506
Oh, excellent, these slipped under my radar. Sounds extremely promising and I do mostly work in Java!
I see more and more blog posts that contain interactive elements. Despite the general enshittification of the average blog and the internet, this feels like a 'modern' touch that actually adds something valuable to the sufficient ad-free no-popups old blog style.
this is the best lead generation form i've ever seen
Sounds like he’s just asking for an old school Inman style transaction log.
Slapping on OpenTelemetry actually will solve your problem.
Point #1 isn't true, auto instrumentation exists and is really good. When I integrate OTel I add my own auto instrumentors wherever possible to automatically add lots of context. Which gets into point #2.
Point #2 also isn't true. It can add business context in a hierarchal manner and ship wide events. You shouldn't have to tell every span all the information again. Just where it appears naturally the first time.
Point #3 also also isn't true because OTel libs make it really annoying to just write a log message and very strongly pushes you into a hierarchy of nested context managers.
Like the author's ideal setup is basically using OTel with Honeycomb. You get the querying and everything. And unlike rawdogging wide events all your traces are connected, can span multiple services and do timing for you.
Maybe better written and simplified to: “microservices suck”.
AI writing sucks even more, get rekt
This article is attacking a strawman. It makes up terrible logs and then says they are bad. Even if this was a single monolith the logs still don't include even something like a thread id, to avoid mixing different requests together.
I see logs worse that that on the daily.
Overly dismissive of OTLP without proper substance to the criticism.
On some languages the tracing frameworks are a godsend. In Rust the instrument macro will automatically record all function arguments as span tags. Plonk anything in e.g jaeger and any full trace can be looked up from pretty much any value.
distributed event id and you are all set
> Your logs are lying to you. Not maliciously. They're just not equipped to tell the truth.
The best way to equip logs to tell the truth is to have other parts of the system consume them as their source of truth.
Firstly: "what the system does" and "what the logs say" can't be two different things.
Secondly: developers can't put less info into the logs than they should, because their feature simply won't work without it.
That doesn't sound like a good plan. You're coupling logging with business logic. I don't want to have to think if i change a debug string am i going to break something.
You're also assuming your log infrastructure is a lot more durable than most are. Generally, logging is not a guaranteed action. Writing a log message is not normally something where you wait for a disk sync before proceeding. Dropping a log message here or there is not a fatal error. Logs get rotated and deleted automatically. They are designed for retroactive use and best effort event recording, not assumed to be a flawless record of everything the system did.
Your logic wouldn't be dependent on a debug string, but some enum in a structured field. Ex, event_type: CREATED_TRANSACTION.
Seeing logging as debugging is flawed imo. A log is technically just a record of what happened in your database.
Tangential, but I wonder if the given example might be straying a step too far? Normally we want to keep sensitive data out of logs, but the example includes a user.lifetime_value_cents field. I'd want to have a chat with the rest of the business before sticking something like that in logs.
In some companies, this type of information is often very important and very easily available to everyone at all levels of the business to help prioritize and understand customer value. I would not consider it "sensitive" in the same way that e.g. PII would be.
Good to know! At previous jobs, that information wasn't available to me (and it didn't matter because the customer bases were small enough that every customer was top priority), so I assumed it was considered more sensitive than it perhaps is.
The substance of this post is outstanding.
The framing is not, though. Why does it have to sound so dramatic and provocative? It’s insulting to its audience. Grumpiness, in the long term, is a career-limiting attitude.
Career-limiting perhaps (if expressing normal human emotion is a minus inside of an organization, it may be time to bail) but some of the best minds I've met/observed were absolute curmudgeons (with purpose—they were properly bothered by a problem and refused to go along with the "sweep it under the rug" behavior).
Sure, I've dealt with plenty of assholes, too, but the grumps are usually just tired of their valid insight being ignored by more foolish, orthogonally incentivized types (read: "playing the game" not "making it work well").
We know these people exist, but I also believe most of us would prefer to work with a person who's both smart and kind over someone who's smart and curmudgeonly. It is possible to be both smart and kind, and I've had the pleasure of working with such people.
Assholes can sap an organization's strength faster than any productive value their intelligence can provide. I'm not suggesting the author is an asshole, though; there's not enough evidence from this post.
I get the AI feeling from it.
It might have been AI-assisted, and it might not have been. It doesn’t really matter. The author is ultimately responsible for the end result.
Some excellent points raised in this article.
Splunk is expensive but it makes searching logs so much faster and more effective. I think of it as SQL for unstructured data.
loki works great too and is FOSS
Logfiles are a user interface.