All software should provide something meaningful for anybody to diagnose, if they’re inclined to. It’s particularly bad in the (Apple) mobile ecosystem, including AppleTV.
I have AdGuard Home but one of my spouse’s streaming services wouldn’t work. “There was a problem.” Gee thanks. Eventually figured out that I had to unblock a few hosts so it would work. Only found which ones by googling and finding some other poor soul who fixed it and documented it.
Apple is all about walled-off, locked-down, black box, just-works (when it does) etc. It's supposed to seem like magic. You're not supposed to tinker with magic, it makes it pedestrian. Apple as a brand is a lifestyle, a feeling. The slick, polished brand. Remember "I'm a Mac, and I'm a PC"? PC is where you tinker, and there is screws and nuts and bolts and jargon and troubleshooting etc. In Apple land, you just take it to a slick genius bar and they do their magic. Or you just buy a new one.
As a European I'm always baffled how Apple got so much market share among the actual techies and power users in the US. You do it to yourself by buying this stuff. It's for people who don't want to spend one second thinking about actual technical issues.
You're baffled because you appear to be uninformed and/or willfully ignorant. macOS is Unix-based and 90% functionally equivalent to Linux for software development and tinkering purposes. iOS, while less customizable than Android, is overall very good software for a phone. Apple hardware is superior across the board, especially for durability.
Meanwhile, I'm baffled why any techie would voluntarily use an OS that force-enables telemetry and advertising. The fight for privacy and ad-free experiences is hard enough without your OS fundamentally working against you.
As someone who came up in the Slashdot M$ era, if nothing else the PR and communication style of Satya is a masterclass is delivering a message to the public. The dude presents like a Zen master. The message is baffling and the strategy is nonexistent, but people think there’s a new gentle Microsoft.
Somehow angry Europeans (at least in this thread) are running into the embrace of Windows as the defender of the tinkerers. Certainly not in n my bingo card.
There's a difference between Apple's mobile devices which are an actual walled garden, and Mac OS which (begrudgingly) still lets you install and run pretty much anything. It has a nice terminal, no driver issues, and is not nearly as distracting and annoying as modern Windows (still has more than enough bugs and quirks though). And once update support runs out I can install Linux on it.
iPads are a completely different world and really feel not just restrictive but the whole ecosystem constantly tries to push you towards subscriptions for everything, including the OS which conveniently offers the only sane backup solution that can cover all apps. It incentivises content consumption and giving up control over one's data. Not my cup of tea.
There is a self-regulating loop that Apple users quickly learn not to "draw outside the lines" and just use the thing as designed and intended by Apple. If you use stuff like AdGuard, custom DNS etc, that's tinkerer tier stuff. A good Apple user either watches the ads or pays not to see them.
I haven’t seen a YouTube ad on my machine in years. I download all the videos that I watch and skip through the ads that content creators bake in. I control my dns and network to restrict what can get to my browser and other apps. I have a highly customized Bash environment (I see no reason to switch to zshell when I’ve got Homebrew).
But paint the nerds who like MacOS and the wonderful third-party app ecosystem of developers who care about fit and finish as a bunch of mindless rubes if it makes you feel better.
> As a European I'm always baffled how Apple got so much market share among the actual techies and power users in the US.
I know exactly how this happened, I was there. It filled a gap for a practical desktop UNIX when none existed.
In the old days, there many flavors of proprietary UNIX, like Solaris, IRIX, HPUX, AIX, et al plus a few open source versions like FreeBSD and early Linux. The early Internet was a purely UNIX world (still mostly is) but UNIX was a fragmented market of dozens of marginally interoperable OS.
During the dotcom boom, Solaris on Sparc became the gold standard for large servers. These are very expensive machines and not particularly user friendly. If you were a dev in those days, you were either using some type of Sparc workstation or FreeBSD or Linux (which wasn’t very good in those days). You wanted your desktop environment to be UNIX-ish but the good + cheap options were limited. Linux became better on the server and started to displace FreeBSD there but was still very limited as a desktop OS. Linux was much worse than Windows NT on the desktop at the time but Windows NT wasn’t UNIX.
MacOS X came along and offered UNIX on the desktop with a far better experience than Linux (or any other UNIX) on the desktop, and much cheaper than a Solaris workstation. It filled a clear gap in the market, and so Silicon Valley moved from a mix of Solaris and Linux desktops for development to MacOS X desktops, which were better in almost every way for the average dev. It was UNIX and it ran normal business applications like Microsoft Office.
MacOS X was a weaker UNIX than many of the other UNIX OS but it offered a desktop that didn’t suck and it was cheap. For someone that had been using Linux or Solaris at the time, which many devs were, it was a massive upgrade.
MacOS still kind of sucks as a UNIX but that’s okay because we don’t use it as a server. Silicon Valley needed a competent UNIX desktop that didn’t cost a fortune and Apple delivered.
Apple is just a remote UNIX system for manipulating the other UNIX systems your code actually runs one.
People make decisions based on their own value system. I’m glad to have choices. I can get everything done with the tools we call computers.
When I view the logs on my Apple systems they make sense to me. One does have to understand the logs which implies understanding the system under diagnosis.
> As a European I'm always baffled how Apple got so much market share among the actual techies and power users in the US.
Linux, historically, was terrible and then some; lots of us simply want to get on with life and not dork with the OS every day. If you didn't want to use Windows at your day job, that left OS X.
And, for a while, Apple hardware was quite nice. For a remarkably long time, you could get way cheaper high resolution laptop displays than the competition. The trackpads have always been far superior on Apple than Linux. And then the M-series came along and was also quite nice.
However, over time Linux has gotten better so it's now functional as a daily driver and reasonably reliable. And macOS has deteriorated until it's now probably below Linux in terms of reliability.
So, here we are. macOS and Windows do seem to be losing share to Linux, but only Linux cares. At this point, desktop/laptop revenue is dwarfed by everything else at both Microsoft and Apple.
> It’s particularly bad in the (Apple) mobile ecosystem
It's been years since I've significantly used Apple software, but when I had to use a Mac at work, or helped friends or family troubleshoot some problem on Mac OS, I had a similar experience. When things don't "just work", it was very difficult to figure out why it didn't work.
Rather than indulging the inevitable argument that most users never read log messages, I hope we can remember a more important fact:
Some users do read log messages, just as some users file useful bug reports. Even when they are a tiny minority, I find their discoveries valuable. They give me a view into problems that my software faces out there in the wilds of real-world use. My log messages enable those discoveries, and have led to improvements not only in my own code, but also in other people's projects that affect mine.
This is part of why I include a logging system and (hopefully) understandable messages in my software. Even standalone tools and GUI applications.
(And since I am among the minority who read log messages, I appreciate good ones in software made by other people. Especially when they allow me to solve a problem immediately, on my own, rather than waiting days or weeks get the developer's attention.)
For years now I’ve been pushing for moving of all non actionable error messages and all aggregate-actionable error messages into telemetry data instead.
Not the least of which because log processing SaaS companies seem to be overcharging for their services even versus hosted Grafana services, and really many of us could do away with the rent seeking entirely.
The computational complexity of finding meaning in log files versus telemetry data leans toward this always being the case. It will never change except in brief cases of VC money subsidizing your subscription.
If an error shouldn’t trigger operator actions, but 1000 should, that’s a telemetry alert not a data dog or Splunk problem.
That is the $65k question and unfortunately I don't have a pat answer for that yet. I probably need to see more types of projects instead of more time on fewer projects which is where I'm at.
But I can give you a partial picture.
You're going to end up with multiple dashboards with duplicate charts on them because you're showing correlation between two charts via proximity. Especially charts that are in the same column in row n±1 or vice versa. You're trying to show whether correlation is likely to be causation or not. Grafana has a setting that can show the crosshairs on all graphs at the same time, but they need to be in the same viewport for the user to see them. Generally, for instance, error rates and request rates are proportional to each other, unless a spike in error rates is being for instance triggered by web crawlers who are now hitting you with 300 req/s each whereas they normally are sending you 50. The difference in the slope of the lines can tell you why an alert fired or that it's about to. So I let previous RCAs inform whether two graphs need to be swapped because we missed a pattern that spanned a viewport. And sometimes after you fix tech debt, the correlation between two charts goes up or way down. So what was best in May not be best come November.
There's a reason my third monitor is in portrait mode, and why that monitor is the first one I see when I come back to my desk after being AFK. I could fit 2 dashboards and group chat all on one monitor. One dashboard showed overall request rate and latency data, the other showed per-node stats like load and memory. That one got a little trickier when we started doing autoscaling. The next most common dashboard which we would check at intervals showed per-service tail latencies versus request rates. You'd check that one every couple of hours, any time there was a weird pattern on the other two, or any time you were fiddling with feature toggles.
From there things balkanized a bit. We had a few dashboards that two or three of us liked and the rest avoided.
Yeah, but that still doesn’t let you see “event A happened before event B which led to C”. I’ve had significantly >> 1 bugs where having good logs lets me investigate and resolve the issue so quickly and easily whereas telemetry would have left you searching around forever.
I feel like splunk’s business model favors a healthy system and gives major disadvantages to an unhealthy one. What I mean in an example: when the system is unhealthy, I know it because all my splunk queries get queued up because everyone is slamming it with queries. I hate it.
But I’m stuck in knowing how to move some things to Prometheus. Like say we have a CustomerID and we want to track number of times something is done by user. If we have thousands of customers, cardinality breaks that solution.
This gets even worse if you have a language with one process per CPU as you can get clobbering other values on the same instance if you don't add fields to uniquely identify them.
We got a lot of pushback when migrating our telemetry to AWS after initially being told to just move it when they saw how OTEL amplified data points and cardinality versus our old StatsD data.
You probably need less cardinality than you think, and there are a mix of stats
that work fine with less frequent polling, while others like heap usage are terrible if you use 20 or 30 second intervals. Our Pareto frontier was to reduce the sampling rate of most stats and push per-process things like heap usage into histograms.
An aggregator per box can drop a couple of tags before sending them upstream which can help considerably with the number of unique values. (eg, instanceID=[0..31] isn't that useful outside of the box)
I recently went all-in on the systemd ecosystem as much as I could on some recent hardware installs, and my biggest pet peeve is the double timestamps and double logs I find in journalctl... it's like they never intended you to read the logs...
One fairly common approach to this for systems, is to configure the system to ship the logs to an external collection mechanism (FluentBit, etc) and do so in JSON format.
While I see the point the author is trying to make, I'm not really sure I agree. Most users don't even read error messages, never mind logs. At best, logs are something they need for compliance, for most, the concept doesn't exist at all. I do agree that the logs should help you understand what went wrong and why, but in that regard the principle is the same for both sysadmins and developers and I don't really see the difference?
In my sysadmin work I curse every developer who makes me fire up strace, tcpdump, procmon, Wireshark, etc, because they couldn't be bothered to actually say what file couldn't be found, what TCP connection failed to be established. etc.
I get the impression that often it isn't laziness but the concept that error details leak information to an attacker and are therefore a vulnerability.
I disagree with this view, but it definitely exists.
Sysadmins needs logs that tell them what action they can do fix it. Developers need logs that tell them what a system is doing.
Generally a sysadmin needs to know "is there an action I can do to correct this" where as a dev has the power to alter source code and thus needs to know what and where the system is doing things.
> but in that regard the principle is the same for both sysadmins and developers and I don't really see the difference?
No, it's very different: developers generally want to know about things they control, so they want detailed debugging info about a program's internal state that could have caused a malfunction. Sysadmins don't care about that (they're fine with coalescing all the errors that developers care about under a general "internal error"), and they care about what in the program environment could have triggered the bug, so that they may entirely avoid it or at least deploy workarounds to sidestep the bug.
> Most users don't even read error messages, never mind logs.
They don't need to. The log message is so helpdesk have something actionable, or so it can be copy pasted into google to find people with similar problem maybe having solution.
> Most users don't even read error messages, never mind logs.
Yes, see all the questions on StackOverflow with people posting their error message without reading it, like “I got an error that says ‘fail! please install package xyz!’, what should I do?!?”.
I think that's being very generous. If you've ever been in tech support, you'll be amazed at how often you'll be asked what to do when it tells me to do X.
If they don't know how to do X, then they should be able to look up how to do X. If it's something like install 3rd party library, then that's not the first party's responsibility. Especially OSS for different arch/distros. They are all different. Look up the 3rd party's repo and figure it out.
I've worked in tech support. I get that 25-50% of the cases appear to be "read the docs to me." But the majority of those is because docs are poorly written, are overwhelming for new users, or they don't understand them and won't admit that directly.
on friday i got 2 calls saying "my phone is no longer showing me my emails, please fix" when the error message they received was roughly "please reenter your password to continue using outlook".
on wednesday i got a call saying "the CRM wont let me input this note, please fix" when the error message was "you have included an invalid character, '□' found in description. remove invalid characters and resubmit".
Oddly enough though, my journey into computers was greatly assisted by my curiosity at random log files that were being dumped to my desktop constantly.
From recent experience, I'm thinking logs need to be written for AI. Over the last few months, I've had a couple of issues where I took a bunch of logs from a bunch of interacting programs, pointed the AI at the logs and the source code and it's been really effective and finding the problems, often seeing patterns that would have been really hard for me to spot in all the noise.
We have a magic button in servicenow that lets the L1 agent kick off a job that pulls telemetry from a user device and do an overall health check of the device. That input identifies the issue like 80% of the time if it’s a device issue.
It either gets resolved quicker by the L2 guy or dispatched to the third party hardware fix it guy or sent to some speciality L3 team. Resolution time is down like 60%.
My next goal is to assess disk and battery health in laptops and proactively replace if they hit whatever threshold we can push the vendors to accept. That could eliminate something like 30% of device related issues, which has a super high value.
This is the way I like to do it. I know bloating the logs too much can be a problem, but it's even worse if you're lacking information to reconstruct what happened when there ends up being a problem. And only providing that detail when there's an error isn't enough. What if the issue never triggered an error in the application and it was only caught later on either by a person seeing something was off or by an error a downstream system?
Also it's helpful to log before operations rather than after because if a step gets stuck it's possible to know what it's stuck on.
Depends a lot on the context and type of software.
For server side software where there is a sysadmin in charge of keeping it running I generally agree.
But for end user software (desktop, mobile, embedded) no one wil read the logs and there the logs can, and probably should, be aimed at the developers. Of course you can and should still provide usable and informative end user oriented error messages but they're not the same thing as logs
> But for end user software (desktop, mobile, embedded) no one wil [sic] read the logs
Lots of end user software is used in an enterprise context where the helpdesk staff will have to read those logs. And for B2C (or retail, or amateur, whatever you want to call them) users, often they will go through online tutorials to try to self-diagnose because the developers are most of the time unreachable.
It doesn't. The detailed log might be nonsense to the user but so is generic error, and the difference is that the specific log message makes it far easier to find solution than generic one.
A small subset of technical users do read logs. If a desktop app has a problem, I have a fighting chance of fixing it if I have logs. Error messages may not give the full picture; what was the app trying to do before the error occurred? Logs let me debug slowness and crashes.
> But if your software is successful (especially if it gets distributed to other people), most of the people running it won't be the developers, they'll only be operating it.
The biggest problem is what when you wrote a code for a 'totally obvious message' you yourself was in the context. Years, year, heck even weeks later you would stare at it and wonder 'why tf I didn't wrote something more verbose?'.
Anecdote: I wrote some supporting scripts to 'integrate' two systems three times - totally oblivious the second and the third times what I already did it. Both times I was somewhere 60% when I though 'wait I totally recognize this code but I just wrote it! What in Deja-vu-nation?!'.
For a FOSS Android app I co-develop, we go out of our way to make verbose logging efficient to collect & easy to share (one-click copy). I've seen users get good mileage out of asking an LLM just what has gone wrong. We are adding more structure to log messages and add in as much state (like callstack) as possible with each log line, and diagnostics from procfs on resources held (like memory, threads, fds).
The interesting edge case with AI agents: the "operator" collapses into whoever owns the agent, and the log's job changes fundamentally.
When a regular app logs an error, it's a passive record — the operator investigates at leisure. When an agent logs "I'm about to delete these 47 files — is that right?", it's an active interrupt. The log becomes a decision request, not an event record. "Waiting for human approval" is a semantically different thing than "ERROR: something failed."
Most agent setups treat this badly — write to stderr, fire a webhook, hope the human checks Slack. There's no canonical "agent pausing for human input" primitive in most stacks. It's logging's open problem for the agentic era.
This is a not-so subtle advantage JavaScript has over 90% of everything else: Chrome DevTools Protocol (CDP), which exists/is-great in-large-part thanks to JavaScript being an alive language. Of the Stop Writing Dead Programs variety (https://jackrusher.com/strange-loop-2022/, https://news.ycombinator.com/item?id=33270235). It's just astoundingly capable, so very richly exposes such a featureful runtime, across so many dimensions of tooling. REPL, logging, performance, heap, profile, storage, tracing and others, just for the core, before you get into the browser based things. https://chromedevtools.github.io/devtools-protocol/
This is such a core advantage to javascript: that it is an alive language. The runtime makes it very easy to change and modify systems ongoingly, and as an operator, that is so so so much better than having a statically compiled binary, in terms of what is possible.
One of my favorite techniques is using SIGUSR1 to start the node debugger. Performance impact is not that bad. Pick a random container in prod, and... just debug it. Use logpoints instead of breakpoints, since you don't want to halt the world. Takes some scripting to SSH port forward to docker port forward to the container, but an LLM can crack that script out in no time.
https://nodejs.org/en/learn/getting-started/debugging#enable...
My cherry on top is to make sure the services my apps consume are attached to globalThis, so I can just hit my services directly from the running instance, in the repl. Without having to trap them being used here or there.
I feel like this is an outdated point of view now. Logs are clearly going to be read primarily by agents very soon, if they're not already now.
For example, we're experimenting with having Claude Desktop read log files for remote users. It's often able to troubleshoot and solve issues for our users faster than we can, especially after you give it access to your codebase through GH MCP or something like that. It's wild.
How does this change the point that is being made in the article? Your agent is also only taking one of the existing roles that humans today occupy (e.g. the software operator or developer)
If the logs are being read by agents then they should be more detailed and verbose to help the agent understand the root cause. We reduce the volume of information for humans. That doesn’t need to be the case any longer.
All software should provide something meaningful for anybody to diagnose, if they’re inclined to. It’s particularly bad in the (Apple) mobile ecosystem, including AppleTV.
I have AdGuard Home but one of my spouse’s streaming services wouldn’t work. “There was a problem.” Gee thanks. Eventually figured out that I had to unblock a few hosts so it would work. Only found which ones by googling and finding some other poor soul who fixed it and documented it.
They don't want tinkering or tinkerers.
Apple is all about walled-off, locked-down, black box, just-works (when it does) etc. It's supposed to seem like magic. You're not supposed to tinker with magic, it makes it pedestrian. Apple as a brand is a lifestyle, a feeling. The slick, polished brand. Remember "I'm a Mac, and I'm a PC"? PC is where you tinker, and there is screws and nuts and bolts and jargon and troubleshooting etc. In Apple land, you just take it to a slick genius bar and they do their magic. Or you just buy a new one.
As a European I'm always baffled how Apple got so much market share among the actual techies and power users in the US. You do it to yourself by buying this stuff. It's for people who don't want to spend one second thinking about actual technical issues.
I totally agree but you can attribute a lot of the Apple worship to Microsoft and their OEM partners making PC laptops an often miserable experience.
You're baffled because you appear to be uninformed and/or willfully ignorant. macOS is Unix-based and 90% functionally equivalent to Linux for software development and tinkering purposes. iOS, while less customizable than Android, is overall very good software for a phone. Apple hardware is superior across the board, especially for durability.
Meanwhile, I'm baffled why any techie would voluntarily use an OS that force-enables telemetry and advertising. The fight for privacy and ad-free experiences is hard enough without your OS fundamentally working against you.
Apple sends tens of megabytes of telemetry from first network connection and regularly:
https://sneak.berlin/20210202/macos-11.2-network-privacy/
None of this able to be turned off, the boot volume is read-only. Can only be deactivated by jumping through hoops.
As someone who came up in the Slashdot M$ era, if nothing else the PR and communication style of Satya is a masterclass is delivering a message to the public. The dude presents like a Zen master. The message is baffling and the strategy is nonexistent, but people think there’s a new gentle Microsoft.
Somehow angry Europeans (at least in this thread) are running into the embrace of Windows as the defender of the tinkerers. Certainly not in n my bingo card.
He's uninformed? I assume you have a jailbroken Apple iPhone then?
IDK, they were sending around stacks of Mac Studios to tinkerer youtubers messing with EXO clustering like @geerlingguy.
https://youtu.be/1iT9JeZYXcI?si=UMR0nfHAYbVq2tF1
There's a difference between Apple's mobile devices which are an actual walled garden, and Mac OS which (begrudgingly) still lets you install and run pretty much anything. It has a nice terminal, no driver issues, and is not nearly as distracting and annoying as modern Windows (still has more than enough bugs and quirks though). And once update support runs out I can install Linux on it.
iPads are a completely different world and really feel not just restrictive but the whole ecosystem constantly tries to push you towards subscriptions for everything, including the OS which conveniently offers the only sane backup solution that can cover all apps. It incentivises content consumption and giving up control over one's data. Not my cup of tea.
Okay, but then their stuff needs to be perfect as designed. Because the moment there's a bug, we're back to needing diagnostic tools.
There is a self-regulating loop that Apple users quickly learn not to "draw outside the lines" and just use the thing as designed and intended by Apple. If you use stuff like AdGuard, custom DNS etc, that's tinkerer tier stuff. A good Apple user either watches the ads or pays not to see them.
I haven’t seen a YouTube ad on my machine in years. I download all the videos that I watch and skip through the ads that content creators bake in. I control my dns and network to restrict what can get to my browser and other apps. I have a highly customized Bash environment (I see no reason to switch to zshell when I’ve got Homebrew).
But paint the nerds who like MacOS and the wonderful third-party app ecosystem of developers who care about fit and finish as a bunch of mindless rubes if it makes you feel better.
My point is that even inside the lines there are still bugs.
> As a European I'm always baffled how Apple got so much market share among the actual techies and power users in the US.
I know exactly how this happened, I was there. It filled a gap for a practical desktop UNIX when none existed.
In the old days, there many flavors of proprietary UNIX, like Solaris, IRIX, HPUX, AIX, et al plus a few open source versions like FreeBSD and early Linux. The early Internet was a purely UNIX world (still mostly is) but UNIX was a fragmented market of dozens of marginally interoperable OS.
During the dotcom boom, Solaris on Sparc became the gold standard for large servers. These are very expensive machines and not particularly user friendly. If you were a dev in those days, you were either using some type of Sparc workstation or FreeBSD or Linux (which wasn’t very good in those days). You wanted your desktop environment to be UNIX-ish but the good + cheap options were limited. Linux became better on the server and started to displace FreeBSD there but was still very limited as a desktop OS. Linux was much worse than Windows NT on the desktop at the time but Windows NT wasn’t UNIX.
MacOS X came along and offered UNIX on the desktop with a far better experience than Linux (or any other UNIX) on the desktop, and much cheaper than a Solaris workstation. It filled a clear gap in the market, and so Silicon Valley moved from a mix of Solaris and Linux desktops for development to MacOS X desktops, which were better in almost every way for the average dev. It was UNIX and it ran normal business applications like Microsoft Office.
MacOS X was a weaker UNIX than many of the other UNIX OS but it offered a desktop that didn’t suck and it was cheap. For someone that had been using Linux or Solaris at the time, which many devs were, it was a massive upgrade.
MacOS still kind of sucks as a UNIX but that’s okay because we don’t use it as a server. Silicon Valley needed a competent UNIX desktop that didn’t cost a fortune and Apple delivered.
Apple is just a remote UNIX system for manipulating the other UNIX systems your code actually runs one.
People make decisions based on their own value system. I’m glad to have choices. I can get everything done with the tools we call computers.
When I view the logs on my Apple systems they make sense to me. One does have to understand the logs which implies understanding the system under diagnosis.
I wouldn't confuse Steve Jobs-era Apple with what it is now.
> As a European I'm always baffled how Apple got so much market share among the actual techies and power users in the US.
Linux, historically, was terrible and then some; lots of us simply want to get on with life and not dork with the OS every day. If you didn't want to use Windows at your day job, that left OS X.
And, for a while, Apple hardware was quite nice. For a remarkably long time, you could get way cheaper high resolution laptop displays than the competition. The trackpads have always been far superior on Apple than Linux. And then the M-series came along and was also quite nice.
However, over time Linux has gotten better so it's now functional as a daily driver and reasonably reliable. And macOS has deteriorated until it's now probably below Linux in terms of reliability.
So, here we are. macOS and Windows do seem to be losing share to Linux, but only Linux cares. At this point, desktop/laptop revenue is dwarfed by everything else at both Microsoft and Apple.
> It’s particularly bad in the (Apple) mobile ecosystem
It's been years since I've significantly used Apple software, but when I had to use a Mac at work, or helped friends or family troubleshoot some problem on Mac OS, I had a similar experience. When things don't "just work", it was very difficult to figure out why it didn't work.
Rather than indulging the inevitable argument that most users never read log messages, I hope we can remember a more important fact:
Some users do read log messages, just as some users file useful bug reports. Even when they are a tiny minority, I find their discoveries valuable. They give me a view into problems that my software faces out there in the wilds of real-world use. My log messages enable those discoveries, and have led to improvements not only in my own code, but also in other people's projects that affect mine.
This is part of why I include a logging system and (hopefully) understandable messages in my software. Even standalone tools and GUI applications.
(And since I am among the minority who read log messages, I appreciate good ones in software made by other people. Especially when they allow me to solve a problem immediately, on my own, rather than waiting days or weeks get the developer's attention.)
For years now I’ve been pushing for moving of all non actionable error messages and all aggregate-actionable error messages into telemetry data instead.
Not the least of which because log processing SaaS companies seem to be overcharging for their services even versus hosted Grafana services, and really many of us could do away with the rent seeking entirely.
The computational complexity of finding meaning in log files versus telemetry data leans toward this always being the case. It will never change except in brief cases of VC money subsidizing your subscription.
If an error shouldn’t trigger operator actions, but 1000 should, that’s a telemetry alert not a data dog or Splunk problem.
How do you handle the problem that telemetry is generally incapable of capturing temporal context?
That is the $65k question and unfortunately I don't have a pat answer for that yet. I probably need to see more types of projects instead of more time on fewer projects which is where I'm at.
But I can give you a partial picture.
You're going to end up with multiple dashboards with duplicate charts on them because you're showing correlation between two charts via proximity. Especially charts that are in the same column in row n±1 or vice versa. You're trying to show whether correlation is likely to be causation or not. Grafana has a setting that can show the crosshairs on all graphs at the same time, but they need to be in the same viewport for the user to see them. Generally, for instance, error rates and request rates are proportional to each other, unless a spike in error rates is being for instance triggered by web crawlers who are now hitting you with 300 req/s each whereas they normally are sending you 50. The difference in the slope of the lines can tell you why an alert fired or that it's about to. So I let previous RCAs inform whether two graphs need to be swapped because we missed a pattern that spanned a viewport. And sometimes after you fix tech debt, the correlation between two charts goes up or way down. So what was best in May not be best come November.
There's a reason my third monitor is in portrait mode, and why that monitor is the first one I see when I come back to my desk after being AFK. I could fit 2 dashboards and group chat all on one monitor. One dashboard showed overall request rate and latency data, the other showed per-node stats like load and memory. That one got a little trickier when we started doing autoscaling. The next most common dashboard which we would check at intervals showed per-service tail latencies versus request rates. You'd check that one every couple of hours, any time there was a weird pattern on the other two, or any time you were fiddling with feature toggles.
From there things balkanized a bit. We had a few dashboards that two or three of us liked and the rest avoided.
Yeah, but that still doesn’t let you see “event A happened before event B which led to C”. I’ve had significantly >> 1 bugs where having good logs lets me investigate and resolve the issue so quickly and easily whereas telemetry would have left you searching around forever.
Honest question, how do you handle high cardinality data points?
Reference to where my brain is at: https://www.robustperception.io/cardinality-is-key/
I feel like splunk’s business model favors a healthy system and gives major disadvantages to an unhealthy one. What I mean in an example: when the system is unhealthy, I know it because all my splunk queries get queued up because everyone is slamming it with queries. I hate it.
But I’m stuck in knowing how to move some things to Prometheus. Like say we have a CustomerID and we want to track number of times something is done by user. If we have thousands of customers, cardinality breaks that solution.
Is there a good solution for this?
This gets even worse if you have a language with one process per CPU as you can get clobbering other values on the same instance if you don't add fields to uniquely identify them.
We got a lot of pushback when migrating our telemetry to AWS after initially being told to just move it when they saw how OTEL amplified data points and cardinality versus our old StatsD data.
You probably need less cardinality than you think, and there are a mix of stats that work fine with less frequent polling, while others like heap usage are terrible if you use 20 or 30 second intervals. Our Pareto frontier was to reduce the sampling rate of most stats and push per-process things like heap usage into histograms.
An aggregator per box can drop a couple of tags before sending them upstream which can help considerably with the number of unique values. (eg, instanceID=[0..31] isn't that useful outside of the box)
Asking this question got me to stop being lazy and actually try to answer my own question. Mimir being one that caught my eye
https://grafana.com/oss/mimir/
I recently went all-in on the systemd ecosystem as much as I could on some recent hardware installs, and my biggest pet peeve is the double timestamps and double logs I find in journalctl... it's like they never intended you to read the logs...
One fairly common approach to this for systems, is to configure the system to ship the logs to an external collection mechanism (FluentBit, etc) and do so in JSON format.
Of possible interest:
* https://dave.autonoma.ca/blog/2022/01/08/logging-code-smell/
* https://dave.autonoma.ca/blog/2026/02/03/lloopy-loops/
Both of these posts discuss using event-based frameworks to eliminate duplicative (cross-cutting) logging statements throughout a code base.
My desktop Markdown editor[1], uses this approach to output log messages to a dialog box, a status bar, and standard error, effectively "for free".
[1]: https://repo.autonoma.ca/repo/keenwrite/tree/HEAD/src/main/j...
While I see the point the author is trying to make, I'm not really sure I agree. Most users don't even read error messages, never mind logs. At best, logs are something they need for compliance, for most, the concept doesn't exist at all. I do agree that the logs should help you understand what went wrong and why, but in that regard the principle is the same for both sysadmins and developers and I don't really see the difference?
In my sysadmin work I curse every developer who makes me fire up strace, tcpdump, procmon, Wireshark, etc, because they couldn't be bothered to actually say what file couldn't be found, what TCP connection failed to be established. etc.
I get the impression that often it isn't laziness but the concept that error details leak information to an attacker and are therefore a vulnerability.
I disagree with this view, but it definitely exists.
In a message returned by a server to a client I suppose it's defensible. For writing to syslog, event log, a log file, etc, it's not.
Yeah, along those lines we have requirements on never logging PII, and not logging anything that potentially contains PII, such as folder names.
Maybe tokenise the PII part of the folder name when outputting it?
ie `$HOME`/.config/foo/stuff.cfg` rather than `/home/joebloggs/foo/stuff.cfg`?
Or have an encrypted data portion, so that the sensitive details can be revealed as-needed, and redaction occurs by rotating a key.
Obviously that depends on the messages being infrequent in production logging levels.
Sysadmins needs logs that tell them what action they can do fix it. Developers need logs that tell them what a system is doing.
Generally a sysadmin needs to know "is there an action I can do to correct this" where as a dev has the power to alter source code and thus needs to know what and where the system is doing things.
> but in that regard the principle is the same for both sysadmins and developers and I don't really see the difference?
No, it's very different: developers generally want to know about things they control, so they want detailed debugging info about a program's internal state that could have caused a malfunction. Sysadmins don't care about that (they're fine with coalescing all the errors that developers care about under a general "internal error"), and they care about what in the program environment could have triggered the bug, so that they may entirely avoid it or at least deploy workarounds to sidestep the bug.
> Most users don't even read error messages, never mind logs.
They don't need to. The log message is so helpdesk have something actionable, or so it can be copy pasted into google to find people with similar problem maybe having solution.
> Most users don't even read error messages, never mind logs.
Yes, see all the questions on StackOverflow with people posting their error message without reading it, like “I got an error that says ‘fail! please install package xyz!’, what should I do?!?”.
That question is more likely how do I install, not what to install.
I think that's being very generous. If you've ever been in tech support, you'll be amazed at how often you'll be asked what to do when it tells me to do X.
If they don't know how to do X, then they should be able to look up how to do X. If it's something like install 3rd party library, then that's not the first party's responsibility. Especially OSS for different arch/distros. They are all different. Look up the 3rd party's repo and figure it out.
But no, it's contact support straight away.
I've worked in tech support. I get that 25-50% of the cases appear to be "read the docs to me." But the majority of those is because docs are poorly written, are overwhelming for new users, or they don't understand them and won't admit that directly.
on friday i got 2 calls saying "my phone is no longer showing me my emails, please fix" when the error message they received was roughly "please reenter your password to continue using outlook".
on wednesday i got a call saying "the CRM wont let me input this note, please fix" when the error message was "you have included an invalid character, '□' found in description. remove invalid characters and resubmit".
Oddly enough though, my journey into computers was greatly assisted by my curiosity at random log files that were being dumped to my desktop constantly.
Any group of people is target of specific log level. INFO for random folks, DEBUG for programmers etc.
From recent experience, I'm thinking logs need to be written for AI. Over the last few months, I've had a couple of issues where I took a bunch of logs from a bunch of interacting programs, pointed the AI at the logs and the source code and it's been really effective and finding the problems, often seeing patterns that would have been really hard for me to spot in all the noise.
We have a magic button in servicenow that lets the L1 agent kick off a job that pulls telemetry from a user device and do an overall health check of the device. That input identifies the issue like 80% of the time if it’s a device issue.
It either gets resolved quicker by the L2 guy or dispatched to the third party hardware fix it guy or sent to some speciality L3 team. Resolution time is down like 60%.
My next goal is to assess disk and battery health in laptops and proactively replace if they hit whatever threshold we can push the vendors to accept. That could eliminate something like 30% of device related issues, which has a super high value.
The log needs to document, at least in broad steps and critical details, what the next operation is and what key parameters were provided to it.
A human, or an 'agent' can use those to figure out why said next step might have gone wrong.
This is the way I like to do it. I know bloating the logs too much can be a problem, but it's even worse if you're lacking information to reconstruct what happened when there ends up being a problem. And only providing that detail when there's an error isn't enough. What if the issue never triggered an error in the application and it was only caught later on either by a person seeing something was off or by an error a downstream system?
Also it's helpful to log before operations rather than after because if a step gets stuck it's possible to know what it's stuck on.
Well the useful ones are. The rest are screaming into the void, or rather, the operator’s ear.
Depends a lot on the context and type of software.
For server side software where there is a sysadmin in charge of keeping it running I generally agree.
But for end user software (desktop, mobile, embedded) no one wil read the logs and there the logs can, and probably should, be aimed at the developers. Of course you can and should still provide usable and informative end user oriented error messages but they're not the same thing as logs
> But for end user software (desktop, mobile, embedded) no one wil [sic] read the logs
Lots of end user software is used in an enterprise context where the helpdesk staff will have to read those logs. And for B2C (or retail, or amateur, whatever you want to call them) users, often they will go through online tutorials to try to self-diagnose because the developers are most of the time unreachable.
It doesn't. The detailed log might be nonsense to the user but so is generic error, and the difference is that the specific log message makes it far easier to find solution than generic one.
I.e. SEO-optimized
A small subset of technical users do read logs. If a desktop app has a problem, I have a fighting chance of fixing it if I have logs. Error messages may not give the full picture; what was the app trying to do before the error occurred? Logs let me debug slowness and crashes.
> But if your software is successful (especially if it gets distributed to other people), most of the people running it won't be the developers, they'll only be operating it.
The biggest problem is what when you wrote a code for a 'totally obvious message' you yourself was in the context. Years, year, heck even weeks later you would stare at it and wonder 'why tf I didn't wrote something more verbose?'.
Anecdote: I wrote some supporting scripts to 'integrate' two systems three times - totally oblivious the second and the third times what I already did it. Both times I was somewhere 60% when I though 'wait I totally recognize this code but I just wrote it! What in Deja-vu-nation?!'.
Not really true for modern cloud architectures. If you have an appropriately tuned Observability stack you're probably pretty familiar with the logs.
For a FOSS Android app I co-develop, we go out of our way to make verbose logging efficient to collect & easy to share (one-click copy). I've seen users get good mileage out of asking an LLM just what has gone wrong. We are adding more structure to log messages and add in as much state (like callstack) as possible with each log line, and diagnostics from procfs on resources held (like memory, threads, fds).
[dead]
The interesting edge case with AI agents: the "operator" collapses into whoever owns the agent, and the log's job changes fundamentally.
When a regular app logs an error, it's a passive record — the operator investigates at leisure. When an agent logs "I'm about to delete these 47 files — is that right?", it's an active interrupt. The log becomes a decision request, not an event record. "Waiting for human approval" is a semantically different thing than "ERROR: something failed."
Most agent setups treat this badly — write to stderr, fire a webhook, hope the human checks Slack. There's no canonical "agent pausing for human input" primitive in most stacks. It's logging's open problem for the agentic era.
This is a not-so subtle advantage JavaScript has over 90% of everything else: Chrome DevTools Protocol (CDP), which exists/is-great in-large-part thanks to JavaScript being an alive language. Of the Stop Writing Dead Programs variety (https://jackrusher.com/strange-loop-2022/, https://news.ycombinator.com/item?id=33270235). It's just astoundingly capable, so very richly exposes such a featureful runtime, across so many dimensions of tooling. REPL, logging, performance, heap, profile, storage, tracing and others, just for the core, before you get into the browser based things. https://chromedevtools.github.io/devtools-protocol/
This is such a core advantage to javascript: that it is an alive language. The runtime makes it very easy to change and modify systems ongoingly, and as an operator, that is so so so much better than having a statically compiled binary, in terms of what is possible.
One of my favorite techniques is using SIGUSR1 to start the node debugger. Performance impact is not that bad. Pick a random container in prod, and... just debug it. Use logpoints instead of breakpoints, since you don't want to halt the world. Takes some scripting to SSH port forward to docker port forward to the container, but an LLM can crack that script out in no time. https://nodejs.org/en/learn/getting-started/debugging#enable...
My cherry on top is to make sure the services my apps consume are attached to globalThis, so I can just hit my services directly from the running instance, in the repl. Without having to trap them being used here or there.
I feel like this is an outdated point of view now. Logs are clearly going to be read primarily by agents very soon, if they're not already now.
For example, we're experimenting with having Claude Desktop read log files for remote users. It's often able to troubleshoot and solve issues for our users faster than we can, especially after you give it access to your codebase through GH MCP or something like that. It's wild.
How does this change the point that is being made in the article? Your agent is also only taking one of the existing roles that humans today occupy (e.g. the software operator or developer)
If the logs are being read by agents then they should be more detailed and verbose to help the agent understand the root cause. We reduce the volume of information for humans. That doesn’t need to be the case any longer.