I'll note that they had a large annotated data set already that they were using to train and evaluate their own models. Once they decided to start testing LLMs it was straightforward for them to say "LLM 1 outperforms LLM 2" or "Prompt 3 outperforms Prompt 4".
I'm afraid that people will draw the wrong conclusion from "We didn’t just replace a model. We replaced a process." and see it as an endorsement of the zero-shot-uber-alles "Prompt and Pray" approach that is dominant in the industry right now and the reason why an overwhelming faction of AI projects fail.
If you can get good enough performance out of zero shot then yeah, zero shot is fine. Thing is that to know it is good enough you still have to collect and annotate more data than most people and organizations want to do.
> Thing is that to know it is good enough you still have to collect and annotate more data than most people and organizations want to do.
This has been the bottleneck in every ML (not just text/LLM) project I’ve been part of.
Not finding the right AI engineers. Not getting the MLops textbook perfect using the latest trends.
It’s the collecting enough high quality data and getting it properly annotated and verified. Then doing proper evals with humans in the loop to get it right.
People who only know these projects through headlines and podcasts really don’t like to accept this idea. Everyone wants synthetic data with LLMs doing the annotations and evals because they’ve been sold this idea that the AI will do everything for you, you just need to use it right. Then layer on top of that the idea that the LLMs can also write the code for you and it’s a mess when you have to deal with people who only gain their AI knowledge through headlines, LinkedIn posts, and podcasts.
Amen brother. Working on a computer vision project right now, it's a wild success.
This isn't my first CV project, but it's the most successful one. And that chiefly because my client pulled out their wallets and let an army of annotators create all the train data I asked for, and more.
This has been the huge problem in AI research since at least 1998 (and that was just when I was first exposed to it). With data, everything is so much easier, and much simpler machine learning methods.
Supervised learning. Took a while to make that work well.
And then every few years someone comes up with a way to distill data out of unsupervised examples. GPT is these days the big example of that, but there was "ImageNet (unlabeled)" and LAION before that too. The issue is that there is just so much unsupervised data.
Now LLMs use that pretty well (even though stuffing everything into an LLM is getting old, and as this article points out, in any specific application they tend to get bested by something like XGBoost with very simple models)
The next frontier is probably "world models", where you first train unsupervised, not to train your model but to predict the world. THEN you train the model in this simulated, predicted world. That's the reason Yann Lecun really really wants to go down this direction.
There was a post on here recently about how you should build your own agent, and I completely agree. I'd say most competent developers should be building even more complex projects than an agent. Once you do you quickly realize how it's a constant uphill battle, and it quickly becomes apparent that the data you're working with is the primary issue.
I don't know if that is what gp and above is talking about. "Agents" are the kind of thing/word that helps to paper over the very fact that these things only work because of huge amount of humans in-the-loop in the outset (that is, you know, labor). Agents help us believe that LLM's can do everything for us, even bootstrap themselves, but, what the above thread is about is that, really, what you get out correlates only to what you put in in the first place.
I would offer a stronger more pointed observation, ofen the problem in building a good classifier is having good negative examples. More generally how a classifier identify good negatives is a function of:
1. Data collection technique.
2. Data annotation(labelling).
3. Classfier can learn on your "good" negatives — quantitaively depending on the machine residuals/margin/contrastive/triplet losses — i.e. learn the difference between a negative and positive for a classifier at train time and the optimization minima is higher than at test time.
4. Calibration/Reranking and other Post Processing.
My guess is that they hit a sweet spot with the first 3 techniques.
- text classification, not text generation
- operating on existing unstructured input
- existing solution was extremely limited (string matching)
- comparing LLM to similar but older methods of using neural networks to match
- seemingly no negative consequences to warranty customers themselves of mis-classification (the data is used to improve process, not to make decisions)
Which is good because a lot of such matching and ML use cases for products I’ve worked on at several companies fit into this. The problem I’ve seen is when decision making capabilities are inferred from/conflated with text classification and sentiment analysis.
In my current role this seems like a very interesting approach to keep up with pop culture references and internet speak that can change as quickly as it takes the small ML team I work with to train or re-train a model. The limit is not a tech limitation, it’s a person-hours and data labeling problem like this one.
Given I have some people on my team that like to explore this area I’m going to see if I can run a similar case study to this one to see if it’s actually a fit.
Edit: At the risk of being self deprecating and reductive: I’d say a lot of products I’ve worked on are profitable/meaningful versions of Silicon Valley’s Hot Dog/Not Hot Dog.
I agree with you that the headline really needs to be qualified with these details. So there's an aspect of being unsurprising here, because that particular set of details is exactly where LLMs perform very well.
But I think it's still an interesting result, because related and similar tasks are everywhere in our modern world, and they tend to have high importance in both business and the public sector, and the older generation of machine learning techniques for handling these tasks we're both sophisticated and to the point where very capable and experienced practitioners might need an R&D cycle just to conclude if the problem was solvable with the available data up to the desired standard.
LLM's represent a tremendous advancement in our ability as a society to deal with these kinds of tasks. So yes, it's a limited range of specific tasks, and success is found within a limited set of criteria, but it's a very important tasks and enough of those criteria are met in practice that I think this result is interesting and generalizable.
That doesn't mean we should fire all of our data scientists and let junior programmers just have at it with the LLM, because you still need to put together a good day to say, makes sense of the results, and iterate intelligently, especially given that these models tend to be expensive to run. It does however mean that existing data teams must be open to adopting LLMs instead of traditional model fitting.
One of the benefits of being immersed in model usage is being able to spot it in the wild from a mile away. People really hate when you catch them doing it and call them out for it.
It didn't stick out to me because "corporate success story" articles already tend to sound like that, which is at least in part where I imagine the popular LLMs get it from. (The other part being pop nonfiction books.)
I dunno, ending with a short, punchy insight is a common way to make an impactful conclusion. It's the equivalent of a "hook" for concluding an article instead of opening. I do it often and see others (e.g. OpEds) use that tactic all the time.
I think we're getting into reverse slop discrimination territory now. LLMs have been trained on so much of what we consider "good writing", that actual good writing is now attributed by default to LLMs.
> That line sticks out so much now, and I can't unsee it.
I thought maybe they did it on purpose at first, like a cheeky but too subtle joke about LLM usage, but when it happened twice near the end of the post I just acknowledged, yeah, they did the thing. At least it was at the end or I might have stopped reading way earlier.
I’ve been on HN long enough to know that the upvotes are primarily driven by reactions to the headline. The actual content only gets viewed after upvoting, or often not at all.
> Also HN readers: upvote the most obvious chatgpt slop to the frontpage
Eh, this one was interesting as documentation of real work that people were doing over years. You don't get that many blog posts about this sort of effort without, usually, a bunch of self hype (because the company blogging also sells data analysis AI or whatever) that clouds any interesting part of the story. The slop in it is annoying but it's also noise thats relatively easy to filter out in this case
Those phrases definitely stick out quite badly. But this post wasn’t pure slop.
It had high quality info about a large ML effort inside an old school auto company, which is very interesting. I was just a bit disappointed no one thought to edit those out.
If you have genuinely interesting and valuable results to report, but you ask AI to do the final writeup for you and it comes across in that generic AI slop style, is it slop? Kind of a gray area for me. It certainly feels lazy and disrespectful to me as a reader, but on the other hand if they don't spend an afternoon proofreading and revising, maybe they can spend that afternoon instead building stuff. I don't know, our whole concept of the purpose of the written word is falling apart.
It's worth highlighting the conditions under which this can help:
> in domains where the taxonomy drifts, the data is scarce, or the requirements shift faster than you can annotate
It's not actually clear if warranty claims really meet these criteria.
For warranty claims, the difficulty is in detecting false negatives, when companies have a strong incentive and opportunity to hide the negatives.
Companies have been trusted to do this kind of market surveillance (auto warranties, drug post-market reporting) largely based on faith that the people involved would do so in earnest. That faith is misplaced when the process is automated (not because the implementors are less diligent, but because they are too removed to tell).
Then the backlash to a few significant injuries might be a much worse regime of bureaucratic oversight, right when companies have replaced knowledge with automation (and replacement labor costs are high).
Dunnow, reads fine to me, also seems we now have a #nothingisreal problem now where everything is AI. Given that LLMS were trained on pre-existing writing it follows that people commonly write like that.
overall I think things have gotten better. I noticed maybe 3 years before chatGPT hit the scene that I would frequent on a page that definitely didn't seem written by a native English speaker. The writing was just weird. I see less of that former style now.
Probably the biggest new trend I notice is this very prominent "Conclusion" block that seems to show up now.
Honestly I'd love to see some data on it. I suspect a lot of "that's LLM slop" isn't and others isn't noticed and lots of LLM tropes were rife within online content long before LLMs but we're now hypersensitive to certain things since they're overused by LLMs.
I believe this is likely a consequence of how RLHF is done. I’ve not verified it, but I suspect the frontier model labs are outsourcing it to companies employing primarily non-native English speakers.
Warranty data is a great example of where LLMs have evolved bureaucratic data overhead. What most people do not know is because of US federal TREAD regulation Automotive companies (If they want to land and look at warranty data) need to review all warranty claims, document, and detect any safety related issues and issue recalls all with an strong auditability requirement. This problem generates huge data and operations overhead, Companies need to either hire 10's if not hundreds of individuals to inspect claims or come up with automation to make this process easier.
Over the past couple of years people have made attempts with NLP (lets say standard ML workflows) but NLP and word temperature scores are hard to integrate into a reliable data pipeline much less a operational review workflow.
Enter LLM's, the world is a data gurus oyster for building an detection system on warranty claims. Passing data to Prompted LLM's means capturing and classifying records becomes significantly easier, and these data applications can flow into more normal analytic work streams.
People generally sleep when you start talking about fine-tuned BERT and CLIP, although they do a fairly decent job as long as you have good data and know what you're doing.
But no, they want to pay $0.1 per request to recognize if a photo has a person in it by asking a multimodal LLM deployed across 8x GPUs, for some reason, instead of just spending some hours with CLIP and run it effectively even on CPU.
> they do a fairly decent job as long as you have good data and know what you're doing.
This is the bottleneck in my experience. Going for the expensive per-request LLM gets something shipped now that you can wow the execs with. Setting up a whole process to gather and annotate data, train models, run evals, and iterate takes time. The execs who hired those expensive AI engineers want their results right now, not after a process of hiring more people to collect and annotate the data.
I’m no ML engineer and far from an LLM expert. Just reading the article though it seemed to me that leveraging an SQL database here was a bigger issue than using traditional ML on the data, rather than the LLM being a win specifically. Just finding anything that was better suited than string matching on a RDBMS to the type of inputs seems like the natural conclusion when the complaint in the article itself was literally about SQL.
I think they're suggesting doing that with BERT for text and CLIP for images. Which in my experience is indeed quite effective (and easy/fast).
There have been some developments in the image-of-text/other-than-photograph area though recently. From Meta (although they seem unsure of what exactly their AI division is called): https://arxiv.org/abs/2510.05014 and Qihoo360: https://arxiv.org/abs/2510.27350 for instance.
So you're still ignoring that problem of putting the oil filter in places that cause excessive spilled oil? The point of artificial intelligence is to quickly have an unbiased check of all signal within the noise.
> Over multiple years, we built a supervised pipeline that worked. In 6 rounds of prompting, we matched it. That’s the headline, but it’s not the point. The real shift is that classification is no longer gated by data availability, annotation cycles, or pipeline engineering.
I thought the same. Having said that, the parenthesis in the example are really wrong for what they were trying to convey. I suspect that they built this sql sample for the document and made some mistakes in its generation.
Perhaps I could say, it isn't just generated--it is also hallucinated!
And this is where the strengths of LLMs really lie: making performant ML available to a wider audience, without requiring PHDs in Computer Science or Mathematics to build. It’s consistently where I spend my time tinkering with these, albeit in a local-only environment.
If all the bullshit hype and marketing would evaporate already (“LLMs will replace all jobs!”), stuff like this would float to the top more and companies with large data sets would almost certainly be clamoring for drop-in analysis solutions based on prompt construction. They’d likely be far happier with the results, too, instead of fielding complaints from workers about it (AI) being rammed down their throats at every turn.
This technique is surprisingly powerful. Yesterday I built an experimental cellular automata classifier system based on some research papers I found and was curious about. Aside from the sheer magic of the entire build process with Cursor + GPT5-Codex, one big breakthrough was simply cloning the original repo's source code and copy/pasting the paper into a .txt file.
Now when I ask questions about design decisions, the LLM refers to the original paper and cites the decisions without googling or hallucinating.
With just these two things in my local repo, the LLM created test scripts to compare our results versus the paper and fixed bugs automatically, helped me make decisions based on the paper's findings, helped me tune parameters based on the empirical outcomes, and even discovered a critical bug in our code that was caused by our training data being random generated versus the paper's training data being a permutation over the whole solution space.
All of this work was done in one evening and I'm still blown away by it. We even ported our code to golang, parallelized it, and saw a 10x speedup in the processing. Right before heading to bed, I had the LLM spin up a novel simulator using a quirky set of tests that I invented using hypothetical sensors and data that have not yet been implemented, and it nailed it first try - using smart abstractions and not touching the original engine implementation at all. This tech is getting freaky.
The mainstream coding agents have been doing this for a long time.
It helps to give it a little context and suggest where to look in the repo. The tools also have mechanisms where you can leave directions and notes in the context for the project. Updating that over time as you discover where the LLM stumbles helps a lot.
Apparently it's also shit. There was a discussion about it a few days ago that contains multiple project maintainers pointing out deepwiki didn't get their repos at all https://news.ycombinator.com/item?id=45884169
It could have been done via topic analysis without an LLM.
In fact there are companies such as Medallia which specialize in CX and have really strong classification solutions for specifically these use cases (plus all the generative AI stuff for closing the loop).
Running big LLMs is expensive, but not nearly expensive as hiring people. Employees are very expensive, well beyond their wages that you see. Everything from employment taxes (employer paid) to hiring additional people to manage the people and their HR needs.
Even if it took $10 to run everything to handle each request, that’s far cheaper than even a minimum wage employee when you consider all of the employment overhead.
Honda probably spends $100-$10,000 on a warranty claim in terms of technician time and parts. [1] Even at the low end they can afford to spend 10 cents on an LLM to analyze a claim.
“cut-chip” usually describes a fault where the engine cuts out briefly—as if someone flicked the ignition off for a split second—and the driver hears or feels a little “chip” or sharp interruption in power.
I think it's fairly known among "LLM practitioners" (or what to call it), that some languages are better at solving specific tasks. Generally if you find yourself in a domain dominated by research in language X, shifting your prompts to that language will give you better results.
intuitively it has seemed that these kinds of "fuzzy text search" applications are an area where llms really shine. it's cool to see evidence of it working.
i'm curious about some kind of notion of "prompt overfitting." it's good to see the plots of improvement as the prompts change (although error bars probably would make sense here), but there's not much mention of hold out sets or other approaches to mitigate those concerns.
It's the Amazon own model. I'm baffled someone would pick it, even more that someone would test Llama 4 for a task in an age where Sonnet 4.5 is already out, so in the last 45 days.
Looks like they were limited by AWS Bedrock options.
> We tried multiple vectorization and classification approaches. Our data was heavily imbalanced and skewed towards negative cases. We found that TF-IDF with 1-gram features paired with XGBoost consistently emerged as the winner.
They also found improvements from augmenting the chunks with Haiku by having it add a summary based on extra context.
That seems to benefit both the keyword search and the embeddings by acting as keyword expansion. (Though it's unclear to me if they tried actual keyword expansion and how that would fare.)
---
Anyway what stands out to me most here is what a Rube Goldberg machine it is. Embeddings, keywords, fusion, contextual augmentation, reranking... each adding marginal gains.
But then the whole thing somehow works really well together (~1% fail rate on most benchmarks. Worse for code retrieval.)
I have to wonder how this would look if it wasn't a bunch of existing solutions taped together, but actually a full integrated system.
Thanks for sharing! I am working on a rag engine and that document provides great guidance.
And, agreed, each individual technique seems marginal but they really add up. What seems to be missing is some automated layer that determines the best way to chunk documents into embeddings. My use case is mostly normalized mostly technical documents so I have a pretty clear idea of how to chunk to preserve semantics. But I imagine that for generalized documents it is a lot trickier.
Yeah I’m curious if they tried training a Bert or similar classifier… intuitively this seems better than tfidf which is throwing away a ton of information.
And yet, the source problem still remains. The company has a shitty way of reporting quality issues in relation to parts and assemblies.
Being an automaker, I can almost smell the silos where data resides, the rigidly defended lines between manufactures, sales and post-sales, the intra-departmental political fights.
Then you have all the legacy of enterprise software.
And the result is this shitty warranty claims data.
As someone that also worked at a large automakers, I think you’re making large, unfounded assumptions.
Warranty data flows up from the technicians - good luck getting any auto technician to properly tag data. Their job is to fix a specific customer’s problem, not identify systematic issues.
There’s a million things that make the data inherently messy. For example, a technician might replace 5 parts before they finally identify the root cause.
Therefore, you need some sort of department to sit between millions of raw claims and engineering. I would be curious what kind of alternative you have in mind?
I'll note that they had a large annotated data set already that they were using to train and evaluate their own models. Once they decided to start testing LLMs it was straightforward for them to say "LLM 1 outperforms LLM 2" or "Prompt 3 outperforms Prompt 4".
I'm afraid that people will draw the wrong conclusion from "We didn’t just replace a model. We replaced a process." and see it as an endorsement of the zero-shot-uber-alles "Prompt and Pray" approach that is dominant in the industry right now and the reason why an overwhelming faction of AI projects fail.
If you can get good enough performance out of zero shot then yeah, zero shot is fine. Thing is that to know it is good enough you still have to collect and annotate more data than most people and organizations want to do.
> Thing is that to know it is good enough you still have to collect and annotate more data than most people and organizations want to do.
This has been the bottleneck in every ML (not just text/LLM) project I’ve been part of.
Not finding the right AI engineers. Not getting the MLops textbook perfect using the latest trends.
It’s the collecting enough high quality data and getting it properly annotated and verified. Then doing proper evals with humans in the loop to get it right.
People who only know these projects through headlines and podcasts really don’t like to accept this idea. Everyone wants synthetic data with LLMs doing the annotations and evals because they’ve been sold this idea that the AI will do everything for you, you just need to use it right. Then layer on top of that the idea that the LLMs can also write the code for you and it’s a mess when you have to deal with people who only gain their AI knowledge through headlines, LinkedIn posts, and podcasts.
Amen brother. Working on a computer vision project right now, it's a wild success.
This isn't my first CV project, but it's the most successful one. And that chiefly because my client pulled out their wallets and let an army of annotators create all the train data I asked for, and more.
This has been the huge problem in AI research since at least 1998 (and that was just when I was first exposed to it). With data, everything is so much easier, and much simpler machine learning methods.
Supervised learning. Took a while to make that work well.
And then every few years someone comes up with a way to distill data out of unsupervised examples. GPT is these days the big example of that, but there was "ImageNet (unlabeled)" and LAION before that too. The issue is that there is just so much unsupervised data.
Now LLMs use that pretty well (even though stuffing everything into an LLM is getting old, and as this article points out, in any specific application they tend to get bested by something like XGBoost with very simple models)
The next frontier is probably "world models", where you first train unsupervised, not to train your model but to predict the world. THEN you train the model in this simulated, predicted world. That's the reason Yann Lecun really really wants to go down this direction.
I’ve got no problem w/ synthetic data, but it is still more work that most people want to do.
There was a post on here recently about how you should build your own agent, and I completely agree. I'd say most competent developers should be building even more complex projects than an agent. Once you do you quickly realize how it's a constant uphill battle, and it quickly becomes apparent that the data you're working with is the primary issue.
I don't know if that is what gp and above is talking about. "Agents" are the kind of thing/word that helps to paper over the very fact that these things only work because of huge amount of humans in-the-loop in the outset (that is, you know, labor). Agents help us believe that LLM's can do everything for us, even bootstrap themselves, but, what the above thread is about is that, really, what you get out correlates only to what you put in in the first place.
I would offer a stronger more pointed observation, ofen the problem in building a good classifier is having good negative examples. More generally how a classifier identify good negatives is a function of:
1. Data collection technique.
2. Data annotation(labelling).
3. Classfier can learn on your "good" negatives — quantitaively depending on the machine residuals/margin/contrastive/triplet losses — i.e. learn the difference between a negative and positive for a classifier at train time and the optimization minima is higher than at test time.
4. Calibration/Reranking and other Post Processing.
My guess is that they hit a sweet spot with the first 3 techniques.
Crucially, this is:
Which is good because a lot of such matching and ML use cases for products I’ve worked on at several companies fit into this. The problem I’ve seen is when decision making capabilities are inferred from/conflated with text classification and sentiment analysis.
In my current role this seems like a very interesting approach to keep up with pop culture references and internet speak that can change as quickly as it takes the small ML team I work with to train or re-train a model. The limit is not a tech limitation, it’s a person-hours and data labeling problem like this one.
Given I have some people on my team that like to explore this area I’m going to see if I can run a similar case study to this one to see if it’s actually a fit.
Edit: At the risk of being self deprecating and reductive: I’d say a lot of products I’ve worked on are profitable/meaningful versions of Silicon Valley’s Hot Dog/Not Hot Dog.
I agree with you that the headline really needs to be qualified with these details. So there's an aspect of being unsurprising here, because that particular set of details is exactly where LLMs perform very well.
But I think it's still an interesting result, because related and similar tasks are everywhere in our modern world, and they tend to have high importance in both business and the public sector, and the older generation of machine learning techniques for handling these tasks we're both sophisticated and to the point where very capable and experienced practitioners might need an R&D cycle just to conclude if the problem was solvable with the available data up to the desired standard.
LLM's represent a tremendous advancement in our ability as a society to deal with these kinds of tasks. So yes, it's a limited range of specific tasks, and success is found within a limited set of criteria, but it's a very important tasks and enough of those criteria are met in practice that I think this result is interesting and generalizable.
That doesn't mean we should fire all of our data scientists and let junior programmers just have at it with the LLM, because you still need to put together a good day to say, makes sense of the results, and iterate intelligently, especially given that these models tend to be expensive to run. It does however mean that existing data teams must be open to adopting LLMs instead of traditional model fitting.
Wish there was a bit more technical details in how the prompt iterations looked like.
> We didn’t just replace a model. We replaced a process.
That line sticks out so much now, and I can't unsee it.
Right? This one is also very clear ChatGPTese
> That’s not a marginal improvement; it’s a different way of building classifiers.
They've replaced an em-dash with a semi-colon.
They are really getting to the heart of the problem!
You're absolutely right! They didn't just replace an em dash with a colon, they invented a whole new way of speaking.
/s if it wasn't obvious
One of the benefits of being immersed in model usage is being able to spot it in the wild from a mile away. People really hate when you catch them doing it and call them out for it.
And people like you will hate it even more when the normies immunize themselves from being obviously caught by such tells:
https://arxiv.org/abs/2510.15061
Ah ha! But now the complete lack of emdash and bullet pointed lists from antislop will be the tell! Riposte!
It didn't stick out to me because "corporate success story" articles already tend to sound like that, which is at least in part where I imagine the popular LLMs get it from. (The other part being pop nonfiction books.)
I dunno, ending with a short, punchy insight is a common way to make an impactful conclusion. It's the equivalent of a "hook" for concluding an article instead of opening. I do it often and see others (e.g. OpEds) use that tactic all the time.
I think we're getting into reverse slop discrimination territory now. LLMs have been trained on so much of what we consider "good writing", that actual good writing is now attributed by default to LLMs.
> That line sticks out so much now, and I can't unsee it.
I thought maybe they did it on purpose at first, like a cheeky but too subtle joke about LLM usage, but when it happened twice near the end of the post I just acknowledged, yeah, they did the thing. At least it was at the end or I might have stopped reading way earlier.
HN readers: claim to hate ai-generated text
Also HN readers: upvote the most obvious chatgpt slop to the frontpage
The two groups can be different but exist in the same community.
And in fact can intersect.
I’ve been on HN long enough to know that the upvotes are primarily driven by reactions to the headline. The actual content only gets viewed after upvoting, or often not at all.
> Also HN readers: upvote the most obvious chatgpt slop to the frontpage
Eh, this one was interesting as documentation of real work that people were doing over years. You don't get that many blog posts about this sort of effort without, usually, a bunch of self hype (because the company blogging also sells data analysis AI or whatever) that clouds any interesting part of the story. The slop in it is annoying but it's also noise thats relatively easy to filter out in this case
Those phrases definitely stick out quite badly. But this post wasn’t pure slop.
It had high quality info about a large ML effort inside an old school auto company, which is very interesting. I was just a bit disappointed no one thought to edit those out.
If you have genuinely interesting and valuable results to report, but you ask AI to do the final writeup for you and it comes across in that generic AI slop style, is it slop? Kind of a gray area for me. It certainly feels lazy and disrespectful to me as a reader, but on the other hand if they don't spend an afternoon proofreading and revising, maybe they can spend that afternoon instead building stuff. I don't know, our whole concept of the purpose of the written word is falling apart.
Seems like a very natural fit for fine tuning - would have loved to see more on the LLM side.
It's worth highlighting the conditions under which this can help:
> in domains where the taxonomy drifts, the data is scarce, or the requirements shift faster than you can annotate
It's not actually clear if warranty claims really meet these criteria.
For warranty claims, the difficulty is in detecting false negatives, when companies have a strong incentive and opportunity to hide the negatives.
Companies have been trusted to do this kind of market surveillance (auto warranties, drug post-market reporting) largely based on faith that the people involved would do so in earnest. That faith is misplaced when the process is automated (not because the implementors are less diligent, but because they are too removed to tell).
Then the backlash to a few significant injuries might be a much worse regime of bureaucratic oversight, right when companies have replaced knowledge with automation (and replacement labor costs are high).
Once you start to recognize AI written, rewritten or even edited articles, it’s hard to stop.
It’s not X it’s Y. We didn’t just do A we did B.
There’s definitely a lot of hard work that has gone in here. It’s gotten hard to read because of these sentence patterns popping up everywhere.
It is some kind of new Law (that ought to be named after someone) that people who write about AI are likely using it to do that writing.
(Even ironically sometimes observed in cases when the writing is disparaging of AI and the use of AI).
If the subject matter is AI, you should instantly pay attention and look for the signs it was AI assisted or generated outright.
Dunnow, reads fine to me, also seems we now have a #nothingisreal problem now where everything is AI. Given that LLMS were trained on pre-existing writing it follows that people commonly write like that.
overall I think things have gotten better. I noticed maybe 3 years before chatGPT hit the scene that I would frequent on a page that definitely didn't seem written by a native English speaker. The writing was just weird. I see less of that former style now.
Probably the biggest new trend I notice is this very prominent "Conclusion" block that seems to show up now.
Honestly I'd love to see some data on it. I suspect a lot of "that's LLM slop" isn't and others isn't noticed and lots of LLM tropes were rife within online content long before LLMs but we're now hypersensitive to certain things since they're overused by LLMs.
You're absolutely right
True now.
At the same time, as a nonnative speaker of English, this is literally how we were taught to write eye-catching articles and phrases. :P
A lot of formulaic writing is what we were taught to do, especially with more formal things. (This is more of a sidenote to this example)
So in a hunt for LLMs, we also get hit.
I believe this is likely a consequence of how RLHF is done. I’ve not verified it, but I suspect the frontier model labs are outsourcing it to companies employing primarily non-native English speakers.
Back in the days I've heard it's why delve is so popular; as it's common in Nigerian English.
I learned it from MtG and I do believe it's a very cool word and I hate that I can't use it without people raising their eyebrows.
That example sentence read more like a part of a LinkedIn post.
... Wait a minute!
You can remove this easily. We wrote a paper on how to remove this slop from LLMs.
https://arxiv.org/abs/2510.15061
Warranty data is a great example of where LLMs have evolved bureaucratic data overhead. What most people do not know is because of US federal TREAD regulation Automotive companies (If they want to land and look at warranty data) need to review all warranty claims, document, and detect any safety related issues and issue recalls all with an strong auditability requirement. This problem generates huge data and operations overhead, Companies need to either hire 10's if not hundreds of individuals to inspect claims or come up with automation to make this process easier.
Over the past couple of years people have made attempts with NLP (lets say standard ML workflows) but NLP and word temperature scores are hard to integrate into a reliable data pipeline much less a operational review workflow.
Enter LLM's, the world is a data gurus oyster for building an detection system on warranty claims. Passing data to Prompted LLM's means capturing and classifying records becomes significantly easier, and these data applications can flow into more normal analytic work streams.
Hmm, why was their starting point not something like BERT:
People generally sleep when you start talking about fine-tuned BERT and CLIP, although they do a fairly decent job as long as you have good data and know what you're doing.
But no, they want to pay $0.1 per request to recognize if a photo has a person in it by asking a multimodal LLM deployed across 8x GPUs, for some reason, instead of just spending some hours with CLIP and run it effectively even on CPU.
> they do a fairly decent job as long as you have good data and know what you're doing.
This is the bottleneck in my experience. Going for the expensive per-request LLM gets something shipped now that you can wow the execs with. Setting up a whole process to gather and annotate data, train models, run evals, and iterate takes time. The execs who hired those expensive AI engineers want their results right now, not after a process of hiring more people to collect and annotate the data.
I’m no ML engineer and far from an LLM expert. Just reading the article though it seemed to me that leveraging an SQL database here was a bigger issue than using traditional ML on the data, rather than the LLM being a win specifically. Just finding anything that was better suited than string matching on a RDBMS to the type of inputs seems like the natural conclusion when the complaint in the article itself was literally about SQL.
>... as long as you have good data and know what you're doing.
I think you've just identified, in a set-theoretic complementary manner, the TAM for GenAI.
What's TAM?
https://en.wikipedia.org/wiki/Total_addressable_market
Are you suggesting use the clip embedding for the text as a feature to train a standard Ml model on?
I think they're suggesting doing that with BERT for text and CLIP for images. Which in my experience is indeed quite effective (and easy/fast).
There have been some developments in the image-of-text/other-than-photograph area though recently. From Meta (although they seem unsure of what exactly their AI division is called): https://arxiv.org/abs/2510.05014 and Qihoo360: https://arxiv.org/abs/2510.27350 for instance.
I think he is. I do things like that plenty.
Because Enterprise Development only moves forward based on hyped technologies.
I found this fun fact really fascinating:
> Translating French and Spanish claims into German first improved technical accuracy—an unexpected perk of Germany’s automotive dominance.
It brings up an interesting idea that some languages are better suited for different domains.
So you're still ignoring that problem of putting the oil filter in places that cause excessive spilled oil? The point of artificial intelligence is to quickly have an unbiased check of all signal within the noise.
> Over multiple years, we built a supervised pipeline that worked. In 6 rounds of prompting, we matched it. That’s the headline, but it’s not the point. The real shift is that classification is no longer gated by data availability, annotation cycles, or pipeline engineering.
I get that SQL text searches are miserable to write, but it would have flagged it properly in the example.
The text says, "...no leaks..." The case statement says, "...AND LOWER(claim_text) NOT LIKE '%no leak%...'"
It would've properly been marked as a "0".
I thought the same. Having said that, the parenthesis in the example are really wrong for what they were trying to convey. I suspect that they built this sql sample for the document and made some mistakes in its generation.
Perhaps I could say, it isn't just generated--it is also hallucinated!
And this is where the strengths of LLMs really lie: making performant ML available to a wider audience, without requiring PHDs in Computer Science or Mathematics to build. It’s consistently where I spend my time tinkering with these, albeit in a local-only environment.
If all the bullshit hype and marketing would evaporate already (“LLMs will replace all jobs!”), stuff like this would float to the top more and companies with large data sets would almost certainly be clamoring for drop-in analysis solutions based on prompt construction. They’d likely be far happier with the results, too, instead of fielding complaints from workers about it (AI) being rammed down their throats at every turn.
^ This. I'm waiting for an LLM where I can just point it to a repo, slurp it up, and let me ask questions about it.
$ git clone repo && cd repo $ claude
Ask away. Best method I’ve found so far for this.
This technique is surprisingly powerful. Yesterday I built an experimental cellular automata classifier system based on some research papers I found and was curious about. Aside from the sheer magic of the entire build process with Cursor + GPT5-Codex, one big breakthrough was simply cloning the original repo's source code and copy/pasting the paper into a .txt file.
Now when I ask questions about design decisions, the LLM refers to the original paper and cites the decisions without googling or hallucinating.
With just these two things in my local repo, the LLM created test scripts to compare our results versus the paper and fixed bugs automatically, helped me make decisions based on the paper's findings, helped me tune parameters based on the empirical outcomes, and even discovered a critical bug in our code that was caused by our training data being random generated versus the paper's training data being a permutation over the whole solution space.
All of this work was done in one evening and I'm still blown away by it. We even ported our code to golang, parallelized it, and saw a 10x speedup in the processing. Right before heading to bed, I had the LLM spin up a novel simulator using a quirky set of tests that I invented using hypothetical sensors and data that have not yet been implemented, and it nailed it first try - using smart abstractions and not touching the original engine implementation at all. This tech is getting freaky.
The mainstream coding agents have been doing this for a long time.
It helps to give it a little context and suggest where to look in the repo. The tools also have mechanisms where you can leave directions and notes in the context for the project. Updating that over time as you discover where the LLM stumbles helps a lot.
Basically every AI agent released in the last 6 months can do this pretty well out of the box? What feature exactly are you missing from these?
github copilot somewhat does this.
Copilot is too stingy with context. In my experience Claude Code is much better at seeing the big picture.
This is exactly what Devin (https://devin.ai) is designed to do. Their deepwiki feature is free. I’ve personally had decent success with it, but YMMV.
Apparently it's also shit. There was a discussion about it a few days ago that contains multiple project maintainers pointing out deepwiki didn't get their repos at all https://news.ycombinator.com/item?id=45884169
Three points to note:
* "2 years vs 1 month" is a bit misleading because the work that enabled testing the 1 month of prompting was part of the 2 years of ML work.
* xgboost is an ensemble method... add the llm outputs as inputs to xgboost and probably enjoy better results.
* vectorize all the text data points using an embedding model and add those as inputs to xgboost for probably better results.
It could have been done via topic analysis without an LLM.
In fact there are companies such as Medallia which specialize in CX and have really strong classification solutions for specifically these use cases (plus all the generative AI stuff for closing the loop).
Or you can have a single checkbox “Problem with the vehicle”.
The old model was capable of running on a CPU. The new one requires a GPU. This might be a consideration for some.
Running big LLMs is expensive, but not nearly expensive as hiring people. Employees are very expensive, well beyond their wages that you see. Everything from employment taxes (employer paid) to hiring additional people to manage the people and their HR needs.
Even if it took $10 to run everything to handle each request, that’s far cheaper than even a minimum wage employee when you consider all of the employment overhead.
Honda probably spends $100-$10,000 on a warranty claim in terms of technician time and parts. [1] Even at the low end they can afford to spend 10 cents on an LLM to analyze a claim.
[1] specifically https://www.warrantyweek.com/archive/ww20230817.html claims the expectation value of warranty claims for a car is around $650.
> cut-chip
This was fun to read
“ Fun fact: Translating French and Spanish claims into German first improved technical accuracy—an unexpected perk of Germany’s automotive dominance.”
It really puzzles me how this is helping and how it was done?
Does it make text more clear? How exactly? Does the German language is more descriptive? Does it somehow expands context?
So many questions in this fun fact.
I wonder how they came up with that. Was it a human idea, or did the AI stumble upon it?
Given that it was inside a 9-step text preprocessing pipeline, it would be surprising if the AI had that much autonomy.
I think it's fairly known among "LLM practitioners" (or what to call it), that some languages are better at solving specific tasks. Generally if you find yourself in a domain dominated by research in language X, shifting your prompts to that language will give you better results.
intuitively it has seemed that these kinds of "fuzzy text search" applications are an area where llms really shine. it's cool to see evidence of it working.
i'm curious about some kind of notion of "prompt overfitting." it's good to see the plots of improvement as the prompts change (although error bars probably would make sense here), but there's not much mention of hold out sets or other approaches to mitigate those concerns.
Would have been nice to have seen this done with the top models rather than something like Nova.
the blog doesnt actually say the word "honda" anywhere on here. would probably advise that
Did the author exactly define "Nova Lite" somewhere in there?
It's the Amazon own model. I'm baffled someone would pick it, even more that someone would test Llama 4 for a task in an age where Sonnet 4.5 is already out, so in the last 45 days.
Looks like they were limited by AWS Bedrock options.
I wonder if text embeddings and semantic similarity would be effective here?
> We tried multiple vectorization and classification approaches. Our data was heavily imbalanced and skewed towards negative cases. We found that TF-IDF with 1-gram features paired with XGBoost consistently emerged as the winner.
Anthropic found a similar result for retrieval: embeddings + BM25 keyword search (variant of TF-IDF) produced significantly better results.
https://www.anthropic.com/engineering/contextual-retrieval
They also found improvements from augmenting the chunks with Haiku by having it add a summary based on extra context.
That seems to benefit both the keyword search and the embeddings by acting as keyword expansion. (Though it's unclear to me if they tried actual keyword expansion and how that would fare.)
---
Anyway what stands out to me most here is what a Rube Goldberg machine it is. Embeddings, keywords, fusion, contextual augmentation, reranking... each adding marginal gains.
But then the whole thing somehow works really well together (~1% fail rate on most benchmarks. Worse for code retrieval.)
I have to wonder how this would look if it wasn't a bunch of existing solutions taped together, but actually a full integrated system.
Thanks for sharing! I am working on a rag engine and that document provides great guidance.
And, agreed, each individual technique seems marginal but they really add up. What seems to be missing is some automated layer that determines the best way to chunk documents into embeddings. My use case is mostly normalized mostly technical documents so I have a pretty clear idea of how to chunk to preserve semantics. But I imagine that for generalized documents it is a lot trickier.
Well, "vectorization" can be anything. BERT is in same capability class as GPT, very different from LSA people did in 1980s...
Yeah I’m curious if they tried training a Bert or similar classifier… intuitively this seems better than tfidf which is throwing away a ton of information.
I have. For a similar-ish task.
LLMs still beat a clarifier, because they're able to extract more signals than a text embedding.
It's very difficult to beat an LLM + prompt in terms of semantic extraction.
Is it hard to beat on performance for the cost, though?
dayumnnn, this is interesting to say the least
And yet, the source problem still remains. The company has a shitty way of reporting quality issues in relation to parts and assemblies.
Being an automaker, I can almost smell the silos where data resides, the rigidly defended lines between manufactures, sales and post-sales, the intra-departmental political fights.
Then you have all the legacy of enterprise software.
And the result is this shitty warranty claims data.
As someone that also worked at a large automakers, I think you’re making large, unfounded assumptions.
Warranty data flows up from the technicians - good luck getting any auto technician to properly tag data. Their job is to fix a specific customer’s problem, not identify systematic issues.
There’s a million things that make the data inherently messy. For example, a technician might replace 5 parts before they finally identify the root cause.
Therefore, you need some sort of department to sit between millions of raw claims and engineering. I would be curious what kind of alternative you have in mind?
Silos are the root of all evil.
I didn't read this very carefully so maybe I missed it, but I'm surprised they didn't try using a classifier on top of a BERT-style encoder model?
“hundreds, if not thousands…thousands”
TLDR: "old" ml techniques like XGBoost can beat LLMs and neural networks for some tasks.
...when given a 2 year head start.