FRESH

Hacker News

Home

Honda: 2 years of ml vs 1 month of prompting - heres what we learned

318 points by Ostatnigrosh

by PaulHoule

3 subcomments

I'll note that they had a large annotated data set already that they were using to train and evaluate their own models. Once they decided to start testing LLMs it was straightforward for them to say "LLM 1 outperforms LLM 2" or "Prompt 3 outperforms Prompt 4".
I'm afraid that people will draw the wrong conclusion from "We didn’t just replace a model. We replaced a process." and see it as an endorsement of the zero-shot-uber-alles "Prompt and Pray" approach that is dominant in the industry right now and the reason why an overwhelming faction of AI projects fail.
If you can get good enough performance out of zero shot then yeah, zero shot is fine. Thing is that to know it is good enough you still have to collect and annotate more data than most people and organizations want to do.

by pjc50

2 subcomments

Crucially, this is:

    - text classification, not text generation
    - operating on existing unstructured input
    - existing solution was extremely limited (string matching)
    - comparing LLM to similar but older methods of using neural networks to match
    - seemingly no negative consequences to warranty customers themselves of mis-classification (the data is used to improve process, not to make decisions)

by jwong_

6 subcomments

Wish there was a bit more technical details in how the prompt iterations looked like.
> We didn’t just replace a model. We replaced a process.
That line sticks out so much now, and I can't unsee it.

by Aniket-N

8 subcomments

Once you start to recognize AI written, rewritten or even edited articles, it’s hard to stop.
It’s not X it’s Y. We didn’t just do A we did B.
There’s definitely a lot of hard work that has gone in here. It’s gotten hard to read because of these sentence patterns popping up everywhere.

by datax2

0 subcomment

Warranty data is a great example of where LLMs have evolved bureaucratic data overhead. What most people do not know is because of US federal TREAD regulation Automotive companies (If they want to land and look at warranty data) need to review all warranty claims, document, and detect any safety related issues and issue recalls all with an strong auditability requirement. This problem generates huge data and operations overhead, Companies need to either hire 10's if not hundreds of individuals to inspect claims or come up with automation to make this process easier.
Over the past couple of years people have made attempts with NLP (lets say standard ML workflows) but NLP and word temperature scores are hard to integrate into a reliable data pipeline much less a operational review workflow.
Enter LLM's, the world is a data gurus oyster for building an detection system on warranty claims. Passing data to Prompted LLM's means capturing and classifying records becomes significantly easier, and these data applications can flow into more normal analytic work streams.

by killerstorm

3 subcomments

Hmm, why was their starting point not something like BERT:

  * already known as SotA for text classification and similarity 
     back in 2023
  * natively multi-lingual

by pards

0 subcomment

> Over multiple years, we built a supervised pipeline that worked. In 6 rounds of prompting, we matched it. That’s the headline, but it’s not the point. The real shift is that classification is no longer gated by data availability, annotation cycles, or pipeline engineering.

by w10-1

0 subcomment

It's worth highlighting the conditions under which this can help:
> in domains where the taxonomy drifts, the data is scarce, or the requirements shift faster than you can annotate
It's not actually clear if warranty claims really meet these criteria.
For warranty claims, the difficulty is in detecting false negatives, when companies have a strong incentive and opportunity to hide the negatives.
Companies have been trusted to do this kind of market surveillance (auto warranties, drug post-market reporting) largely based on faith that the people involved would do so in earnest. That faith is misplaced when the process is automated (not because the implementors are less diligent, but because they are too removed to tell).
Then the backlash to a few significant injuries might be a much worse regime of bureaucratic oversight, right when companies have replaced knowledge with automation (and replacement labor costs are high).

by mcdonje

2 subcomments

I get that SQL text searches are miserable to write, but it would have flagged it properly in the example.
The text says, "...no leaks..." The case statement says, "...AND LOWER(claim_text) NOT LIKE '%no leak%...'"
It would've properly been marked as a "0".

by stego-tech

1 subcomments

And this is where the strengths of LLMs really lie: making performant ML available to a wider audience, without requiring PHDs in Computer Science or Mathematics to build. It’s consistently where I spend my time tinkering with these, albeit in a local-only environment.
If all the bullshit hype and marketing would evaporate already (“LLMs will replace all jobs!”), stuff like this would float to the top more and companies with large data sets would almost certainly be clamoring for drop-in analysis solutions based on prompt construction. They’d likely be far happier with the results, too, instead of fielding complaints from workers about it (AI) being rammed down their throats at every turn.

by djoldman

0 subcomment

Three points to note:
* "2 years vs 1 month" is a bit misleading because the work that enabled testing the 1 month of prompting was part of the 2 years of ML work.
* xgboost is an ensemble method... add the llm outputs as inputs to xgboost and probably enjoy better results.
* vectorize all the text data points using an embedding model and add those as inputs to xgboost for probably better results.

by greazy

0 subcomment

I found this fun fact really fascinating:
> Translating French and Spanish claims into German first improved technical accuracy—an unexpected perk of Germany’s automotive dominance.
It brings up an interesting idea that some languages are better suited for different domains.

by 1970-01-01

0 subcomment

So you're still ignoring that problem of putting the oil filter in places that cause excessive spilled oil? The point of artificial intelligence is to quickly have an unbiased check of all signal within the noise.

by stogot

2 subcomments

This was fun to read
“ Fun fact: Translating French and Spanish claims into German first improved technical accuracy—an unexpected perk of Germany’s automotive dominance.”

by esafak

2 subcomments

The old model was capable of running on a CPU. The new one requires a GPU. This might be a consideration for some.

by juancn

1 subcomments

It could have been done via topic analysis without an LLM.
In fact there are companies such as Medallia which specialize in CX and have really strong classification solutions for specifically these use cases (plus all the generative AI stuff for closing the loop).

by robocat

0 subcomment

> cut-chip

  “cut-chip” usually describes a fault where the engine cuts out briefly—as if someone flicked the ignition off for a split second—and the driver hears or feels a little “chip” or sharp interruption in power.

by NumberCruncher

0 subcomment

It would have been interesting to see how an Elasticsearch like system performs on this task.

by elmigranto

0 subcomment

Or you can have a single checkbox “Problem with the vehicle”.

by a-dub

0 subcomment

intuitively it has seemed that these kinds of "fuzzy text search" applications are an area where llms really shine. it's cool to see evidence of it working.
i'm curious about some kind of notion of "prompt overfitting." it's good to see the plots of improvement as the prompts change (although error bars probably would make sense here), but there's not much mention of hold out sets or other approaches to mitigate those concerns.

by Workaccount2

0 subcomment

Would have been nice to have seen this done with the top models rather than something like Nova.

by Upvoter33

1 subcomments

Did the author exactly define "Nova Lite" somewhere in there?

by yahoozoo

3 subcomments

I wonder if text embeddings and semantic similarity would be effective here?

by swyx

0 subcomment

the blog doesnt actually say the word "honda" anywhere on here. would probably advise that

by elzbardico

2 subcomments

And yet, the source problem still remains. The company has a shitty way of reporting quality issues in relation to parts and assemblies.
Being an automaker, I can almost smell the silos where data resides, the rigidly defended lines between manufactures, sales and post-sales, the intra-departmental political fights.
Then you have all the legacy of enterprise software.
And the result is this shitty warranty claims data.

by shubham_zingle

0 subcomment

dayumnnn, this is interesting to say the least

by suddenlybananas

0 subcomment

I didn't read this very carefully so maybe I missed it, but I'm surprised they didn't try using a classifier on top of a BERT-style encoder model?

by cwmoore

0 subcomment

“hundreds, if not thousands…thousands”

by Gormanu

0 subcomment

[dead]

by DeathArrow

1 subcomments

TLDR: "old" ml techniques like XGBoost can beat LLMs and neural networks for some tasks.