I'm afraid that people will draw the wrong conclusion from "We didn’t just replace a model. We replaced a process." and see it as an endorsement of the zero-shot-uber-alles "Prompt and Pray" approach that is dominant in the industry right now and the reason why an overwhelming faction of AI projects fail.
If you can get good enough performance out of zero shot then yeah, zero shot is fine. Thing is that to know it is good enough you still have to collect and annotate more data than most people and organizations want to do.
- text classification, not text generation
- operating on existing unstructured input
- existing solution was extremely limited (string matching)
- comparing LLM to similar but older methods of using neural networks to match
- seemingly no negative consequences to warranty customers themselves of mis-classification (the data is used to improve process, not to make decisions)> We didn’t just replace a model. We replaced a process.
That line sticks out so much now, and I can't unsee it.
It’s not X it’s Y. We didn’t just do A we did B.
There’s definitely a lot of hard work that has gone in here. It’s gotten hard to read because of these sentence patterns popping up everywhere.
Over the past couple of years people have made attempts with NLP (lets say standard ML workflows) but NLP and word temperature scores are hard to integrate into a reliable data pipeline much less a operational review workflow.
Enter LLM's, the world is a data gurus oyster for building an detection system on warranty claims. Passing data to Prompted LLM's means capturing and classifying records becomes significantly easier, and these data applications can flow into more normal analytic work streams.
* already known as SotA for text classification and similarity
back in 2023
* natively multi-lingual> in domains where the taxonomy drifts, the data is scarce, or the requirements shift faster than you can annotate
It's not actually clear if warranty claims really meet these criteria.
For warranty claims, the difficulty is in detecting false negatives, when companies have a strong incentive and opportunity to hide the negatives.
Companies have been trusted to do this kind of market surveillance (auto warranties, drug post-market reporting) largely based on faith that the people involved would do so in earnest. That faith is misplaced when the process is automated (not because the implementors are less diligent, but because they are too removed to tell).
Then the backlash to a few significant injuries might be a much worse regime of bureaucratic oversight, right when companies have replaced knowledge with automation (and replacement labor costs are high).
The text says, "...no leaks..." The case statement says, "...AND LOWER(claim_text) NOT LIKE '%no leak%...'"
It would've properly been marked as a "0".
If all the bullshit hype and marketing would evaporate already (“LLMs will replace all jobs!”), stuff like this would float to the top more and companies with large data sets would almost certainly be clamoring for drop-in analysis solutions based on prompt construction. They’d likely be far happier with the results, too, instead of fielding complaints from workers about it (AI) being rammed down their throats at every turn.
* "2 years vs 1 month" is a bit misleading because the work that enabled testing the 1 month of prompting was part of the 2 years of ML work.
* xgboost is an ensemble method... add the llm outputs as inputs to xgboost and probably enjoy better results.
* vectorize all the text data points using an embedding model and add those as inputs to xgboost for probably better results.
> Translating French and Spanish claims into German first improved technical accuracy—an unexpected perk of Germany’s automotive dominance.
It brings up an interesting idea that some languages are better suited for different domains.
“ Fun fact: Translating French and Spanish claims into German first improved technical accuracy—an unexpected perk of Germany’s automotive dominance.”
In fact there are companies such as Medallia which specialize in CX and have really strong classification solutions for specifically these use cases (plus all the generative AI stuff for closing the loop).
“cut-chip” usually describes a fault where the engine cuts out briefly—as if someone flicked the ignition off for a split second—and the driver hears or feels a little “chip” or sharp interruption in power.i'm curious about some kind of notion of "prompt overfitting." it's good to see the plots of improvement as the prompts change (although error bars probably would make sense here), but there's not much mention of hold out sets or other approaches to mitigate those concerns.
Being an automaker, I can almost smell the silos where data resides, the rigidly defended lines between manufactures, sales and post-sales, the intra-departmental political fights.
Then you have all the legacy of enterprise software.
And the result is this shitty warranty claims data.