Can anyone direct me towards how to ... make one? At the most fundamental level, is it about having test questions with known, golden (verified, valid) answers, and asking different LLM models to find the answer, and comparing scores (how many were found to be correct)?
What are "obvious" things that are important to get right - temperature set to 0? At least ~10 or 20 attempts at the same problem for each llm? What are non-obvious gotchas?
Finally, any known/commonly used frameworks to do this, or any tooling that can call different LLMs would be enough?
Thanks!
Could've easily been framed as "you need both evals and a/b testing," but instead they chose this route which comes across as defensive, disingenuous, and desperate.
BTW, if a competitor ever writes a whole post to refute something you barely alluded to without even mentioning their name... congratulations, you've won.
I don't doubt that Raindrop's product is worthwhile to model vendors, but the post seems like its audience is C suite folks who have no clue how anything works. Do their most important customers even have any of these?
In a gold rush, each is trying to sell you a different kind of shovel claiming theirs to be the best when you really should go find a geologist and and figure out where the vein is.
(Raindrop.io is a bookmark service that AFAIK has "take money from people and stores their bookmarks" as its complete business model.)
This. I am so tired of people saying evals without defining what they mean. And now even management is asking me for evals and why we are not fine tuning.