by bayarearefugee
2 subcomments
- I mostly use Gemini, so I can't speak for Claude, but Gemini definitely has variable quality at different times, though I've never bothered to try to find a specific time-of-day pattern to it.
The most reliable time to see it fall apart is when Google makes a public announcement that is likely to cause a sudden influx of people using it.
And there are multiple levels of failure, first you start seeing iffy responses of obvious lesser quality than usual and then if things get really bad you start seeing just random errors where Gemini will suddenly lose all of its context (even on a new chat) or just start failing at the UI level by not bothering to finish answers, etc.
The sort of obvious likely reason for this is when the models are under high load they probably engage in a type of dynamic load balancing where they fall back to lighter models or limit the amount of time/resources allowed for any particular prompt.
- The math is obvious on this one. It's super well-documented that model performance on complex tasks scales (to some asymptote) with the amount of inference-time compute allocated.
LLM providers must dynamically scale inference-time compute based on current load because they have limited compute. Thus it's impossible for traffic spikes _not_ to cause some degradations in model performance (at least until/unless they acquire enough compute to saturate that asymptotic curve for every request under all demand conditions -- it does not seem plausible that they are anywhere close to this)
- My limited understanding here is that usage loads impact model outputs to make them less deterministic (and likely degrading in quality). See: https://thinkingmachines.ai/blog/defeating-nondeterminism-in...
by janalsncm
1 subcomments
- It’s possible that they could be using fallback models during peak load times (west coast mid day). I assume your traffic would be routed to an east coast data center though. But secretly routing traffic to a worse model is a bit shady so I’d want some concrete numbers to quantify worse performance.
- I had something similar with GPT, like a clockwork every day after like 1pm it started producing total garbage. Not sure if our account was A/B tested or they just routed us to some brutal quantization of GPT, or even a completely different model.
by schmookeeg
0 subcomment
- I do think Claude does jiggery pokery with its model quality but I have had Clod appear any time of day.
What i find IS tied to time of day is my own fatigue, my own ability to detect garbage tier code and footguns, and my patience is short so if I am going to start cussing at Clod, it is almost always after 4 when I am trying to close out my day.
- I've had the same suspicion for various providers - if I had time and motivation I would put together a private benchmark that runs hourly and chart performance over time. If anyone wants to do that I'll upvote your Show HN :)
- I've certainly noticed some variance from opus. there are times it gets stuck and loops on dumb stuff that would have been frustrating from sonnet 3.5, let alone something as good as opus 4.5 when it's locked in. But it's not obviously correlated with time, I've hit those snags at odd hours, and gotten great perf during peak times. It might just be somewhat variable, or a shitty context.
Now GPT4.1 was another story last year, I remember cooking at 4am pacific and feeling the whole thing slam to a halt as the US east coast came online.
by oncallthrow
2 subcomments
- For what it’s worth, Anthropic very strongly claim that they don’t degrade model performance by time of day [1]. I have no reason to doubt that, imo Anthropic are about as ethical as LLM companies get.
[1] https://www.anthropic.com/engineering/a-postmortem-of-three-...
- Yes I’ve noticed that at certain times it gets very stuck, with the exact same setups. If I keep trying with new context windows it will still have poor performance, but if I come back in 30m or an hour it returns to normal. I don’t think it’s my context window changing, it seems to truly be degradation.
FWIW, I experienced it with sonnet as well. My conspiracy brain says they’re testing tuning the model to use up more tokens when they want to increase revenue, especially as agents become more automated. Making things worse == more money! Just like the rest of tech
by joshribakoff
0 subcomment
- Yep, i have long felt like i randomly get sonnet results despite opus billing. I try to work odd hours and notice better results.
by anonzzzies
0 subcomment
- Many people 'notice' it (on reddit); I notice it too, but it is hard to prove. I tried the same prompt on the same code every 4 hours for 48 hours, the behaviour was slightly different but not worse or much different in time. But then I just work on my normal code, think wtf is it doing now??? look at the time and see it is US day time and stop.
People put forward many theories for this (weaker model routing; be it a different model, Sonnet or Haiku or lower quantized Opus seem the most popular), Anthropic says it is all not happening.
by killingtime74
0 subcomment
- Are you using the API or a subscription?
by DefundPortland
0 subcomment
- [flagged]
- Simple, the model is tired after a long day of working so it starts making mistakes. Give it some rest and it is ready to serve again.
- It seems clear that, rather than throttling, anthropic serves lower quality versions of their models during peak usage to keep up with demand. They refuse to admit it, and it's hard to prove, but these threads consistently happen ~3 months after every single model release.