On the other hand, GPT feels much more consistent and direct with execution, where Opus might fade or timeout because Anthropic's servers are on fire at 2pm on a Monday, or take longer than necessary burning tokens for the same result. GPT seems more consistent and dots all the i's, etc.
I was trying Fable for execution and noticed a fair bit of what looked like thrashing or farting around rewriting tests that it just made which were failing, which didn't give me a lot of confidence. But the final result was clean, just a longer path to get there.
I then like to have GPT or Opus review my PR for any issues before I spend time reading the output. This usually surfaces some stuff to tweak, but with Fable it was coming back clean. Again, this was a small window of normal usage for a few days, but some interesting takeaways.
If Fable doesn't come back it's not the end of the world for me and in some ways I prefer a bit more of an antagonistic relationship. It makes a nice in-road to reasoning about the code and how I might want restructure things. This is a bit harder when the code is "bug free" except for subtle or architectural decisions you can overlook, but I find if I sweat the architecture early on, anything beneath that is compartmentalized and stays trivial to fix.