However, I do not see a big advantage over Cypress tests.
The article mentions shortcomings of Cypress (and Playwright):
> They start a dev server with bootstrapping code to load the component and/or setup code you want, which limits their ability to handle complex enterprise applications that might have OAuth or a complex build pipeline.
The simple solution is to containerise the whole application (including whatever OAuth provider is used), which then allows you to simply launch the whole thing and then run the tests. Most apps (especially in enterprise) should already be containerised anyway, so most of the times we can just go ahead and run any tests against them.
How is SafeTest better than that when my goal is to test my application in a real world scenario?
I've been recently thinking about testing/qa w/ VLMs + LLMs, one area that I haven't seen explored (but should 100% be feasible) is to have the first run be LLM + VLM, and then have the LLM(s?) write repeatable "cheap" tests w/ traditional libraries (playwright, puppeteer, etc). On every run you do the "cheap" traditional checks, if any fail go with the LLM + VLM again and see what broke, only fail the test if both fail. Makes sense?
1. https://netflixtechblog.com/introducing-safetest-a-novel-app...
test('can log in and see correct settings')
.step('log in to the app')
.say('my username is user@example.com')
I'll need a way to extract data as part of the tests, like screenshots and page content. This will allow supplementing the tests with non-magnitude features, as well as add things that are a bit more deterministic. Assert that the added todo item exactly matches what was used as input data, screenshot diffs when the planner fallback came into play, execution log data, etc.
This isn't currently possible from what I can see in the docs, but maybe I'm wrong?
It'd also be ideal if it had an LLM-free executor mode to reduce costs and increase speed (caching outputs, or maybe use accessibility tree instead of VLM), and also fit requirements when the planner should not automatically kick in.
One benefit not using pure vision is that it's a strong signal to developers to make pages accessible. This would let them off the hook.
Perhaps testing both paths separately would be more appropriate. I could imagine a different AI agent attempting to navigate the page through accessibility landmarks. Or even different agents that simulate different types of disabilities.