All but the most basic vacuum robots map their work area and devise plans how to clean them systematically. The others just bump into obstacles, rotate a random amount and continue forward.
Don't get me wrong, I love this project and the idea to build it yourself. I just feel like that (huge) part is missing in the article?
Too little train data, and/or data of insufficient quality. Maybe let the robot run autonomously with an (expensive) VLM operating it to bootstrap a larger train dataset without needing to annotate it yourself.
Or maybe the problem itself is poorly specified, or intractable with your chosen network architecture. But if you see that a vision llm can pilot the bot, at least you know you have a fighting chance.
(Lidar can of course also be echolocation).
Very cool project though!