It's an interesting approach. I can see it being really useful for networks that are inherently smaller than an LLM, maybe recommendation systems, fraud detection models etc. For LLMs I guess the most important followup line of research would be to ask whether a network trained in this special manner can then be distilled or densified in some way that retains the underlying decision making of the interpretable network with a more efficient runtime representation. Or alternatively, whether super sparse networks can be made efficient to inference.
There's also a question of expected outcomes. Mechanistic interpretability seems hard not only because of the density and superposition but also because a lot of the deep concepts being represented are just inherently difficult to express in words. There are going to be a lot of groups of neurons encoding fuzzy intuitions that might take an entire essay to crudely put into words, at best.
Starting from product goals and working backwards definitely seems like the best way to keep this stuff focused but the product goal is going to depend heavily on the network being analyzed. Like, the goal of interpretability for a recommender is going to look very different to the interpretability goal for a general chat focused LLM.