I'm not sure about whether this is a bottlenecking step in applications, but even so, is it interesting to ask which parts of this are gpu-friendly? That is, is there a (sparse) matrix representation which is used in gpu's? And does it make sense to carry through the dag/tree construction as a sort of "prep" step (on cpu or gpu)?
The initial plain/dense algorithm looks pretty straightforward, but not sure about the tree construction.