One question, how portable are performance benefits from tweaks to memory alignment? Is this something where going beyond rough heuristics (sequential access = good, order of magnitude cache sizes, etc) requires knowing exactly what platform you're targeting?
It sounds like the “worst case“ here is that the program is already optimized.