The gain is minimal for doing this optimization at one location. But doing it everywhere, that could matter. Pushing back in a loop could maybe be optimized to a single allocation and a memcopy.
On the push impl in the article - for non-x86 (and perhaps even on x86 for performance, though not size/instruction count) it'd be better to allow the size increment to reuse the size read done by the capacity check; with C++'s lack of suitable aliasing information, the interleaved memcpy/store prevents the compiler from deciding this itself.