A one-bit adder (which is subtraction in reverse) makes signals pass through two gates.
See https://en.wikipedia.org/wiki/Adder_(electronics)
You need the 2 gates for adding/subtracting because you care about carry. So if you're adding/subtracting 8 bits, 16 bits, or more, you're connecting multiples of these together, and that carry has to ripple through all the rest of the gates one-by-one. It can't be paralellized without extra circuitry, which increases your costs in other ways.
Without the AND gate needed for carry, all the XORs can fire off at the same time. If you added the extra circuitry for a parallelizable add/subtract to make it as fast as XOR, your actual parallel XOR would consume less power.
Will remember for the next time I write asm for Itanium!
Probably, there are ALU pipeline designs where you don't pay an explicit penalty. But not all, and so XOR is faster.
Surely, someone as awesome as Raymond Chen knows that. The answer is so obvious and basic I must be missing something myself?
I looked it up afterwards and xor was also a valid instruction in that architecture to zero out a register, and used even fewer cycles than the subtraction method; but it was not listed in the subset of the assembly language instructions we were allowed to use for that unit. I suspect that it was deemed a bit off-topic, since you would need to explain what the mathematical XOR operation was (if you didn't already learn about it in other units), when the unit was about something else entirely- but everyone knows what subtraction is, and that subtracting a number by itself leads to zero.
[0] Not x86, I do not recall the exact architecture.
Absolutely. But I can also imagine that it feels more like something that should be more efficient, because it's "a bit hack" rather than arithmetic. After all, it avoids all the "data dependencies" (carries, never mind the ALU is clocked to allow time for that regardless)!
I imagine that a similar feeling is behind XOR swap.
> Once an instruction has an edge, even if only extremely slight, that’s enough to tip the scales and rally everyone to that side.
Network effects are much older than social media, then....
I remember the very first ROM instruction was XOR A and this was already a revelation to me as I'd never considered doing anything other than LD A,0 to clear the accumulator.
xor was the default zeroing idiom.I onkly did sub reg,reg when I actually want its flags result. Otherwise the main rule is: do not touch either form unless flags liveness makes the rewrite obviously safe. Had about 40 such idioms for the passes.
Once an instruction has an edge, even if only extremely slight, that’s enough to tip the scales and rally everyone to that side.
And this, interestingly, is why life on earth uses left-handed amino acids and right-handed sugars .. and why left handed sugar is perfect for diet sodas.There are many kinds of SUB instructions in the x86-64 ISA, which do subtraction modulo 2^64, modulo 2^32, modulo 2^16 or modulo 2^8.
To produce a null result, any kind of subtraction can be used, and XOR is just a particular case of subtraction, it is not a different kind of operation.
Unlike for bigger moduli, when operations are done modulo 2 addition and subtraction are the same, so XOR can be used for either addition modulo 2 or subtraction modulo 2.
latency (L) and throughput (T) measurements from the InstLatx64 project (https://github.com/InstLatx64/InstLatx64) :
| GenuineIntel | ArrowLake_08_LC | SUB r64, r64 | L: 0.26ns= 1.00c | T: 0.03ns= 0.135c |
| GenuineIntel | ArrowLake_08_LC | XOR r64, r64 | L: 0.03ns= 0.13c | T: 0.03ns= 0.133c |
| GenuineIntel | GoldmontPlus | SUB r64, r64 | L: 0.67ns= 1.0 c | T: 0.22ns= 0.33 c |
| GenuineIntel | GoldmontPlus | XOR r64, r64 | L: 0.22ns= 0.3 c | T: 0.22ns= 0.33 c |
| GenuineIntel | Denverton | SUB r64, r64 | L: 0.50ns= 1.0 c | T: 0.17ns= 0.33 c |
| GenuineIntel | Denverton | XOR r64, r64 | L: 0.17ns= 0.3 c | T: 0.17ns= 0.33 c |
I couldn't find any AMD chips where the same is true.How much of that advice applies to anything these days is questionable. Back then we used to squeeze as much as possible from every clock cycle.
And cache misses weren’t great but the “front side bus” vs CPU clock difference wasn’t so insane either. RAM is “far away” now.
So the stuff you optimize for has changed a bit.
Always measure!
8 'sub al, al', 14 'sub ah, ah', 3 'sub ax, ax'
26 'xor al, al', 43 'xor ah, ah', 3 'xor ax, ax'
edit: checked a 2010 bios and not a single 'sub x, x'
In principle, sub requires 4 steps:
1. Move both operands to the ALU
2. Invert second operand (twos complement convert)
3. Add (which internally is just XOR plus carry propagate)
4. Move result to proper result register.
This is absolutely not how modern processors do it in practice; there are many shortcuts, but at least with pure XOR you don't need twos complement conversion or carry propagation.
Source: Wrote microcode at work a million years ago when designing a GPU.