FRESH

Hacker News

Home

XOR'ing a register with itself is the idiom for zeroing it out. Why not sub?

216 points by ingve

by RiverCrochet

5 subcomments

XOR is a simple logic-gate operation. SUB would have to be an ALU operation.
A one-bit adder (which is subtraction in reverse) makes signals pass through two gates.
See https://en.wikipedia.org/wiki/Adder_(electronics)
You need the 2 gates for adding/subtracting because you care about carry. So if you're adding/subtracting 8 bits, 16 bits, or more, you're connecting multiples of these together, and that carry has to ripple through all the rest of the gates one-by-one. It can't be paralellized without extra circuitry, which increases your costs in other ways.
Without the AND gate needed for carry, all the XORs can fire off at the same time. If you added the extra circuitry for a parallelizable add/subtract to make it as fast as XOR, your actual parallel XOR would consume less power.

by Suzuran

1 subcomments

On some of IBM's smaller processors, such as channel controllers and the CSP used in the midrange line prior to the System/38, the xor instruction had a special feature when used with identical source and destination - It would inhibit parity and/or ECC error checking on the read cycle, which meant that xor could be used to clear a register or memory location that had been stored with bad parity without taking a machine check or processor check.

by Sweepi

2 subcomments

"Bonus bonus chatter: The xor trick doesn’t work for Itanium because mathematical operations don’t reset the NaT bit. Fortunately, Itanium also has a dedicated zero register, so you don’t need this trick. You can just move zero into your desired destination."
Will remember for the next time I write asm for Itanium!

by NewCzech

19 subcomments

The obvious answer is that XOR is faster. To do a subtract, you have to propagate the carry bit from the least-significant bit to the most-significant bit. In XOR you don't have to do that because the output of every bit is independent of the other adjacent bits.
Probably, there are ALU pipeline designs where you don't pay an explicit penalty. But not all, and so XOR is faster.
Surely, someone as awesome as Raymond Chen knows that. The answer is so obvious and basic I must be missing something myself?

by drfuchs

4 subcomments

Relatedly, there's a steganographic opportunity to hide info in machine code by using "XOR rax,rax" for a "zero" and "SUB rax,rax" for a "one" in your executable. Shouldn't be too hard to add a compiler feature to allow you to specify the string you want encoded into its output.

by b1temy

0 subcomment

Back when I was in university, one of the units touching Assembly[0] required students to use subtraction to zero out the register instead of using the move instruction (which also worked), as it used fewer cycles.
I looked it up afterwards and xor was also a valid instruction in that architecture to zero out a register, and used even fewer cycles than the subtraction method; but it was not listed in the subset of the assembly language instructions we were allowed to use for that unit. I suspect that it was deemed a bit off-topic, since you would need to explain what the mathematical XOR operation was (if you didn't already learn about it in other units), when the unit was about something else entirely- but everyone knows what subtraction is, and that subtracting a number by itself leads to zero.
[0] Not x86, I do not recall the exact architecture.

by nopurpose

1 subcomments

It amazes me how entertaining Raymond's writing on most mundane aspects of computing often is.

by tliltocatl

3 subcomments

It might be because XOR is rarely (in terms of static count, dynamically it surely appears a lot in some hot loops) used for anything else, so it is easier to spot and identify as "special" if you are writing manual assembly.

by zahlman

0 subcomment

> but xor took a slightly lead due to some fluke, perhaps because it felt more “clever”.
Absolutely. But I can also imagine that it feels more like something that should be more efficient, because it's "a bit hack" rather than arithmetic. After all, it avoids all the "data dependencies" (carries, never mind the ALU is clocked to allow time for that regardless)!
I imagine that a similar feeling is behind XOR swap.
> Once an instruction has an edge, even if only extremely slight, that’s enough to tip the scales and rally everyone to that side.
Network effects are much older than social media, then....

by billforsternz

0 subcomment

Back in the early 1980s I leveled up my self taught Z80 assembly skills by reading a book that attempted to disassemble and explain the Sinclair Spectrum ROM.
I remember the very first ROM instruction was XOR A and this was already a revelation to me as I'd never considered doing anything other than LD A,0 to clear the accumulator.

by enduku

0 subcomment

I ran into this rabbithole while writing an x86-64 asm rewriter.
xor was the default zeroing idiom.I onkly did sub reg,reg when I actually want its flags result. Otherwise the main rule is: do not touch either form unless flags liveness makes the rewrite obviously safe. Had about 40 such idioms for the passes.

by defrost

3 subcomments

```
  Once an instruction has an edge, even if only extremely slight, that’s enough to tip the scales and rally everyone to that side.
```
And this, interestingly, is why life on earth uses left-handed amino acids and right-handed sugars .. and why left handed sugar is perfect for diet sodas.

by adrian_b

2 subcomments

It should be noted that XOR is just (bitwise) subtraction modulo 2.
There are many kinds of SUB instructions in the x86-64 ISA, which do subtraction modulo 2^64, modulo 2^32, modulo 2^16 or modulo 2^8.
To produce a null result, any kind of subtraction can be used, and XOR is just a particular case of subtraction, it is not a different kind of operation.
Unlike for bigger moduli, when operations are done modulo 2 addition and subtraction are the same, so XOR can be used for either addition modulo 2 or subtraction modulo 2.

by matja

3 subcomments

SUB has higher latency than XOR on some Intel CPUs:

latency (L) and throughput (T) measurements from the InstLatx64 project (https://github.com/InstLatx64/InstLatx64) :

  | GenuineIntel | ArrowLake_08_LC | SUB r64, r64 | L: 0.26ns=  1.00c  | T:   0.03ns=   0.135c |
  | GenuineIntel | ArrowLake_08_LC | XOR r64, r64 | L: 0.03ns=  0.13c  | T:   0.03ns=   0.133c |
  | GenuineIntel | GoldmontPlus    | SUB r64, r64 | L: 0.67ns=  1.0 c  | T:   0.22ns=   0.33 c |
  | GenuineIntel | GoldmontPlus    | XOR r64, r64 | L: 0.22ns=  0.3 c  | T:   0.22ns=   0.33 c |
  | GenuineIntel | Denverton       | SUB r64, r64 | L: 0.50ns=  1.0 c  | T:   0.17ns=   0.33 c |
  | GenuineIntel | Denverton       | XOR r64, r64 | L: 0.17ns=  0.3 c  | T:   0.17ns=   0.33 c |

I couldn't find any AMD chips where the same is true.

by lochnessduck

1 subcomments

I use the carry flag in a lot of z80 assembly for communicating a status of an operation. XOR doesn’t mess with the carry flag, I think it’s another point in favor of xor. (Though I don’t remember even considering using sub)

by aforwardslash

0 subcomment

Afaik xor reg,reg is optimized by the cpu as zero-out reg; sub reg, reg is quite more difficult to optimize this way; this seems to be quite important in modern cpus, where cisc is translated to micro-ops; in superscalar archs, this is probably optimized away instead of causing a stall.

by butterisgood

0 subcomment

I recall thinking about these things quite a bit when reading Michael Abrash back in the 90s.
How much of that advice applies to anything these days is questionable. Back then we used to squeeze as much as possible from every clock cycle.
And cache misses weren’t great but the “front side bus” vs CPU clock difference wasn’t so insane either. RAM is “far away” now.
So the stuff you optimize for has changed a bit.
Always measure!

by empiricus

3 subcomments

The hw implementation of xor is simpler than sub, so it should consume slightly less energy. Wondering how much energy was saved in the whole world by using xor instead of sub.

by anematode

1 subcomments

My favorite (admittedly not super useful) trick in this domain is that sbb eax, eax breaks the dependency on the previous value of eax (just like xor and sub) and only depends on the carry flag. arm64 is less obtuse and just gives you csetm (special case of csinv) for this purpose.

by kevin_thibedeau

0 subcomment

XOR is faster than SUB on bit slice processors. I'd posit that that created established idioms as micros came on the scene.

by NanoWar

0 subcomment

XORing just feels more like xxxxing out the register. SUB feels like a calculation or mistaken use of a register.

by rasz

1 subcomments

Looking at some random 1989 Zenith 386SX bios written in assembly so purely programmer preferences:
8 'sub al, al', 14 'sub ah, ah', 3 'sub ax, ax'
26 'xor al, al', 43 'xor ah, ah', 3 'xor ax, ax'
edit: checked a 2010 bios and not a single 'sub x, x'

by dreamcompiler

2 subcomments

I vaguely remember we used the XOR trick on processors other than Intel, so it may not be Intel-specific.
In principle, sub requires 4 steps:
1. Move both operands to the ALU
2. Invert second operand (twos complement convert)
3. Add (which internally is just XOR plus carry propagate)
4. Move result to proper result register.
This is absolutely not how modern processors do it in practice; there are many shortcuts, but at least with pure XOR you don't need twos complement conversion or carry propagation.
Source: Wrote microcode at work a million years ago when designing a GPU.

by jhoechtl

3 subcomments

Back in the stone ages XOR ing was just 1 byte of opcode. Habbits stick. In effect XORing is no longer faster since a long time.

by fortran77

0 subcomment

SUB may have side effects of setting flags that XOR may not depending on the CPU.

by jdw64

0 subcomment

[dead]

by hadnoclue

0 subcomment

[dead]

by grebc

0 subcomment

[flagged]