FRESH

Hacker News

Disassembling terabytes of random data with Zig and Capstone to prove a point

42 points by birdculture

by Dylan16807

0 subcomment

With the static huffman table, the decompressed output is very similar from what you'd get with this algorithm:
```
  loop:
     chance to exit
     chance to output a random byte
     chance to repeat some previous bytes
```
If decompression succeeds, the chance that 95% of the output can disassemble is not very different from random noise, except sometimes it'll get lucky and repeat a valid opcode 120 times so maybe it's slightly higher.
But the chance that 100% can disassemble gets more complicated. It's two factors multiplied together, the chance decompression succeeds, and then the chance that the output can disassemble.
If decompression is successful, the output will have a lot fewer random bytes than 128. The repetitions have some impact, but less impact. So the chance that the output disassembles should be higher than the chance 128 random bytes disassemble.
With DEFLATE in particular, decompression fails 95+% of the time, so the overall odds of decompress+decode suck.
This is partially mitigated by DEFLATE exiting early a lot of the time, but only partially.
If you made a DEFLATE-like decompressor that crashes less and exits early more often, it would win by a mile.
If you made a DEFLATE-like decompressor that crashes rarely but doesn't exit early, it would likely be able to win the 100% decoding test. Some of the random bytes go into the output, some get used up on less dangerous actions than outputting random bytes, overall success rate goes up.

by kazinator

2 subcomments

In common anecdotal experience with disassembling code, it is very common for data areas interspersed with code (like string literals) to disaassemble to instructions, momentarily causing the human to be puzzled: what is this repetition of five "or" instructions doing here referencing registers that would never be arguments?
The reason is that the opcode encoding is very dense, and has no redundancy against detecting bad encodings, and usually no relationship to neighboring words.
By that I mean that some four byte chunk (say) treated as an opcode word is treated that way regardless of what came before or what comes after. If it looks like an opcode with a four-byte immediate operand, then the disassembly will pull in that operand (which can be any bit combination) and skip another four bytes. Nothing in the operand will indicate "this is a bad instruction overall".

by kazinator

1 subcomments

But look, it almost looks as if the Static Huffman (a simpler encoding of compression with fewer decoding errors) almost bears out a certain aspect of the friend's intuition, in the following way:
* only 4.4% of the random data disassembles.
* only 4.0% of the random data decodes as Static Huffman.
BUT:
* 1.2% of the data decompresses and disassembles.
Relative to the 4.0% decompression, 1.2% is 30%.
In other words, 30% of successfully decompressed material also disassembles.
That's something that could benefit from an explanation.
Why is that, evidently, the conditional probability of a good disassemble, given a successful Static Huffman expansion, much higher than the probability of a disassemble from random data?

by swisniewski

0 subcomment

Another interesting thing… random data has a high likely hood of disassembling into random instructions, but there’s a low probability that such instructions (particularly sequences of such instructions) are valid semantically.
For example, there’s a very high chance a single random instruction would page fault.
If you want to generate random instructions and have them execute, you have to write a tiny debugger, intercept the page faults, fix up the program’s virtual memory map, then re-run the instruction to make it work.
This means that even though high entropy data has a good chance of producing valid instructions, it doesn’t have a high chance of producing valid instruction sequences.
Code that actually does something will have much much lower entropy.
That is interesting…even though random data is syntactically valid as instructions, it’s almost certainly invalid semantically.

by mfcl

2 subcomments

Why the AI disclosure? Is it just for the author to make sure the readers know they are AI-skeptic and use the opportunity to link to another article, or would there be something wrong with the proof had AI been used to help write the code?
(By help I mean just help, not write an entire sloppy article.)

by 0x1ch

1 subcomments

I believe this is the third or fourth posting of this article in the last week.