The central problem of cryptology is to prevent inference about either the key or the plaintext, despite the requirement to be able to reconstruct the plaintext from the ciphertext+key. So ciphers have to almost perfectly mix information.
Machine learning is possible because in the absence of perfect mixing, inference is possible (given many input output pairs), even if the information is many decibels down below the noise. So the information about what parameters need changing is present in the output despite many subsequent layers of processing. This means that a lot of mixing can be tolerated, and it's needed because you don't know in advance what the data flow should look like in detail, so the NN has to provide as many options as possible.
There's a lot of multiplication of numbers in parallel, so it makes sense to try to fit that to matrices.
Cryptography is built bottom-up, but likewise it makes sense to exploit data structures that already exist in silicon.
I've finished the Cryptography I on Coursera already. Can't recommend it enough
The condition for carcinization is usually described as a kind of “shared condition.” After reading this article, that is what I felt.
In other words, from the perspective of shared conditions, isn’t it possible that systems receive similar pressures when they need to mix information?
1. There is a state space. 2. Each part of the input affects many parts of the output. 3. A simple rule is not enough, so nonlinearity becomes necessary. 4. But the hardware cannot be allowed to stall, so the system evolves toward a structure where simple transformations are repeated many times.
Ultimately, even across different fields, the core question is how to decompose complexity into atomic units. The choice of those units tends to converge under the pressures imposed by the underlying substrate. This seems to be the central thesis of the article.
This feels similar to how humans solve nonlinear differential equations.
If so, perhaps the structure of human cognition itself works in a similar way: when facing nonlinearity, we break it into smaller structures and design around those smaller parts.
Because my academic background is limited, I find it difficult to express this properly in language. But I think this kind of pressure can also be applied to programming and software theory.
When I think about software engineering, it also often starts from the smallest element that does not change easily, and then builds larger systems by composing those elements. In OOP, that unit is the conceptual object. In FP, it is the function. In DOP, it is data.
FP is mathematical. DOP is aligned with the data that computers store and transmit. OOP is connected to our abstract model of the world. That may be why different people are good at different paradigms.
OOP compresses the world into objects and responsibilities. FP compresses the world into functions and composition. DOP compresses the world into data and transformable structures. Utlimately, it is a question of how we cut complexity, what we choose as the minimal unit of decomposition.
Then what should this idea be called? And if we apply this to AI coding, what would it imply?
I have thoughts, but because I did not study enough, I feel frustrated that I cannot express them more fully. I wish I had learned more.
I think the contrast is more interesting: exact discrete trajectories in cryptography versus approximate continuous function approximation in neural networks.
In cryptography, you usually want a state space so large that nobody can accidentally find, reconstruct, or predict the same path you took.
In neural networks, you want an immense initial search space because NNs need to model the real world, which is highly complex and contains patterns that appear unpredictably. One aspect I think is often overlooked is that NNs are mostly deletive: they start with a very broad representational space and become progressively more specific by discarding what the NN perceives as irrelevant distinctions.
I think this puts the article's point about complexity and mixing in a clearer light. The same class of procedures achieves almost opposite effects. In neural networks, you want mixing so the model can approximate many possible paths at once. In cryptography, you want mixing so the path taken is unpredictable and hard to trace. The key difference is that, for NNs, an approximate path can be good enough. In cryptography, an approximate path is as useless as a very distant one.
E.g. graphs, matrices, linear algebra systems, digital photos (and much more) can be all perfectly transformed into each other, or in other ways this are all different ways to look at the same data. (This is also not just hypothetical * 2)
As a side effect saying things are similar "because they are just matrix algorithms" is meaningless, because most things are "just a matrix algorithm". (It's also meaningful for the same reason, as it means you can transform many problems into such with well understood solutions.)
And the high level abstraction of "encoder -> state -> decoder" structure is another of such "too generic/meaningless" things. As state can be anything and encoding is just "process input to generate state" and decoder is just "generate output from state" wen can model most algorithms with that. Like the identity function is now `encode(input): state=input; decode(state): output=state` (and indeed as long as state is large enough wrt. the input (or unbound) you can train encode decoder network to do exactly that, as meaningless as that seems.
Similar you can treat everything as everything you have in cryptography itself: All of hash, PRNG, stream cipher can be easily(kinda, * 3) build from another and like most algorithms in existence they can be formulated to "consume data and then produce output", i.e. as encoder decoder pattern ;)
So IMHO it's mostly an combination of observer bias due to how we like to model things (in a very high level POV). And a construction bias from that and what computer can compute well (in a slightly less high level POV when looking at somewhat "arbitrary choices").
I know some of the examples here might sound a bit ridiculous, but it's some of the most important insights in CS:
- a lot of algorithms can be transformed (or modeled as) a lot of other algorithms, take advantage of it
- just because something looks alike, it neither means it is alike nor that it being alike has any deeper meaning (e.g. two graphs might look alike, but only for the subset of sample data you happen to use to plot them)
---
(* 1): Anything related to quantum physics seems be especially bad affected by it.
(* 2): you navigation system might navigate by mapping a navigation graph to a matrix and then compute on the matrix, where steps involved might treat it as a linear algebra system solved using some fast approximation.
(* 3): At least some of the conversion directions are easy. Some are a bit less intuitive. Also this assumes you either have perfect properties on the used building block or don't have to estimate how the imperfect properties map to properties of the created construct... Oh and naturally you can use some simple operations in the transformation (add, xor) as long as they are used in a straight forward way. E.g. PRNG(seed, offset) = HASH(encode(seed, offset)), with encode being a bijektive pairing function (e.g. `bytes_128bit_le(seed)+bytes_64bit_le(offset)`). E.g. for encryption you chunk, xor every chunk with a single time used key and generate the key using HASH(encode(key, nonce, chunk_idx)), or PRNG(encode(key, nonce), chunk_idx). That is how AES-CTR/AES-GCM does encryption. It (roughly) uses XOR(AES(key, encode(nonce, offset)), chunk). (Yes AES is more used like a hash then a block cipher in most modern AES ciphers...)