Where-does-Information-Lie

# Where does Information Lie? We start with asking [Does Linearity provide Information](Does%20Linearity%20provide%20Information.md)? Now let us consider an information theoretic perspective. Traditionally, Shannon Entropy is defined as the *surprisal* of observing a random variable, $X$, with a distribution, $P(X)$. Consider the case that $P(X)$ is a uniform distribution. We generally state that the *information* in $X$ is maximal (we have maximum entropy). More specifically, we receive maximum information upon *making an observation*. Now, consider *being told* that $P(X)$ follows the distribution $P(X=x_1) = 0.99, P(X=x_2) = 0.01$. In this case, we just received a great deal of information about $X$! We were given details about the *distribution* of $X$, which was incredibly informative. ### Recap Okay so at this point we have looked at transformations as well as random variables. For transformations: * We can consider the information contained in the statement "$f$ is linear", *given* that we do not know anything about $f$. In this case, we gain information. * We can consider the information contained in seeing a particular value of $f(x)$ (an observation), given that we know the function is linear. In this case we gain little information (since knowing it is linear already told us nearly all information about $f$, subsequent observations aren't very revealing). On the other hand a nonlinear function could be thought of as storing more information since we need far more "bits" to describe it. And we looked at information theory: * We can consider the information contained in *learning* that a random variable $X$ has the distribution $P(X=x_1) = 0.99, P(X=x_2) = 0.01$. Here we gain information with this knowledge. * Likewise, we can consider the information gained in observing a particular value of $X$ *given that we already know the distribution*. In this case (based on the distribution above), the observation provides us with very little information. ### Is Information is Fundamentally Relative? So, depending on the way we look at things linear functions can be seen as either containing a great deal of information, or very little information. Likewise, a distribution can be thought of as a containing a lot of information or very little. How do you tend to think about this? ### TLDR There's a bit of tension between - a function as a store of information--if you know its value at a few points, will it tell you the rest (this is the case with a linear function, the language concept itself stores a lot of information) - the subset a function describes (in the case of the linear function the information stored here is actually very low, since it is easy to describe!) So a noisy function, which I'd want to think of as "low information" actually describes a very high entropy subset ### An interesting read: Kolmogorov complexity and PCA * http://www.theorangeduck.com/page/machine-learning-kolmogorov-complexity-squishy-bunnies * Commentary: https://news.ycombinator.com/item?id=26026182 ### Thoughts from Justin Waugh okay - re-read the doc in a bit more detail I think there's (from literature) at least 1 topic that is worth going over to be careful about measuring information in machines. -> Specifically, they are about languages and automata (foundational / historic CS stuff) [https://en.wikipedia.org/wiki/Automata\_theory](https://en.wikipedia.org/wiki/Automata_theory)You had in page 2 of this, a question about "carrying out computation": which, is largely about what machine could actually 'compute' the value of your linear function. Starting with turing machines (generalization), the way you can measure 'information' in a compute sequence is by looking at the "Turing machine that computes the answer you want.", and measuring the size of that turing machine's state and tape and 'time-complexity'.However, That entirely depends on choice of a bunch of "inputs" and "output" definitions. Eg. is it that you want a machine that, given the (x, y) will update them to give you the new (x\_new, y\_new). Or, do you want a function that, given a set of 2D space transformation functions, reduce them to approximate linear forms or something (Eg. think wolfram mathematica, takes in as it's "input" functions/language, and outputs other language/math via reductions)Given an interpretation of the question: you want to represent "classes" of functions (eg. linear, squared, non-linear, etc.) that operate on 2D points (x, y), and carry out the computation on those points to get "where points end up", then you have 2 inputs to your machine, and then some "algorithm" that actually does the math to compute the result using some basic set of operations (on raw Turing machines this might be pretty long, on some pushdown automata for linear functions, you might be able to get away with just a handful of gates, etc. it really depends on the target architecture what 'optimal' would be). There are, in this space of "functions" you can write that are interpreted by a specific machine, some functions that can only be written out or computed _by_ turing machines, and it's also possible to have functions that are unknown whether or not they will stop (use finite compute time, and halt).I think, probably interestingly, there is an 'enumeration' question for a given 'machine' (which is leading into some of the Universal Turing Machine style questions): given a machine that uses ~100 bits to store an x,y, and then N (lets say 1000 bits) to store an 'algorithm', what is the "space" of accessable functions. Eg. i'm using in a mental language that there exists some string of length N `______________` and that is 'considered the input' to the machine. `d213970dac3` might be interpreted (by a specifc machine, with a specific instrucction set\` as "take 13 as x, take 15 as y, apply the matri \[\[2, 3\], \[1, 5\]\]\*\[x, y\], and output the result" which then is put on the tape somewhere. If you just enumerated all possible strings. (eg. in python, with ASCII encoding files, you can just enumerate all string python .py files that are < 128 bits, or 16 characters long), and then check them for properties (like, run at all, etc.) In this 'enumerated from low-size towards high size' I think we can sort of instinctively feel that there are "domains" of complexity that seem to show up. I feel, surprisingly, that kinda low in this space include iterative / recursive linear functions that lead to very non-trivial answers. Eg. fractals. They are iterated simple math, up to some finite size, followed by a final check. But, even lower, it feels like "in the low-information space" there's some really trivial "pass through" and "raw filter" at low program information (eg. think of no-ops, or simple 'if you see a 7, replace with 3' single pass algorithms, then there's a 'range' of things that are all mostly 'linear' functions (eg. think of the simple 2d matrix multiply), then there's a set of functions that start to look like fractals and more complex "iterated" solutions with good convergence (use of the language space to describe a 'branch' operation,, and then do a loop and conditionals -> which is necessary for stuff like `sin` or fractals. )All of that said, I want to come back to the 'relative' question at the end: because I think that that is important. It's the case that in some languages (eg. something that can be interpreted by a specific machine of sufficient "instruction set") that some of these operations might be "very light" in that language. Not require loops, simply store their algorithm in a few bits, that the system itself knows how to interpret. There's a question, of comparative information between different programs _in that language_ which seems to give a different answer than comparative information between those same programs, just converted to another instruction set for another machine. Eg. on an old PIC, it might be 4 kb to represent an aes encryption algorithm, and then different programs that are "close" in some higher language that has aes as 1 step, are suddenly very different on the pic. In the 'high level' language, you can say `aes(value)+12` vs. `sum(value)+12` for instance, which 'feel close' in that language. But, in the PIC language, those might be 'very far apart', and of different degrees of information seemingly in it.So, given that, there is a question (?) a bit in the abstract, what is the information stored in "a universal" language, or is there some "low level" language that if you can write an algorithm in it, then it represents the actual 'raw' information of the process? And, to answer that I feel like falling back to: kolmogorov complexity + turing machines. The idea of the simplest universal turing machine: eg. what steven wolfram chased: [https://en.wikipedia.org/wiki/Wolfram%27s\_2-state\_3-symbol\_Turing\_machine](https://en.wikipedia.org/wiki/Wolfram%27s_2-state_3-symbol_Turing_machine) . For that 2-3 symbol turing machine, there is a "language compiler" that takes in the 'bits' for your program (eg. `aes(value)+12`) and then, written in 2-3TM, converts that into its own 'interpreted' language and executes the algorithm. The question is (from kolmogorov) how "big" is that interpreter program? and, the answer (in kinda non-satisfying form) is that it can be seen as a constant. There is a "constant sized" interpreter written in B that can take in some fixed language A, and run it on machine B. I think the _next_ question is: given a problem or algorithm to be solved (eg. 2D mappings), what is the "minimum language" that is a fixed language, that you need to represent all the operations you want, such that you can then write a minimal interpreter to execute that algorithm. <- if you can chase not just minimization of algorithm in a single language, but minimization of the 'class of the problem you are solving' into 'constraints on the complexity of the choice of language', I think then you are chasing the "fundamental" (non-relative) information of an algorithm/function/process.In terms of how to go about 'choosing a fixed language that solves your problems', i think there's a way to look at this problem in terms of the information flowing through the program itself, and enumerating all possible paths. Eg. Assuming we have problem that halts, there exists some enumerated space of inputs (eg. say 32 bits for 2 16 bit numbers, x and y written in float16). and then we have (32 bit outputs). there exists, "only" a finite number of operations that could possibly exist (number of functions between two sets), and that "space" of functions, could be enumerated -> and gives a max size. But it also can be "clustered" and reduced (chose an encoding for the language:: eg. Depending on the pdf of what functions _will_ be called into the language (similar to markov model in original shannon paper), you can chose an encoding for the language that optimizes for those clusters / specific functions of 'interest' at lower cost than others. that makes that language 'more efficient' for a specific subset of the entire functional space.) This encoding / 'specific subset' / 'clustering' of your language, is, as far as I interpret it as: "defining symmetries on the space of functions" and has some "noether theorem" style constraints that lead to universal constants of the language which then 'get factored out' of the cost of the information. ^^ At this point, I think I have come full circle to the problem/question. Given the 'constraint' of a language that you want to specify, or assume 'Linear' functions, what is the information "cost" of that symmetry on the choice of language? I don't know the exact answer, but I think it _is_ solvable based on the setup, for (16 bit, 16 bit) -> (16 bit, 16 bit) function "space" for a language, If you 'factor out' all linear functions (or rather, consider a language that _only supports_ functions that are linear) as a symetry on the language, how much does it 'reduce' the phase space, and subsequently reduce the 'information capacity' of the language? Eg. I think you could estimate the answer for all 'linear' (up to a definition of linear) functions on 4 bit, 4 bit 2D (x, y) input and 4 bit, 4 bit 2D (x, y) output languages. (just enumerate all possible 8bit->8bit mappings, interpret the constraint of "given x, y -> a, b, does (2\*x, 2\*y) = 2\*(a, b)" if so, keep that mapping, throw out the rest, and see how many "agree" with the symetry. Then start scaling out number of bits, and see if the 'ratio' (number of functions that are linear // number of all possible functions) converges to a constant. If so, that gives a sense of the partitioning 'entropy' of linear functions.