Content-Addressability-in-Fabric

# Content Addressability in Fabric ### My Initial Question I am reading back over the notes of our conversation and another thing you mentioned was: > Because fabric keeps this universal addressability, it means that it can even address itself. **_In that universal addressability of referencing itself it can even change its shape, but still continue to reference itself_**. This is where the two worlds of AI start colliding. At this point you can write graph search things, and you can write embedding generation things, you can gradient descent on a program. It is a different take; all runs through the idea that we are addressing a computational landscape on a graph, that is measured. This bold/italic sentence really has caught my attention as one of the most powerful components of our conversation (and hence fabric). I have two questions related to this: 1. You talk a lot about _content_ vs. _computably_ addressable systems. Could you touch on the difference, and maybe give a super quick/basic example of how you view fabric as being computably addressable (trying to build up some intuition here and I always find a concrete example is super helpful on that front). How does fabric (from an actually implementation stand point), allow for computable addressability? 2. Based on that, when you mention this _universal addressability_ above, I am interpreting it as: "We have a task graph (that has a particular shape/topology), and fabric is aware of/able to reference this task graph. When we (for example) perform embeddings and annotate the graph, these annotations may then inform us of ways in which the graph should be changed. So we _change_ the graph, but fabric is still able to reference it as _the same thing_." I feel pretty confident that my interpretation is not fully correct, so any clarification here as to how you are thinking about this would be great. Specifically: what does it mean (from an implementation standpoint and a high level) for fabric to _reference itself_? And, how you think about updating vs. creating in this context? For instance if a bunch of changes are made to a part of the task graph (it changes its shape), have we created a _new_ graph, or has it just been updated to change its shape? Clearly (from our conversations) I know we are treating this as an _update_ to the graph, not a new graph, but I guess I am just trying to understand _why?_ I feel like this is a pretty important distinction for me to be able to understand; the idea of continuous updates and smooth transformations or shape makes me think of topology, yet when I try and think about making large changes to a task graph at a certain point I stop thinking about it as _updates the same thing_ and start thinking about it as _creation of a new thing_. ### Computable Addressability (Answer to question 1) Content addressable systems (even with relationships) are things like `git`, `blockchain`, etc. So, the way a content addressable system would work is, lets say you have 1000 code files on your computer. you can hash each one, and then as your 'reference' (Address) use the `md5sum` of the file contents as it's identifier. If you want to say, file code file 1 came before code file 2, you can put that in a file, `md5_file1 -> md5_file2` and then save that file, hash it, and now you have a data-structure that's also "content-addressed". This is the foundation of `git` (logicallly), but in practice actually makes a few git objects (file, tree, commit, etc.) each that are content-addressed and reference eachother. Blockchain is pretty similar, in that each 'file' is a bunch of state (transactions for instance), and then the entire thing is "signed" by a salted value that makes it's content-address "statistically meaningful" (proof of work, eg. starts with 12 0s). In fabric, content addressability would be a `content-hash` of an attribute. Attributes are, definitionally: un-ordered, key-value stores. They are hashmaps. A content hash of a hash-map is a bit harder than a file, but you can reduce with `xor` or `+` (order invariant) ways to combine hash functions. Eg. hash each key + each value, concat those, hash that. Now you have a unique concent hash for each "element", then run a `+` or `xor` aggregation across the whole space of keys, that is a "content hash". That content hash of data (which data versioning systems, etc.) are built on is _different_ than compute hash of data. The "compute" hash of data would be a hash of the language @. Eg. the SQL query, the python files, etc. (or more generally, the actual bytecode + choice of machine). What fabric does, is build up a DAG of compute-hashes that follow functional operations, and calls that the ID of the object. Eg. Given nothing: run "1271" followed by "5d12" followed by "3317". That produces "55ad". This new `55ad` can only be 'arrived at' via construction of those same operations. (just like how in content addressability, you can only arrive at that value by exactly the same content in the data), in this case, you can only arrive at the exact same attribute uid via the same "compute lineage", the exact same operations had to be done in a specific way on the same sets of things in the DAG. This pattern of hashing the language _and_ the data for versioning is not completely novel, but we are doing it on a granular level. Eg. it is possible for you to "version your data's lineage" (truthfully) by just content hashing your entire code-base giving you the "code hash" and then storing a file with "command x on data y with code-base z", hash that. That would work. But! You lose relationships. Any change anywhere, and you don't know "what edges mean", you can't attach language to the differences in hashes that come out. There could have been 5 unrelated webapp code changes that affected your compute+content hash for a specific unrelated algorithm. It's not "minimal in it's description". This minimal in it's descrpition is where the DAG of fabric tasks comes in. Similar to how content hashes are actually computed via merkle trees / hash trees, which allows you to "diagnose where the difference is at, our "compute hash" is a bit of a tree structure (DAG) of computable "functions". Leaning towards "functional programming" paradigm, each task is something that takes in a typed set of inputs, and outputs a new typed object. By minimizing the scope to single attributes, and single tasks, we can hash "just the code that runs to get from A to B" and then we can compositionally combine the code for arbitrary graphs up to arbitrary result. But, if at a later time, someone shows you a different UID, you can look at the fabric "complete graph", and find "where does this fit in, how do the graphs overlap, etc." and find out a language based difference. Also, if someone comes in and wants to change for 1 feature in their final output table, from TakeOne to TakeOneSorted, but leave everything else the same. Everything else is already "the same" no new lineage has to be taken into account, no special diffs must be made, and only that 1 attribute "sees its difference" in the graph. This "specificity" is not at the most granular level (yet) that I want it to be, all the way down to bytecode. But, right now it has a proxy for "functions on attributes" which I think is a pretty close approximation that works in distributed environments. (attribute being effectively synonymous with distributed hash table/ key-value-store, or even pub/sub message queue (which can be represented as a KV store), etc) ### Answer to question 2 Effectively: the key point is that "by construction, you arrive at the same attribute UIDs regardless of where you are". Eg. You could be on Mars, and say "what if i loaded 'column Y from parquet X from /app/work', applied a grouby, take one and then looked at the count attribute". That attribute uid might come out to be '1511'. This value is universal. Any reference to `1511` in the entire universe (that follow's fabrics routine), is talking about the same concept: "unique count of column Y from parquet X". So, it doesn't matter if you are algorithm on machine X, or Y, if you find a reference to something interesting, you can broadcast just the identifier (1511) and, others, if they have ever 'walked to that part' will know what you are talking about. So, when we are adding content to the graph in fabric, as you noted, we are "updating" the graph, and not creating a new one. That's becuase the "entire fabric graph" already exists. By construction of the algorithms of fabric, we actually have already 'addressed' all space, but we just don't know the lookups for each address. (lookups, because also collisions, since we %128 bits). So, now the question is, how does it "reference itself" . Fabric references itself (in thought experiment) by imagining the data that fabric itself produces / state related to it. There are 3 main ones I often consider: 1. Take the fabric database (postgres db with ~5 tables, attribute, task, semantics, encodings, small-value-cache, etc.) 2. Take logs (eg. run logs that output in the log `attribute_uid{x}, ...` etc. 3. Take the code-base itself (on github, python AST) Given the (1) data-source, you can imagine exporting those tables to parquet files. Then doing a `sf.View.load('Task').unique_on('attribute_uid')`. This is now an "Attribute Perspective" in Fabric itself. The tasks here 'reference' loading the full set of attributes. Eg. you might have `TaskName` as a column, you could, for any given fabric count the `attributes['TaskName'].value_counts()` for instance at that spot in time. But that's not all, you can also merge in (via the perspective uid) of things like attribute uid, the actual code for a task. Or, via logs, the 'run-time' of the task, etc. Eg. imagine `run_log.groupby('TaskName')['time_duration'].median()` and then you have 'profiled statistics' about how long each task (by task name) takes, in median to run. This 'enrichment' on the task/attribute perspective, from data that is self-produced (1, 2 and 3 above), is adding to the embeddings that you can compute on the attributes. This leads to an embedding space / representation that is both explainable and trainable by things like reinforcement agents. --- Date: 20211023 Links to: [Content-Addressable-Systems](Content-Addressable-Systems.md) Tags: References: * []()