Semantic Fabric Theory Overview

# Fabric Theory Overview ### Justin answer (recap) #### What problem does fabric solve for JW? * There is no way via content addressability to (i.e. no mechanism) identify data universally and see that data point (of data) relative to other data * Note, it is worth looking at [Content-Addressed-Memory-(CAM)](Content-Addressed-Memory-(CAM).md) * An example to make this concrete is the world wide wide * We can go to a website and based on it's URI and subdomains we can split that language and see "this URL is on the same domain name as another so it is *closer* to this other piece of content". Note that here we have a domain name/URL *mapping* to a particular web page (content). Google/search allowed us to input a given bit of content (ex: "top AI companies in 2021") and have that be used to identify the relevant data * Wikipedia/functions is closer to wikipedia/france than it is to google/scholar (since the two former are both subdomains of wikipedia) * Part of what fabric is trying to solve is *what is data*? * Attributes in general, the idea of data, doesn't really have a universal addressable system. There is no standard for that. * you generally solve this from a CS stand point by creating languages, grammars, and then parsing and referencing it * Both relational model and document model (json, xml) lack a universal reference system and you can't measure distances between them * to move domain knowledge from peoples heads into a system where automated code can work on it (and do it in a distributed fashion). You need a language system that is a universal referencing system for things * We really want fabric to be a global reference system * What JW really wants is **computable addressability**. #### What practical problems does fabric solve? * Practical problems fabric is meant to solve * being able to trust your data * where did columns come from? How do you build trust? * there are many tools that go through ways of introducing trust, but they all rely on data engineering process and people specifically following a routine when they do their work so that they track things in a certain system. * Fabric says: do whatever you want and the system will keep track of itself in its system of record, and it will be interpretable * in order to build trust you can't just do something that is perfect under it's own rules. It has to communicate why its got a certain guarantee (when you go check "why is this all nulls?") you should be able to say "oh because you did a group by here and there were no unique groups" * SW failing * needed to move from a table based thinking to a column based thinking in order to do the minimum update patterns for data * Ex: we had massive semantic engine SQL statements that are converting tables to other tables, doing huge groupbys (on big query). But, once you ran your whole pipeline, if you simply wanted to convert datetime to string at some point, you would need to rerun whole pipeline. * it became clear that breaking apart work into "per column" was a really useful abstraction * Shared communication layer between AI and people (this will be semantics part of system) * All AI is graph search. In order for AI (graph search) to run, it needs some annotated graph that it is working off of, somewhere (or, some ability to measure annotations off of a graph). Pattern find does this by looking at each node and calling things like get_sub_pats, ks score, etc. Along with that data, we added the idea that everything in these graphs should just have a whole bunch of metadata attached to it (semantics). Right now that language in legacy format was hard coded and bespoke (on creation side it was just creating dictionaries and on reading side it was just look for the key you know should be there). **That is not enhancing arbitrary AI; that is simply fine tuning an algorithm from the inside**. This is very useful, but is not great persistence. Fabric wishes to move this state into a database. This way you can go to this record of state and get stuff. * #### What are we trying to build? * **An autonomous (AI) system that learns to understand and effectively communicate about all forms of data**. * By communicate we do not mean "has a clean API". Ideally it will eventually grow into NLP style, query action flows. It should be a living system. This is a system that helps people that makes better decisions. * We can keep tackling a pattern find here, a primary key find there, a feature find there. And we can build up a process for each one, develop products and apps on it. But fundamentally fabric is solving the shared component that they all have. They all need an AI system or a compute system that understands data and effectively communicates about it. * The biggest difference from the above and standard DB technologies is that it is not meant to be a source of the data. It is not meant to be your master record of the data. It is meant to be your meta-heuristic understanding of the data. It is meant to be your metadata world. It lets you communicate about the data for each of your systems. * On a company scale, fabric is setting up all of the solutions that we want to build. * One way to describe this: * pattern find was our killer app. It worked and we started to deliver results on it. Great. That is one of tens or more apps that we plan to build. These apps need to be built on a platform underneath it. **Fabric is meant to be the platform that lets all of the apps work together, and in all cases, have certain guarantees for our ability to load in the data, to trust and universally reference the same concepts**. #### Two types of AI * History time! There are two different historical AI path: symbolist and connectionist * Symbolist * computer languages, functional programming came from here. Eventually turned into rule based systems and ontologies * Universal principle that can be felt: They are graph algorithms (discrete graph algorithms). They walk edges via actions to new nodes. They do not move through continuous space. * Fabric wishes to capture: By representing your data on a complex enough graph, you allow yourself to ask and query that graph for contextual distance * Given the context of hair color, how close is any person to any other given person (note: seems like we are defining a contextual *distance metric*). This type of question is effectively answered by these types of graphs. For specific contexts they give you efficient algorithms to find nearest neighbors, not nearest neighbors, etc * For database systems: Data itself has a lot of structure and a lot of rule based things you can do and measure contextual distances between. * the raw data models of fabric are meant to do that. I.e. encodings, attribute graph, the word perspective, etc, are meant to capture the rule based walking of the graph. If I give you any two attribute, you can say: "In *language space* you can measure a distance. You can measure that distance contextually as well." Fabric solves this via its data model, its representation. * **It solves it by representing the information about computable graphs of data, and allows you to content address into that graph, for any two nodes, and then walk through (also content addressable) nodes, to any other one, for contextual distance.** You can then answer what is the shortest path, longest path, shortest path out of known things, etc. * Connectionist * floating point, embedding world. *Smooth topologies of spaces*. This is the other half of the world, where most of current AI happens. You convert you text into tokenized space and then you run linear-near linear transforms (/posts/purposefully not raw linear, also purposefully close to linear, see [Purposefully-linear-purposefully-not-linear](Purposefully-linear-purposefully-not-linear.md)). * So, fabric let's you represent (in it's compute space), the ability to compute things like embeddings and things like that * Now things start to get *meta*. There exists a perspective in fabric that is the *attribute perspective*. Every row is an attribute in fabric itself. On that perspective that is stuff; such as "what was the task that was used to build this attribute?". That would be a column. What are measurements off of that? For instance, for each attribute go get its unique count. This would be a column. You can start treating even the attributes in fabric as data, just as you would normally treat it, and that same system allows you to do both symbolic and connectionist (such as build up an embedding on your data) *on the attributes themselves*. * This is the marriage between *symbolic* and *connectionist* approaches: we can then write *embeddings* on *language space*. Note: here I believe language space is meant to refer to the task graph (it is a language that can be used to build up a set of features. Similar to how SQL, a language, can also be used to build up a set of features. This is why JW refers to tasks as the "basis" of the language). Note he responded with: Yes. by "Language space" i mean tasks. It's the case that you can take SQL and directly translate to fabric graphs and back, etc. It's a language in the sense that logical plans are languages. (Eg. the formal definition of relational stuff, is relational algebra -> used in the "logical plan".) Fabric tasks are our "logical plan", directly mappable to language. * Note that graph embeddings are a similar phenomenon, *but* graph embeddings take you back out; in a sense, you have the data modeling problem. You don't have a system that is self referential for arbitrary perspectives doing that. Instead you have to put a pin down in your language and say "this is my perfect concept of what a row is". * Fabric is a system that addresses regardless of your choice! So even if attribute wasn't perfect as the correct perspective to build those embeddings for the graph to annotate correctly, you could make a new thing that is some aggregation of that, or filter down the perspective, or make a perspective near that. You get to *work through the data model space, build up embeddings on those new arbitrary data models, apply it back to your graph, which then changes your data model, which then might change how you chose to represent it!* * THAT is the morphic system that through just sitting there churning away on reducing complexity * Note that there are metrics for the whole system to behave behind that look better. For instance, represent as much as you can-as many concepts as you can-while using as simple a language as possible (/posts/think [Occam's Razor](Occam's%20Razor.md)).. These patterns show up in compilers as well. Because fabric keeps this universal addressability, it means that it can even address itself. In that universal addressability of referencing itself it can even change its shape, but still continue to reference itself (note: this is kind of nuts). This is where the two worlds of AI start colliding. At this point you can write graph search things, and you can write embedding generation things, you can gradient descent on a program. It is a different take; all runs through the idea that we are addressing a computational landscape on a graph, that is measured. For more see [Content-Addressability-in-Fabric](Content-Addressability-in-Fabric.md) * Note that all programs are enumerable (see turing proof). A good example of a system that is content addressable is git. It represents all of its data-all of the code-as a unique enumerable thing. It then builds up the graph of how those things relate to each other, in order to give up a contextual interface-like a language-that lets you measure (once you can enumerate things and then can also have a graph, you can walk the contextual distance. So you can say what is my diff right before, what is my diff right after). This is an example of where *content addressability on an annotated graph gives you capabilities*. * The Fabric graph, the edges (the contextual distances/differences), are not just history of record (/posts/[Merkle-Tree](Merkle-Tree.md) type stuff). It is a functional thing; you can take this reference and turn it into this other reference, via language code. And it's not just a fixed language (an open language), it is the *task language* (which is just arbitrary data structures with code, UDFs, processors, etc). That means that your content address (for contextual distances that you measure) includes information about the compute you run. #### Fabric as a language * Fabric itself is a language (we don't have the grammar and the d-serialization; right now it is only the DAG and computable...). This is one of those Turing machine vs. State machine type boundaries. * SQL is a language where: * you can write the string (so it is enumerable) * And you can parse into an abstract syntax tree (AST) * this means that given two SQL queries you could measure a contextual AST distance between the two * But that distance is not tightly coupled to the information flow in that system. It is more tightly couple to the abstractions that the person who wrote it made. * So, directly off of SQL you don't get good distance metrics between two different queries * What you would want more of is to compile the sql from its AST to its logical plan, and then do distance metrics on the logical plan. * These are contextual distances between components of a language * Fabrics language is itself in the space where, JW proposes, is the most interesting contextual distance that is causal relational. * the contextual relation that tasks have in fabric as causal dependent * We decompose and keep the language pinned at a causal layer. Rather than putting the language at a compiled causal language. * The point is, when you measure the distance between any two attributes in fabric, using naive walking distance via the task graph (the language one), it captures and will find (in its embedding and nearest neighbors and things) causally related things. * Like `age of customer` compared to `groupby(state).age.mean()`; the age column is directly in it's lineage, that attributes data is *causally* or *informationally* related to the other computed value. * Keep those relationships close is important. And not mental model relationships close. Mental models are things that change and are flexible (any one can do anything they want with a mental model), but causal history is the closest to truth of reality for a contextual metric that JW can think of. * So the principle is: *Attach the language to as close to reality as possible*. #### Fabric as a metadata layer * The underlying thing that is captured in data that represents reality is information in the data stored in the form of shape, symmetries, and topology. This idea that data has shape and the shape is truth, not the values of the data. The actual truth kind of exists in the data, but it exists as a shape phenomenon. This is where truth is. * Consider, for example, a constant value in fabric of `1`. This is a very fixed concept, it is very pinpointed. You can load this number in, it is a constant. It has a UID and it always going to be the same UID, it is addressable. A compute flow that loads in an arbitrary column, groups by itself, and then counts the number of rows in each group, that is an attribute, and you can calculate the max of that. If that max is equal to, in the content addressability space (_not in the attribute UID space_, since the attribute uid space is our compute addressable space. I.e. take this column, do that math, do this thing), to the content in other attributes in the graph (for, in this case, the number `1`) that is a statement about the shape of the data. We know in "language space" that that statement is "that data was a primary key ", or "it's all unique". We have many ways to phrase that. But, in the shape world or information graph world, it is that content-hash(attribute 1) == content-hash(attribute 2). That statement is the symmetry statement about information in language space. So, you can decompose that into language query 1 == language query 2. That either reduces (because the language can reduce the content hash function via application through your language; it just flows through and you can say "that is tautologically true in the language"), _or_, when it is not the language it must be a symmetry in your data. It must be a statement about information in your data. You can write it as a statement in language. We are actually just saying: "In our language A is the same thing as B". That moving of information from shape space of data to language space; moving into language space allows us to be able to start playing explainability games. How would you interpret this? How would you explain? You do "language translations". This is a different game from converting between language space and data space. * Strong statement: The metadata layer can absorb all of the information in a data set and if it is doing it well (think compression algorithms, or shazam. All of these systems are moving some component of data into a language space such that then it is compressed. If you can express it in a language space you can attach an algorithm to it, if you can attach an algorithm to it you can then move that algorithm around-this is a smaller piece of data than the data itself in general). If you can then start making statements about symmetry classes in it you start fully representing the data in language, and then you can play reductions on your language to get to the most informationally dense statements. What are the statements that reduce the most information out of your data and into language space and then translate that to english. That is the learn the understand the data (represent it and then go find a whole bunch of symmetries in it) and then to effectively communicate go translate those symmetries that you found effectively back to a user. * These two things are the backbone to all of these different algorithms (strong statement). Even KS scores is finding: "subgroup A of KPI is far from subgroup B". That is a statement in language that is strong and biases the data. It shows up in the graph as KS attribute 1 compared to KS attribute 2 is not equal. It goes against the assumption that if it were uniformly sampled they would be equal. So, we move that into the language space. --- Cool stuff * visualize ERD * Attribute perspective * Conversations about justin about long term vision to build on * As you work with it, being able to quickly query for all sorts of interesting things --- References * [Chat 1 with JW](https://unsupervised.zoom.us/rec/play/7XMnYlB7f4mKRUtnNjBSy-_1Re3YaboAemDK1f9f_Y1KnElb3sb5rIJCQGajXc54nEP1yMS1FLd_6tMa.scJLlQjRSMUf2LQy?autoplay=true&startTime=1619028422000&_x_zm_rtaid=kIoLO-a2TXm5qk3ExQRuIg.1619436444353.358bb1a6b5bc9293cd868ed81e6f5010&_x_zm_rhtaid=484) #### Transforming between problem representation spaces JW strongly agrees with this. Think: * u substitution * circuits, can't solve nonlinear equation with sines and cosines, so you drop in and assume an omega term and you do a laplace transform (fourier transform). Then you solve via algebra in a constrained systems space and the you convert back * In all of these we are converting from one language to another language, so in language space you can then do reductions * So it is about allowing us to convert to languages and allowing us to move fluidly through arbitrary languages (not must computer languages), that allow you to solve problems differently * Note that every language is essentially a metric (distance metric-yes this is a weird stretch). You take a language, build the AST. You take another object in that language, build the AST. You can measure the graph distance between those things. You can write a graph edit distance metric on it. You can do an edit distance in the string form of a language. Given a language to represent something, you can just measure distance between it. Choice of language means that you can chose language such that distances are contextually meaningful. If those contextually meaningful distances are attached to a language itself, so that you can use language to walk through arbitrary language translations * Imagine that we can then do _many_ translations, frequently changing our problem representation space. We are solving many problems that are nearly impossible in one space via converting to another space, then converting back. * One example of an underyling symmetry you can extract from your data: * these two things look like they should causally be the same * But they are not * That is an underlying symmetry that you have extracted from the data * you get this guarantee of being able to be sure about that because of how precise your language is in how it tracks information flow #### Mental models * DAG graph lineage view * tend to imagine it as a hyper dimensional graph, but it is reduced to contextual dimensions * imagine it is in 100d. Any time we are visualizing it we project it down to "oh I am going to measure depth as my x axis and separation between the two as the y axis" * Avoid thinking in terms of blocky, table mental models * want to think in terms of smooth graphs, embeddings, much more richness * We lose the richness of attributes if we go too close to the table based view #### Fabric is a data structure that we can build a top * Fabric is built and adopted as the way to manage all of your data * We then keep throwing out new apps/verticals * some will live, some will die * we will have 10s to 100s of these things --- References: * [Convo with JW](https://unsupervised.zoom.us/rec/play/7XMnYlB7f4mKRUtnNjBSy-_1Re3YaboAemDK1f9f_Y1KnElb3sb5rIJCQGajXc54nEP1yMS1FLd_6tMa.scJLlQjRSMUf2LQy?autoplay=true&startTime=1619028422000&_x_zm_rtaid=szFd7sK2Qyi62V-5d1ZA3Q.1619615572338.e3423afb95a770a530e06b0261fafa93&_x_zm_rhtaid=243) --- Date: 20211025 Links to: [Semantic Fabric MOC](Semantic%20Fabric%20MOC.md) [Content-Addressable-Systems](Content-Addressable-Systems.md) [Content-Addressability-in-Fabric](Content-Addressability-in-Fabric.md) [Computable-Addressability](Computable-Addressability.md) Tags: References: * []()