Semantic Fabric Theory Questions

# Fabric Theory Questions ### Fabric Investigation TODO * how is the lineage created/tracked? * given an attribute/task, how to we determine the lineage (under the hood) ### Fabric Implementation Questions * Why does the `View` `get` method return a `View` or `Perspective`? Is that coupling them together? (see [here](https://github.com/Unsupervisedcom/code/blob/master/semantic_fabric/semantic_fabric/view.py#L615)) * What is the `TaskClient`? * Levels of abstraction, from highest to lowest? * `View` * `CRUD api` * `SQL alchemy API` * ### Bryce answers * Basis of language * everything is a graph * nodes are tasks * edges are compute/transformation or a list of no.op * task takes you from one node to another * nodes are components * What is a feature? * A function * domain is some subset of integers * range is some set * set is determined as an encoding * encoding is a set of values * set is a collection of objects, you can specify them * collection of ids, names, addresses, numbers between 1 and 10 * when we use them they are very specific sets * if encoding is list of first names * then a features is a way of assigning a list of integers to a first name * i.e. idx =5 -> name =bob (index is mapped to a value) * no req that it is a 1 to 1 mapping * So, a perspective is... * features are at a perspective if all domains * takes some numeric values (index), assigns value of transaction * if you say it is at perspective of transaction id, then idex is determined by specific id * So that transaction id -> hashed -> that is index value (integer) * you have transaction id (a feature), but you are making it so that transaction id is in 1 to 1 bijective correspondance * you are are really saying that a perspective is a feature that is a bijective function * all features are functions * you could think of a perspective as an injective function * some subset of integers maps to some seto f values * this transaction id is this primary key for this table * What is a view? * Collections of features with a common perspective * Convenient tool to get comfortable with fabric * a table * when at this perspective you are saying this is the primary key for the table * indices are way of gluing table back together * What happens when we load data into fabric? * What is a perspective? * I see this idea of: "Primary-Key encoding tuple" * the perspective is the grain of the data * In fabric things are at a * A table no longer exists? * we can build a table if we have a collection of attributes that all share the same perspective * Still powerful and valuable now * will allow use to redefine how data works * good organization of data sources * good management of meta data * great base to build tons and tons tools on * we built PF this is valuable * shortly after we were like: "hmmmmm, we need a foundation for PF that is distributed" * Distributed pattern find * we then work on dataset to do stuff * can't do anything with PF that doesn't fit into memory * still have a weighted graph search * can find interesting parts in the data * Pattern space is expanding * Hey sorry, I got busy and forgot to respond to this yesterday. If we want to talk about perspective in terms of things that pandas users are familiar with then it really is just and index and setting a perspective is setting an index. Having the unknown-unknown perspective is like having a dataframe with no index (or the default). I think to go into the details of exactly what it is might not be necessary, unless you want to explain exactly what a features is in terms of key-value stores and stuff like that and how you can set the keys on features by hashing values in columns, etc. * perspectives could be surject but in general they are not. e.g. if the endocing type is dollars spent, then the feature will be a finite list of dollars spent, but there is never a case where you have every possible float as a value in a single feature. Because then your features would be infinitely large. * ~~You should think of the encoding as the set of all possible values the feature could take, and the feature as selecting some, or all of those values, but not necessarily all.~~ * ~~in the above example the `customer_id` is the perspective / primary key because it is in bijective correspondence with the rows. i.e. each row is uniquely determined by a customer id. whereas the state does not have this property.~~ * state is a feature that is at the perspective of customer id what that means is that for each row there is a state value and you can determine that row by selecting a customer id. that does not mean that there needs to be a 1-1 correspondence between customer id and state. (edited) * in this case the 'index' is really some hidden value that is basically a 'row' and the values in the feature `customer_id` are in 1-1 correspondence with that hidden 'index' but the values in `customer_state` are not. So, you can say that `customer_id` is / could be a primary key / perspective, but `customer_state` is not. If you want to set `customer_id` as a perspective then you hash the values in the column and make the hidden 'index' those values (that is what `unique_on` is doing under the hood). If you tried to do that with `customer_state` you would end up with fewer rows, but rows that were uniquely determined by a value in the state column. ### David answer * Fabric composes two external services together * DB and dask cluster * “What does the API of the constructor look like?” * taking information from DB and using it to do things in the cluster * Can also be other way around * load: cluster fetches info about partitions, then creates entries in DB to represent load operation * measurement -> result cached in DB (entity related to that will cache result) * Fabric uses API classes * get_implementation() * want people to be able to customize internal implementations, * so, can import ViewApi, make subclass, then fabric will use that def of the API instead of the default * `FabricApi` is an object that has a reference to the fabric instance * implemented due to goals of customization * wanted to split out low level internal logic * ViewApi is high level API for people to work with * Crud API is lower level (providing lower level DB methods) * Creating a subclass of fabric api allows us to override parent methods * Called a FabricApi that is extending Fabric in some way (So needs an instance of a Fabric) * David’s Opinions * Use as little magic as possible * using lesser known features of python make it harder to predict outcomes * Someone may be unaware of certain features and possible side effects * Dynamically loading code * we have large fabric task library * certain engines may have certain tasks * discovers at run time * using core principles of language can you tell what it is doing? What effect may result? * Low level DB stuff > As for the fabric vision doc stuff, I wanted to offer a few more comments. I believe that the what and why of fabric are actually very simple...but from a technical viewpoint. For example, I think you can sum up what fabric is doing and why in the following description: > * **What** - Fabric is a toolkit for representing possible analytics pipelines (features) over data sources as DAGs and provides a high-level API for building those DAGs and also wants the DAGs to be amenable to algorithmic analysis. > * **Why** - A lot of things that data scientists do are readily represented as algorithms. For example, using knowledge about the semantics of data (this column is an integer but is also an account identifier) to make decisions about what kinds of operations actually make sense. Also, recognizing that there's often a general pattern of grouping many different columns by a common grouper column and then doing whatever aggregations are appropriate. Those types of things are somewhat mechanical and can give us information that can then be further analyzed by more sophisticated algorithms like pattern find or the UMAP embedding that Tom did recently. Fabric puts pipelines in DAGs because it sets you up nicely to do that kind of analysis. And we move away from data scientists spending a lot of time on mechanistic work that is often repeated without much thought or creativity. > > So there. In my mind, that reasonably sums up the important stuff about fabric.Now, the challenge is translating this into something that non-technical people can understand and **_also_** describing everything that's implied by those simple core concepts (which can maybe be thought of as more thoroughly spelling out the "why?" question). It can be hard for non-technical people to wrap their heads around the meta of finding ways to represent things computationally and turning the algorithmic spyglass on real-world things that people do. Also, assuming we get the core concepts of fabric ironed out and clean, there's a whole world of crazy possibility that opens up. It can be a bit overwhelming. There's more straight forward stuff like building an interface to annotate the DAGs that fabric uses with feedback from customers about how "useful" certain features seem. And then we have algorithms that analyze that feedback and make decisions about how to automatically extend the pipeline (build more features). And it gets very deep and abstract too. [@Justin Waugh](https://unsuper.slack.com/team/U975HFZ2L) did the talk that made interesting connections with the history of AI and the whole dichotomy of connectionists vs. symbolists in the literature. It could be that fabric has a story to tell there about how the connectionist/symbolist thing is sort of a false dichotomy. All the possibility makes you feel like we're really onto something. But it can also be a distraction.I'm going to see if I can spend more time writing down thoughts like these and really zeroing in on a simple description of fabric that is centered around how it's trying to make life easier for the people that interact with it (data scientists and customers). I think I'll try to take the approach I sketched out above that starts with a technical description and then expands on that with "user story"-like examples. See more [here](https://unsuper.slack.com/archives/G01E0J7EMSS/p1614382240101000). ### Jade answer * database management system * giant free form columnset * by setting encoding, creates new Views * i.e. this view is related by columns that are related by this encoding * no need to join tables together, it pulls them together for you * “I am interested in shipping date”, then it goes out and finds anything that has a shipping date, and pulls it in * We go from a Flat Map to a 3-d topological map * Semantic definition * relating to meaning in language or logic * relates to the meaning of the thing logically * Aim * structural, interconnected, logical representation of the meaning of the column * Some sql Dbs do this kind of, goal of fabric is to provide a clear lineage * We no longer want people to need to think about how we join things together * Benefits Graph based system vs. Relational * lineage (in early days), DAG * Knowledge Graph (related conceptually) * Recommendations of what could be brought in * Graph Embeddings * Self service * currently a company wanting to do self service, company will need slick Data Scientists ### Matt answer ... * We are not there * if timezone is not a time stamp, breaks fabric * a place for using pandas (pandas thinking) * Come in, treat it like pandas, and be more or less productive * Encodings * too abstracted away * postal code can be mapped to a state name * argument for weave * used to define a perspective * perspective is an encoding!!!!! * Do DS need to understand the model? * * Graph world thinking paradigm shift? * Here is what you do in pandas, here is how to do it in fabric * using a different model for perspectives and encodings, etc * think of perspectives as a folder * don't directly auto populate * if I load something in, I am sorting the columns into the right folders * perspective is an attribute of the column * perspective is folder, tag, i.e. way to group columns * a view I am specifically curating from stuff in the folder * ### Tom answer * What is the purpose of encodings graph and ontology * support fabric task library * what are the available tasks to me for this attribute * we are at a node in the dag/task graph, without touching data how can we determine what other operations we can take on ehre? * given a state name, we know we can't apply mean * We only need to look at encoding name!! (the meta data) * We don't need to touch the data! * Task library has each task annotated with valid encodings * main way for use to avoid issues with pandas * This is a type system * can allow us to be confident about inputs and outputs * currently, encodings graphs gives you a very binary answer to whether or not you can do this task * fabric could know that we've got a lat long pair * the encoding is telling me that I can move to a state (should you?) * in a world where we are automatically generating features * we could have a measurement and resolution * Without seeing end game, how do you know right path? * anchor on strategic goal, strategic need we are trying to solve * Why is graph based world better than relational? * fabric task graph is abstraction of operations on data * operations have instrinsic order, depednenceis, you have described a dag whether or not you intended to * task graph is a metadata layer * Data layer has the compute graph (think of dask graph) * it allows you to compile and optimize them more readily * all programming languages can be expressed as a dag * when you write something it is effectively procedural, so you have procedures go in order * compiler can say that these two points of code * Optimizing compute is important for that * attributes maps to column * in engine world you are carrying around many extra columns that may not be worth it to operate on * in semantic warehouse vs fabric, fabric has more to do with interface vs underlying abstraction * currently loading stuff into fabric, we have unique on * does it require rethinking of end uses? * on some level that is the case * graphs are more complicated * ideally we would like to write an API that minimizes the amount of thought * there will be a big mental shift * biggest shift will be in realm of perspectives * thinking in the realm of large number of attributes * Attribute space * is meant as a projection from task dag to some other space * could be a metric space (embeddings) * more in line with markov models * we pick space to project into * projection is a dimensionality reduction * causal inference models * mutual information graphs * Bayes nets * write equivalent to pattern find, but does a greedy search over space of models * Where do semantics fit in? * key differentiator! * Semantics are ways that people can annotate task graph with subjective information * encoding information contains objective truth that everyone could agree on (if you can't do this, not a good candidate for encoding) * semantics can answer: should you? * object model in DB -> semantic is question and answer pair (we have it fixed to be the same person) * expectations API is a good example of this * we annotate this on task graph on semantics (helper to read and write semantics off of the task graph) * rule based half (if I write a semantic for expectations, need rules and validation-will parse your answer and say yes that is valid or no that is not valid) * totally arbitrary string in and json out * vision here is that this is the interface that people communicate with AI via natural language * **constraint** and an additive thing! * View vs. perspectives * operational definition of view is a collection of attributes (tasks) that share the same perspective * in some sense we could represent a dataframe, can't have a df where two columns represent different indices * if we think of view in this way, then a perspective is itself a view! * unknown unknown perspective is "staging" area * API hierarchy (low to high) * `View` * `CRUD api` * `SQL alchemy API` * `EncodingsAPI` * `TaskAPI` * `Engines` * `FabricAPI` based class * `Crud` * `Query` * `Expectation` * no knowledge graph/knowledge DB? * No knowledge graph! * require maintenance and reevaluation * Learning * `Pyarrow` * `Dask` * `Fabric` * `Probabilistic Data Structures` * `Legacy Codebase` * `Semantics` * `Encodings` * `Ontologies` * `SQL Alchemy` * Teaching people about data in fabric * Learning * No single source * I consume and learn cutting edge via all over the place fashion * Research and learning -> self directive exploration that is done personally * youtube yanik kilcher is really good (machine learning niche) * computer science * cognitive neuroscience (working models of the brain) * information theory * scalable algorithms * Physics backgrounds * fourier transform and laplace transform are successful examples of basis functions * very applicable in quantum mechanics * linear algebra * favorite topics -> group theory (3b1b) * statistical physics comes up a lot, computer science, information theory, same exact concepts -> really cool. Illustrate different sides of the same coin. Ising model, built on lattice, if you tweak it it starts to look like other problems that are not physics-esque at all * Markov models are super common in statistical physics as well -> can get really wild complex behavior from markov models to simulate and analyze complex systems ### BRYCE * MapTo * A map to is a change of index * a feature is a key value pair, mapping an index to the value * a map to is changing that index * the index is a hashed version of the different values in that column * this is like composition * Aggregation * is a function * to aggregate you need to have a nested thing * you group by * turns a list of values into a single value (i.e. nesting) * rather than a feature that is a key value pair, the feature now has values that are a list of numbers * So, it is a function that reduces the nesting * takes a feature of features and produces a features of vealues * eats a features, and gives a single output value * groupby -> take two features, turn into one feature, all values are subset lists * each group is single row * each value is whatever value is in subset of data * subset is a feature * ### Encodings question * Set encodings simply means derived encoding * that column is likely a join key (pk or fk) * you could set these encodings on load * the casting function in find encoding pairs says "you could get here if you casted" * so it provides a suggestion * At the start you would like to declare encodings up front * change encoding can get you into difficult situations * if you derive an encoding and to groupbys, maptos, perspective, etc, then that may invalidate a ton of stuff in DB * always allowed to cast to new encoding via as typing * in theory can't get into bad state where encoding doesn't match up with data * we know what comes in, unknown-string or unknown-float64 * if you get more specific can say this is a us zip code string (can only take certain form) * In that scenario, moving from unknown to known encoding, we need to use astype methods * can come in two forms, lossy and non lossy * lossy can't do set of values from first to second encoding without losing info * ex: anything that doesn't meet validation pattern gets filtered out and replaced with null * in lossless, can always do int -> string * user shouldn't be able to get into a bad spot ### Lazy vs. immediate evaluation * all metadata layer stuff happens immediately * until you actually call compute to resolve the data * doesn't have to be that way * slower things in principle could be delayed calls as well * any operation on metadata layer should happen near immediately, ideally would calculate measurement and then block until then