Feature-vs-Feature-Value

# Feature vs. Feature Value ### My Question One question that I do still have that I think may lead us slightly off topic if I post in the fabric channel as well, so I’ll ask here: - How do you think about _features/variables_ vs. their _values?_ Or to be more technical, I could say _dimensions_ vs. _domain_ ? For example, if I have a feature/column/variable `color`, it may have a domain of values `red, green, blue, …`. - However, I we could easily create three new _features_ from this information: `color_equals_red`, `color_equals_green`, `color_equals_blue`. These features are simply binary indicators/one hot. These features also have some constraints imposed, namely if `color_equals_red` is True for a given data point, then `color_equals_green` and `color_equals_blue` must be false. - I have been thinking a lot about this relationship between features/columns/variables and their values, particularly because via different techniques like one hot encoding, applying a filter via pattern find, etc, we essentially move between them. - My main question: because a _feature value_ (say, the value `blue` in the `color` feature), could be used to generate a _new feature_ `color_equals_blue` , there seems to be a fluid nature between features and values. I am guessing their is a way to formalize this and am wondering if this is something you have thought about/have a way of reasoning about? ### Justin Response i looove these questions. this is **at the core** of some of the opinions of fabric! okay, maybe we should sync, haha, im nerd sniped into it, since it's so dead on things i care about. XD This is getting into the `data` vs. `language` thing we've talked about a long time ago, moving data into the language, etc. hmm, so, ill briefly also say: i don't have all of this.. formalized and fully figured out. i have all the edge cases ive come across "organized", but it's not necessarily hardended. it turns out that .. there's a meta referenential thing that ... opens up a logical explosion. (Onec you introduce the `Sweep` task, which then creates perspectives on attributes, which themselves are sets of records, and you start heading into power-set / bops, as attributes, etc.) -- but that's the "messy" part of all of this. So, i'll also try to answer explicitly now 1. I think of features/variables as 'language representation' and their values as 'measurable / results of computation'. I treat those as _distinct_ from `value` and `set of value` or domain. That's what i consider encodings and defining attributes. a Defining attribute is a measurable answer to "what is known", an encoding is a declaration of abstract "set class". Eg. in a math sense, you have a set of values {0, 1, 2}, and you also have a set Z (integers). the "measured" {0, 1, 2} is a subset of Z. All defining attributes are subsets of their encoding (and also their perspective's encoding). 2. About the groups / patterns. I think of these as simply compute graphs that yield some DA. These defining attributes, exist in some space that can be measured relative to other defining attributes. Eg. you can use HLL sketch and check overlaps, and see that `color_equals_green` is a 'singleton' and mutually exclusive from `color_equals_blue`. Also, if you look at content hashes of DAs, you can check for exact equivalances, which becomes important when building up a representation of fabric. So, for instance, think of a complex fabric (10^7 attributes), and then measure all of them non stop (so you have a stream of 10^7 cross delta t) definitions of attributes. You could compute for every one of those, a attribute content hash, and find all equivalences. It's likely the case that lots of things "point to the same result", and you can measure the structure of this graph in time. eg. load(column).groupby(.).takeone() might be updating, but filtering that by "color_blue" might always remain the same. That means that you found a (for all time as you know it) fixed point. a "consistency" in the graph as far as content-identity is concerned. The "embedding of attributes" based on content identity makes sense for exact matches (i hope from that example), but take it 1 step further, and you have "embedding of attributes" based on relationship / semantic similarity / approximate content overlaps {0, 1, 2} and {0, 1} are closer than {0, 5, 8, 12}. But those are "measurements" of compute, and so you are saying "at some time-domain of consistency" attrs 1 are measured to be close to attrs 2. or exact same as attr 2. or mutually exclusive attr 1 from attr 2. 3. And yeah, coming back to your thing about moving between data and 1hot, that's exactly an exmaple where the data itself is pivoted into the language. The core thing is: the language isn't a definition of data. It's possible the language causes the filter to result in Null, because `color=Blue` didn't exist at all, therefore, filtering by it resulted in null. Or it could have resulted in blue. Or, "the system that determines equivalence and parses filter by color=Blue" could decide, "my = means normalized equals", which means it'd match `blue` and `Blue` and `BLUE`, in which case you'd get 3 results. The "goal" of fabric as a learner, as an embedder, is to understand that difference between the language and the data, across the full domain, and build a consistent embedding. 4. My general thoughts about formalization is that there "isnt 1 rule". Just becuase `blue` exists in `color` doesn't mean that there is a bi-directional truth relationship between that and `color_equals_blue`. There could be `color = blue` filter, there could be `regex_match_in_str(color, '^blue)`, etc. that all result in the same thing. And the abstract learner that uses fabric (humans and ais) don't really know that those are going to be 100% equivalent, unless you have some language definition (formal math logic) that says they are _or_ you just meaure it, over and over, and find out that that is true. In reflection, that was just a stream of conscious brain dump. Not sure if the core things got across, keep prompting with questions as you digest / reject if it seems wrong~ Core to it is language vs. data representation. Language space is fabric attributes as far as the task-graph meta-info is concerned and semantics, data space is stuff like encodings, the actual value of the attribute (eg. measured at a specifc integrity hash). Since fabric connects the language space and the data space via pins at attr-uids, it creates a graph connection between the two. So, the "way to traverse" from _data_ `{'blue': 'blue'}` to `{'red': 'red'}` would be, (1) is there a content hash of that somewhere you have seen, (2) find the attr for it (which might be DA1.filter('color': blue) and DA2.filter('color': blue), eg. lots of attr in fabric show it). (3) walk the neighborhoods around each / consider places where the data<->language overlap exists (`blue` is the string in the data, and `blue` shows up in the language itself as a string, acts as a weighting). (4) given the neighbrhood, find `DA1.filter('color': red)`, check it's history-cache or measure it, find out that it's content is `{'color': 'red'}` and validate that, yes indeed, you found a "lanugage path" [DA.filter('color': blue) -> parent-> apply(.filter('color': red)) ] lets you connect two pieces of data. It _may_ also be the case, for instance, that there exists a filter `.filter(color_tone_mood='angry')` -> `{'red': 'red'}` as well, which is itself an interesting learning. That multiple differnet language specifications can result in the same data. and as you watch itover time, maybe other colors join in or leave that defining attribute. And the "connection" between "color_tone_mood='angry'" and "color=red", can be seen based on neighbrohood embedding distance. --- Date: 20210519 Links to: [004-Unsupervised-MOC](004-Unsupervised-MOC.md) [Content-Addressable-Systems](Content-Addressable-Systems.md) [Computable-Addressability](Computable-Addressability.md) References: * [Feature vs. feature value, Whiteboard](https://photos.google.com/photo/AF1QipOxnUmAuyA5ZhGLV2aF3bVtxjoY_0ia3dnwhLyH) * Concepts, Features vs. Values