# Feature vs. Feature Value
### My Question
One question that I do still have that I think may lead us slightly off topic if I post in the fabric channel as well, so I’ll ask here:
- How do you think about _features/variables_ vs. their _values?_ Or to be more technical, I could say _dimensions_ vs. _domain_ ? For example, if I have a feature/column/variable `color`, it may have a domain of values `red, green, blue, …`.
- However, I we could easily create three new _features_ from this information: `color_equals_red`, `color_equals_green`, `color_equals_blue`. These features are simply binary indicators/one hot. These features also have some constraints imposed, namely if `color_equals_red` is True for a given data point, then `color_equals_green` and `color_equals_blue` must be false.
- I have been thinking a lot about this relationship between features/columns/variables and their values, particularly because via different techniques like one hot encoding, applying a filter via pattern find, etc, we essentially move between them.
- My main question: because a _feature value_ (say, the value `blue` in the `color` feature), could be used to generate a _new feature_ `color_equals_blue` , there seems to be a fluid nature between features and values. I am guessing their is a way to formalize this and am wondering if this is something you have thought about/have a way of reasoning about?
### Justin Response
i looove these questions. this is **at the core** of some of the opinions of fabric!
okay, maybe we should sync, haha, im nerd sniped into it, since it's so dead on things i care about. XD
This is getting into the `data` vs. `language` thing we've talked about a long time ago, moving data into the language, etc.
hmm, so, ill briefly also say: i don't have all of this.. formalized and fully figured out. i have all the edge cases ive come across "organized", but it's not necessarily hardended.
it turns out that .. there's a meta referenential thing that ... opens up a logical explosion. (Onec you introduce the `Sweep` task, which then creates perspectives on attributes, which themselves are sets of records, and you start heading into power-set / bops, as attributes, etc.) -- but that's the "messy" part of all of this.
So, i'll also try to answer explicitly now
1. I think of features/variables as 'language representation' and their values as 'measurable / results of computation'. I treat those as _distinct_ from `value` and `set of value` or domain. That's what i consider encodings and defining attributes. a Defining attribute is a measurable answer to "what is known", an encoding is a declaration of abstract "set class". Eg. in a math sense, you have a set of values {0, 1, 2}, and you also have a set Z (integers). the "measured" {0, 1, 2} is a subset of Z. All defining attributes are subsets of their encoding (and also their perspective's encoding).
2. About the groups / patterns. I think of these as simply compute graphs that yield some DA. These defining attributes, exist in some space that can be measured relative to other defining attributes. Eg. you can use HLL sketch and check overlaps, and see that `color_equals_green` is a 'singleton' and mutually exclusive from `color_equals_blue`. Also, if you look at content hashes of DAs, you can check for exact equivalances, which becomes important when building up a representation of fabric. So, for instance, think of a complex fabric (10^7 attributes), and then measure all of them non stop (so you have a stream of 10^7 cross delta t) definitions of attributes. You could compute for every one of those, a attribute content hash, and find all equivalences. It's likely the case that lots of things "point to the same result", and you can measure the structure of this graph in time. eg. load(column).groupby(.).takeone() might be updating, but filtering that by "color_blue" might always remain the same. That means that you found a (for all time as you know it) fixed point. a "consistency" in the graph as far as content-identity is concerned. The "embedding of attributes" based on content identity makes sense for exact matches (i hope from that example), but take it 1 step further, and you have "embedding of attributes" based on relationship / semantic similarity / approximate content overlaps {0, 1, 2} and {0, 1} are closer than {0, 5, 8, 12}. But those are "measurements" of compute, and so you are saying "at some time-domain of consistency" attrs 1 are measured to be close to attrs 2. or exact same as attr 2. or mutually exclusive attr 1 from attr 2.
3. And yeah, coming back to your thing about moving between data and 1hot, that's exactly an exmaple where the data itself is pivoted into the language. The core thing is: the language isn't a definition of data. It's possible the language causes the filter to result in Null, because `color=Blue` didn't exist at all, therefore, filtering by it resulted in null. Or it could have resulted in blue. Or, "the system that determines equivalence and parses filter by color=Blue" could decide, "my = means normalized equals", which means it'd match `blue` and `Blue` and `BLUE`, in which case you'd get 3 results. The "goal" of fabric as a learner, as an embedder, is to understand that difference between the language and the data, across the full domain, and build a consistent embedding.
4. My general thoughts about formalization is that there "isnt 1 rule". Just becuase `blue` exists in `color` doesn't mean that there is a bi-directional truth relationship between that and `color_equals_blue`. There could be `color = blue` filter, there could be `regex_match_in_str(color, '^blue