Abstractions at Unsupervised

# Abstractions at Unsupervised ### The Document I wish I had when I started at Unsupervised Unsupervised takes a different approach when it comes to AI and analytics. In your first few months here you will likely hear terms such as: "Pattern Find", "Dressing Room", "Pattern Flow", "Fabric", "Semantics", "Encodings", among many others. The role of the Data Scientist here is slightly more nuanced than you many encounter at other companies. More than anywhere else I have worked there seem to be a specific set of *abstractions* that are worth having. There is an incredible amount of brain power here, but the reality is that we are still a startup and many of the best ways to think about our product aren't written down anywhere. *Yet*. ## 1. Standing in the Same Place Let's take a moment to make sure that we are mentally standing in the same place. We are going to take a step back, even before AI, and start with BI. ### 1.1 Business Intelligence and the Quest to be "Data Driven" When it comes down to BI, it is generally *proposed* for one main reason: * Being data driven (i.e. *empirical*) is inherently useful if you are trying to make good decisions. BI has a rather interesting etiology. To fully appreciate it, we must acknowledge that it is really ~450 years in the making. From the time of Francis Bacon, Isaac Newton, John Locke, and a host of other Philosophers, the idea of *empiricism* has been viewed as the fundamental way of attaining knowledge. We must *observe* something in our surrounding world in order to gain confidence in it's truth. Suppose you have a hypothesis: "I think that expanding our offerings into the Southeast United States is a good idea and will lead to increased profits". Okay, now we must ask *why* do you believe that to be true? Most likely, you formed this hypothesis based on some knowledge you had about the world; in particular, some knowledge you had about how your product would perform in the Southeast United States. How did you acquire that knowledge? There is a very good chance it was via some sort of observation. Maybe a recent report showed that a competitor of yours has had tremendous success with a similar offering in the Southeast United States. Or perhaps you have spoken to a few contacts in that region and they all share that demand is high and cannot be filled. Regardless, we can say the following: > By gathering relevant *information* you were able to *form* a *hypothesis*. Note that gathering information could allow us to *test* a hypothesis as well. Imagine that you decided to act on your initial hypothesis and expand your offering into the Southeast United States. Now you want to know if that was in fact a good decision. Well, by gathering data (revenue in the region over the next 3 months, retention rates, etc), you can *test* whether your hypothesis was indeed correct. So, knowledge is beneficial. No surprise there. You are probably wondering why we are even going over it. Well it is important to help us really understand the problem that BI was trying to solve (and in turn, what Unsupervised is trying to solve). Now this type of thinking-which kicked off the Scientific Revolution in the mid 1500's-was slowly integrated into not only the scientific community, but others as well. By the time the 1950's came around, it began to truly permeate into business. It is no surprise that it did. Any business that gathers more knowledge and applies it thoughtfully to decision making is going to give itself a competitive advantage. Those that don't will die. The computer (and the arrival of the first databases) increased this trend. "Wait if making decisions based on knowledge derived from data is good, then shouldn't we gather as much data as possible?" Fast forward another 50 years and we arrive at the present; data is everywhere and those who aren't using it are expected to go extinct. We can summarize *why* BI came to be (at risk of over simplifying) as follows: > A combination of *science says "being data driven is a good thing"* and *adaptive pressures* have lead BI to be commonplace. And we can utilize a blanket statement to capture the current sentiment: > Being data driven is inherently useful if you are trying to make good decisions. What we can't forget is that somewhere along the way the message was distorted. Note the difference between the two statements below: 1. By gathering *relevant information* you can come up with an effective *hypothesis*. 2. Gathering more data will allow you to make better decisions. These statements express *drastically* different viewpoints. Statement 1 represents the philosophy of the Scientific Revolution. Statement 2 represents the distorted message that is prevalent today. Any fool can tell you that simply gathering *more* information won't help you make better decisions. Gathering *relevant information* that provides insight into your problem can help you make better decisions. In fact, there is strong evidence that after a certain point gathering too much data results in no improvement in decision making, but an increase in *confidence*[^1]. But, this raises an entirely new question-one that BI was not prepared to answer: How do we know what constitutes *relevant information*? **Key Takeaways** * We need the right pieces of information to make optimal decisions. * Realistically, there is an unwieldy number of dimensions we could consider. * BI practices have lead us to have copious amounts of data. We need a way to deal with this higher dimensionality. ### 1.3 Say Hello to Human Biases So, BI lead us (albeit in a meandering way) to the idea the more information was better. Specifically, more *relevant* information was better. If that was true, why didn't we suddenly have humans analyzing thousands of variables, finding those that are relevant to one final decision, and making decisions accordingly? There are actually two diametrically opposing problems that we encounter at this stage. We can place people into two groups, each group presenting it's own problem: 1. **Group 1**: People who want to consider as many "relevant" variables as possible in order to make the best decision. **Problem 1**: Considering more variables often does not lead to better decisions, but it feeds *confirmation bias*. 2. **Group 2**: People who believe that clearly some information is "better" than other information when it comes to relevance in making a decision. **Problem 2**: By prematurely reducing the variables under considering, a *bias* is brought in that likely removes certain variables that would have lead to a better decision. Let's take a look at these problems in a bit more detail. #### 1.3.1 Decision Quality vs. Number of Pieces of Information (Problem 1) There is a fantastic study[^1] conducted by Paul Slovic that makes a great case for the following point: > More information does not necessarily improve the _quality_ of a decision, but it will improve one's _confidence_ in the decision. Slovic took a group of 8 professional handicappers-professional gamblers who made a living betting on which horse would win in a race-and tested the accuracy of their predictions as a function of how much information (how many variables) they were provided with. The findings were incredibly eye opening. We can summarize the study as follows: * A group of 8 handicappers were presented with a list of 88 variables taken from a chart of horses past performance * The handicappers were asked to select 5 variables from the 88 that they would like to have access to in predicting the winner of the horse race. They were also asked to rate how confident they were in their prediction. * With 5 variables, the handicappers were 17% accurate in their predictions. They rated themselves (on average) as 19% confident. So, given 5 variables confidence and decision making accuracy were nearly equal. * Then, the handicappers were given 10, 20, and 40 variables of their choosing and then asked to make predictions of the winner of the horse race. * What happened was the accuracy had flat-lined at ~17%. On the other hand confidence nearly doubled and increased to 34%. ![](confidence_vs_items_of_information.png) What we learn from this is that beyond a certain minimum amount of information, additional information only feeds *confirmation bias*. As you start considering more and more variables, there is an increasing likelihood that they contain mutual information (they are correlated). For instance, a horses top speed is often correlated with it's average speed. If you are given the top speed, that provides you with new information that can be helpful in predicting if that horse will win the race. Now if you are provided with the average speed of the horse, it has a good deal of the same information that the max speed provided. This means it isn't going to significantly help improve your decision. However, it improves the handicappers *confidence* in their decision, because now two variables support their claim. What happens is that given, say, 40 pieces of information, the handicapper likely has already decided horse A is going to win the race. The handicapper then (unknowingly) looks through the other 35 pieces of information and finds variables the *confirm* this believe. This is why their confidence grows. So, taking a step back, our original question was "why don't we suddenly have humans analyzing thousands of variables that relate to one final decision?". This is part one of the answer; deciding what information is *relevant* once an initial decision has been made is likely to only increase confirmation bias. #### 1.3.2 Humans look at data they believe is related to their decision (Problem 2) Now let us look at group 2. This group of people have an idea of what variables they want to look at ahead of time, specifically those that they believe are directly related to their decision. To illustrate this point, let's pose the following scenario: You are a VP at a Telecom company. You have a broadband offering that is performing poorly. You have to decide whether it should be discontinued, or whether it's original execution was ineffective and it needs a reinvestment of resources. Now, you are offered two pieces of information: * The revenue generated from the broadband offering, over the past 3 years. * The number of sunny days in Fresno, CA over the past 3 years. Which will be more useful in making your decision? Clearly the first piece of information is directly related to your decision while the second likely has no relationship whatsoever. Now, this is not meant to discredit bringing in additional data. However, we must recognize that some push back could (and *should*) be expected . Humans have finite time, energy, and cognitive capabilities. Combing through pages upon pages of *noise* just to get to a bit of *signal* is not going to cut it. So, as BI was established, it was immediately accompanied by the process of deciding *what data was relevant*? What should we be monitoring? What are our KPI's? How do we want to view this information? Now, what is the problem here? At first glance this seems all well and good. In fact, at this stage it doesn't even differ from the scientific approach (where we also are forced to determine what information is *relevant* and form a hypothesis). So this should in fact be fine, right? Well not so fast. The problem with this approach is that we implicitly bring in *human biases*. For instance, a business analyst likely has some preexisting ideas about what they should think about, what they need to focus on. They will look at their data accordingly. What if in their narrow field of view they *missed* a key variable of interest, a crucial bit of information that would allow for a much better final decision. It is highly likely that this will happen, but there is no clear cut way to prevent it given the current methodology of BI. To be clear this isn't an indictment against anyone approach. Rather, it is trying to highlight the trajectory that leads us to the current state of affairs. It is worth noting that this problem is not native to the world of BI, it is in fact present in science (science is indeed done by humans, and humans are well...biased). Thankfully the degree to which it is present in the sciences isn't quite as extreme. **Key Takeaway** * Using **heuristics** and **background** knowledge to select a subset of variables is a commonly used form of "dimensionality reduction". * This approach introduces **biases**. ##### 1.4 AI and the Fight Against Human Biases In an effort to stave off the challenges posed to BI via the need for human judgment, AI was proposed as a potential solution. After all, the desire to be "data driven" has left us with more data than ever-why not take *full advantage* of it? Computers have become more and more powerful, making way for more impressive AI techniques than ever before. Surely if we feed all of this data, unaltered and unbiased, into an AI algorithm we will achieve the results we truly have been after. Not so fast. At their core, standard AI (often statistical learning) techniques have several shortcomings that limit their use in practice. 1. **Lack of Interpretability** Imagine you are an pricing analyst for a trucking company. You are trying to find shipments that are losing you money in an unexpected way. You have 1 million shipments, and 500 attributes corresponding to each shipment. Your company decide to hire a data science consultant to come in and work some magic in order to find a group of these shipments that they can either make or save money on. A classic needle in the haystack problem, only now our haystack is a flat table and our needle is a subset of rows in that table. The data scientist comes back, *raving* about the dimensionality reduction techniques that they applied: "I utilized a bleeding edge manifold learning algorithm, and just look at this! ![](Screen%20Shot%202021-01-07%20at%204.11.43%20PM.png) See, now the points are no longer 500 dimensional-there they are embedded in 2 dimensions-X and Y! See that little cluster off to the left in red? That represents an interesting group of shipments that you should look at. Any questions?" You sit back wondering what excuse you can make to leave the room at once. How are you supposed to reason about why this little grouping of red shipments is interesting? Interesting how? What in the world does this have to do with the actual, real world pricing decisions you need to make? This lack of interpretability is a fatal flaw to traditional AI and statistical learning techniques. Even if you *understand* how they work, it is very hard to interpret resulting dimensions that don't have an intuitive and interpretable meaning. Now, what if the complexity is so high that is challenging to even intuit what is going on under the hood of one of these algorithms? 2. **People Don't Trust Black Boxes** Say you are an in a position to decide the next city that you should open up a new branch of your up and coming retail company. You have *thousands* of variables describing your company, the potential cities under consideration, competitors, and so on. You have heard all about hot new AI company, Snoogle, and their ultra quantum super neural net. Wow! Let's throw this data in there and let the AI make an unbiased decision! It spits out the recommendation: "Des Moines". You pause; a brief moment of excitement leads to thinking "WTF? *Why* exactly does Des Moines make sense exactly?" AI is great at making *predictions*, but many modern techniques are black boxes (for instance Deep Nets); a human can only understand so many layered linear transformations before losing intuition on how the final recommendation was arrived at. At the end of the day, humans don't simply want *predictions*-they want *explanations*. What if it turns out that behind this super fancy AI there was really some lazy employee who, when the function was called, would throw a dart at a map of the US and return the nearest city. You would have no way of *knowing* this was what was happening, because their was no *explanation*. Put another way, you had no *context* on why the decision was made. We can reasonably bet that even if you trusted the AI and considered opening up a branch in Des Moines you would undoubtedly conduct you own analysis on Des Moines as a candidate. You would look at the things such as it's work force, growth rate, tax implications, and so on, until you had constructed your own *explanation* as to whether or not Des Moines actually made sense. **Key Takeaways** * Traditional AI techniques are not interpretable. * People do not trust things that they cannot interpret or understand. ### 2. Enter: Unsupervised * Unsupervised comes in and takes an approach * Find slices of data interesting *relative to a KPI*. We essentially use KPI as our distance metric, instead of trying to use some sort of high dimensional euclidean distance. * Describe these slices of data with metrics that help a business user understand why the slice is interesting * Dressing room provides the missing context. * At a certain point, it shouldn't even feel like: "Here are some data points that are interesting", instead it should be: "Here is a filter (human readable) that, when applied to your data, defines a certain set of rows. " ### 3. Data Science at Unsupervised * A pattern is a data structure. It consists of a filter set. When this filter set is applied to a data set it returns a set of rows. * Patterns are great because they are human interpretable. We are now saying: "here is an interesting set of rows, *described* by these filters, and interesting because their KPI is anomalous" * We can think about the *distance* between two patterns in several key ways: * Rowset distance: How many rows do they differ by (jaccard distance) * Filterset Distance: How many filters do they differ by * An interesting note on mutual information: * patterns that are close in terms of jaccard distance will have a high degree of mutual information (they have a large proportion of the same rows) * patterns that are far away in filter set space but close in rowset space may represent similar concepts * To prevent needing to navigate a high dimensional space (traditional unsupervised learning), we frequently look at things as being interesting WRT the KPI. We do this based on difference in distribution (via KS score) * The relationship upon which we optimize for is the parent child relationship * Note that a pattern has a neighborhood consisting of more than simply it's parents. It has it's children, siblings, etc * Pattern find is a way to explore a given space for anomalous slices of data (data points) WRT a KPI. Where we as DS have a lot of control is in creating/defining the *space* that pattern find is exploring. * This is done via feature engineering * A traditional example here is TODO * We want columns that are correlated with KPI, but not *too* correlated! Unless they are not familiar to the business users! There is an interesting balance that we need to achieve. ### 4. Fabric and the Future - Enter unsupervised and fabric. - By using encodings and semantics we can consider _all_ human understandable features that could be created in a data set (i.e. that can be created via transformations of the original features), and we can use semantics to help pair down that massive list to a more manageable list - Then, based on this resulting feature set, we can use PF and DR to optimize the resulting patterns, where we find those that have nuance wrt the KPI - So, unsupervised is a human understandable form of dimensionality reduction [^1]: Towards Understanding and Improving Decisions, Paul Slovic, 1982 (pg. 19)