# T-Digest ### Raison d'être An [Online-Algorithm](Online-Algorithm.md) that allows us to compute approximate quantiles with adjustable accuracy limits. ### Overview We may start by asking why it is even so important to estimate quantiles? Recall their definition: > **Quantile (or percentile)**: A value that is used to split a group (vector/list) of numbers in a particular way. The **median** (50th percentile) is the value that is used to split a list of numbers into two equal groups. Percentiles can be used to reconstruct the [Cumulative Distribution Function](Cumulative%20Distribution%20Function.md). It is important to note that T-Digest has variable accuracy and *constant relative accuracy*. This is a very important property to have, ensuring that it is more accurate towards the 99.9th percentile and less accurate towards the median. ### Original Approach The way that T-Digest originally achieved its goals was as follows: * It kept clusters (similar to k means in 1 dimension), where the size of the cluster was allowed to be large in the middle (where q ~ 0.5) and small at the 0 and 1 end * Choosing how that size was selected could be done via $size = q(1 - q)$, ensuring that the size will decrease towards the end. This allows for us to have the *accuracy properties* that we want! * ### * See * Input: Any sequence of numeric observations * numeric observations paired with a mass * The sketch is an estimate of the CDF ![](Screen%20Shot%202021-03-03%20at%209.23.40%20AM.png) ![](Screen%20Shot%202021-03-03%20at%209.24.18%20AM.png) Under the hood T-Digest stores a bunch of clusters. A cluster is a location and a mass. It is a location somewhere in your data space. ![](Screen%20Shot%202021-03-03%20at%209.24.43%20AM.png) When you update your T-Digest you have an incoming observation and you locate the cluster in the T-Digest nearest it (green below) and then you add the mass and update the location based on the weighted average. ![](Screen%20Shot%202021-03-03%20at%209.25.20%20AM.png) Each cluster can be assigned a quantile. That is the sum of the mass to the left, divided by the total mass. ![](Screen%20Shot%202021-03-03%20at%209.26.46%20AM.png) ![](Screen%20Shot%202021-03-03%20at%209.27.29%20AM.png) ![](Screen%20Shot%202021-03-03%20at%209.28.37%20AM.png) ![](Screen%20Shot%202021-03-03%20at%209.29.06%20AM.png) --- Date: 20210831 Links to: Tags: #todo References: * [Tdunning overview](https://www.youtube.com/watch?v=CR4-aVvjE6A) * [# Sketching Data with T Digest - youtube talk](https://www.youtube.com/watch?v=ETUYhEZRtWE) * [PDF paper by dunning (marked up by me)](https://drive.google.com/file/d/11qk76KQHBwo6ZKuj8cKT-HJpD41Rmida/view?usp=sharing)