# Mean Variance Optimization ## Situation Mean Variance Optimization (MVO) will construct an optimal portfolio to maximize a variance adjusted mean return (i.e. Sharpe Ratio). It does this by taking in the expected return of each asset, $\mu$, and the covariance of asset returns, $\Sigma$. We do not know the true $\mu$ and $\Sigma$ at trade time, so we must pass in our best estimates: $\hat{\mu}$ and $\hat{\Sigma}$. The quality of the resulting portfolio's is only as good as the quality of the estimated inputs. We must be able to improve the quality of the estimated inputs. ## Problem For the purpose of this write up, we will focus $\mu$: How can we improve our estimate $\hat{\mu}$? In order to improve our estimate, we must first be able to measure if an improvement was made. So we can narrow our question a bit further: > Given two estimates of $\mu$, $\hat{\mu}_1$ and $\hat{\mu}_2$, how can we measure if $\hat{\mu}_2$ is an improvement over $\hat{\mu}_1$? How might we go about this? One approach is to just run the optimization with both inputs and then compare the financial performance (FP). If $\hat{\mu}_2$ lead to better FP than $\hat{\mu}_1$, we could declare $\hat{\mu}_2$ an improvement. But there are clear problems with this approach. First, this is very susceptible to backtest hacking. Second, say $\hat{\mu}_2$ doesn't lead to an FP improvement over $\hat{\mu}_1$—what should our next step be? If our only evaluation tool is a backtest, our level of analysis is so coarse that we have no clear next step other than to toss the approach that created $\hat{\mu}_2$ aside and try something else. We need evaluation procedures that can tell us *where* to make improvements, or *why* genuine improvements aren't being acted upon by the optimizer. So evaluating $\hat{\mu}$ by backtesting isn't an option. We can see that this is inherently a cross validation (CV) problem. We need CV functions that can measure the quality of $\hat{\mu}_2$ compared to $\hat{\mu}_1$. These CV functions should satisfy two properties: 1. Correlate with financial performance: if $\hat{\mu}_2$ has better CV than $\hat{\mu}_1$, it should also have better FP. And 2. Deepen our understanding by answering questions like: Where should we focus our efforts when trying to improve $\hat{\mu}$? Why didn't a genuine improvement get utilized? To answer achieve these two properties, no single CV function will suffice. We will need a suite of functions that answer different questions about our estimates. This suite will need to be complementary—different functions will fill in each others "gaps". To build this suite, there is no way around building up a deeper understanding of what the optimizer is doing. How is it using the inputs we pass it? ## Goal The goal of this writeup is to help build a crisp mental model of what the optimizer is *seeing* . We will do this mainly by thinking *geometrically*. This will allow us to design CV functions that evaluate our inputs based on the context they will be used in. # Spaces We start with **asset space**, $\mathbb{A} = \mathbb{R}^n$, with basis vectors $a_1, \dots, a_n$ being *assets*. Suppose we have a vector $v$ in this space. The $i$-th coordinate of $v$ tells us "how much along asset $is axis". There are two objects that live in $\mathbb{A}$: * **Returns**: Random vectors in asset space, $R$ * **Weights**: Portfolios, $w$ (linear combinations of asset). Given a return vector, $r$, and some fixed $w$, we can compute the return of the portfolio $w$ as $r_p = w^\top r$. If $w$ is of unit length, it is just the projection of $r$ onto $w$. ![center](Pasted%20image%2020250817141622.png) We always have a set of observed $r \sim R$. These can be visualized as a point cloud in asset space. We can compute the mean, $\mu = \mathbb{E}[R]$, and covariance, $\Sigma = \text{Cov}(R)$, of these observations. ![center | 350](Pasted%20image%2020250817142121.png) # Risk: Portfolio Variance We could rewrite the portfolio return across $R$ as a random variable as well, $R_p = w^\top R$. As a random variable, $R_p$ has a distribution. We can ask questions about the distribution of $R_p$. In portfolio optimization, one of the main quantities of interest is the *variance* of the $R_p$ (variance of a portfolio). We can call this **risk**. It is defined as: $\text{Risk: \;\;\;\;\;\;\;\;\;\;}\mathrm{Var}(R_p) = \mathrm{Var}(w ^\top R) = w^\top \Sigma w$ We can see this visually below. We have our point cloud of $r \sim R$, and $w_1$, a direction in $\mathbb{A}$. We can project our point cloud onto $w_1$. This projection is $r_p \sim R_p$. The distribution of this projection—the distribution of portfolio returns—is on the right. This has a variance, and that variance is the risk. ![center](Pasted%20image%2020250817143739.png) Great. We have a *risk* quantity that is of utmost interest to us, and we know how to compute it. Now, any $r$ or $w$ we evaluate will always be in the context of risk. Was $r$ a rare, risky outcome? Is $w$ a risky portfolio? # Returns and Weights are Anisotropic in Risk But we have a problem: our space and the geometry we place on top of it does not reflect this out of the box. Simply put: > Returns and weights are both [Anisotropic](Anisotropic.md) with respect to risk. In other words, depending on what direction we move, the property of risk changes different amounts. The visualization below should make this clear. We have three different portfolio directions, each of which has a different distribution of returns, and a different $\text{Var}(R_p)$—a different risk. ![center](Pasted%20image%2020250817144334.png) This is a problem. We'd like returns and weights to both be expressed in a way that they account for risk. We will address this in three ways: 1. Give $R$ a geometry (remains anisotropic) 2. Whiten the points $R$ (isotropic) 3. Give $w$ a geometry (remains anisotropic) The common connection between all three will be the scalar portfolio variance, $R_p$. # The Geometry of $R$ and the $\Sigma^{-1}$ Transform Here we are starting with our point cloud of returns, $R$. We've already seen that moving different directions in this space yields different amounts of risk. Returns are anisotropic with respect to risk. One way that we can address this is keeping the raw point cloud of returns unchanged, but using a different [Metric](Metric.md). Implicitly we are using euclidean distance as our metric: $\|r\|^2 = r^\top r$. But instead, we can use [Mahalanobis Distance](Mahalanobis%20Distance.md) (MD): $\|r\|_R^2 = r^\top \Sigma^{-1} r$ By using a different metric we haven't transformed our points at all—they still fall in the same elongated, elliptical point cloud. We just have changed the *ruler* we using to measure distances. With this new ruler, as points equal risk distance from the origin satisfy: $r^\top \Sigma^{-1} r = c$ Geometrically, this is an *ellipse*. This means that the returns are still anisotropic: directions differ in their risk contribution. However, we have a tool that allows us to measure constant risk contours. This is what MD and $\Sigma^{-1}$ provide us with. We can think of the ellipse as answering the question: "what is the set of return points that are all $c$ units of risk from the origin?" Put another way: This new metric tells us how risky a given return vector is—how far from the origin it is in terms of *risk*. ![center | 400](Pasted%20image%2020250817145253.png) To be clear, $R$ is still anisotropic. But we have a metric that allows us to account for this anisotropic nature. ###### Summary | **Aspect** | **What happens** | | ------------------------------- | ---------------------------------------------------------------------------- | | **Points** | Stay the same (cloud of observed returns $r$) | | **Metric** | Changed to $r^\top \Sigma^{-1} r$ (Mahalanobis metric = risk-based distance) | | **Contours (equidistant Risk)** | Ellipses (still anisotropic) | # Whitening and the $\Sigma^{-1/2}$ Transform With the $R$-geometry we kept our returns the same, but used a different metric (MD). However, there is another option: transform the points themselves so that they are isotropic—then we can just use plain old euclidean distance as our metric. We can transform our points $r$ to be isotropic via: $z = \Sigma^{-1/2} r$ After this transform ellipses are now circles. We are in an isotropic geometry (with respect to *risk*—we must never forget that): moving $1$ unit in any direction corresponds to the same amount of risk. ![center ](Pasted%20image%2020250817150839.png) We can see that after whitening, any direction we pick will have the same portfolio variance! This literally what is meant when we say we have made the space isotropic with respect to risk. ![center](Pasted%20image%2020250817151908.png) Note that after whitening we are no longer in the canonical asset basis. We are still in $\mathbb{A}$, but the bases of $z$ are no longer specific assets—they are *linear combinations of assets*. We could say that whitening is actually a *change of basis*, and that once in the whitened basis *risk* is isotropic. ###### Summary | **Aspect** | **What Happens** | | ------------------------------- | ------------------------------------------------------------- | | **Points** | Changed: returns are transformed $z = \Sigma^{-1/2} r$ | | **Metric** | Back to Euclidean ($z_1^\top z_2$, $\|z\|^2 = z^\top z$) | | **Contours (equidistant risk)** | Circles (isotropy achieved) — 1 Euclidean unit = 1 risk unit) | # The Geometry of $w$ and the $\Sigma$ Transform We have been focusing solely on $R$. But let's turn our focus to $w$. This is what the optimizer actually has control over—$w$ is the decision variable. What does the geometry of $w$ look like? We quickly run into the same problem that we did with $R$: different $ws pointing in different directions will correspond to different *risk*. If we measure $\|w\|$ with euclidean distance, then two portfolios of equal euclidean length would be equally far from the origin. But in risk terms, that is nonsense! Two equally long $w$ vectors could have very different portfolio variances, depending on the direction they point. This is exactly what we saw in the first projection plot with $w_1, w_2, w_3$. So again, euclidean distance doesn't line up with risk. And again, we can pick a different metric. But what metric to pick? It is actually enlightening to think this through from first principles (it is not as daunting as it may seem at first). Metrics are distance functions: $d(x,y) = \sqrt{(x-y)^\top M (x-y)}$ Where the matrix $M$ is referred to as the Gram Matrix. In our case, we are dealing with $ws, so for simplicity we can write: $w^\top M w$ Which means we really just need to pick $M$. How do we want to pick $M$? Remember, our whole goal is to construct a geometry where we have a way to talk about equal risk $ws. And risk is just portfolio variance, which early we derived to be $w^\top \Sigma w$. And there is our Gram Matrix, $M$—it is $\Sigma$. By using $\Sigma$ our metric is literally portfolio variance. Using this metric on $w$ ensures that we are measuring the distance between two $ws in terms of portfolio variance. We can formally write our metric as: $\|w\|_W^2 = w^\top \Sigma w$ Now, if we look at contours of a constant value c, they are *ellipses* of equal risk: $E_c = \left\{\, w \in \mathbb{A} \;|\; w^\top \Sigma w = c \,\right\}$. In other words: all weight vectors that fall on a specific contour have the same risk. ![center | 350](Pasted%20image%2020250817154835.png) This new metric that we created answers the question: "What is the set of portfolios, $w$, that are all $c$ units of risk from the origin?". It tells us how risky a specific $w$ is. ###### Summary |**Aspect**|**What happens**| |---|---| |**Points**|Stay the same (portfolio weights $w$ in asset space)| |**Metric**|Changed to $w^\top \Sigma w$ (portfolio variance = risk)| |**Contours (equidistant Risk)**|Ellipses (still anisotropic)| # Connection between $R$-geometry and $w$-geometry Both the $R$-geometry ($\Sigma^{-1}$), and the $w$-geometry ($\Sigma$) describe the same underlying variance structure, but seen from dual perspectives (returns vs portfolios). Put another way, they are different coordinate systems for representing the same risk measure. In $R$-geometry, $r^\top \Sigma^{-1} r = 1$. This defines the set of returns $r$ that are $1$ unit of risk away from the origin. In $w$-geometry, $w^\top \Sigma w = 1$. This defines the set of weights $w$ that have $1$ unit of risk. These two conditions describe the same scalar variance measure, once we recognize that $r$ and $w$ are linked linearly: $r = \Sigma w \quad\Longleftrightarrow\quad w = \Sigma^{-1} r$ This is the mapping that takes the risk ellipse in return space into the risk ellipse in weight space (and vice versa). Visually we can see this below. The intuition is straightforward. We start with a vector specific set of returns—those that fall at $1$ unit of risk. These returns define the dark purple ellipse, $E^R_1$. We then ask: what portfolios $w$ lead to that set of returns? To answer this we just map $E^R_1$ via $\Sigma^{-1}$, getting the dark pink ellipse on the left, $E^w_1$. ![](Pasted%20image%2020250818071527.png) The key idea is that both geometries are just different representations of the same scalar quantity: **portfolio variance.** # Risk Adjusted Space All three geometries we've developed ($R$-geometry, whitening and $w$-geometry) are different **risk adjusted coordinate systems**. They are anisotropic or isotropic in their own way, but each is simply a different lens on the same underlying variance (risk) condition. ![](Pasted%20image%2020250818065609.png) # Mean Variance Optimization (MVO) We now have the conceptual tools to tack MVO. ###### Mean Optimization (MO) To start, we can ask what the optimizer would do if we ignored risk. In this case the optimizer just maximizes expected return: $\max_w \; w^\top \mu$ The solution is trivial: put $w$ in the direction of $\mu$: $w || \mu$ ###### MVO: $\Sigma^{-1}$ View %%TODO: need to clean up the math here and make the link between Sigma^-1, r geometry and w geometry clean (I see the w geometry, and the r geometry Sigma^-1 metric pops out at the end. But I don't see the link). Think it has to do with dual view %% Now we bring risk back in. With MVO, the optimizer maximizes *risk adjusted* return: how much mean return you get per unit variance. To account for risk adjustment, we must use the risk metrics we defined earlier. We want to maximize expected return per unit risk. Thinking in terms of $w$-geometry, risk for a weight vector is measured by the quadratic form: $\|w\|_{\Sigma}^2 = w^\top \Sigma w$ To maximize return per unit risk, solve: $\max{w\neq 0}\ \frac{w^\top \mu}{\sqrt{w^\top \Sigma w}}.$ And with a bit more (unenlightening) algebra, we can derive: $w \;\propto\; \Sigma^{-1} \mu$ We can see this visually below. The big idea is that in MVO $w$ is *not* in the direction of $\mu$—rather, it is in the direction of $\Sigma^{-1} \mu$, which is almost certainly rotated and scaled from the original $\mu$. This transformation is literally what we mean when we say "risk adjusted". ![center](Pasted%20image%2020250818073744.png) Some more intuition: in the raw return space, $\mu$ points in the direction of the highest returns, but that direction might be dominated by a high variance axis. The $\Sigma^{-1}$ rescales so that 1 unit is $1\sigma$ of risk in any direction. This has the effect of: * Squishing risky directions (so a $\mu$ along a risky direction will get pulled towards origin—it is down weighted) * Stretching safe directions > MVO picks $w$ in the *risk adjusted direction*. ###### MVO: Whitening View %% TODO: fill this math out more %% %%TODO: image of raw return space being mapped to whitened space, watching mu go along with it%% To account for risk, we replace the naïve mean maximization $\max_w w^\top \mu$ with a **risk-adjusted mean**. In our geometries, this means measuring w not with the Euclidean metric, but with the risk metric $w^\top \Sigma w$. In whitened space, the variance constraint is isotropic, so the problem reduces to: $\max_{w_z} \; w_z^\top (\Sigma^{-1/2}\mu), \quad \text{s.t. } \|w_z\|^2 = 1$ By Cauchy–Schwarz, the optimizer simply aligns $w_z$ with the whitened mean $\Sigma^{-1/2}\mu$. Mapping back to original coordinates $(w = \Sigma^{-1/2} w_z)$ gives $w \;\propto\; \Sigma^{-1}\mu$ So the effect of risk adjustment is exactly to “rotate and rescale” the mean direction by $\Sigma^{-1}$. # Mahalanobis Distance This has a direct connection to [Mahalanobis Distance](Mahalanobis%20Distance.md) (MD). But first, a quick recap on what MD captures. For a point $x$, a mean $\mu$ and a covariance $\Sigma$: $\text{MD}(x, \mu) \;=\; \sqrt{ (x - \mu)^\top \Sigma^{-1} (x - \mu) }$ This measures how many *risk-standard deviations* is $x$ away from $\mu$, when the risk geometry is given by $\Sigma$. We can think about this as the covariance being centered at $\mu$, or subtracting $\mu$ first so that $\mu$ is effectively the origin: ![center](Pasted%20image%2020250818080810.png) MD can be thought of as *whitening* (transforming) the space, so that instead of an ellipse, the risk is a circle. ![center | 300](Pasted%20image%2020250818080944.png) # Connection between MVO and MD: $\Sigma^{-1}$ Geometry %%TODO: question—mahal uses Sigma^-1, but I'm also seeing Sigma^-1/2 showing up. Is that due to sqrt?%% ###### Mahalanobis Distance Given two points $\mu$ and $x$ in return space, we want their distance measured in **risk geometry**. This is encoded via $\Sigma$. Whitening with $\Sigma^{-1/2}$ turns the $1\sigma$ ellipse into a unit circle. Euclidean distance in this whitened space is risk-adjusted distance. Directions of high variance are *compressed*—therefore they count less to the distance in this space. Below we can see that after applying MD, $x_2$ is now closer to $\mu$ and $x_1$ is further away. ![center](Pasted%20image%2020250818081854.png) ###### The Connection > Both MD and MVO start with a vector in return space and *re-express it in the risk geometry* defined by $\Sigma$. Whitening removes the stretching and skewing of that geometry: > - In MD, the vector is a _difference_ $(x - \mu)$ — we’re measuring separation in risk-adjusted units. > - In MVO, the vector is $\mu$ itself — we’re finding the optimal direction for $w$ after accounting for risk. > > In both cases, distances in high variance directions are given less "weight". | Category | Mahalanobis Distance (MD) | Mean-Variance Optimization (MVO) | | ------------------ | ---------------------------------------------------------------------------------------------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------- | | **Goal** | Measure how far $x$ is from $\mu$ in the data’s geometry (geometry of $\Sigma$) | Choose portfolio weights $w$ that balance expected return $\mu$ and risk $\Sigma$ | | **Whitening Step** | Apply $\Sigma^{-1/2}$ to $(x - \mu)$, which rescales each direction so 1 unit means one std of risk | Apply $\Sigma^{-1/2}$ to $\mu$, producing a *risk-adjusted mean vector* in a whitened space | | **Effect** | In high-variance directions, raw distances are *compressed*—differences there contribute less to the final distance | Components of $\mu$ in high-variance directions are *shrunk* and rotated towards lower variance directions | | **Interpretation** | *Discounts* differences along directions where the data naturally varies a lot, and *emphasizes/amplifies* differences along less varying directions | Whitening *discounts* return potential from risky directions, and *emphasizes* return potential from stable directions | # Using MD to Cross Validate $\mu$ and improve MVO We know that MVO is only as good as the inputs that we provide it: $\mu$ and $\Sigma$. But say we are making updates and improvements to $\mu$—how can we measure if we are actually improving the quality of $\mu$? In other words, how can we *cross validate* $\mu$? Sure we could just run it through the optimization process, but there are two drawbacks to that: 1. This is very susceptible to backtest hacking 2. If we find that $\mu_2$ is *not* an improvement over $\mu_1$, we'll have no idea why. And therefore we'll have no clear way to iterate forward and improve $\mu_2$. Using the backtest as a means of improving $\mu$ is a dead end. So the cross validation of $\mu$ is important. But how can we achieve this? What would a good cross validation function(s) look like? It will serve two purposes: 1. **Correlate with financial performance (FP)** If we show that $\mu_2$ is an improvement over $\mu_1$, this should mean that $\text{FP}_2$ is an improvement over $\text{FP}_1$. 2. **Help identify areas of $\mu$ to improve** Some points of $\mu$ matter more than others. Some elements of $\mu$ aren't that useful to improve: those that the optimizer will never trade due to the associated risk, or their middle-of-the-pack value, or our balance constraint, and so on. But some elements are useful to improve: those that have low risk, those that really help us in terms of our balance constraint, and so on. Moving forward these are our *goals* for a cross validation function(s). # Mahalanobis Distance ###### Intuition One candidate cross validation function is MD. Let's start trying to build some intuition around why it would be useful. Lets start with $\hat{\mu}$ (prediction) and $\mu$ (true). Suppose they differ mainly along a low(high) variance direction. When the optimizer performs the whitening step via $\Sigma^{-1/2}$, what will happen? - **Low-variance direction difference**: Whitening barely shrinks it and MD is large. MD flags this as _bad_, because the optimizer is more likely to put weight on that direction (low risk, so high risk-adjusted return). If your prediction is wrong there, it will meaningfully mislead the allocation. - **High-variance direction difference**: Whitening shrinks it a lot and MD is smaller. MD treats it as _less harmful_, because the optimizer tends to allocate less along high-variance axes anyway. Even if you’re wrong here, the portfolio impact is muted. So MD isn’t just a generic “distance” — in this MVO setting, it’s _aligned_ with the economic consequence of our forecast errors. I have yet to address what $\Sigma$ we will be using with MD. Considering we don't ever have access to the true sigma, we'll be using $\hat{\Sigma}$, our prediction. But that is actually the *correct choice* in this case. We are trying to measure errors that will *matter*—in the sense that the optimizer *will act on them*. And the optimizer is going to act based on what we pass to it, which in this case is $\hat{\Sigma}$. Remember, the optimizer will whiten return space via $\hat{\Sigma}$ (implicitly) and sets $w \propto \hat{\Sigma}^{-1}\hat{\mu}$. This means the portfolio will be in the directions that $\hat{\Sigma}$ deems as low-variance (low risk). MD measures forecast error in the *same geometry* that the optimizer uses. | Direction (variance) | Error size | MD magnitude | Impact on optimizer | | -------------------------- | ---------- | ------------ | ----------------------------------------------------- | | *Low-variance directions* | Big | Large MD | Harmful — optimizer tends to allocate weight there | | *High-variance directions* | Big | Smaller MD | Less harmful — optimizer downweights these directions | > **Takeaway:** Mahalanobis distance is not just a statistical gadget — in the MVO context, it’s a _financially meaningful diagnostic_, because it evaluates forecast errors in exactly the same geometry that drives the optimizer’s portfolio choice. %%TODO: --- Add this equivalence— So the equivalence is: - In **whitened coordinates**: Mahalanobis distance _is just Euclidean_. - In **original coordinates**: the same thing _appears as_ the quadratic form with \Sigma^{-1}. That’s why people flip between saying “it’s \Sigma^{-1}” and “it’s \Sigma^{-1/2}” depending on whether they’re thinking in the original space or the whitened one.%% * MD tells us a risk adjusted error—error in the exact geometry that our optimizer uses. * That’s why it’s a natural cross-validation metric: it lines up the evaluation of your forecasts with the _economic consequences_ the optimizer actually cares about. * **The optimizer doesn’t use** $\mu$ **directly** — it uses the _direction_ of $\Sigma^{-1}\mu$ * In unconstrained MVO, the optimizer chooses portfolio weights in the direction of the whitened mean, w \propto \Sigma^{-1}\mu. The proportionality means that only the **direction** of this vector matters; the eventual scaling of w will be set by constraints (like leverage or budget). * This is why forecasting \mu correctly is really about getting the **direction of** \hat{\mu} **in whitened space** right. If the direction matches the true \mu, the optimizer will pick essentially the right portfolio composition, even if magnitudes are off. But if the angle is wrong, no amount of rescaling can fix it—you’ll be allocating in the wrong directions entirely. MVO is, at its core, a direction-finding procedure. * Plot * --- Date: 20250816 Links to: Tags: References: * []()