Jacobian-vs-Gradient-vs-Hessian

# Jacobian vs. Gradient vs. Hessian ### [Gradient](Gradient.md) ### Jacobian The jacobian matrix of a vector valued function is the matrix of its first order partial derivatives. Let $f: \mathbb{R}^n \rightarrow \mathbb{R}^m$. This function takes a vector $x \in \mathbb{R}^n$ as input and produces the vector $\bf{f(x)} \in \mathbb{R}^m$ as output. The Jacobian matrix is defined to be an $m \times n$ matrix, denoted by $J$, whose $(i,j)$th entry is $J_{ij} = \frac{\partial f_i}{\partial x_j}$. Explicitly this looks like: $ J = \begin{bmatrix} \frac{\partial \bf{f}}{\partial x_1} & \frac{\partial \bf{f}}{\partial x_2} & \dots \frac{\partial \bf{f}}{\partial x_n} \end{bmatrix} = \begin{bmatrix} \frac{\partial f_1}{\partial x_1} & \frac{\partial f_1}{\partial x_2} & \dots & \frac{\partial f_1}{\partial x_n} \\ \vdots & \vdots & \ddots & \vdots \\ \frac{\partial f_m}{\partial x_1} & \frac{\partial f_m}{\partial x_2} & \dots & \frac{\partial f_m}{\partial x_n} \\ \end{bmatrix} $ #### Key Intuitions * We look at how every input dimension effects every output dimension. Hence that is why if we have $n$ input dimensions and $m$ output dimensions our Jacobian matrix will have shape $n \times m$. * The Jacobian is the *generalization* of the gradient of a scalar function of several variables ### Hessian Suppose that $f: \mathbb{R}^n \rightarrow \mathbb{R}$ is a function taking as input a vector $x \in \mathbb{R}^n$ and outputting a scalar $f(x) \in \mathbb{R}$. If all second partial derivatives of $f$ exist and are continuous, then the Hessian matrix, $H$, of $f$ is a square $n \times n$ matrix, usually defined and arranged as: $ H = \begin{bmatrix} \frac{\partial^2 f}{\partial x_1^2} & \frac{\partial^2 f}{\partial x_1 \partial x_2} & \dots & \frac{\partial^2 f}{\partial x_1 \partial x_n} \\ \frac{\partial^2 f}{\partial x_2 \partial x_1} & \frac{\partial^2 f}{\partial x_2^2} & \dots & \frac{\partial^2 f}{\partial x_2 \partial x_n} \\ \vdots & \vdots & \ddots & \vdots \\ \frac{\partial^2 f}{\partial x_n \partial x_1} & \frac{\partial^2 f}{\partial x_n \partial x_2} & \dots & \frac{\partial^2 f}{\partial x_n^2} \\ \end{bmatrix} $ The Hessian matrix of a function $f$ is the Jacobian Matrix of the gradient of the function $f$: $H(f(x)) = J(\nabla f(x))$ #### Key Intuitions * The Hessian is all about **curvature** * The Hessian matrix is a way to package all of the information about the 2nd partial derivatives of a function * The Hessian ensures that we take our partial derivatives in all possible orders! For instance, we see entries in the matrix where we take our partial derivative with respect to $x_1$ twice in a row. But we also see entries where we first take our partial derivative wrt $x_1$ and then $x_2$, and also the other way around-first wrt $x_2$ and then $x_1$. ### Hessian Evaluation 1. **Point-Specific Evaluation**: The Hessian matrix of a function $f: \mathbb{R}^n \rightarrow \mathbb{R}$ at a point $\mathbf{x} = (x_1, x_2, ..., x_n)$ is constructed from the second-order partial derivatives of $f$ evaluated at that point. Specifically, each element $H_{ij}$ of the Hessian matrix $H$ is given by $H_{ij} (\mathbf{x}) = \frac{\partial^2 f}{\partial x_i \partial x_j} (\mathbf{x})$. This means the Hessian matrix provides a local “snapshot” of the curvature characteristics of $f$ at the point $\mathbf{x}$. 2. **Interpretation of the Hessian**: The significance of the Hessian being point-specific is that it describes the curvature of the function locally around the point $\mathbf{x}$. If, at a given point $\mathbf{x}$, all the eigenvalues of the Hessian are positive, it indicates that the function exhibits positive curvature in all directions at that point, suggesting a local minimum. If all eigenvalues are negative, it suggests a local maximum, and if there are both positive and negative eigenvalues, the point is a saddle point. 3. **Role in Optimization and Analysis**: In the context of optimization, particularly in machine learning or data science, evaluating the Hessian at specific points (such as at critical points where the gradient is zero) helps in understanding the nature of these points – whether they are local minima, maxima, or saddle points. This is crucial for algorithms that involve second-order derivatives, like Newton's method, where the goal is often to find the minima of a function (e.g., a loss function). Therefore, when considering the Hessian and its eigenvalues, it's important to always think in terms of a specific point in the domain of the function where these are being evaluated. #### References * [Hessian 3b1b](https://www.youtube.com/watch?v=LbBcuZukCAw)