Lesson 1 - A very fast paced introduction to the foundations

08 Jan 2024 in Lectures 2024-02-10

How to view probability from a measure theoretic perspective.

Optional reading for this lesson

Slides

Video (soon)

In this lecture we give a very fast paced introduction to a perspective on probability some of you might have encountered, some of you might have not: measure theory. We will also dive into stochastic calculus and stochastic differential equations. This is a very dense lecture and we will not be able to cover everything in detail. The goal is to give you a feeling for the concepts and to give you a starting point for further reading.

Optional reading for this lesson
Slides
Video (soon)
1. Measure Theory and Probability
2. Stochastic Processes and Martingales
3. Brownian Motion
4. ODEs
5. SDEs
6. Ito Integral
7. Ito’s Lemma
8. Solving linear SDEs with Ito’s Lemma
9. Ornstein-Uhlenbeck Process
10. Solving non-linear SDEs numerically
Credits

1. Measure Theory and Probability

This first section is mostly taken from Francisco’s MPhil thesis available here.

For many of the technical derivations in the methodology we will study, a basic notion of measure and integration is required, thus we will refresh concepts such as $\sigma$-algebra, probability measures, measurable functions and the Lebesgue–Stieltjes integral, without going into unnecessary technical detail.

1.1 Probability Spaces

Definition: A probability space is defined as a 3-element tuple $(\Omega, \mathcal{F}, \mathbb{P})$ , where:
$\Omega$ is the sample space, i.e. the set of possible outcomes. For example, for a coin toss $\Omega=\{\text{Head}, \text{Tails}\}$ .
The $\sigma$ -algebra $\mathcal{F} \subseteq 2^{\Omega}$ represents the set of events we may want to consider. Continuing the coin toss example, we may have $\Omega=\{\emptyset, \text{Head}, \text{Tails},\{\text{Head}, \text{Tails}\}\}$ .
A probability measure $\mathbb{P}:\mathcal{F} \rightarrow [0,1]$ is a function which assigns a number in $[0,1]$ to any event (set) in the $\sigma$ -algebra $\mathcal{F}$ . The function $\mathbb{P}$ has the following requirements:
$\sigma$ -additive, which means that if $\bigcup_{i=0}^\infty A_i\in \mathcal{F}$ and $A_j \cap A_j = \emptyset, \; i \neq j$ , then $\mathbb{P}\left(\bigcup_{i=0}^\infty A_i\right) =\sum_{i=0}^\infty \mathbb{P}(A_i)$ .
$\mathbb{P}(\Omega)=1$ , which we can read as the sample space summing up to (integrating) to 1. Without this condition $\mathbb{P}$ would be a regular measure.

Note that we want to be able to measure (assign a probability to) each set in $\mathcal{F}$ rather than $2^\Omega$ , since the latter is not always possible. In other words, one can construct sets that cannot be measured (or can only be measured by the trivial measure $\lambda(A) = 0,\:\forall A \subseteq \Omega$ , which is not a probability measure). Thus, we require $\mathcal{F}$ to only contain measurable sets and this is what a $\sigma$ -algebra guarantees.

Definition: $\sigma$ -algebra $\mathcal{F} \subseteq 2^{\Omega}$ is a collection of sets satisfying the property:
$\mathcal{F}$ contains $\Omega$ : $\Omega \in \mathcal{F}$ .
$\mathcal{F}$ is closed under complements: if $A\in\mathcal{F}$ , then $\Omega \setminus A \in \mathcal{F}$ .
$\mathcal{F}$ is closed under countable union: if $A_i \in \mathcal{F} \ \forall i$ , then $\bigcup_i A_i \in \mathcal{F}$ .

Note we use the notation $\mathcal{B}(\R^d)$ for the Borel $\sigma$ -algebra of $\R^d$ , which we can think of as the canonical $\sigma$ -algebra for $\R^d$ – it is the most compact representation of all measurable sets in $\R^d$ . This notation can also be extended for subsets of $\R^d$ and more generally to any topological space.

1.2 Random Variables

Definition: For a probability space $(\Omega, \mathcal{F}, \mathbb{P})$ , a real-valued random variable (vector) $\mathbf{x}(\omega)$ is a function $\mathbf{x}:\Omega \rightarrow \R^d$ , requiring that $\mathbf{x}(\omega)$ is a measurable function, meaning that the pre-image of $\mathbf{x}(\omega)$ lies within the $\sigma$ -algebra $\mathcal{F}$ :

\mathbf{x}^{-1}(B)= \{\omega : \mathbf{x}(\omega) \in B \} \in \mathcal{F}, \quad \forall B \in \mathcal{B}(\R^d)

The above formalism of the random variable allows us to assign a numerical representation to outcomes in $\Omega$ . The clear advantage is that we can now ask questions such as what is the probability that $\mathbf{x}$ is contained within a set $B \subseteq \R^d$ : $\text{P}(\mathbf{x}(\omega) \in B) = \mathbb{P}( \{\omega : \mathbf{x}(\omega) \in B \} )$

and if we consider the more familiar 1D example, we recover the cumulative distribution function (CDF): $\text{P}(\mathbf{x}(\omega) \leq r) = \mathbb{P}( \{\omega : \mathbf{x}(\omega) \leq r \} )$

The random-variable formalism provides us with a more clear connection between the probability measure $\mathbb{P}$ and the familiar CDF. For simplicity in some cases, we may drop the argument $\omega$ from the random variable notation (e.g. $\mathbf{x} \sim \mathcal{N}(\bm{0}, \mathbb{I})$ ).

1.3 Lebesgue–Stieltjes Integral

Here we offer a pragmatic introduction to the Lebesgue–Stieltjes integral, in the context of a probability space.

For a probability measure space $(\Omega, \mathcal{F}, \mathbb{P})$ , a measurable function $f: \Omega \rightarrow \mathbb{R}$ and $A \in \mathcal{F}$ , the Lebesgue–Stieltjes integral $\int_{A} f(\bm{x}) d\mathbb{P}(\bm{x})$ is a Lebesgue integral with respect to the probability measure $\P$ .

Whilst we have not yet introduced a precise definition for the Lebesgue integral, we will now illustrate some of its properties that give us a grasp of this seemingly new notation. Expectations in our probability space can be written as $\mathbb{E}_{\mathbb{P}}[f(\mathbf{x})] = \int_{\Omega} f(\bm{x}) d\mathbb{P}(\bm{x})$

Let $\bm{1}(\mathbb{x} \in A)$ be the indicator function for set $A$ , then $\begin{align} \mathbb{E}_{\mathbb{P}}[\bm{1}(\mathbb{x} \in A)] &= \int_{\Omega} \bm{1}(\mathbb{x} \in A) d\mathbb{P}(\bm{x}) &= \int_{A} d\mathbb{P}(\bm{x}) &= \mathbb{P}(A) \end{align}$

The above result is a useful example, since it shows how a distribution (probability measure) is defined in terms of the integral. This is effectively the definition of a cumulative density function. When our distribution $\mathbb{P}$ admits a probability density function (PDF) $p(\bm{x})$ , we have the following: $\begin{align} \int_{\Omega} f(\bm{x}) d\mathbb{P}(\bm{x}) = \int_{\Omega} f(\bm{x})p(\bm{x}) d\lambda(\bm{x}), \end{align}$ where $\lambda$ is the Lesbesgue measure, and we can think of it as the characteristic measure for $\R^d$ . For many purposes, we can interpret $\int_{\Omega} f(\bm{x})p(\bm{x}) d\lambda(\bm{x})$ as the regular Riemann integral and in many cases authors \citep{williams2006gaussian} use $\int_{\Omega} f(\bm{x})p(\bm{x}) d\bm{x}$ notationally when the integral is with respect to the Lebesgue measure.

An important takeaway is that, whilst connecting the Lebesgue integral to the standard Riemann integral gives us a useful conceptual connection, it is not always something that can be done. As we will soon see, many random processes do not admit a PDF. In order to be able to compute expectations with respect to these processes, we must adopt the Lebesgue-Stieltjes integral, which is well-defined in these settings, while the standard Riemann integral is not.

2. Stochastic Processes and Martingales

Now let’s talk about stochastic processes, where we thing about dynamic random variables that evolve over time. Let T be a set of times.

Definition: A stochastic process $\{X_t\}_{t \in T}$ is a collection of random variables $X_t$ defined on a common probability space $(\Omega, \mathcal{F}, \mathbb{P})$ .

We can think of a stochastic process as a function $X: T \times \Omega \rightarrow \mathbb{R}$ that maps a time $t$ and a sample point $\omega$ to a real number $X_t(\omega)$ .

Definition: A filtration $\{F_t\}$ is a family of sub-sigma algebras of $\mathcal{F}$ such that $F_s \subseteq F_t$ for $0 \leq s \leq t$ .

We see that the sub-sigma algebras $F_t$ are nested, i.e. $F_0 \subseteq F_1 \subseteq F_2 \subseteq ...$ . They also get thinner over time, meaning that the information we have about the process increases over time. This is a natural property of stochastic processes, since we can only observe the process over time and therefore only get more information about it over time.

Definition: A stochastic process $\{X_t\}_{t \in T}$ is adapted to a filtration $\{F_t\}$ if $X_t$ is $F_t$ -measurable for all $t \in T$ .

This means that the random variable $X_t$ is measurable with respect to the sigma algebra $F_t$ , i.e. we can observe the value of $X_t$ at time $t$ by observing the sigma algebra $F_t$ . Adapted processes are also called non-anticipating processes, since we cannot anticipate the value of $X_t$ at time $t$ by observing the sigma algebra $F_s$ for $s > t$ . In other words, we cannot see into the future; $X_t$ is only know at time $t$ .

An additional key idea contained in this is that the sigma algebra $F_t$ contains all the information we have about the process up to time $t$ , i.e. it describes not only the value of $X_t$ at time $t$ , but also everything before it.

From these definitions, we can now define the concept of a martingale. Intuitively, a martingale is a stochastic process that has constant expected value over time. Let’s see how we can define this formally.

Definition: A stochastic process $\{X_t\}_{t \in T}$ is a martingale with respect to a filtration $\{F_t\}$ if:
$X_t$ is integrable for all $t \in T$
$X_t$ is adapted to $F_t$ for all $t \in T$
$\mathbb{E}[X_t \mid F_s] = X_s$ for all $0 \leq s \leq t$ .

The last statement is equivalent to $\mathbb{E}[X_t - X_s \mid F_s] = 0$ since $\mathbb{E}[X_s \mid F_t] = X_s$ and we can therefore pull $-X_s$ into the expectation. This means that the expected value of the change in $X_t$ from time $s$ to time $t$ is zero. This means that the expected value of $X_t$ at time $t$ is equal to the expected value of $X_s$ at time $s$ , i.e. the expected value of $X_t$ is constant over time. This is the key property of a martingale.

3. Brownian Motion

Before we finally come to stochastic differential equations, let’s talk about Brownian motion. Brownian motion is a stochastic process that is a martingale and is central to Ito integration theory which we will discuss a lot about later, so it is worth dwelling on it a bit.

Definition: A stochastic process $B_t$ is a Brownian motion if:
$B_0 = 0$ (process starts at $0$ )
$B_t$ is almost surely continuous
$B_t$ has independent increments ( $B_t - B_s$ is independent of $B_s$ )
$B_t - B_s \sim \mathcal{N}(0, t-s)$ (for $0 \leq s \leq t$ )

From this definition, we can derive some properties of Brownian motion:

$\mathbb{E}[B_t] = 0$ (since $B_t$ follows a centered normal distribution)
$\mathbb{E}[B_t^2] = t = var(B_t)$ . This is a key property of Brownian motion and is called the quadratic variation of Brownian motion. It is the reason why we cannot ignore the second term in the Taylor expansion of Ito’s lemma as we will later see. We can show this by noting that $var(B_t) = \mathbb{E}[B_t^2] - \mathbb{E}[B_t]^2 = \mathbb{E}[B_t^2]$ since $\mathbb{E}[B_t] = 0$ .
$B_t$ is a martingale. The first two martingale properties are easy to show: $B_t$ is integrable for all $t$ since it follows a normal distribution and it is $F_t$ -adapted. For the third martingale property, we make use of a common trick when working with Brownian motion: we try to an increment appear in an expectation and then use that this is 0 by definition of Brownian motion. We can write:

\mathbb{E}[B_t \mid B_s] = \mathbb{E}[B_t - B_s + B_s \mid B_s] = \mathbb{E}[B_t - B_s] + B_s = B_s

In the following, whenever we talk about Brownian motion we either use $B_t$ or $W_t$ as notation.

4. ODEs

Now that we have a measure theoretic perspective on probability and some feeling for what stochastic processes are, we can start to think about how to define stochastic differential equations.

To do this, let’s first remember what an ordinary differential equation is. An ordinary differential equation is a differential equation that contains one or more functions of one independent variable and the derivatives of those functions. The term ordinary is used in contrast with the term partial differential equation which may be with respect to more than one independent variable. An example ODE is the following: $\frac{d x}{d t} = -\alpha x$

This is a first order ODE, because it contains the first derivative of $x$ . It is also a linear ODE, because it is linear in $x$ . It is also an time-invariant ODE, because it does not depend on $t$ .

In general, we often write linear differential equations like this: $\frac{d x}{d t} = F(t) x(t) + L(t) w(t)$

where $F(t)$ is a function of time, $x(t)$ is the state of the system at time $t$ , $L(t)$ is a function of time and $w(t)$ is some forcing or driving function; in applications, this coud for example be an input to a system.

A useful special case of this are linear time-invariant differential equations: $\frac{d x}{d t} = F x(t) + L w(t)$

where $F$ and $L$ are constant. We can solve such equations via methods like separation of variables or, for the case of inhomgeneous equations, via the integrating factor method.

For general linear differential equations, we can solve them via methods like Fourier or Laplace transforms or via numerical methods. Linear differential equations are useful to consider since they arise in many equations and can be solved analytically. They will also be useful for us to consider when we think about stochastic differential equations.

5. SDEs

A pragmatic way to think about stochastic differential equations is to think about them as ODEs with some noise added. For example, let’s think about a car in 2 dimensions that is governed by the following ODE resulting from Newton’s second law: $\begin{aligned} \textbf{f}(t) &= m \textbf{a}(t) \\ &= m \frac{d^2 \textbf{x}(t)}{d t^2} \end{aligned}$

where $\textbf{f}(t)$ is the force acting on the car, $m$ is the mass of the car, $\textbf{a}(t)$ is the acceleration of the car and $\textbf{x}(t)$ is the position of the car.

Let’s say that our acceleration is now not precisely known/we have a measurement error in our acceleration. We can model this by adding some noise to our acceleration and descrie it by a random white-noise process $\textbf{w}(t) = (w_1(t), w_2(t))$ : $\begin{aligned} \frac{d^2 x_1(t)}{d t^2} &= w_1(t); \frac{d^2 x_1(t)}{d t^2} &= w_2(t) \end{aligned}$

Now if we define $x_3 = \frac{d x_1(t)}{d t}$ and $x_4 = \frac{d x_2(t)}{d t}$ we can write this as a system of first order ODEs in our old familiar form: $\begin{aligned} \frac{d \textbf{x}(t)}{d t} \left(\begin{array}{c} x_1 \\ x_2 \\ x_3 \\ x_4 \end{array}\right) &= \left(\begin{array}{ccc} 0 & 0 & 1 & 0 \\ 0 & 0 & 0 & 1 \\ 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 \end{array}\right) \left(\begin{array}{c} x_1 \\ x_2 \\ x_3 \\ x_4 \end{array}\right) + \left(\begin{array}{cc} 0 & 0 \\ 0 & 0 \\ 1 & 0 \\ 0 & 1 \end{array}\right) \left(\begin{array}{c} w_1 \\ w_2 \\ \end{array}\right) \end{aligned}$

where the first matrix is $F$ and the second matrix is $L$ . This is a linear time-invariant differential equation where the forcing function is not deterministic but a stochastic process: a stochastic differential equation.

Let’s try to solve such an equation heuristically with initial condition $\textbf{x}(t_0) \sim \mathcal{N}(\textbf{m}_0, \textbf{P}_0)$ . If we pretend that $\textbf{w}(t)$ is deterministic and continuous (which it is not), we can solve this equation like we would solve any other ODE. We can write the solution via the method of integrating factor as: $\begin{aligned} \frac{d \textbf{x}}{d t} &= \textbf{F} \textbf{x}(t) + \textbf{L} \textbf{w}(t) \\ \frac{d \textbf{x}}{d t} - \textbf{F} \textbf{x}(t) &= \textbf{L} \textbf{w}(t) \\ exp(-\textbf{F} t) \frac{d \textbf{x}}{d t} - exp(-\textbf{F} t) \textbf{F} \textbf{x}(t) &= exp(-\textbf{F} t) \textbf{L} \textbf{w}(t) \\ \frac{d}{d t} exp(-\textbf{F} t) \textbf{x}(t) &= exp(-\textbf{F} t) \textbf{L} \textbf{w}(t) \\ exp(-\textbf{F} t) \textbf{x}(t) - exp(-\textbf{F} t_0) \textbf{x}(t_0) &= \int_{t_0}^t exp(-\textbf{F} s) \textbf{L} \textbf{w}(s) ds \\ \textbf{x}(t) &= exp(\textbf{F}(t-t_0)) \textbf{x}(t_0) + \int_{t_0}^t exp(\textbf{F}(t-s)) \textbf{L} \textbf{w}(s) ds \end{aligned}$

where $t_0$ is the initial time and $exp()$ is the matrix exponential. Let us continue to pretend that $\textbf{w}(t)$ is deterministic and continuous. Then taking expectation on both side yields: $\begin{aligned} \mathbb{E}[\textbf{x}(t)] &= \mathbb{E}[exp(\textbf{F}(t-t_0)) \textbf{x}(t_0)] + \mathbb{E}[\int_{t_0}^t exp(\textbf{F}(t-s)) \textbf{L}\textbf{w}(s) ds] \\ &= \mathbb{E}[exp(\textbf{F}(t-t_0)) \textbf{x}(t_0)] \\ &= exp(\textbf{F}(t-t_0)) \textbf{m}_0 \end{aligned}$

where we used in the first step that $\mathbb{E}[\textbf{w}(s)] = 0$ since this is one of the properties of the white noise process and in the second step that $\mathbb{E}[\textbf{x}(t_0)] = \textbf{m}_0$ since this was our initial condition.

We can also get a covariance equation by recalling the Dirac delta correlation proeprty of the white noise process $\textbf{P}(t,s) = \mathbb{E}[\textbf{w}(t) \textbf{w}^T(s)] = \delta(t-s) \textbf{Q}$ with $\textbf{Q}$ being the spectral density of the white noise process. Then we can write (for $t_0 = 0$ and $m(t) = \mathbb{E}[\textbf{x}(t)]$ ): $\begin{aligned} \textbf{P}(t, t_0) &= \mathbb{E}[(\textbf{x}(t) - \mathbb{E}[\textbf{x}(t)])(\textbf{x}(t) - \mathbb{E}[\textbf{x}(t)])^T] \\ &= \mathbb{E}[(\textbf{x}(t) - \textbf{m}(t))(\textbf{x}(t) - \textbf{m}(t))^T] \\ &= \mathbb{E}[\textbf{x}(t)\textbf{x}(t)^T] - \mathbb{E}[\textbf{x}(t)\textbf{m}(t)^T] - \mathbb{E}[\textbf{m}(t)\textbf{x}(t)^T] + \mathbb{E}[\textbf{m}(t)\textbf{m}(t)^T]\\ &= exp(\textbf{F}(t)) \textbf{P}_0 exp(\textbf{F}(t))^T + \int_0^t exp(\textbf{F}(t-s)) \textbf{L} \textbf{Q} \textbf{L}^T exp(\textbf{F}(t-s))^T ds \end{aligned}$

where in step 3 we used $\mathbb{E}[\textbf{x}(t)] = exp(\textbf{F}(t-t_0))\textbf{m}_0$ and $\textbf{m}_0\textbf{m}_0^T = \textbf{P}_0$ .

We can now get the the differential equations for the mean and covariance by taking the derivative of the mean and covariance equations with respect to $t$ (where we denote $m(t) = \mathbb{E}[\textbf{x}(t)]$ and $\textbf{P}(t) = \mathbb{E}[(\textbf{x}(t) - \mathbb{E}[\textbf{x}(t)])(\textbf{x}(t) - \mathbb{E}[\textbf{x}(t)])^T]$ ): $\begin{aligned} \frac{d \mathbb{E}[\textbf{x}(t)]}{d t} &= \textbf{F} \mathbb{E}[\textbf{x}(t)] \\ \frac{d \textbf{P}(t)}{d t} &= \textbf{F} \textbf{P}(t) + \textbf{P}(t) \textbf{F}^T + \textbf{L} \textbf{Q} \textbf{L}^T \end{aligned}$

So far, so good: these equations are actually the correct differential equations for the mean and covariance. So can we treat SDEs just like ODEs? No, because we were just lucky in this case that (1) the SDE here was linear and (2) we did not use any of the calculus rules that are different in the stochastic setting. We cannot stretch this heuristic treatment to nonlinear SDEs and we cannot use it for more complex expressions.

To see how quickly this treatment can go wrong, let’s try to get the differential equations for the mean and covariance just by directly differentiating the original equation for mean and covariance and only afterwards taking the expectation, which should not make a difference. For the mean, we indeed get the same result: $\begin{aligned} \mathbb{E}[\frac{d\textbf{x}(t)}{dt}] &= \mathbb{E}[\textbf{F} \textbf{x}(t) + \textbf{L} \textbf{w}(t)] \\ &= \textbf{F} \mathbb{E}[\textbf{x}(t)] + \textbf{L} \mathbb{E}[\textbf{w}(t)] \\ &= \textbf{F} \mathbb{E}[\textbf{x}(t)] \\ \end{aligned}$

When we try the same thing for the covariance, we get the following: $\begin{aligned} \frac{d}{dt}[(\textbf{x} - \textbf{m})(\textbf{x} - \textbf{m})^T] &= \Big( \frac{d \textbf{x}}{dt} - \frac{\textbf{m}}{dt} \Big) (\textbf{x}-\textbf{m})^T + (\textbf{x}-\textbf{m}) \Big( \frac{d \textbf{x}}{dt} - \frac{\textbf{m}}{dt} \Big)^T \mid \text{substitute der. and take expectation}\\ \frac{d}{dt}\mathbb{E}[(\textbf{x} - \textbf{m})(\textbf{x} - \textbf{m})^T] &= \textbf{F}\mathbb{E}[(\textbf{x} - \textbf{m})(\textbf{x} - \textbf{m})^T] + \mathbb{E}[(\textbf{x} - \textbf{m})(\textbf{x} - \textbf{m})^T] \textbf{F}^T \\ \frac{d\textbf{P}(t)}{dt} &= \textbf{F}\textbf{P}(t) + \textbf{P}(t)\textbf{F}^T \\ &\neq \textbf{F} \textbf{P}(t) + \textbf{P}(t) \textbf{F}^T + \textbf{L} \textbf{Q} \textbf{L}^T \end{aligned}$

We see that the equation is wrong, it is missing the last term. This is because we cannot treat the stochastic process $\textbf{w}(t)$ as a regular function and therefore cannot use the product rule.

6. Ito Integral

This seems easy enough, we can just solve this like we would solve any other ODE, right? Well, not quite. The problem is that the noise process $\textbf{w}(t)$ is not a regular function, but a random process. This means that we cannot just integrate it like we would integrate a regular function. We need a new way to integrate random processes. This is where the Ito integral comes in.

To see the problem, let’s consider

7. Ito’s Lemma

So far we defined the stochastic integral in the Ito formulation. Now we want to see how we can solve these integrals in order to make progress with a proper treatment of our SDEs beyond the heuristic treatment we did before. To do this, we will use Ito’s lemma, which is a generalization of the chain rule for stochastic processes. It basically is a Taylor expansiona adapted to the stochastic world and a key tool in stochastic calculus.

Let’s sketch how we get to Ito’s lemma by considering a Taylor series for a function $f(x)$ around $0$ in ordinary calculus: $d f(x) = f'(x) dx + \frac{1}{2} f''(x)(dx)^2$

Example: $f(x) = x^2$ $d f(x) = 2 x dx + \frac{1}{2} f''(x) (dx)^2$

However, for ordinary functions, $(dx)^2$ will be very small since our partitions $dx$ are very small to begin with. The corresponding change in function value that this term induces will be very small since our function is continuous and differentiable.

Therefore, we can ignore the second term and write: $d f(x) = f'(x) dx = 2 x dx$

How does this look like in stochastic calculus? We again consider the change in $f(x)$ , but now we consider $x$ to be a stochastic process and therefore denote it is $X_t$ . We can write the change in $f(X_t)$ as:

$df(X_t) =$ \frac{\partial f(X_t)}{\partial X_t} dX_t + \frac{1}{2} \frac{\partial^2 f(X_t)}{\partial X_t^2} (dX_t)^2$$

where $dX_t$ is the change in $X_t$ and $(dX_t)^2$ is the quadratic variation of $X_t$ . Now in a stochastic process like Brownian motion, the quadratic variation in the limit of infinitely small partitions is not converging to zero, but to the time step $dt$ (we have described this before in the properties of Brownian motion). Therefore, we cannot ignore the second term in the Taylor expansion and have to keep it. This is the key insight of Ito’s lemma.

We can use Ito’s lemma for all Ito processes.

Definition: An Ito process is an adapted stochastic process $X_t$ that can be expressed as the sum of an integral with respect to time and an integral with respect to a Brownian motion $W_t$ : $dX_t = \mu(t, X_t) dt + \sigma(t, X_t) dW_t$

Here $\mu(t, X_t)$ is the drift term and $\sigma(t, X_t)$ is the diffusion term. We can think of the drift term as the deterministic part of the process and the diffusion term as the stochastic part of the process.

Now we can formulate what Ito’s lemma is for these Ito processes:

Ito’s lemma: Let $X_t$ be an Ito process and $f(t, X_t)$ be a function of $t$ and $X_t$ that is twice continuously differentiable with respect to $t$ and $X_t$ . Then $f(t, X_t)$ is also an Ito process, can be denoted $Y_t$ and we can write: $dY_t = df(t, X_t) = \frac{\partial f(t, X_t)}{\partial t} dt + \frac{\partial f(t, X_t)}{\partial X_t} dX_t + \frac{1}{2} \frac{\partial^2 f(t, X_t)}{\partial X_t^2} (dX_t)^2$

where we can use $df(t, X_t) = f_t dt + f_X dX_t + \frac{1}{2}f_{XX}(dX_t)^2$ as a simpler notation. This is the stochastic version of the Taylor expansion we saw before. We can also write this in integral form: $Y_t = f(t, X_t) = f(t_0, X_{t_0}) + \int_{t_0}^t f_t(s, X_s) ds + \int_{t_0}^t f_X(s, X_s) dX_s + \frac{1}{2} \int_{t_0}^t f_{XX}(s, X_s) (dX_s)^2$

where the first term is the initial value of $f(t, X_t)$ , the second term is the deterministic part of the process, the third term is the stochastic part of the process and the fourth term is the correction term that arises from the quadratic variation of the stochastic process.

Another version of Ito’s lemma in its differential form that you will often see is the following: $dY_t = df(t, X_t) = (\frac{\partial f(t, X_t)}{\partial x} \mu(t, X_t) + \frac{\partial f(t, X_t)}{\partial t} + \frac{1}{2} \frac{\partial^2 f(t, X_t)}{\partial x^2} \sigma(t, X_t)^2) dt + \frac{\partial f(t, X_t)}{\partial x} \sigma(t, X_t) dW_t$

Here we used the original differential form, substituted $dX_t$ with the Ito process, $(dX_t)^2$ with the quadratic variation of the Ito process $\sigma(t, X_t)^2$ and then simplified the resulting expression by grouping $dt$ and $dW_t$ terms.

Ito’s lemma can also be generalized to higher dimensions. Written it matrix from, it looks like this: $dY_t = df(t, X_t) = (\partial_t f + \nabla f^T \mu + \frac{1}{2} Tr(\nabla \nabla^T f \sigma \sigma^T)) dt + \nabla f^T \sigma dW_t$

where $X_t$ is a $n$ -dimensional Ito process and $f(t, X_t)$ is a function of $t$ and $X_t$ that is twice continuously differentiable with respect to $t$ and $X_t$ .

8. Solving linear SDEs with Ito’s Lemma

Now that we have a practical formula in our hand, let’s use it to solve some linear SDE equations.

Definition: A geometric Brownian motion (GBM, also known as exponential Brownian motion) is a continuous-time stochastic process $S_t$ that is defined by the following stochastic differential equation:

dX_t = \mu X_t dt + \sigma X_t dW_t

where $\mu$ is the drift of the process, $\sigma$ is the volatility of the process and $B_t$ is a Brownian motion. Both $\mu$ and $\sigma$ are constants.

Compared to the general Ito process, we added the $X_t$ on the right hand side as a multiplicative factor to the drift and diffusion term.

Here we see the first problem when solving SDEs: we have $X_t$ on the right-hand side of the equation. Simply integrating won’t work. We need to use Ito’s lemma to solve this. For this, we need choose a good Ansatz for our transformation function $Y_t = f(t, X_t)$ so that it will be easy to solve the resulting equation. If we divide both sides by $X_t$ , we can write this as: $\frac{1}{X_t} dX_t = \mu dt + \sigma dW_t$

In this case, we choose $Y_t = ln(X_t)$ since the $\frac{1}{X_t} dX_t$ on the left hand side hints at the derivative of a logarithm. This transformation is a good choice since it will make the $X_t$ on the right-hand side of the original equation disappear due to the derivative of it being $\frac{1}{x}$ . Let us consider the three kinds of derivatives that we need to compute: $\begin{aligned} \frac{\partial f(t, X_t)}{\partial t} &= 0 \frac{\partial f(t, X_t)}{\partial X_t} &= \frac{1}{X_t} \frac{\partial^2 f(t, X_t)}{\partial X_t^2} &= -\frac{1}{X_t^2} \end{aligned}$

With this, we can write Ito’s lemma for our transformation function $Y_t = ln(X_t)$ as: $dY_t = \frac{1}{X_t} dX_t - \frac{1}{2} \frac{1}{X_t^2} (dX_t)^2$

Now we can substitute $dX_t$ with the original equation. To substitute $(dX_t)^2$ , we need to use the rules for differentials in Ito calculus. We will talk about these more later, but for now we can use the following rules: $\begin{aligned} dt dt &= 0 \hspace{1cm} dW_t dt &= 0 \hspace{1cm} dW_t dW_t &= dt \hspace{1cm} \end{aligned}$

With this, we can write: $\begin{aligned} (dX_t)^2 &= (\mu X_t dt + \sigma X_t dW_t)^2 \\ &= \mu^2 X_t^2 dt^2 + 2 \mu \sigma X_t^2 dt dW_t + \sigma^2 X_t^2 (dW_t)^2 \\ &= \sigma^2 X_t^2 dt \end{aligned}$

where we used the rules above to simplify the expression. Now we can substitute this into our equation for $dY_t$ : $\begin{aligned} dY_t &= \frac{1}{X_t} dX_t - \frac{1}{2} \frac{1}{X_t^2} (dX_t)^2 \\ &= \frac{1}{X_t} (\mu X_t dt + \sigma X_t dW_t) - \frac{1}{2} \frac{1}{X_t^2} \sigma^2 X_t^2 dt \\ &= \mu dt + \sigma dW_t - \frac{1}{2} \sigma^2 dt \\ &= (\mu - \frac{1}{2} \sigma^2) dt + \sigma dW_t \end{aligned}$

We can see that the $X_t$ on the right-hand side disappeared and X_t only appears implicitly on the right hand side via $dY_t = d(ln(X_t))$ . This is the key insight of Ito’s lemma: we can transform a stochastic differential equation into an easier to solve differential equation by choosing a good transformation function $Y_t = f(t, X_t)$ .

Now we can solve this equation by integrating both sides: $\begin{aligned} \int_{0}^t dY_t &= \int_{0}^t (\mu - \frac{1}{2} \sigma^2) dt + \int_{0}^t \sigma dW_t \\ Y_t - Y_0 &= (\mu - \frac{1}{2} \sigma^2) t + \sigma B_t \\ ln(X_t) - ln(X_0) &= (\mu - \frac{1}{2} \sigma^2) t + \sigma B_t \\ ln(\frac{X_t}{X_0}) &= (\mu - \frac{1}{2} \sigma^2) t + \sigma B_t \\ \frac{X_t}{X_0} &= exp((\mu - \frac{1}{2} \sigma^2) t + \sigma B_t) \\ X_t &= X_0 exp((\mu - \frac{1}{2} \sigma^2) t + \sigma B_t) \end{aligned}$

where we used that $Y_0 = ln(X_0)$ and $B_0 = 0$ .

This is the solution to the geometric Brownian motion. We can see that the solution is a log-normal distribution. This is a very important distribution that will appear again in the course, for example when we talk about the Girsanov theorem.

We can look at the interesting special case when our process has no drift, i.e. $\mu = 0$ : $\begin{aligned} \frac{1}{X_t} dX_t &= \mu dt + \sigma dW_t X_T - X_0 &= \int_{0}^T \sigma X_t dW_t \\ X_T &= X_0 exp((-\frac{\sigma^2}{2})T+\sigma B_t) \end{aligned}$

We can see that the integral on the right hand side is an integral martingale and therefore conclude that X_t is a martingale with constant expected value. This is a very important property of the log-normal distribution that we will see again later.

We can look at this in code by simulating a geometric Brownian motion numerically:

# file: "geometric_brownian_motion.py"
import numpy as np
import matplotlib.pyplot as plt
M = 10 # number of simulations
t = 10 # Time
n = 100 # steps we want to see
dt = t/n # time step
#simulating the brownian motion
steps = np.random.normal(0, np.sqrt(dt), size=(M, n)).T
origin = np.zeros((1,M))
bm_paths = np.concatenate([origin, steps]).cumsum(axis=0)
time = np.linspace(0,t,n+1)
tt = np.full(shape=(M, n+1), fill_value=time)
tt = tt.T
plt.plot(tt,bm_paths)
plt.xlabel("Years (t)")
plt.ylabel("Move")
plt.show()
# change time steps to 1,000,000 to observe same quadratic variation along paths
[quadratic_variation(path) for path in bm_paths.T[:4]]
# output: 
# change simulations to 100,000 to observe convergence of variance to Time at a particular time step
[variance(path) for path in bm_paths[1:11]]
# output:

9. Ornstein-Uhlenbeck Process

Now let’s look at another example of a linear SDE: the Ornstein-Uhlenbeck process. This is a process that is often used to model the velocity of a particle under the influence of friction. It is defined by the following SDE: $dX_t = \theta (\mu - X_t) dt + \sigma dW_t$

where $\theta$ is the friction coefficient, $\mu$ is the drift term and $\sigma$ is the volatility.

Let’s solve this equation in general. We can again use Ito’s lemma to solve this equation. We first see that $\mu$ is intended as the long-term mean of the process X_t. We can therefore choose $Y_t = X_t - \mu$ as our transformation function. We then get the following differential: $dY_t = dX_t = - \theta Y_t dt + \sigma dW_t$

where we used that $Y_{0} = X_{0} - \mu$ . We can see that the solution is a linear combination of a deterministic part and a stochastic part. The deterministic part is the long-term mean $\mu$ plus a term that decays exponentially to the long-term mean with a time constant of $\theta$ . The stochastic part is a Brownian motion that is weighted by a factor that decays exponentially to zero with a time constant of $\theta$ .

If we look at this equation for a bit, we can see that we are equating the change in $Y_t$ to a multiple of itself plus some noise. If we ignore the noise, this pattern of (roughly speaking) $\frac{dY_t}{dt} = - \theta Y_t$ is strikingly similar to the classic 1st order ODE with the exponential function as a solution.

We can therefore do another change of variables and define $Z_t = exp(\theta t) Y_t$ as our new transformation function. We can then write: $Z_t = f(t, \theta, Y_t) = exp(\theta t) Y_t = exp(\theta t) (X_t - \mu)$

To solve this, we can use Ito’s lemma again. We first compute the derivatives: $\begin{aligned} d Z_t &= d f(t, \theta, Y_t) \\ &= \frac{\partial f(t, \theta, Y_t)}{\partial t} dt + \frac{\partial f(t, \theta, Y_t)}{\partial Y_t} dY_t + \frac{1}{2} \frac{\partial^2 f(t, \theta, Y_t)}{\partial Y_t^2} (dY_t)^2 \\ &= \partial_t [e^{\theta t} Y_t] dt + \partial_{Y_t} [e^{\theta t} Y_t] dY_t + 0\\ &= \theta e^{\theta t} Y_t dt + e^{\theta t} dY_t \\ &= \theta e^{\theta t} Y_t dt + e^{\theta t} (- \theta Y_t dt + \sigma dW_t) \\ &= \sigma e^{\theta t} dW_t \\ Z_t &= Z_0 + \int_{0}^t \sigma e^{\theta s} dW_s \\ &= Z_0 + \sigma \int_{0}^t e^{\theta s} dW_s \end{aligned}$

where we used that $dY_t = - \theta Y_t dt + \sigma dW_t$ .

Now we just have to undo the change of variables and we first get the solution for $Y_t$ and then for $X_t$ : $\begin{aligned} Y_t &= e^{-\theta t} Z_t \\ &= e^{-\theta t} (Z_0 + \sigma \int_{0}^t e^{\theta s} dW_s) \\ &= e^{-\theta t} (Y_0 + \sigma \int_{0}^t e^{\theta s} dW_s) \\ X_t &= Y_t + \mu \\ &= \mu + e^{-\theta t} (Y_0 + \sigma \int_{0}^t e^{\theta s} dW_s) \\ &= \mu + e^{-\theta t} (X_0 - \mu) + \sigma \int_{0}^t e^{\theta s} dW_s \end{aligned}$

We now focus on the case where $\mu = 0$ . In addition, we choose a parameterization where $\theta = \alpha$ and $\sigma = \sqrt{2\alpha}$ for reasons that will become clear later (mostly that our solutions come out nicely). With this, our solution for $X_t$ with intial condition $X_0$ becomes: $\begin{aligned} X_t &= e^{-\alpha t} X_0 + \sqrt{2 \alpha} \int_{0}^t e^{-\alpha (t-s)} dW_s \\ &= X_0 exp(-\alpha t) + \sqrt{1 - exp(-2 \alpha t)} W_1 \\ &= X_0 exp(-\alpha t) + W_{1-exp(-2 \alpha t)} \end{aligned}$

where $W_{1-exp(-2 \alpha t)}$ is a Brownian motion with variance $1-exp(-2 \alpha t)$ .

From this solution we can again get some intuition about the process by poking a bit with it. First, we can see that the process is mean-reverting. This is because the process is pulled back to the mean $\mu$ with a time constant of $\alpha$ . It is also the only nontrivial process that is at the same time a Gaussian prcess, a Markov process and temporally homogeneous. To see that the stationary distribution is indeed a Gaussian distribution, we can compute the mean and the variance of the process and then evaluate it in the limit of $t \rightarrow 0$ : $\begin{aligned} \mathbb{E}[X_t] &= \mathbb{E}[X_0 exp(-\alpha t) + W_{1-exp(-2 \alpha t)}] \\ &= X_0 exp(-\alpha t) + \mathbb{E}[W_{1-exp(-2 \alpha t)}] \\ &= X_0 exp(-\alpha t) \end{aligned}$

where we used

10. Solving non-linear SDEs numerically

So far we only looked at linear SDEs and how to solve them. Can we do something similar for non-linear SDEs? The answer is sadly no, but since we care about computers and have to discretise our eqyaution anyway, we can use the Euler-Maruyama method to solve non-linear SDEs numerically. This is a numerical method that is similar to the Euler method for ODEs, but adapted to the stochastic setting.

euler-maruyama

You will see in the next lecture that many practical implementations of diffusion models correspond to Euler-Maruyama discretisations of SDEs.

Credits

Code examples adapted from QuantPy. Much of the logic in the lecture is based on the Oksendal as well as the Särkkä & Solin book. Title image from Wikipedia

Lesson 1 - A very fast paced introduction to the foundations

Optional reading for this lesson

Slides

Video (soon)

1. Measure Theory and Probability

1.1 Probability Spaces

1.2 Random Variables

1.3 Lebesgue–Stieltjes Integral

2. Stochastic Processes and Martingales

3. Brownian Motion

4. ODEs

5. SDEs

6. Ito Integral

7. Ito’s Lemma

8. Solving linear SDEs with Ito’s Lemma

9. Ornstein-Uhlenbeck Process

10. Solving non-linear SDEs numerically

Credits

SDEs & Diffusion Processes

Error

Optional reading for this lesson

Slides

Video (soon)

1. Measure Theory and Probability

1.1 Probability Spaces

1.2 Random Variables

1.3 Lebesgue–Stieltjes Integral

2. Stochastic Processes and Martingales

3. Brownian Motion

4. ODEs

5. SDEs

6. Ito Integral

7. Ito’s Lemma

8. Solving linear SDEs with Ito’s Lemma

9. Ornstein-Uhlenbeck Process

10. Solving non-linear SDEs numerically

Credits

Templates (for web app):

Error