Lesson 1 - A very fast paced introduction to the foundations

Lesson 1 - A very fast paced introduction to the foundations

How to view probability from a measure theoretic perspective.

Optional reading for this lesson

Slides

Video (soon)

In this lecture we give a very fast paced introduction to a perspective on probability some of you might have encountered, some of you might have not: measure theory. We will also dive into stochastic calculus and stochastic differential equations. This is a very dense lecture and we will not be able to cover everything in detail. The goal is to give you a feeling for the concepts and to give you a starting point for further reading.

1. Measure Theory and Probability

This first section is mostly taken from Francisco’s MPhil thesis available here.

For many of the technical derivations in the methodology we will study, a basic notion of measure and integration is required, thus we will refresh concepts such as $\sigma$-algebra, probability measures, measurable functions and the Lebesgue–Stieltjes integral, without going into unnecessary technical detail.

1.1 Probability Spaces

Definition: A probability space is defined as a 3-element tuple (Ω,F,P)(\Omega, \mathcal{F}, \mathbb{P}), where:

  • Ω\Omega is the sample space, i.e. the set of possible outcomes. For example, for a coin toss Ω={Head,Tails}\Omega=\{\text{Head}, \text{Tails}\}.
  • The σ\sigma-algebra F2Ω\mathcal{F} \subseteq 2^{\Omega} represents the set of events we may want to consider. Continuing the coin toss example, we may have Ω={,Head,Tails,{Head,Tails}}\Omega=\{\emptyset, \text{Head}, \text{Tails},\{\text{Head}, \text{Tails}\}\}.
  • A probability measure P:F[0,1]\mathbb{P}:\mathcal{F} \rightarrow [0,1] is a function which assigns a number in [0,1][0,1] to any event (set) in the σ\sigma-algebra F\mathcal{F}. The function P\mathbb{P} has the following requirements:
    1. σ\sigma-additive, which means that if i=0AiF\bigcup_{i=0}^\infty A_i\in \mathcal{F} and AjAj=,  ijA_j \cap A_j = \emptyset, \; i \neq j, then P(i=0Ai)=i=0P(Ai)\mathbb{P}\left(\bigcup_{i=0}^\infty A_i\right) =\sum_{i=0}^\infty \mathbb{P}(A_i).
    2. P(Ω)=1\mathbb{P}(\Omega)=1, which we can read as the sample space summing up to (integrating) to 1. Without this condition P\mathbb{P} would be a regular measure.

Note that we want to be able to measure (assign a probability to) each set in F\mathcal{F} rather than 2Ω2^\Omega, since the latter is not always possible. In other words, one can construct sets that cannot be measured (or can only be measured by the trivial measure λ(A)=0,AΩ\lambda(A) = 0,\:\forall A \subseteq \Omega, which is not a probability measure). Thus, we require F\mathcal{F} to only contain measurable sets and this is what a σ\sigma-algebra guarantees.

Definition: σ\sigma-algebra F2Ω\mathcal{F} \subseteq 2^{\Omega} is a collection of sets satisfying the property:

  • F\mathcal{F} contains Ω\Omega: ΩF\Omega \in \mathcal{F}.
  • F\mathcal{F} is closed under complements: if AFA\in\mathcal{F}, then ΩAF\Omega \setminus A \in \mathcal{F}.
  • F\mathcal{F} is closed under countable union: if AiF iA_i \in \mathcal{F} \ \forall i, then iAiF\bigcup_i A_i \in \mathcal{F}.

Note we use the notation B(Rd)\mathcal{B}(\R^d) for the Borel σ\sigma-algebra of Rd\R^d, which we can think of as the canonical σ\sigma-algebra for Rd\R^d – it is the most compact representation of all measurable sets in Rd\R^d. This notation can also be extended for subsets of Rd\R^d and more generally to any topological space.

1.2 Random Variables

Definition: For a probability space (Ω,F,P)(\Omega, \mathcal{F}, \mathbb{P}), a real-valued random variable (vector) x(ω)\mathbf{x}(\omega) is a function x:ΩRd\mathbf{x}:\Omega \rightarrow \R^d, requiring that x(ω)\mathbf{x}(\omega) is a measurable function, meaning that the pre-image of x(ω)\mathbf{x}(\omega) lies within the σ\sigma-algebra F\mathcal{F}:

x1(B)={ω:x(ω)B}F,BB(Rd)\mathbf{x}^{-1}(B)= \{\omega : \mathbf{x}(\omega) \in B \} \in \mathcal{F}, \quad \forall B \in \mathcal{B}(\R^d)

The above formalism of the random variable allows us to assign a numerical representation to outcomes in Ω\Omega. The clear advantage is that we can now ask questions such as what is the probability that x\mathbf{x} is contained within a set BRdB \subseteq \R^d: P(x(ω)B)=P({ω:x(ω)B})\text{P}(\mathbf{x}(\omega) \in B) = \mathbb{P}( \{\omega : \mathbf{x}(\omega) \in B \} )

and if we consider the more familiar 1D example, we recover the cumulative distribution function (CDF): P(x(ω)r)=P({ω:x(ω)r})\text{P}(\mathbf{x}(\omega) \leq r) = \mathbb{P}( \{\omega : \mathbf{x}(\omega) \leq r \} )

The random-variable formalism provides us with a more clear connection between the probability measure P\mathbb{P} and the familiar CDF. For simplicity in some cases, we may drop the argument ω\omega from the random variable notation (e.g. xN(0,I)\mathbf{x} \sim \mathcal{N}(\bm{0}, \mathbb{I})).

1.3 Lebesgue–Stieltjes Integral

Here we offer a pragmatic introduction to the Lebesgue–Stieltjes integral, in the context of a probability space.

For a probability measure space (Ω,F,P)(\Omega, \mathcal{F}, \mathbb{P}), a measurable function f:ΩRf: \Omega \rightarrow \mathbb{R} and AFA \in \mathcal{F}, the Lebesgue–Stieltjes integral Af(x)dP(x)\int_{A} f(\bm{x}) d\mathbb{P}(\bm{x}) is a Lebesgue integral with respect to the probability measure \P.

Whilst we have not yet introduced a precise definition for the Lebesgue integral, we will now illustrate some of its properties that give us a grasp of this seemingly new notation. Expectations in our probability space can be written as EP[f(x)]=Ωf(x)dP(x)\mathbb{E}_{\mathbb{P}}[f(\mathbf{x})] = \int_{\Omega} f(\bm{x}) d\mathbb{P}(\bm{x})

Let 1(xA)\bm{1}(\mathbb{x} \in A) be the indicator function for set AA, then EP[1(xA)]=Ω1(xA)dP(x)=AdP(x)=P(A)\begin{align} \mathbb{E}_{\mathbb{P}}[\bm{1}(\mathbb{x} \in A)] &= \int_{\Omega} \bm{1}(\mathbb{x} \in A) d\mathbb{P}(\bm{x}) &= \int_{A} d\mathbb{P}(\bm{x}) &= \mathbb{P}(A) \end{align}

The above result is a useful example, since it shows how a distribution (probability measure) is defined in terms of the integral. This is effectively the definition of a cumulative density function. When our distribution P\mathbb{P} admits a probability density function (PDF) p(x)p(\bm{x}), we have the following: \begin{align} \int_{\Omega} f(\bm{x}) d\mathbb{P}(\bm{x}) = \int_{\Omega} f(\bm{x})p(\bm{x}) d\lambda(\bm{x}), \end{align} where λ\lambda is the Lesbesgue measure, and we can think of it as the characteristic measure for Rd\R^d. For many purposes, we can interpret Ωf(x)p(x)dλ(x)\int_{\Omega} f(\bm{x})p(\bm{x}) d\lambda(\bm{x}) as the regular Riemann integral and in many cases authors \citep{williams2006gaussian} use Ωf(x)p(x)dx\int_{\Omega} f(\bm{x})p(\bm{x}) d\bm{x} notationally when the integral is with respect to the Lebesgue measure.

An important takeaway is that, whilst connecting the Lebesgue integral to the standard Riemann integral gives us a useful conceptual connection, it is not always something that can be done. As we will soon see, many random processes do not admit a PDF. In order to be able to compute expectations with respect to these processes, we must adopt the Lebesgue-Stieltjes integral, which is well-defined in these settings, while the standard Riemann integral is not.

2. Stochastic Processes and Martingales

Now let’s talk about stochastic processes, where we thing about dynamic random variables that evolve over time. Let T be a set of times.

Definition: A stochastic process {Xt}tT\{X_t\}_{t \in T} is a collection of random variables XtX_t defined on a common probability space (Ω,F,P)(\Omega, \mathcal{F}, \mathbb{P}).

We can think of a stochastic process as a function X:T×ΩRX: T \times \Omega \rightarrow \mathbb{R} that maps a time tt and a sample point ω\omega to a real number Xt(ω)X_t(\omega).

Definition: A filtration {Ft}\{F_t\} is a family of sub-sigma algebras of F\mathcal{F} such that FsFtF_s \subseteq F_t for 0st0 \leq s \leq t.

We see that the sub-sigma algebras FtF_t are nested, i.e. F0F1F2...F_0 \subseteq F_1 \subseteq F_2 \subseteq .... They also get thinner over time, meaning that the information we have about the process increases over time. This is a natural property of stochastic processes, since we can only observe the process over time and therefore only get more information about it over time.

Definition: A stochastic process {Xt}tT\{X_t\}_{t \in T} is adapted to a filtration {Ft}\{F_t\} if XtX_t is FtF_t-measurable for all tTt \in T.

This means that the random variable XtX_t is measurable with respect to the sigma algebra FtF_t, i.e. we can observe the value of XtX_t at time tt by observing the sigma algebra FtF_t. Adapted processes are also called non-anticipating processes, since we cannot anticipate the value of XtX_t at time tt by observing the sigma algebra FsF_s for s>ts > t. In other words, we cannot see into the future; XtX_t is only know at time tt.

An additional key idea contained in this is that the sigma algebra FtF_t contains all the information we have about the process up to time tt, i.e. it describes not only the value of XtX_t at time tt, but also everything before it.

From these definitions, we can now define the concept of a martingale. Intuitively, a martingale is a stochastic process that has constant expected value over time. Let’s see how we can define this formally.

Definition: A stochastic process {Xt}tT\{X_t\}_{t \in T} is a martingale with respect to a filtration {Ft}\{F_t\} if:

  1. XtX_t is integrable for all tTt \in T
  2. XtX_t is adapted to FtF_t for all tTt \in T
  3. E[XtFs]=Xs\mathbb{E}[X_t \mid F_s] = X_s for all 0st0 \leq s \leq t.

The last statement is equivalent to E[XtXsFs]=0\mathbb{E}[X_t - X_s \mid F_s] = 0 since E[XsFt]=Xs\mathbb{E}[X_s \mid F_t] = X_s and we can therefore pull Xs-X_s into the expectation. This means that the expected value of the change in XtX_t from time ss to time tt is zero. This means that the expected value of XtX_t at time tt is equal to the expected value of XsX_s at time ss, i.e. the expected value of XtX_t is constant over time. This is the key property of a martingale.

3. Brownian Motion

Before we finally come to stochastic differential equations, let’s talk about Brownian motion. Brownian motion is a stochastic process that is a martingale and is central to Ito integration theory which we will discuss a lot about later, so it is worth dwelling on it a bit.

Definition: A stochastic process BtB_t is a Brownian motion if:

  1. B0=0B_0 = 0 (process starts at 00)
  2. BtB_t is almost surely continuous
  3. BtB_t has independent increments (BtBsB_t - B_s is independent of BsB_s)
  4. BtBsN(0,ts)B_t - B_s \sim \mathcal{N}(0, t-s) (for 0st0 \leq s \leq t)

From this definition, we can derive some properties of Brownian motion:

  1. E[Bt]=0\mathbb{E}[B_t] = 0 (since BtB_t follows a centered normal distribution)
  2. E[Bt2]=t=var(Bt)\mathbb{E}[B_t^2] = t = var(B_t). This is a key property of Brownian motion and is called the quadratic variation of Brownian motion. It is the reason why we cannot ignore the second term in the Taylor expansion of Ito’s lemma as we will later see. We can show this by noting that var(Bt)=E[Bt2]E[Bt]2=E[Bt2]var(B_t) = \mathbb{E}[B_t^2] - \mathbb{E}[B_t]^2 = \mathbb{E}[B_t^2] since E[Bt]=0\mathbb{E}[B_t] = 0.
  3. BtB_t is a martingale. The first two martingale properties are easy to show: BtB_t is integrable for all tt since it follows a normal distribution and it is FtF_t-adapted. For the third martingale property, we make use of a common trick when working with Brownian motion: we try to an increment appear in an expectation and then use that this is 0 by definition of Brownian motion. We can write:
E[BtBs]=E[BtBs+BsBs]=E[BtBs]+Bs=Bs\mathbb{E}[B_t \mid B_s] = \mathbb{E}[B_t - B_s + B_s \mid B_s] = \mathbb{E}[B_t - B_s] + B_s = B_s

In the following, whenever we talk about Brownian motion we either use BtB_t or WtW_t as notation.

4. ODEs

Now that we have a measure theoretic perspective on probability and some feeling for what stochastic processes are, we can start to think about how to define stochastic differential equations.

To do this, let’s first remember what an ordinary differential equation is. An ordinary differential equation is a differential equation that contains one or more functions of one independent variable and the derivatives of those functions. The term ordinary is used in contrast with the term partial differential equation which may be with respect to more than one independent variable. An example ODE is the following: dxdt=αx\frac{d x}{d t} = -\alpha x

This is a first order ODE, because it contains the first derivative of xx. It is also a linear ODE, because it is linear in xx. It is also an time-invariant ODE, because it does not depend on tt.

In general, we often write linear differential equations like this: dxdt=F(t)x(t)+L(t)w(t)\frac{d x}{d t} = F(t) x(t) + L(t) w(t)

where F(t)F(t) is a function of time, x(t)x(t) is the state of the system at time tt, L(t)L(t) is a function of time and w(t)w(t) is some forcing or driving function; in applications, this coud for example be an input to a system.

A useful special case of this are linear time-invariant differential equations: dxdt=Fx(t)+Lw(t)\frac{d x}{d t} = F x(t) + L w(t)

where FF and LL are constant. We can solve such equations via methods like separation of variables or, for the case of inhomgeneous equations, via the integrating factor method.

For general linear differential equations, we can solve them via methods like Fourier or Laplace transforms or via numerical methods. Linear differential equations are useful to consider since they arise in many equations and can be solved analytically. They will also be useful for us to consider when we think about stochastic differential equations.

5. SDEs

A pragmatic way to think about stochastic differential equations is to think about them as ODEs with some noise added. For example, let’s think about a car in 2 dimensions that is governed by the following ODE resulting from Newton’s second law: f(t)=ma(t)=md2x(t)dt2\begin{aligned} \textbf{f}(t) &= m \textbf{a}(t) \\ &= m \frac{d^2 \textbf{x}(t)}{d t^2} \end{aligned}

where f(t)\textbf{f}(t) is the force acting on the car, mm is the mass of the car, a(t)\textbf{a}(t) is the acceleration of the car and x(t)\textbf{x}(t) is the position of the car.

Let’s say that our acceleration is now not precisely known/we have a measurement error in our acceleration. We can model this by adding some noise to our acceleration and descrie it by a random white-noise process w(t)=(w1(t),w2(t))\textbf{w}(t) = (w_1(t), w_2(t)): d2x1(t)dt2=w1(t);d2x1(t)dt2=w2(t)\begin{aligned} \frac{d^2 x_1(t)}{d t^2} &= w_1(t); \frac{d^2 x_1(t)}{d t^2} &= w_2(t) \end{aligned}

Now if we define x3=dx1(t)dtx_3 = \frac{d x_1(t)}{d t} and x4=dx2(t)dtx_4 = \frac{d x_2(t)}{d t} we can write this as a system of first order ODEs in our old familiar form: dx(t)dt(x1x2x3x4)=(0010000100000000)(x1x2x3x4)+(00001001)(w1w2)\begin{aligned} \frac{d \textbf{x}(t)}{d t} \left(\begin{array}{c} x_1 \\ x_2 \\ x_3 \\ x_4 \end{array}\right) &= \left(\begin{array}{ccc} 0 & 0 & 1 & 0 \\ 0 & 0 & 0 & 1 \\ 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 \end{array}\right) \left(\begin{array}{c} x_1 \\ x_2 \\ x_3 \\ x_4 \end{array}\right) + \left(\begin{array}{cc} 0 & 0 \\ 0 & 0 \\ 1 & 0 \\ 0 & 1 \end{array}\right) \left(\begin{array}{c} w_1 \\ w_2 \\ \end{array}\right) \end{aligned}

where the first matrix is FF and the second matrix is LL. This is a linear time-invariant differential equation where the forcing function is not deterministic but a stochastic process: a stochastic differential equation.

Let’s try to solve such an equation heuristically with initial condition x(t0)N(m0,P0)\textbf{x}(t_0) \sim \mathcal{N}(\textbf{m}_0, \textbf{P}_0). If we pretend that w(t)\textbf{w}(t) is deterministic and continuous (which it is not), we can solve this equation like we would solve any other ODE. We can write the solution via the method of integrating factor as: dxdt=Fx(t)+Lw(t)dxdtFx(t)=Lw(t)exp(Ft)dxdtexp(Ft)Fx(t)=exp(Ft)Lw(t)ddtexp(Ft)x(t)=exp(Ft)Lw(t)exp(Ft)x(t)exp(Ft0)x(t0)=t0texp(Fs)Lw(s)dsx(t)=exp(F(tt0))x(t0)+t0texp(F(ts))Lw(s)ds\begin{aligned} \frac{d \textbf{x}}{d t} &= \textbf{F} \textbf{x}(t) + \textbf{L} \textbf{w}(t) \\ \frac{d \textbf{x}}{d t} - \textbf{F} \textbf{x}(t) &= \textbf{L} \textbf{w}(t) \\ exp(-\textbf{F} t) \frac{d \textbf{x}}{d t} - exp(-\textbf{F} t) \textbf{F} \textbf{x}(t) &= exp(-\textbf{F} t) \textbf{L} \textbf{w}(t) \\ \frac{d}{d t} exp(-\textbf{F} t) \textbf{x}(t) &= exp(-\textbf{F} t) \textbf{L} \textbf{w}(t) \\ exp(-\textbf{F} t) \textbf{x}(t) - exp(-\textbf{F} t_0) \textbf{x}(t_0) &= \int_{t_0}^t exp(-\textbf{F} s) \textbf{L} \textbf{w}(s) ds \\ \textbf{x}(t) &= exp(\textbf{F}(t-t_0)) \textbf{x}(t_0) + \int_{t_0}^t exp(\textbf{F}(t-s)) \textbf{L} \textbf{w}(s) ds \end{aligned}

where t0t_0 is the initial time and exp()exp() is the matrix exponential. Let us continue to pretend that w(t)\textbf{w}(t) is deterministic and continuous. Then taking expectation on both side yields: E[x(t)]=E[exp(F(tt0))x(t0)]+E[t0texp(F(ts))Lw(s)ds]=E[exp(F(tt0))x(t0)]=exp(F(tt0))m0\begin{aligned} \mathbb{E}[\textbf{x}(t)] &= \mathbb{E}[exp(\textbf{F}(t-t_0)) \textbf{x}(t_0)] + \mathbb{E}[\int_{t_0}^t exp(\textbf{F}(t-s)) \textbf{L}\textbf{w}(s) ds] \\ &= \mathbb{E}[exp(\textbf{F}(t-t_0)) \textbf{x}(t_0)] \\ &= exp(\textbf{F}(t-t_0)) \textbf{m}_0 \end{aligned}

where we used in the first step that E[w(s)]=0\mathbb{E}[\textbf{w}(s)] = 0 since this is one of the properties of the white noise process and in the second step that E[x(t0)]=m0\mathbb{E}[\textbf{x}(t_0)] = \textbf{m}_0 since this was our initial condition.

We can also get a covariance equation by recalling the Dirac delta correlation proeprty of the white noise process P(t,s)=E[w(t)wT(s)]=δ(ts)Q\textbf{P}(t,s) = \mathbb{E}[\textbf{w}(t) \textbf{w}^T(s)] = \delta(t-s) \textbf{Q} with Q\textbf{Q} being the spectral density of the white noise process. Then we can write (for t0=0t_0 = 0 and m(t)=E[x(t)]m(t) = \mathbb{E}[\textbf{x}(t)]): P(t,t0)=E[(x(t)E[x(t)])(x(t)E[x(t)])T]=E[(x(t)m(t))(x(t)m(t))T]=E[x(t)x(t)T]E[x(t)m(t)T]E[m(t)x(t)T]+E[m(t)m(t)T]=exp(F(t))P0exp(F(t))T+0texp(F(ts))LQLTexp(F(ts))Tds\begin{aligned} \textbf{P}(t, t_0) &= \mathbb{E}[(\textbf{x}(t) - \mathbb{E}[\textbf{x}(t)])(\textbf{x}(t) - \mathbb{E}[\textbf{x}(t)])^T] \\ &= \mathbb{E}[(\textbf{x}(t) - \textbf{m}(t))(\textbf{x}(t) - \textbf{m}(t))^T] \\ &= \mathbb{E}[\textbf{x}(t)\textbf{x}(t)^T] - \mathbb{E}[\textbf{x}(t)\textbf{m}(t)^T] - \mathbb{E}[\textbf{m}(t)\textbf{x}(t)^T] + \mathbb{E}[\textbf{m}(t)\textbf{m}(t)^T]\\ &= exp(\textbf{F}(t)) \textbf{P}_0 exp(\textbf{F}(t))^T + \int_0^t exp(\textbf{F}(t-s)) \textbf{L} \textbf{Q} \textbf{L}^T exp(\textbf{F}(t-s))^T ds \end{aligned}

where in step 3 we used E[x(t)]=exp(F(tt0))m0\mathbb{E}[\textbf{x}(t)] = exp(\textbf{F}(t-t_0))\textbf{m}_0 and m0m0T=P0\textbf{m}_0\textbf{m}_0^T = \textbf{P}_0.

We can now get the the differential equations for the mean and covariance by taking the derivative of the mean and covariance equations with respect to tt (where we denote m(t)=E[x(t)]m(t) = \mathbb{E}[\textbf{x}(t)] and P(t)=E[(x(t)E[x(t)])(x(t)E[x(t)])T]\textbf{P}(t) = \mathbb{E}[(\textbf{x}(t) - \mathbb{E}[\textbf{x}(t)])(\textbf{x}(t) - \mathbb{E}[\textbf{x}(t)])^T]): dE[x(t)]dt=FE[x(t)]dP(t)dt=FP(t)+P(t)FT+LQLT\begin{aligned} \frac{d \mathbb{E}[\textbf{x}(t)]}{d t} &= \textbf{F} \mathbb{E}[\textbf{x}(t)] \\ \frac{d \textbf{P}(t)}{d t} &= \textbf{F} \textbf{P}(t) + \textbf{P}(t) \textbf{F}^T + \textbf{L} \textbf{Q} \textbf{L}^T \end{aligned}

So far, so good: these equations are actually the correct differential equations for the mean and covariance. So can we treat SDEs just like ODEs? No, because we were just lucky in this case that (1) the SDE here was linear and (2) we did not use any of the calculus rules that are different in the stochastic setting. We cannot stretch this heuristic treatment to nonlinear SDEs and we cannot use it for more complex expressions.

To see how quickly this treatment can go wrong, let’s try to get the differential equations for the mean and covariance just by directly differentiating the original equation for mean and covariance and only afterwards taking the expectation, which should not make a difference. For the mean, we indeed get the same result: E[dx(t)dt]=E[Fx(t)+Lw(t)]=FE[x(t)]+LE[w(t)]=FE[x(t)]\begin{aligned} \mathbb{E}[\frac{d\textbf{x}(t)}{dt}] &= \mathbb{E}[\textbf{F} \textbf{x}(t) + \textbf{L} \textbf{w}(t)] \\ &= \textbf{F} \mathbb{E}[\textbf{x}(t)] + \textbf{L} \mathbb{E}[\textbf{w}(t)] \\ &= \textbf{F} \mathbb{E}[\textbf{x}(t)] \\ \end{aligned}

When we try the same thing for the covariance, we get the following: ddt[(xm)(xm)T]=(dxdtmdt)(xm)T+(xm)(dxdtmdt)Tsubstitute der. and take expectationddtE[(xm)(xm)T]=FE[(xm)(xm)T]+E[(xm)(xm)T]FTdP(t)dt=FP(t)+P(t)FTFP(t)+P(t)FT+LQLT\begin{aligned} \frac{d}{dt}[(\textbf{x} - \textbf{m})(\textbf{x} - \textbf{m})^T] &= \Big( \frac{d \textbf{x}}{dt} - \frac{\textbf{m}}{dt} \Big) (\textbf{x}-\textbf{m})^T + (\textbf{x}-\textbf{m}) \Big( \frac{d \textbf{x}}{dt} - \frac{\textbf{m}}{dt} \Big)^T \mid \text{substitute der. and take expectation}\\ \frac{d}{dt}\mathbb{E}[(\textbf{x} - \textbf{m})(\textbf{x} - \textbf{m})^T] &= \textbf{F}\mathbb{E}[(\textbf{x} - \textbf{m})(\textbf{x} - \textbf{m})^T] + \mathbb{E}[(\textbf{x} - \textbf{m})(\textbf{x} - \textbf{m})^T] \textbf{F}^T \\ \frac{d\textbf{P}(t)}{dt} &= \textbf{F}\textbf{P}(t) + \textbf{P}(t)\textbf{F}^T \\ &\neq \textbf{F} \textbf{P}(t) + \textbf{P}(t) \textbf{F}^T + \textbf{L} \textbf{Q} \textbf{L}^T \end{aligned}

We see that the equation is wrong, it is missing the last term. This is because we cannot treat the stochastic process w(t)\textbf{w}(t) as a regular function and therefore cannot use the product rule.

6. Ito Integral

This seems easy enough, we can just solve this like we would solve any other ODE, right? Well, not quite. The problem is that the noise process w(t)\textbf{w}(t) is not a regular function, but a random process. This means that we cannot just integrate it like we would integrate a regular function. We need a new way to integrate random processes. This is where the Ito integral comes in.

To see the problem, let’s consider

7. Ito’s Lemma

So far we defined the stochastic integral in the Ito formulation. Now we want to see how we can solve these integrals in order to make progress with a proper treatment of our SDEs beyond the heuristic treatment we did before. To do this, we will use Ito’s lemma, which is a generalization of the chain rule for stochastic processes. It basically is a Taylor expansiona adapted to the stochastic world and a key tool in stochastic calculus.

Let’s sketch how we get to Ito’s lemma by considering a Taylor series for a function f(x)f(x) around 00 in ordinary calculus: df(x)=f(x)dx+12f(x)(dx)2d f(x) = f'(x) dx + \frac{1}{2} f''(x)(dx)^2

Example: f(x)=x2f(x) = x^2 df(x)=2xdx+12f(x)(dx)2d f(x) = 2 x dx + \frac{1}{2} f''(x) (dx)^2

However, for ordinary functions, (dx)2(dx)^2 will be very small since our partitions dxdx are very small to begin with. The corresponding change in function value that this term induces will be very small since our function is continuous and differentiable.

Therefore, we can ignore the second term and write: df(x)=f(x)dx=2xdxd f(x) = f'(x) dx = 2 x dx

How does this look like in stochastic calculus? We again consider the change in f(x)f(x), but now we consider xx to be a stochastic process and therefore denote it is XtX_t. We can write the change in f(Xt)f(X_t) as:

df(Xt)=df(X_t) =\frac{\partial f(X_t)}{\partial X_t} dX_t + \frac{1}{2} \frac{\partial^2 f(X_t)}{\partial X_t^2} (dX_t)^2$$

where dXtdX_t is the change in XtX_t and (dXt)2(dX_t)^2 is the quadratic variation of XtX_t. Now in a stochastic process like Brownian motion, the quadratic variation in the limit of infinitely small partitions is not converging to zero, but to the time step dtdt (we have described this before in the properties of Brownian motion). Therefore, we cannot ignore the second term in the Taylor expansion and have to keep it. This is the key insight of Ito’s lemma.

We can use Ito’s lemma for all Ito processes.

Definition: An Ito process is an adapted stochastic process XtX_t that can be expressed as the sum of an integral with respect to time and an integral with respect to a Brownian motion WtW_t: dXt=μ(t,Xt)dt+σ(t,Xt)dWtdX_t = \mu(t, X_t) dt + \sigma(t, X_t) dW_t

Here μ(t,Xt)\mu(t, X_t) is the drift term and σ(t,Xt)\sigma(t, X_t) is the diffusion term. We can think of the drift term as the deterministic part of the process and the diffusion term as the stochastic part of the process.

Now we can formulate what Ito’s lemma is for these Ito processes:

Ito’s lemma: Let XtX_t be an Ito process and f(t,Xt)f(t, X_t) be a function of tt and XtX_t that is twice continuously differentiable with respect to tt and XtX_t. Then f(t,Xt)f(t, X_t) is also an Ito process, can be denoted YtY_t and we can write: dYt=df(t,Xt)=f(t,Xt)tdt+f(t,Xt)XtdXt+122f(t,Xt)Xt2(dXt)2dY_t = df(t, X_t) = \frac{\partial f(t, X_t)}{\partial t} dt + \frac{\partial f(t, X_t)}{\partial X_t} dX_t + \frac{1}{2} \frac{\partial^2 f(t, X_t)}{\partial X_t^2} (dX_t)^2

where we can use df(t,Xt)=ftdt+fXdXt+12fXX(dXt)2df(t, X_t) = f_t dt + f_X dX_t + \frac{1}{2}f_{XX}(dX_t)^2 as a simpler notation. This is the stochastic version of the Taylor expansion we saw before. We can also write this in integral form: Yt=f(t,Xt)=f(t0,Xt0)+t0tft(s,Xs)ds+t0tfX(s,Xs)dXs+12t0tfXX(s,Xs)(dXs)2Y_t = f(t, X_t) = f(t_0, X_{t_0}) + \int_{t_0}^t f_t(s, X_s) ds + \int_{t_0}^t f_X(s, X_s) dX_s + \frac{1}{2} \int_{t_0}^t f_{XX}(s, X_s) (dX_s)^2

where the first term is the initial value of f(t,Xt)f(t, X_t), the second term is the deterministic part of the process, the third term is the stochastic part of the process and the fourth term is the correction term that arises from the quadratic variation of the stochastic process.

Another version of Ito’s lemma in its differential form that you will often see is the following: dYt=df(t,Xt)=(f(t,Xt)xμ(t,Xt)+f(t,Xt)t+122f(t,Xt)x2σ(t,Xt)2)dt+f(t,Xt)xσ(t,Xt)dWtdY_t = df(t, X_t) = (\frac{\partial f(t, X_t)}{\partial x} \mu(t, X_t) + \frac{\partial f(t, X_t)}{\partial t} + \frac{1}{2} \frac{\partial^2 f(t, X_t)}{\partial x^2} \sigma(t, X_t)^2) dt + \frac{\partial f(t, X_t)}{\partial x} \sigma(t, X_t) dW_t

Here we used the original differential form, substituted dXtdX_t with the Ito process, (dXt)2(dX_t)^2 with the quadratic variation of the Ito process σ(t,Xt)2\sigma(t, X_t)^2 and then simplified the resulting expression by grouping dtdt and dWtdW_t terms.

Ito’s lemma can also be generalized to higher dimensions. Written it matrix from, it looks like this: dYt=df(t,Xt)=(tf+fTμ+12Tr(TfσσT))dt+fTσdWtdY_t = df(t, X_t) = (\partial_t f + \nabla f^T \mu + \frac{1}{2} Tr(\nabla \nabla^T f \sigma \sigma^T)) dt + \nabla f^T \sigma dW_t

where XtX_t is a nn-dimensional Ito process and f(t,Xt)f(t, X_t) is a function of tt and XtX_t that is twice continuously differentiable with respect to tt and XtX_t.

8. Solving linear SDEs with Ito’s Lemma

Now that we have a practical formula in our hand, let’s use it to solve some linear SDE equations.

Definition: A geometric Brownian motion (GBM, also known as exponential Brownian motion) is a continuous-time stochastic process StS_t that is defined by the following stochastic differential equation:

dXt=μXtdt+σXtdWtdX_t = \mu X_t dt + \sigma X_t dW_t

where μ\mu is the drift of the process, σ\sigma is the volatility of the process and BtB_t is a Brownian motion. Both μ\mu and σ\sigma are constants.

Compared to the general Ito process, we added the XtX_t on the right hand side as a multiplicative factor to the drift and diffusion term.

Here we see the first problem when solving SDEs: we have XtX_t on the right-hand side of the equation. Simply integrating won’t work. We need to use Ito’s lemma to solve this. For this, we need choose a good Ansatz for our transformation function Yt=f(t,Xt)Y_t = f(t, X_t) so that it will be easy to solve the resulting equation. If we divide both sides by XtX_t, we can write this as: 1XtdXt=μdt+σdWt\frac{1}{X_t} dX_t = \mu dt + \sigma dW_t

In this case, we choose Yt=ln(Xt)Y_t = ln(X_t) since the 1XtdXt\frac{1}{X_t} dX_t on the left hand side hints at the derivative of a logarithm. This transformation is a good choice since it will make the XtX_t on the right-hand side of the original equation disappear due to the derivative of it being 1x\frac{1}{x}. Let us consider the three kinds of derivatives that we need to compute: f(t,Xt)t=0f(t,Xt)Xt=1Xt2f(t,Xt)Xt2=1Xt2\begin{aligned} \frac{\partial f(t, X_t)}{\partial t} &= 0 \frac{\partial f(t, X_t)}{\partial X_t} &= \frac{1}{X_t} \frac{\partial^2 f(t, X_t)}{\partial X_t^2} &= -\frac{1}{X_t^2} \end{aligned}

With this, we can write Ito’s lemma for our transformation function Yt=ln(Xt)Y_t = ln(X_t) as: dYt=1XtdXt121Xt2(dXt)2dY_t = \frac{1}{X_t} dX_t - \frac{1}{2} \frac{1}{X_t^2} (dX_t)^2

Now we can substitute dXtdX_t with the original equation. To substitute (dXt)2(dX_t)^2, we need to use the rules for differentials in Ito calculus. We will talk about these more later, but for now we can use the following rules: dtdt=0dWtdt=0dWtdWt=dt\begin{aligned} dt dt &= 0 \hspace{1cm} dW_t dt &= 0 \hspace{1cm} dW_t dW_t &= dt \hspace{1cm} \end{aligned}

With this, we can write: (dXt)2=(μXtdt+σXtdWt)2=μ2Xt2dt2+2μσXt2dtdWt+σ2Xt2(dWt)2=σ2Xt2dt\begin{aligned} (dX_t)^2 &= (\mu X_t dt + \sigma X_t dW_t)^2 \\ &= \mu^2 X_t^2 dt^2 + 2 \mu \sigma X_t^2 dt dW_t + \sigma^2 X_t^2 (dW_t)^2 \\ &= \sigma^2 X_t^2 dt \end{aligned}

where we used the rules above to simplify the expression. Now we can substitute this into our equation for dYtdY_t: dYt=1XtdXt121Xt2(dXt)2=1Xt(μXtdt+σXtdWt)121Xt2σ2Xt2dt=μdt+σdWt12σ2dt=(μ12σ2)dt+σdWt\begin{aligned} dY_t &= \frac{1}{X_t} dX_t - \frac{1}{2} \frac{1}{X_t^2} (dX_t)^2 \\ &= \frac{1}{X_t} (\mu X_t dt + \sigma X_t dW_t) - \frac{1}{2} \frac{1}{X_t^2} \sigma^2 X_t^2 dt \\ &= \mu dt + \sigma dW_t - \frac{1}{2} \sigma^2 dt \\ &= (\mu - \frac{1}{2} \sigma^2) dt + \sigma dW_t \end{aligned}

We can see that the XtX_t on the right-hand side disappeared and X_t only appears implicitly on the right hand side via dYt=d(ln(Xt))dY_t = d(ln(X_t)). This is the key insight of Ito’s lemma: we can transform a stochastic differential equation into an easier to solve differential equation by choosing a good transformation function Yt=f(t,Xt)Y_t = f(t, X_t).

Now we can solve this equation by integrating both sides: 0tdYt=0t(μ12σ2)dt+0tσdWtYtY0=(μ12σ2)t+σBtln(Xt)ln(X0)=(μ12σ2)t+σBtln(XtX0)=(μ12σ2)t+σBtXtX0=exp((μ12σ2)t+σBt)Xt=X0exp((μ12σ2)t+σBt)\begin{aligned} \int_{0}^t dY_t &= \int_{0}^t (\mu - \frac{1}{2} \sigma^2) dt + \int_{0}^t \sigma dW_t \\ Y_t - Y_0 &= (\mu - \frac{1}{2} \sigma^2) t + \sigma B_t \\ ln(X_t) - ln(X_0) &= (\mu - \frac{1}{2} \sigma^2) t + \sigma B_t \\ ln(\frac{X_t}{X_0}) &= (\mu - \frac{1}{2} \sigma^2) t + \sigma B_t \\ \frac{X_t}{X_0} &= exp((\mu - \frac{1}{2} \sigma^2) t + \sigma B_t) \\ X_t &= X_0 exp((\mu - \frac{1}{2} \sigma^2) t + \sigma B_t) \end{aligned}

where we used that Y0=ln(X0)Y_0 = ln(X_0) and B0=0B_0 = 0.

This is the solution to the geometric Brownian motion. We can see that the solution is a log-normal distribution. This is a very important distribution that will appear again in the course, for example when we talk about the Girsanov theorem.

We can look at the interesting special case when our process has no drift, i.e. μ=0\mu = 0: 1XtdXt=μdt+σdWtXTX0=0TσXtdWtXT=X0exp((σ22)T+σBt)\begin{aligned} \frac{1}{X_t} dX_t &= \mu dt + \sigma dW_t X_T - X_0 &= \int_{0}^T \sigma X_t dW_t \\ X_T &= X_0 exp((-\frac{\sigma^2}{2})T+\sigma B_t) \end{aligned}

We can see that the integral on the right hand side is an integral martingale and therefore conclude that X_t is a martingale with constant expected value. This is a very important property of the log-normal distribution that we will see again later.

We can look at this in code by simulating a geometric Brownian motion numerically:

# file: "geometric_brownian_motion.py"
import numpy as np
import matplotlib.pyplot as plt
M = 10 # number of simulations
t = 10 # Time
n = 100 # steps we want to see
dt = t/n # time step
#simulating the brownian motion
steps = np.random.normal(0, np.sqrt(dt), size=(M, n)).T
origin = np.zeros((1,M))
bm_paths = np.concatenate([origin, steps]).cumsum(axis=0)
time = np.linspace(0,t,n+1)
tt = np.full(shape=(M, n+1), fill_value=time)
tt = tt.T
plt.plot(tt,bm_paths)
plt.xlabel("Years (t)")
plt.ylabel("Move")
plt.show()
# change time steps to 1,000,000 to observe same quadratic variation along paths
[quadratic_variation(path) for path in bm_paths.T[:4]]
# output: 
# change simulations to 100,000 to observe convergence of variance to Time at a particular time step
[variance(path) for path in bm_paths[1:11]]
# output:

9. Ornstein-Uhlenbeck Process

Now let’s look at another example of a linear SDE: the Ornstein-Uhlenbeck process. This is a process that is often used to model the velocity of a particle under the influence of friction. It is defined by the following SDE: dXt=θ(μXt)dt+σdWtdX_t = \theta (\mu - X_t) dt + \sigma dW_t

where θ\theta is the friction coefficient, μ\mu is the drift term and σ\sigma is the volatility.

Let’s solve this equation in general. We can again use Ito’s lemma to solve this equation. We first see that μ\mu is intended as the long-term mean of the process X_t. We can therefore choose Yt=XtμY_t = X_t - \mu as our transformation function. We then get the following differential: dYt=dXt=θYtdt+σdWtdY_t = dX_t = - \theta Y_t dt + \sigma dW_t

where we used that Y0=X0μY_{0} = X_{0} - \mu. We can see that the solution is a linear combination of a deterministic part and a stochastic part. The deterministic part is the long-term mean μ\mu plus a term that decays exponentially to the long-term mean with a time constant of θ\theta. The stochastic part is a Brownian motion that is weighted by a factor that decays exponentially to zero with a time constant of θ\theta.

If we look at this equation for a bit, we can see that we are equating the change in YtY_t to a multiple of itself plus some noise. If we ignore the noise, this pattern of (roughly speaking) dYtdt=θYt\frac{dY_t}{dt} = - \theta Y_t is strikingly similar to the classic 1st order ODE with the exponential function as a solution.

We can therefore do another change of variables and define Zt=exp(θt)YtZ_t = exp(\theta t) Y_t as our new transformation function. We can then write: Zt=f(t,θ,Yt)=exp(θt)Yt=exp(θt)(Xtμ)Z_t = f(t, \theta, Y_t) = exp(\theta t) Y_t = exp(\theta t) (X_t - \mu)

To solve this, we can use Ito’s lemma again. We first compute the derivatives: dZt=df(t,θ,Yt)=f(t,θ,Yt)tdt+f(t,θ,Yt)YtdYt+122f(t,θ,Yt)Yt2(dYt)2=t[eθtYt]dt+Yt[eθtYt]dYt+0=θeθtYtdt+eθtdYt=θeθtYtdt+eθt(θYtdt+σdWt)=σeθtdWtZt=Z0+0tσeθsdWs=Z0+σ0teθsdWs\begin{aligned} d Z_t &= d f(t, \theta, Y_t) \\ &= \frac{\partial f(t, \theta, Y_t)}{\partial t} dt + \frac{\partial f(t, \theta, Y_t)}{\partial Y_t} dY_t + \frac{1}{2} \frac{\partial^2 f(t, \theta, Y_t)}{\partial Y_t^2} (dY_t)^2 \\ &= \partial_t [e^{\theta t} Y_t] dt + \partial_{Y_t} [e^{\theta t} Y_t] dY_t + 0\\ &= \theta e^{\theta t} Y_t dt + e^{\theta t} dY_t \\ &= \theta e^{\theta t} Y_t dt + e^{\theta t} (- \theta Y_t dt + \sigma dW_t) \\ &= \sigma e^{\theta t} dW_t \\ Z_t &= Z_0 + \int_{0}^t \sigma e^{\theta s} dW_s \\ &= Z_0 + \sigma \int_{0}^t e^{\theta s} dW_s \end{aligned}

where we used that dYt=θYtdt+σdWtdY_t = - \theta Y_t dt + \sigma dW_t.

Now we just have to undo the change of variables and we first get the solution for YtY_t and then for XtX_t: Yt=eθtZt=eθt(Z0+σ0teθsdWs)=eθt(Y0+σ0teθsdWs)Xt=Yt+μ=μ+eθt(Y0+σ0teθsdWs)=μ+eθt(X0μ)+σ0teθsdWs\begin{aligned} Y_t &= e^{-\theta t} Z_t \\ &= e^{-\theta t} (Z_0 + \sigma \int_{0}^t e^{\theta s} dW_s) \\ &= e^{-\theta t} (Y_0 + \sigma \int_{0}^t e^{\theta s} dW_s) \\ X_t &= Y_t + \mu \\ &= \mu + e^{-\theta t} (Y_0 + \sigma \int_{0}^t e^{\theta s} dW_s) \\ &= \mu + e^{-\theta t} (X_0 - \mu) + \sigma \int_{0}^t e^{\theta s} dW_s \end{aligned}

We now focus on the case where μ=0\mu = 0. In addition, we choose a parameterization where θ=α\theta = \alpha and σ=2α\sigma = \sqrt{2\alpha} for reasons that will become clear later (mostly that our solutions come out nicely). With this, our solution for XtX_t with intial condition X0X_0 becomes: Xt=eαtX0+2α0teα(ts)dWs=X0exp(αt)+1exp(2αt)W1=X0exp(αt)+W1exp(2αt)\begin{aligned} X_t &= e^{-\alpha t} X_0 + \sqrt{2 \alpha} \int_{0}^t e^{-\alpha (t-s)} dW_s \\ &= X_0 exp(-\alpha t) + \sqrt{1 - exp(-2 \alpha t)} W_1 \\ &= X_0 exp(-\alpha t) + W_{1-exp(-2 \alpha t)} \end{aligned}

where W1exp(2αt)W_{1-exp(-2 \alpha t)} is a Brownian motion with variance 1exp(2αt)1-exp(-2 \alpha t).

From this solution we can again get some intuition about the process by poking a bit with it. First, we can see that the process is mean-reverting. This is because the process is pulled back to the mean μ\mu with a time constant of α\alpha. It is also the only nontrivial process that is at the same time a Gaussian prcess, a Markov process and temporally homogeneous. To see that the stationary distribution is indeed a Gaussian distribution, we can compute the mean and the variance of the process and then evaluate it in the limit of t0t \rightarrow 0: E[Xt]=E[X0exp(αt)+W1exp(2αt)]=X0exp(αt)+E[W1exp(2αt)]=X0exp(αt)\begin{aligned} \mathbb{E}[X_t] &= \mathbb{E}[X_0 exp(-\alpha t) + W_{1-exp(-2 \alpha t)}] \\ &= X_0 exp(-\alpha t) + \mathbb{E}[W_{1-exp(-2 \alpha t)}] \\ &= X_0 exp(-\alpha t) \end{aligned}

where we used

10. Solving non-linear SDEs numerically

So far we only looked at linear SDEs and how to solve them. Can we do something similar for non-linear SDEs? The answer is sadly no, but since we care about computers and have to discretise our eqyaution anyway, we can use the Euler-Maruyama method to solve non-linear SDEs numerically. This is a numerical method that is similar to the Euler method for ODEs, but adapted to the stochastic setting.

euler-maruyama

You will see in the next lecture that many practical implementations of diffusion models correspond to Euler-Maruyama discretisations of SDEs.

Credits

Code examples adapted from QuantPy. Much of the logic in the lecture is based on the Oksendal as well as the Särkkä & Solin book. Title image from Wikipedia