In this article we continue the exploration of functional principal component analysis (FPCA) that we did before. We start with a deeper look into Mercer’s theorem. The theorem states that every continuous, symmetric positive-definite function $K(x, y) : \Omega \times \Omega \rightarrow \mathbb{R}$, with $\Omega \in \mathbb{R}^d$, can be expressed as an infinite sum of product functions,
\[K(x, y) = \sum_{\ell=1}^\infty \lambda_\ell \varphi_\ell(x) \varphi_\ell(y),\]where the $\lambda_\ell$ are non-negative eigenvalues and the $\varphi_\ell$ the corresponding eigenfunctions. The convergence is absolute and uniform. This extends a classical result on symmetric positive-definite matrices to real fuctions.
Our interest of the above theorem is for functions $K(x, y)$ that are kernel functions. A Kernel function is a function that is symmetric, $K(x, y) = K(y, x), \forall x, y \in \Omega$, and is positive definite, meaning that for any finite set of points ${x_1, x_2, \ldots, x_d }$ and any real numbers ${ c_1, c_2, \ldots, c_n }$, the inequality \(\sum_{i, j} c_i c_j K(x_i, x_j) \ge 0\) holds. This means that the Gram matrix $G \in \mathbb{R}^{d \times d}$ with elements $G_{i, j} = K(x_i, x_j)$ is symmetric positive definite.
Common examples of kernel functions are, $\forall x, y \in \Omega = \mathbb{R}^d$,
- the linear kernel $K(x, y) = x^T y$,
- the polynomial* kernel $K(x, y) = (x^T y + c)^n$, with $c \ge 0, n\ge 1$,
- the Gaussian kernel $K(x, y) = e^{ -\frac{\lVert x - y\rVert^2}{2 \sigma^2} }, \sigma > 0.$,
- the Laplacian kernel $K(x, y) = e^{-\alpha \lVert x - y \rVert}, \alpha > 0$, and
- the Abel kernel $K(x, y) = e^{-\alpha \lVert x - y \rVert}, \alpha > 0, d = 1$.
As a simple example, we look at $\Omega = [a, b] \subset \mathbb{R}$ and $K(x, y) = f(x) f(y)$ for a generic function $f$. An eigenfunction $\varphi(x)$ must satisfy
\[\begin{aligned} \int_a^b K(x, y) \varphi(y) dy & = \lambda \varphi(x) \\ % \int_a^b f(x) f(y) \varphi(y) dy & = \lambda \varphi(x) \\ % f(x) \int_a^b f(y)\varphi(y) dy & = \lambda \varphi(x). \end{aligned}\]Assuming $\lambda \neq 0$, we must have $\varphi(x) = \alpha f(x)$ for some $x \in \mathbb{R}$, and consequently
\[f(x) \int_a^b f(y) \alpha f(y) dy = \lambda \alpha f(x)\]which yields
\[\lambda = \int_a^b f(y)^2 dy.\]To get the normalized eigenfunctions, we need
\[\int_a^b \alpha^2 f(y)^2 dy = 1,\]that is \(\alpha = \left( \int_a^b f(y)^2 dy \right)^{-1/2}.\)
It also follows that
\[\begin{aligned} \lambda \varphi(x) \varphi(y) & = \lambda \lambda^{-1/2} f(x) \lambda^{-1/2} f(y) \\ & = f(x) f(y) \\ & = K(x, y), \end{aligned}\]as expected from Mercer’s theorem, since there is only one eigenvalue (and eigenfunction).
In the case $f(x) = x$, it is easy to see that
\[\lambda = \int_a^b y^2 dy = \frac{1}{3}(b-a)^3\]and
\[\alpha = \left( \int_a^b y^2 dy \right)^{-1/2} = \lambda^{-1/2}.\]Therefore, \(\varphi(x) = \lambda^{-1/2} x.\)
After this small detour on Mercer’s theorem, we go back to the Karhunen-Loève Decomposition introduced in the previous article, focusing on two-dimensional problems, that is $d=2$. We consider a sequence \(\{ X_t, t=1, \ldots, T\},\) where each $X_t$ is a sample of an unknown function $X(t, x, y)$, that is given a set of sampling points \(\{(x_i, y_i), i=1, \ldots, I \},\) we have \(X_t = \{X_{t, i} = z(t, x_i, y_i), i=1, \ldots, I\}.\) For simplicity we assume $(x, y) \in \Omega = [-1, 1] \times [-1, 1]$, assuming that different domains can be easily mapped to $\Omega$.
Given the finite sequence \(\{X_t\}\), we can produce an estimate of the covariance $C(x, y, u, v)$ of the process between two sets of points $(x, y) \in \Omega$ and $(u, v) \in \Omega$ and then apply Mercer’s theorem to it to identify the most important modes of such covariance operator. This means that we will identify a set of functions \(\{\varphi_\ell(x, y) : \Omega \rightarrow \mathbb{R}, \ell=1,\ldots, L\}\) that explain most of the variance in the process. Both $K$ and $\varphi_\ell$, being continuous, are infinite-dimensional, so the first step is to approximate them into a finite-dimensional space. We will indicate with \(\{ \psi_b(x, y) : \Omega \rightarrow \mathbb{R}, b=1\, \ldots, B \}\) the set of basis functions of such space and operate exclusively in this space.
The basis functions we will use are Legendre polynomials, because this has certain advantages for our procedure, as we will see later on. Normalized Legendre polynomials of order $m$ are polynomials in the form
\[\mathcal{L}_m(x) = \frac{\sqrt{2m + 1}}{2} \frac{1}{2^m m!} \frac{d^m}{dx^m (x^2 - 1)^m}\]that are orthogonal in the sense that
\[\int_{-1}^1 \mathcal{L}_i(x) \mathcal{L}_j(x) dx = \delta_{ij},\]with $\delta_{ij}$ the Kronecher delta. Let’s visualize those polynomials.
import matplotlib.pyplot as plt
plt.style.use('ggplot')
import numpy as np
from scipy.special import eval_legendre
X = np.linspace(-1, 1)
fig, axes = plt.subplots(figsize=(10, 10), nrows=4, ncols=4, sharex=True, sharey=True)
axes = axes.flatten()
for n, ax in zip(range(0, 16), axes):
y = eval_legendre(n, X) * np.sqrt((2 * n + 1 ) /2)
ax.plot(X, y, label=r'$P_{}(x)$'.format(n))
ax.set_title(f'n={n}', fontdict=dict(size=10))
fig.tight_layout()
To create basis functions for our two-dimensional case, we take the products of the first $n$ Legendre polynomials, such that the powers sums to at most $m$, with the first polynomial applied to $x$ and the second to $y$. Specifically, the basis functions $\psi_b(x, y), b=0, \ldots, B$ are defined as
\[\begin{aligned} \psi_1(x, y) & = \mathcal{L}_0(x) \mathcal{L}_0(y) \\ \psi_2(x, y) & = \mathcal{L}_1(x) \mathcal{L}_0(y) \\ \psi_3(x, y) & = \mathcal{L}_0(x) \mathcal{L}_1(y) \end{aligned}\]and so on. This gives a total of $B=\frac{1}{2}(n + 1)(n + 2)$ basis functions for any value of $m$. Because of the orthogonality discussed above, we have
\[\int_{-1}^1 \int_{-1}^1 \psi_i(x, y) \psi_j(x, y) dx dy = \delta_{i, j}.\]It is worthwhile to visualize those basis functions and appreciate their lack of smoothness as $m$ increases.
func2d = lambda x, y, i, j: eval_legendre(i, x) / np.sqrt(2 / (2 * i + 1)) * eval_legendre(j,y) / np.sqrt( 2 / (2 * j + 1))
n = 101
x = np.linspace(-1, 1, n)
y = np.linspace(-1, 1, n)
xx, yy = np.meshgrid(x, y)
fig = plt.figure(figsize=(20, 20))
counter = 0
m = 8
for i in range(m):
for j in range(m):
ax = fig.add_subplot(m, m, counter + 1, projection='3d')
counter += 1
zz = func2d(xx.flatten(), yy.flatten(), i, j).reshape(n, n)
ax.plot_surface(xx, yy, zz, linewidth=0, antialiased=True)
ax.axis('off')
fig.tight_layout()
The image above suggests that large values of $m$ may be unstable, as the polynomials have large oscillations, while moderate values of $m$ will produce smooth basis functions.
We can now project any given $X_t$ into the space spanned by our basis functions. The best approximation $\hat{X}_t$ is
\[\hat X_t = \sum_{b=1}^B a_{t, b} \psi_b (x, y), \quad \quad \forall (x, y) \in \Omega,\]where the coefficients $a_{t, b}$ are estimated by minimizing the least square error,
\[a_t = \operatorname{argmin}_{\alpha} \sum_{i=1}^I \left( X_{t, i} - \sum_{b=1}^B \alpha_b \psi_b(x_i, y_i) \right)^2,\]where $\alpha_t \in \mathbb{R}^B$ is a vector with $B$ elements. We assume $I \gg B$ such that the problem above is well-posed. After this step, the original sequence of $X_t$ is replaced by a sequence of the projections $\hat X_t$.
To apply the Kahrunen-Loève decomposition, we express the basis functions $\varphi_\ell(x, y), \ell=1, \ldots, L$ as linear combination of the $\psi_b$ defined above,
\[\varphi_\ell(x, y) = \sum_{b=1}^B c_{\ell, b} \psi_b(x, y).\]Denoting $A = {A_{t, k}}$ the time series of the coefficients, with $A \in \mathbb{R}^{T, B}$, $\Psi(x, y) = (\psi_1(x, y), \ldots, \psi_B(x, y))$ the column vector of the $B$ basis functions and $c_\ell = (c_{\ell, 1}, \ldots, c_{\ell, B})$ the column vector of coefficients for the $\ell-th$ basis function, we obtain
\[\hat X(x, y) = A \Psi(x, y)\]and
\[\varphi_\ell(x, y) = \Psi(x, y)^T c_\ell.\]From the $T$ observation we can empirically estimate the covariance function with a simple averaging,
\[\hat C(x, y, u, v) = \frac{1}{T}\Psi(x, y)^T A^T A \Psi(u, v),\]where we have assumed for simplicity of exposition that the observations have zero mean.
Since each $\varphi_\ell$ is an eigenfunction, we must have
\[\begin{aligned} \int_{-1}^1 \int_{-1}^1 \left[ \frac{1}{T}\Psi(x, y)^T A^T A \Psi(u, v) ds dt \right] \Psi(u, v)^T c_\ell & = \lambda_\ell \Psi(x, y)^T c_\ell \\ % \frac{1}{T}\Psi(x, y)^T A^T A \int_{-1}^1 \int_{-1}^1 \Psi(u, v) \Psi(u, v)^T c_\ell ds dt & = \lambda_ \Psi(x, y)^T c_\ell. \end{aligned}\]The integral can be easily computed because of the orthonormality of the basis functions $\psi_\ell$, giving
\[\frac{1}{T}\Psi(x, y)^T A^T A c_\ell = \lambda_\ell \Psi(x, y)^T c_\ell, \quad \quad \forall x,y \in \Omega\]that is,
\[\frac{1}{T} A^T A c_\ell = \lambda_\ell c_\ell,\]which yields the $\ell-th$ eigenvalue and eigenvector of $A^T A$, where the eigenvalues are in decreasing ordered. This is equivalent to performing principal component analysis on the matrix $A$. Once the coefficients $c_\ell$ are computed, we can easily assemble $\varphi_\ell(x, y)$ as a linear composition of $\psi_b(x)$; the eigenvalues $\lambda_\ell$ will give their relative importance in explaning the variance, meaning that we can select the value $L$ that explains a predefined level of variance, say 99% or 99.5%.