Mathematical Background¶

This page describes the eigensubstitution accumulation algorithm implemented by the library, following Holmes & Rubin (2002).

Problem statement¶

Given:

A multiple sequence alignment $x$ of $R$ sequences over $C$ columns
A binary phylogenetic tree $T$ with branch lengths $t_n$
A continuous-time Markov substitution model with rate matrix $Q$ and equilibrium distribution $\pi$

Compute, for each column $c$:

Log-likelihood $\log P(x_c \mid T, Q)$
Expected substitution counts $E[s_{ij,c}]$ for each state pair $(i,j)$, $i \ne j$
Expected dwell times $E[w_{i,c}]$ for each state $i$
Posterior root state $P(\text{root} = a \mid x_c)$

Substitution model¶

General CTMC¶

A continuous-time Markov chain on $A$ states is parameterized by: - Rate matrix $Q$ with $Q_{ij} \ge 0$ for $i \ne j$ and $\sum_j Q_{ij} = 0$ - Equilibrium distribution $\pi$ satisfying $\pi^T Q = 0$

Eigendecomposition¶

The rate matrix is diagonalized as $Q = V \Lambda V^{-1}$ where $\Lambda = \text{diag}(\mu_0, \ldots, \mu_{A-1})$ are eigenvalues and $V$ contains the corresponding right eigenvectors as columns. In general, $\mu_k$ and $V$ may be complex-valued. We always have $\mu_0 = 0$ (corresponding to the stationary distribution), and $\text{Re}(\mu_k) \le 0$ for all $k$.

The transition probability matrix is: $$M_{ab}(t) = \left[ e^{Qt} \right]_{ab} = \sum_k V_{ak} \, e^{\mu_k t} \, V^{-1}_{kb}$$

Note: $V^{-1}$ is the matrix of left eigenvectors (rows). For a general rate matrix, $V^{-1} \ne V^T$; the left and right eigenvectors are biorthogonal ($V^{-1} V = I$) but not individually orthonormal.

Reversible models (special case)¶

A model is reversible if it satisfies detailed balance: $\pi_i Q_{ij} = \pi_j Q_{ji}$. In this case, the symmetrized rate matrix $$S_{ij} = \sqrt{\pi_i} \, Q_{ij} / \sqrt{\pi_j}$$ is real symmetric. Its eigendecomposition $S = V \Lambda V^T$ yields real eigenvalues $\mu_k$ and an orthogonal eigenvector matrix $V$ (i.e. $V^{-1} = V^T$).

The transition probability matrix simplifies to: $$M_{ab}(t) = \sqrt{\frac{\pi_b}{\pi_a}} \sum_k V_{ak} \, e^{\mu_k t} \, V_{bk}$$

This is the basis for the symmetrized formulation used throughout the reversible path.

Supported models¶

Model	States	Reversible	Eigendecomposition
HKY85	4 (A,C,G,T)	Yes	Closed-form (4 analytical eigenvectors)
Jukes-Cantor	$A$ (any)	Yes	$\mu_0 = 0$, $\mu_k = -A/(A-1)$ for $k \ge 1$; QR complement
F81	$A$ (any)	Yes	$\mu_0 = 0$, $\mu_k = -\mu$ for $k \ge 1$; QR complement
General	$A$ (any)	No	Full complex eigendecomposition via `eig`

Auto-detection of reversibility is supported via a detailed balance check with configurable tolerance. A rate matrix satisfying detailed balance is automatically routed to the real symmetric (eigh) path; otherwise the complex (eig) path is used.

Rate heterogeneity¶

A discretized gamma distribution (Yang 1994) with shape $\alpha$ and $K$ categories provides rate multipliers $r_k$ at quantile medians, with equal weights $1/K$. Each rate category scales the eigenvalues: $\mu'_k = r_k \cdot \mu_k$.

Algorithm¶

Steps 1 and 2 operate on the transition matrices $M(t)$ directly and are identical for reversible and irreversible models. Steps 3 and 4 work in the eigenbasis and differ in how vectors are projected and back-transformed.

Step 1: Felsenstein pruning (upward/inside pass)¶

For each node $n$ in postorder (leaves to root), the inside vector $U^{(n)}_a(c)$ gives the likelihood of the data below node $n$ given that $n$ is in state $a$ at column $c$.

Leaf initialization:

$U^{(n)}_a(c) = \delta_{a, x_{n,c}}$ if observed
$U^{(n)}_a(c) = 1$ for all $a$ if gapped or unobserved

Internal node recursion: $$U^{(p)}_b(c) = \prod_{\text{child } n \text{ of } p} \left[ \sum_j M_{bj}(t_n) \cdot U^{(n)}_j(c) \right]$$

Rescaling: After each multiplication, divide by $\max_b U^{(p)}_b(c)$ and accumulate the log-normalizer $\lambda^{(p)}(c)$ to prevent underflow.

Log-likelihood: $$\log P(x_c) = \lambda^{(0)}(c) + \log \left[ \sum_a \pi_a \cdot U^{(0)}_a(c) \right]$$

Step 2: Outside algorithm (downward pass)¶

For each non-root node $n$ in preorder, the outside vector $D^{(n)}_a(c)$ gives the likelihood of all data except the subtree below $n$, given that $n$ is in state $a$.

For node $n$ with parent $p$ and sibling $s$: $$D^{(n)}_a(c) = \left[ \sum_j M_{aj}(t_s) \cdot U^{(s)}_j(c) \right] \cdot L(x_p \mid a) \cdot \begin{cases} \pi_a & \text{if } p = \text{root} \\ \sum_i D^{(p)}_i(c) \cdot M_{ia}(t_p) & \text{otherwise} \end{cases}$$

where $L(x_p \mid a)$ is the observation likelihood at the parent node.

Step 3: Eigensubstitution accumulation¶

The key insight of Holmes & Rubin (2002) is that the branch integral can be performed in the eigenbasis of $Q$, deferring the expensive back-transformation until after summing over branches.

J interaction matrix¶

For each branch $n$ with length $t_n$, define the pairwise interaction between eigenmodes $k$ and $l$: $$J_{kl}(t_n) = \begin{cases} t_n \cdot e^{\mu_k t_n} & \text{if } \mu_k \approx \mu_l \\ \frac{e^{\mu_k t_n} - e^{\mu_l t_n}}{\mu_k - \mu_l} & \text{otherwise} \end{cases}$$

This is the integral $\int_0^{t_n} e^{\mu_k s} \, e^{\mu_l(t_n - s)} \, ds$. For irreversible models, $\mu_k$ and $J_{kl}$ are complex-valued.

Eigenbasis projection¶

General (irreversible) case. Project inside/outside vectors using the right eigenvectors $V$ and their inverse $V^{-1}$: $$\tilde{U}^{(n)}_l(c) = \sum_b U^{(n)}_b(c) \cdot V_{bl}$$ $$\tilde{D}^{(n)}_k(c) = \sum_a D^{(n)}_a(c) \cdot (V^{-1})^T_{ak} = \sum_a D^{(n)}_a(c) \cdot V^{-1}_{ka}$$

These are complex-valued when $V$ is complex.

Reversible simplification. When $V^{-1} = V^T$ (orthogonal eigenvectors from the symmetrized decomposition), the projections absorb a $\sqrt{\pi}$ weighting to account for the symmetrization: $$\tilde{U}^{(n)}_l(c) = \sum_b U^{(n)}_b(c) \cdot V_{bl} \cdot \sqrt{\pi_b}$$ $$\tilde{D}^{(n)}_k(c) = \sum_a D^{(n)}_a(c) \cdot V_{ak} / \sqrt{\pi_a}$$

These are always real-valued.

Accumulation¶

Sum over all non-root branches: $$C_{kl}(c) = \sum_{n > 0} \tilde{D}^{(n)}_k(c) \cdot J_{kl}(t_n) \cdot \tilde{U}^{(n)}_l(c) \cdot \text{scale}(n,c)$$

where $\text{scale}(n,c) = \exp(\lambda_D^{(n)}(c) + \lambda_U^{(n)}(c) - \log P(x_c))$.

For irreversible models, $C_{kl}$ is complex; for reversible models, it is real. The formula is identical in both cases.

Step 4: Back-transformation¶

Transform eigenbasis counts to natural basis and recover the rate matrix.

General (irreversible) case. Use $V$ and $V^{-1}$: $$\text{VCV}_{ij}(c) = \sum_{kl} V_{ik} \cdot C_{kl}(c) \cdot V^{-1}_{lj}$$

Reconstruct the rate matrix: $Q_{ij} = \sum_k V_{ik} \, \mu_k \, V^{-1}_{kj}$.

Dwell times (diagonal): $E[w_i(c)] = \text{Re}(\text{VCV}_{ii}(c))$
Substitution counts (off-diagonal): $E[s_{ij}(c)] = \text{Re}(Q_{ij} \cdot \text{VCV}_{ij}(c))$

The final result is guaranteed real because complex eigenvalues and eigenvectors come in conjugate pairs.

Reversible simplification. Since $V^{-1} = V^T$, the back-transformation uses $V$ on both sides: $$\text{VCV}_{ij}(c) = \sum_{kl} V_{ik} \cdot C_{kl}(c) \cdot V_{jl}$$

The rate matrix is replaced by its symmetrized form $S_{ij} = \sum_k V_{ik} \, \mu_k \, V_{jk}$:

Dwell times: $E[w_i(c)] = \text{VCV}_{ii}(c)$
Substitution counts: $E[s_{ij}(c)] = S_{ij} \cdot \text{VCV}_{ij}(c)$

All quantities are real throughout.

Standalone branch integrals (ExpectedCounts)¶

Independent of any alignment or tree, we can compute the expected substitution counts and dwell times for a single CTMC branch of length $t$, conditioned on the endpoint states:

$$E[N_{i \to j}(t) \mid X(0)=a,\, X(t)=b] \quad (i \ne j)$$ $$E[T_i(t) \mid X(0)=a,\, X(t)=b]$$

Branch integral tensor¶

Define the branch integral $W_{abij}(t) = \int_0^t M_{ai}(s) \, M_{jb}(t-s) \, ds$.

General (irreversible) case. In the eigenbasis of $Q = V \Lambda V^{-1}$: $$W_{abij}(t) = \sum_{kl} V_{ak} \, V^{-1}_{ki} \cdot J_{kl}(t) \cdot V_{jl} \, V^{-1}_{lb}$$

The expected counts are: - Dwell: $\text{result}[a,b,i,i] = \text{Re}(W_{abii}) / M_{ab}$ - Subs: $\text{result}[a,b,i,j] = \text{Re}(Q_{ij} \cdot W_{abij}) / M_{ab}$ for $i \ne j$

where $M_{ab} = M_{ab}(t) = \sum_k V_{ak} e^{\mu_k t} V^{-1}_{kb}$ is the transition probability.

Reversible simplification. In the symmetrized eigenbasis ($V^{-1} = V^T$): $$W_{abij}(t) = \sum_{kl} V_{ak} \, V_{ik} \cdot J_{kl}(t) \cdot V_{bl} \, V_{jl}$$

The expected counts are: - Dwell: $\text{result}[a,b,i,i] = W_{abii} / S_t[a,b]$ - Subs: $\text{result}[a,b,i,j] = S_{ij} \cdot W_{abij} / S_t[a,b]$ for $i \ne j$

where $S_t[a,b] = \sum_k V_{ak} e^{\mu_k t} V_{bk}$ is the symmetrized transition matrix (not the actual transition matrix $M_{ab} = \sqrt{\pi_b/\pi_a} \cdot S_t[a,b]$).

Key properties¶

For all reachable $(a,b)$ with $M_{ab}(t) > 0$:

Dwell times sum to $t$: $\sum_i \text{result}[a,b,i,i] = t$
Non-negative: all entries $\ge 0$
$t=0$ limit: all entries are zero

Relationship to tree-based counts¶

BranchCounts (the tree-based per-branch expected counts from the inside-outside algorithm) returns the EM sufficient statistics, which differ from ExpectedCounts by a factor of the transition probability:

$$\text{BranchCounts}[r, i, j, c] = M_{ab}(t_r) \cdot \text{ExpectedCounts}[a,b,i,j]$$

where $(a,b)$ are the posterior endpoint states. This normalization convention is correct for the EM M-step, where the transition probability cancels in the ratio $E[N_{ij}] / E[T_i]$.

F81/Jukes-Cantor fast path¶

For F81 (and its special case Jukes-Cantor), all non-zero eigenvalues are equal ($\mu_k = -\mu$ for $k \ge 1$), enabling a direct $O(CRA^2)$ computation without the eigenbasis. The transition probability is: $$M_{ij}(t) = \delta_{ij} \cdot e^{-\mu t} + \pi_j \cdot (1 - e^{-\mu t})$$

The integral $I^{ab}_{ij}(t)$ decomposes into three closed-form terms parameterized by:

$\alpha(t) = t \cdot e^{-\mu t}$
$\beta(t) = (1 - e^{-\mu t})/\mu - t \cdot e^{-\mu t}$
$\gamma(t) = t(1 + e^{-\mu t}) - 2(1 - e^{-\mu t})/\mu$

Gap handling: Steiner tree branch masking¶

Not all branches are informative for every column. The minimum Steiner tree of ungapped leaves identifies which branches carry signal:

Classify leaves as ungapped (tokens $0 \ldots A-1$ or $A$) or gapped ($-1$, $A+1$)
Upward pass: propagate "has ungapped descendant" flags
Identify Steiner nodes: ungapped leaves, branching points ($\ge 2$ ungapped children), and pass-through nodes
A branch $p \to n$ is active iff both $p$ and $n$ are in the Steiner tree

Mixture model posterior¶

For a mixture of $K$ rate-scaled models with prior weights $w_k$: $$P(k \mid x_c) = \frac{w_k \cdot P(x_c \mid k)}{\sum_{k'} w_{k'} \cdot P(x_c \mid k')} = \text{softmax}_k(\log P(x_c \mid k) + \log w_k)$$

References¶

Holmes, I. & Rubin, G.M. (2002). An Expectation Maximization Algorithm for Training Hidden Substitution Models. Journal of Molecular Biology, 317(5), 753-764.
Felsenstein, J. (1981). Evolutionary trees from DNA sequences: a maximum likelihood approach. Journal of Molecular Evolution, 17, 368-376.
Yang, Z. (1994). Maximum likelihood phylogenetic estimation from DNA sequences with variable rates over sites. Journal of Molecular Evolution, 39, 306-314.
Hasegawa, M., Kishino, H., & Yano, T. (1985). Dating of the human-ape splitting by a molecular clock of mitochondrial DNA. Journal of Molecular Evolution, 22, 160-174.