Oracle API Reference¶
The oracle is a pure NumPy reference implementation using explicit for-loops. It serves as the numerically trustworthy test oracle for all other backends (atol=1e-8 vs JAX).
Dependencies: numpy, scipy (for gammainc only).
Data representations¶
Tree¶
From a Newick string:
from subby.formats import Tree
tree_result = parse_newick("((A:0.1,B:0.2):0.05,C:0.3);")
tree = Tree(parentIndex=tree_result['parentIndex'],
distanceToParent=tree_result['distanceToParent'])
Or construct directly with numpy arrays:
tree = Tree(
parentIndex=np.array([-1, 0, 0, 1, 1], dtype=np.int32),
distanceToParent=np.array([0.0, 0.1, 0.2, 0.15, 0.25]),
)
Model¶
A plain dict (constructed via model functions below):
model = jukes_cantor_model(4)
# {'eigenvalues': array([0., -1.333, -1.333, -1.333]),
# 'eigenvectors': V, # (4, 4) orthogonal matrix
# 'pi': array([0.25, 0.25, 0.25, 0.25])}
Alignment¶
Use parsers to create alignments from standard formats (see Format parsers below):
aln = parse_fasta(">A\nACGT\n>B\nTGCA\n")
aln = parse_dict({"A": "ACGT", "B": "TGCA"})
combined = combine_tree_alignment(tree_result, aln)
The resulting (R, C) int32 array uses this token encoding:
| Token | Meaning | Likelihood vector |
|---|---|---|
0 to A-1 |
Observed state (alphabet position) | One-hot |
A |
Ungapped, unobserved (internal nodes) | All ones |
A+1 or -1 |
Gapped (- or .) |
All ones |
Alphabet detection is automatic: DNA (ACGT), RNA (ACGU), protein (20 AAs), or sorted unique characters. Override with the alphabet parameter on any parser.
High-level API¶
from subby.oracle import LogLike, Counts, RootProb, MixturePosterior
LogLike(alignment, tree, model)¶
Compute per-column log-likelihoods.
| Parameter | Type | Description |
|---|---|---|
alignment |
int32 (R, C) |
Token-encoded alignment |
tree |
Tree |
Tree(parentIndex=..., distanceToParent=...) |
model |
dict |
{'eigenvalues': ..., 'eigenvectors': ..., 'pi': ...} |
Returns: (C,) float64 array.
Counts(alignment, tree, model, f81_fast=False)¶
Compute expected substitution counts and dwell times.
| Parameter | Type | Description |
|---|---|---|
alignment |
int32 (R, C) |
Token-encoded alignment |
tree |
Tree |
Tree NamedTuple |
model |
dict |
Model dict |
f81_fast |
bool |
Use $O(CRA^2)$ fast path (F81/JC only) |
Returns: (A, A, C) float64 tensor. Diagonal = dwell times, off-diagonal = substitution counts.
RootProb(alignment, tree, model)¶
Compute posterior root state distribution per column.
Returns: (A, C) float64 array. Sums to 1 over $A$ for each column.
MixturePosterior(alignment, tree, models, log_weights)¶
Compute posterior over mixture components per column.
| Parameter | Type | Description |
|---|---|---|
alignment |
int32 (R, C) |
Token-encoded alignment |
tree |
Tree |
Tree NamedTuple |
models |
list[dict] |
$K$ model dicts |
log_weights |
(K,) |
Log prior weights |
Returns: (K, C) float64 posterior probabilities.
Model constructors¶
hky85_diag(kappa, pi)¶
HKY85 substitution model.
model = hky85_diag(kappa=2.0, pi=np.array([0.3, 0.2, 0.25, 0.25]))
jukes_cantor_model(A)¶
Jukes-Cantor model for $A$ states.
model = jukes_cantor_model(4) # nucleotide
model = jukes_cantor_model(64) # triplet (codon-like)
f81_model(pi)¶
F81 model with non-uniform equilibrium frequencies.
gamma_rate_categories(alpha, K)¶
Discretized gamma rate categories (Yang 1994).
rates, weights = gamma_rate_categories(alpha=0.5, K=4)
# rates: [0.03, 0.26, 0.82, 2.89] (approximately)
# weights: [0.25, 0.25, 0.25, 0.25]
gy94_model(omega, kappa, pi=None)¶
Goldman-Yang (1994) codon substitution model. Operates on 61 sense codons.
Rate matrix: - $Q_{ij} = 0$ if codons differ at more than 1 nucleotide position - $Q_{ij} = \pi_j \cdot \kappa^{\mathbb{1}[\text{transition}]} \cdot \omega^{\mathbb{1}[\text{nonsynonymous}]}$ - Diagonal: $Q_{ii} = -\sum_{j \neq i} Q_{ij}$ - Normalized so $-\sum_i \pi_i Q_{ii} = 1$
This is reversible ($\pi_i Q_{ij} = \pi_j Q_{ji}$), so uses symmetric eigendecomposition.
model = gy94_model(omega=0.5, kappa=2.0)
# {'eigenvalues': ..., 'eigenvectors': ..., 'pi': ..., 'reversible': True}
# A=61 sense codons
| Parameter | Type | Description |
|---|---|---|
omega |
scalar | dN/dS ratio (Ka/Ks) |
kappa |
scalar | Transition/transversion ratio |
pi |
(61,) or None |
Codon equilibrium frequencies (default: uniform $1/61$) |
Returns: dict with eigenvalues, eigenvectors, pi, reversible: True. $A = 61$ states.
scale_model(model, rate_multiplier)¶
Scale eigenvalues by a rate multiplier (scalar).
Tree utilities¶
children_of(parentIndex)¶
Compute child and sibling arrays.
left_child, right_child, sibling = children_of(parentIndex)
Returns: Three (R,) int arrays. -1 for absent children (leaves) or absent sibling (root).
validate_binary_tree(parentIndex)¶
Assert every internal node has exactly 2 children. Raises ValueError if not.
compute_branch_mask(alignment, parentIndex, A)¶
Compute Steiner tree branch mask per column.
mask = compute_branch_mask(alignment, parentIndex, A=4)
# Shape: (R, C) — boolean, True where branch is informative
Format parsers¶
Standard phylogenetic file format parsers are re-exported from subby.oracle for convenience:
from subby.oracle import (
parse_newick, parse_fasta, parse_stockholm, parse_maf,
parse_strings, parse_dict, combine_tree_alignment, detect_alphabet,
)
parse_newick(newick_str) -> dict¶
Parse a Newick tree string into subby's internal tree representation.
Returns: dict with parentIndex (int32), distanceToParent (float64), leaf_names, node_names.
tree = parse_newick("((A:0.1,B:0.2):0.05,C:0.3);")
parse_fasta(text, alphabet=None) -> dict¶
Parse FASTA-formatted alignment. Auto-detects DNA/RNA/protein alphabet.
Returns: dict with alignment (int32 (N, C)), leaf_names, alphabet.
parse_stockholm(text, alphabet=None) -> dict¶
Parse Stockholm format. Extracts #=GF NH tree if present. When a tree is found, automatically calls combine_tree_alignment.
parse_maf(text, alphabet=None) -> dict¶
Parse MAF (Multiple Alignment Format). Concatenates alignment blocks; fills gaps for missing species.
parse_strings(sequences, alphabet=None) -> dict¶
Parse a list of equal-length strings into an alignment tensor.
parse_dict(sequences, alphabet=None) -> dict¶
Parse a {name: sequence} dictionary into an alignment tensor.
Returns: dict with alignment (int32 (N, C)), leaf_names, alphabet.
aln = parse_dict({"human": "ACGT", "mouse": "TGCA", "dog": "GGGG"})
combine_tree_alignment(tree_result, alignment_result) -> CombinedResult¶
Map leaf sequences to tree positions by name. Creates full (R, C) alignment with internal nodes filled with the ungapped-unobserved token.
Returns: CombinedResult NamedTuple with alignment (int32 (R, C)), tree (Tree), alphabet, leaf_names.
tree = parse_newick("((A:0.1,B:0.2):0.05,C:0.3);")
aln = parse_fasta(">A\nACGT\n>B\nTGCA\n>C\nGGGG\n")
combined = combine_tree_alignment(tree, aln)
ll = LogLike(combined.alignment, combined.tree, jukes_cantor_model(4))
genetic_code() -> dict¶
Return the standard genetic code. Codons in ACGT lexicographic order; stop codons (TAA=48, TAG=50, TGA=56) marked with '*'.
Returns: dict with codons (64 strings), amino_acids (64 letters), sense_mask (64 bools), sense_indices (61 ints), codon_to_sense (64 ints, stop -> -1), sense_codons (61 strings), sense_amino_acids (61 letters).
from subby.formats import genetic_code
gc = genetic_code()
print(len(gc['sense_codons'])) # 61
codon_to_sense(alignment, A=64) -> dict¶
Remap a 64-codon tokenized alignment to 61-sense-codon tokens. Stop codons become the gap token.
| Parameter | Type | Description |
|---|---|---|
alignment |
int32 (N, C) |
Tokens 0..63 for codons, 64 unobserved, 65 gap |
A |
int |
Input codon alphabet size (default 64) |
Returns: dict with alignment (int32 (N, C) with $A_\text{sense} = 61$), A_sense, alphabet.
K-mer tokenization¶
K-mer tokenization converts single-character token alignments into multi-column tokens (e.g., codons = 3-mers). The API cleanly separates column indexing from tokenization:
- Column indexing:
sliding_windowsandall_column_ktuplesproduce(T, k)arrays of column indices. - Tokenization:
kmer_tokenizeconverts alignment columns into k-mer tokens given column tuples. - Index mapping:
KmerIndexprovides O(1) lookup between column tuples and output positions.
KmerIndex¶
Maps between column tuples and output alignment indices. Provides O(1) lookup in both directions.
from subby.oracle import KmerIndex
index = KmerIndex([(0, 1), (2, 3), (4, 5)])
index.tuple_to_idx((2, 3)) # → 1
index.idx_to_tuple(0) # → (0, 1)
len(index) # → 3
| Method | Description |
|---|---|
tuple_to_idx(t) |
Column tuple → output index. Returns -1 if absent. |
idx_to_tuple(idx) |
Output index → column tuple. |
__len__() |
Number of tuples. |
sliding_windows(C, k, stride=None, offset=0, edge='truncate')¶
Generate column index tuples for sliding-window k-mer tokenization.
| Parameter | Type | Description |
|---|---|---|
C |
int |
Number of columns in the alignment |
k |
int |
Window size (k-mer length) |
stride |
int or None |
Step between window starts. Default None → k (non-overlapping). |
offset |
int |
Starting column index. Default 0. |
edge |
str |
'truncate' (default): drop incomplete trailing window; 'pad': include partial window, using -1 for out-of-bounds columns |
Returns: (M, k) int64 array. Entries of -1 indicate out-of-bounds positions (only with edge='pad').
from subby.oracle import sliding_windows
# Non-overlapping codons
sliding_windows(9, 3) # → [[0,1,2],[3,4,5],[6,7,8]]
# Overlapping stride-1 windows
sliding_windows(5, 3, stride=1) # → [[0,1,2],[1,2,3],[2,3,4]]
# Three reading frames
for frame in range(3):
windows = sliding_windows(9, 3, stride=3, offset=frame)
all_column_ktuples(C, k, ordered=True)¶
Generate all k-tuples of column indices. WARNING: produces $O(C^k)$ tuples.
| Parameter | Type | Description |
|---|---|---|
C |
int |
Number of columns |
k |
int |
Tuple size |
ordered |
bool |
True: permutations ($C \cdot (C-1)$ for $k=2$); False: combinations ($\binom{C}{k}$) |
Returns: (T, k) int64 array.
kmer_tokenize(alignment, A, k_or_tuples, gap_mode='any', alphabet=None)¶
Core tokenizer. Accepts either an integer k (backward compatible, non-overlapping windows) or a (T, k) array of column index tuples.
| Parameter | Type | Description |
|---|---|---|
alignment |
int32 (R, C) |
Token-encoded alignment |
A |
int |
Single-character alphabet size |
k_or_tuples |
int or (T, k) array |
Integer k for non-overlapping windows (C must be divisible by k), or column tuples from sliding_windows() / all_column_ktuples() / custom |
gap_mode |
str |
'any': gap in any position gaps the k-mer; 'all': only all-gap k-mers become gaps |
alphabet |
list[str] or None |
Single-character labels for building k-mer labels |
Returns: dict with:
| Key | Type | Description |
|---|---|---|
alignment |
int32 (R, T) |
K-mer tokens |
A_kmer |
int |
$A^k$ |
index |
KmerIndex |
Bidirectional tuple ↔ index mapping |
alphabet |
list[str] |
K-mer labels (only if alphabet given) |
Token encoding: 0..A^k-1 observed, A^k ungapped-unobserved, A^k+1 gap, A^k+2 illegal (partial gap with gap_mode='all'). Entries of -1 in column tuples are treated as unobserved positions.
from subby.oracle import kmer_tokenize, sliding_windows
# Non-overlapping codons (backward compatible)
result = kmer_tokenize(dna_alignment, A=4, k_or_tuples=3)
# Overlapping codon windows
windows = sliding_windows(C=100, k=3, stride=1)
result = kmer_tokenize(dna_alignment, A=4, k_or_tuples=windows)
result['index'].tuple_to_idx((5, 6, 7)) # → 5
Low-level pipeline functions¶
token_to_likelihood(alignment, A)¶
Convert integer tokens to likelihood vectors.
Returns: (R, C, A) float64.
compute_sub_matrices(model, distances)¶
Compute transition probability matrices.
Returns: (R, A, A) float64. Rows sum to 1.
upward_pass(alignment, tree, subMatrices, rootProb)¶
Felsenstein pruning (postorder).
Returns: (U, logNormU, logLike):
- U: (R, C, A) rescaled inside vectors
- logNormU: (R, C) log-normalizers
- logLike: (C,) log-likelihoods
downward_pass(U, logNormU, tree, subMatrices, rootProb, alignment)¶
Outside algorithm (preorder).
Returns: (D, logNormD):
- D: (R, C, A) rescaled outside vectors
- logNormD: (R, C) log-normalizers
compute_J(eigenvalues, distances)¶
$J$ interaction matrix.
Returns: (R, A, A) float64.
eigenbasis_project(U, D, model)¶
Project inside/outside vectors into eigenbasis.
Returns: (U_tilde, D_tilde) — each (R, C, A).
accumulate_C(D_tilde, U_tilde, J, logNormU, logNormD, logLike, parentIndex)¶
Sum eigenbasis contributions over non-root branches.
Returns: (A, A, C).
back_transform(C, model)¶
Transform to natural-basis counts.
Returns: (A, A, C).
f81_counts(U, D, logNormU, logNormD, logLike, distances, pi, parentIndex)¶
F81/JC fast path for counts.
Returns: (A, A, C).
mixture_posterior(log_likes, log_weights)¶
Numerically stable softmax.
Returns: (K, C).