Testing Strategy¶

Overview¶

The phylogenetic sufficient statistics library uses a layered testing architecture with a Python oracle as the single source of truth, verified against JAX, and golden files as the cross-language bridge to WebGPU and Rust/WASM.

Test hierarchy¶

Layer 1: JAX unit tests (57 tests)¶

Located in tests/test_phylo/test_*.py (excluding test_oracle.py).

These validate the JAX implementation against: - Known analytical results (HKY85 eigendecomposition roundtrip, substitution matrix row sums) - Brute-force enumeration (3-node tree eigensub vs integral) - Internal consistency (sum-product identity: $\sum_{a,b} D_a M_{ab} U_b = P(x)$) - Shape and non-negativity constraints

Tolerance: Machine precision (atol ≈ 1e-12 for most, 1e-4 for accumulated operations).

Layer 2: Oracle vs JAX (99 tests)¶

Located in tests/test_phylo/test_oracle.py.

Every function in the oracle is compared against the corresponding JAX function on identical inputs:

Test class	Functions tested	Parametrization
`TestTokenToLikelihood`	`token_to_likelihood`	A ∈ {4, 64}
`TestChildrenOf`	`children_of`	R ∈ {3, 5, 7, 19}
`TestSubMatrices`	`compute_sub_matrices`	JC4, F81, HKY85, JC64
`TestUpwardPass`	`upward_pass` (U, logNormU, logLike)	R × models
`TestDownwardPass`	`downward_pass` (D, logNormD)	R × models
`TestComputeJ`	`compute_J`	All models
`TestEigenbasisProject`	`eigenbasis_project` (U_tilde, D_tilde)	R ∈ {5,7} × models
`TestLogLike`	`LogLike`	R × all models including JC64
`TestCounts`	`Counts` (eigensub + f81_fast)	R × models + JC64
`TestRootProb`	`RootProb`	R × models
`TestMixturePosterior`	`MixturePosterior`	R ∈ {5, 7}
`TestBranchMask`	`compute_branch_mask`	R ∈ {3, 5, 7, 19}
`TestModels`	eigenvalues, eigenvectors, gamma	Model-specific

Tolerance: atol=1e-8 for most, atol=1e-6 for accumulated counts, atol=1e-4 for gamma quantiles.

Layer 3: Golden files (6 test cases)¶

Generated by scripts/generate_golden_tests.py using the oracle.

Each golden file is a JSON document containing: - inputs: alignment, parentIndex, distanceToParent, model parameters - intermediates: sub_matrices, U, logNormU, D, logNormD, J, U_tilde, D_tilde, C_eigen - outputs: logLike, counts, root_prob, branch_mask

Golden files serve as the cross-language test oracle — any implementation that matches the golden files is correct.

File	Description	Key features tested
`5node_jc4.json`	5-node JC, A=4	Basic correctness
`7node_hky85_4.json`	7-node HKY85, A=4	Non-uniform frequencies, κ bias
`7node_jc64.json`	7-node JC, A=64	Large alphabet (triplet tokenization)
`7node_jc4_mixed_gaps.json`	7-node JC, A=4, gaps	Branch masking, partial observation
`7node_jc4_all_gaps.json`	7-node JC, A=4, all-gap col	Edge case: logLike ≈ 0
`7node_mixture_jc4.json`	7-node mixture of 3 JC4	Rate heterogeneity, mixture posterior

Layer 4: WebGPU vs golden (planned)¶

Playwright + headless Chromium tests loading golden JSON files and comparing WebGPU outputs at atol=1e-3 (f32 precision).

Layer 5: Rust/WASM vs golden¶

cargo test with golden file comparisons at atol=1e-8 (f64 precision)
Node.js tests loading WASM module and golden files at atol=1e-8

Running tests¶

# Oracle vs JAX (requires jax-env)
~/jax-env/bin/python -m pytest tests/test_phylo/test_oracle.py -v

# All Python tests
~/jax-env/bin/python -m pytest tests/test_phylo/ -v --ignore=tests/test_phylo/test_pruning.py

# Rust native tests
cd src/phylo/wasm && cargo test

# Regenerate golden files (after oracle changes)
python scripts/generate_golden_tests.py

Adding new test cases¶

Add a generator function to scripts/generate_golden_tests.py
Run the script to produce the JSON file
Add the corresponding test in each backend's test suite
Verify all backends agree within their precision tolerance