Testing Strategy¶
Overview¶
The phylogenetic sufficient statistics library uses a layered testing architecture with a Python oracle as the single source of truth, verified against JAX, and golden files as the cross-language bridge to WebGPU and Rust/WASM.
Test hierarchy¶
Layer 1: JAX unit tests (57 tests)¶
Located in tests/test_phylo/test_*.py (excluding test_oracle.py).
These validate the JAX implementation against: - Known analytical results (HKY85 eigendecomposition roundtrip, substitution matrix row sums) - Brute-force enumeration (3-node tree eigensub vs integral) - Internal consistency (sum-product identity: $\sum_{a,b} D_a M_{ab} U_b = P(x)$) - Shape and non-negativity constraints
Tolerance: Machine precision (atol ≈ 1e-12 for most, 1e-4 for accumulated operations).
Layer 2: Oracle vs JAX (99 tests)¶
Located in tests/test_phylo/test_oracle.py.
Every function in the oracle is compared against the corresponding JAX function on identical inputs:
| Test class | Functions tested | Parametrization |
|---|---|---|
TestTokenToLikelihood |
token_to_likelihood |
A ∈ {4, 64} |
TestChildrenOf |
children_of |
R ∈ {3, 5, 7, 19} |
TestSubMatrices |
compute_sub_matrices |
JC4, F81, HKY85, JC64 |
TestUpwardPass |
upward_pass (U, logNormU, logLike) |
R × models |
TestDownwardPass |
downward_pass (D, logNormD) |
R × models |
TestComputeJ |
compute_J |
All models |
TestEigenbasisProject |
eigenbasis_project (U_tilde, D_tilde) |
R ∈ {5,7} × models |
TestLogLike |
LogLike |
R × all models including JC64 |
TestCounts |
Counts (eigensub + f81_fast) |
R × models + JC64 |
TestRootProb |
RootProb |
R × models |
TestMixturePosterior |
MixturePosterior |
R ∈ {5, 7} |
TestBranchMask |
compute_branch_mask |
R ∈ {3, 5, 7, 19} |
TestModels |
eigenvalues, eigenvectors, gamma | Model-specific |
Tolerance: atol=1e-8 for most, atol=1e-6 for accumulated counts, atol=1e-4 for gamma quantiles.
Layer 3: Golden files (6 test cases)¶
Generated by scripts/generate_golden_tests.py using the oracle.
Each golden file is a JSON document containing:
- inputs: alignment, parentIndex, distanceToParent, model parameters
- intermediates: sub_matrices, U, logNormU, D, logNormD, J, U_tilde, D_tilde, C_eigen
- outputs: logLike, counts, root_prob, branch_mask
Golden files serve as the cross-language test oracle — any implementation that matches the golden files is correct.
| File | Description | Key features tested |
|---|---|---|
5node_jc4.json |
5-node JC, A=4 | Basic correctness |
7node_hky85_4.json |
7-node HKY85, A=4 | Non-uniform frequencies, κ bias |
7node_jc64.json |
7-node JC, A=64 | Large alphabet (triplet tokenization) |
7node_jc4_mixed_gaps.json |
7-node JC, A=4, gaps | Branch masking, partial observation |
7node_jc4_all_gaps.json |
7-node JC, A=4, all-gap col | Edge case: logLike ≈ 0 |
7node_mixture_jc4.json |
7-node mixture of 3 JC4 | Rate heterogeneity, mixture posterior |
Layer 4: WebGPU vs golden (planned)¶
Playwright + headless Chromium tests loading golden JSON files and comparing WebGPU outputs at atol=1e-3 (f32 precision).
Layer 5: Rust/WASM vs golden¶
cargo testwith golden file comparisons at atol=1e-8 (f64 precision)- Node.js tests loading WASM module and golden files at atol=1e-8
Running tests¶
# Oracle vs JAX (requires jax-env)
~/jax-env/bin/python -m pytest tests/test_phylo/test_oracle.py -v
# All Python tests
~/jax-env/bin/python -m pytest tests/test_phylo/ -v --ignore=tests/test_phylo/test_pruning.py
# Rust native tests
cd src/phylo/wasm && cargo test
# Regenerate golden files (after oracle changes)
python scripts/generate_golden_tests.py
Adding new test cases¶
- Add a generator function to
scripts/generate_golden_tests.py - Run the script to produce the JSON file
- Add the corresponding test in each backend's test suite
- Verify all backends agree within their precision tolerance