Testing Strategy

Overview

The phylogenetic sufficient statistics library uses a layered testing architecture with a Python oracle as the single source of truth, verified against JAX, and golden files as the cross-language bridge to WebGPU and Rust/WASM.

Test hierarchy

Layer 1: JAX unit tests (57 tests)

Located in tests/test_phylo/test_*.py (excluding test_oracle.py).

These validate the JAX implementation against: - Known analytical results (HKY85 eigendecomposition roundtrip, substitution matrix row sums) - Brute-force enumeration (3-node tree eigensub vs integral) - Internal consistency (sum-product identity: $\sum_{a,b} D_a M_{ab} U_b = P(x)$) - Shape and non-negativity constraints

Tolerance: Machine precision (atol ≈ 1e-12 for most, 1e-4 for accumulated operations).

Layer 2: Oracle vs JAX (99 tests)

Located in tests/test_phylo/test_oracle.py.

Every function in the oracle is compared against the corresponding JAX function on identical inputs:

Test class Functions tested Parametrization
TestTokenToLikelihood token_to_likelihood A ∈ {4, 64}
TestChildrenOf children_of R ∈ {3, 5, 7, 19}
TestSubMatrices compute_sub_matrices JC4, F81, HKY85, JC64
TestUpwardPass upward_pass (U, logNormU, logLike) R × models
TestDownwardPass downward_pass (D, logNormD) R × models
TestComputeJ compute_J All models
TestEigenbasisProject eigenbasis_project (U_tilde, D_tilde) R ∈ {5,7} × models
TestLogLike LogLike R × all models including JC64
TestCounts Counts (eigensub + f81_fast) R × models + JC64
TestRootProb RootProb R × models
TestMixturePosterior MixturePosterior R ∈ {5, 7}
TestBranchMask compute_branch_mask R ∈ {3, 5, 7, 19}
TestModels eigenvalues, eigenvectors, gamma Model-specific

Tolerance: atol=1e-8 for most, atol=1e-6 for accumulated counts, atol=1e-4 for gamma quantiles.

Layer 3: Golden files (6 test cases)

Generated by scripts/generate_golden_tests.py using the oracle.

Each golden file is a JSON document containing: - inputs: alignment, parentIndex, distanceToParent, model parameters - intermediates: sub_matrices, U, logNormU, D, logNormD, J, U_tilde, D_tilde, C_eigen - outputs: logLike, counts, root_prob, branch_mask

Golden files serve as the cross-language test oracle — any implementation that matches the golden files is correct.

File Description Key features tested
5node_jc4.json 5-node JC, A=4 Basic correctness
7node_hky85_4.json 7-node HKY85, A=4 Non-uniform frequencies, κ bias
7node_jc64.json 7-node JC, A=64 Large alphabet (triplet tokenization)
7node_jc4_mixed_gaps.json 7-node JC, A=4, gaps Branch masking, partial observation
7node_jc4_all_gaps.json 7-node JC, A=4, all-gap col Edge case: logLike ≈ 0
7node_mixture_jc4.json 7-node mixture of 3 JC4 Rate heterogeneity, mixture posterior

Layer 4: WebGPU vs golden (planned)

Playwright + headless Chromium tests loading golden JSON files and comparing WebGPU outputs at atol=1e-3 (f32 precision).

Layer 5: Rust/WASM vs golden

Running tests

# Oracle vs JAX (requires jax-env)
~/jax-env/bin/python -m pytest tests/test_phylo/test_oracle.py -v

# All Python tests
~/jax-env/bin/python -m pytest tests/test_phylo/ -v --ignore=tests/test_phylo/test_pruning.py

# Rust native tests
cd src/phylo/wasm && cargo test

# Regenerate golden files (after oracle changes)
python scripts/generate_golden_tests.py

Adding new test cases

  1. Add a generator function to scripts/generate_golden_tests.py
  2. Run the script to produce the JSON file
  3. Add the corresponding test in each backend's test suite
  4. Verify all backends agree within their precision tolerance