Benchmark and regression-gate tooling for Refua model workflows. Enforce safe model updates with statistical gating, baseline registries, and adapter-based evaluation.
Refua Bench is a standalone benchmark and regression-gating project for Refua model workflows. It defines benchmark suites, runs evaluations via pluggable adapters, and enforces safe model updates through statistical regression gates with minimum practical effect sizes and bootstrap confidence intervals.
The baseline registry system enables named baselines with safe promotion flows, ensuring that model upgrades never silently degrade performance.
Write a benchmark suite in YAML with tasks, metrics, expected values, and regression tolerances.
Execute the suite with a model adapter (file, command, golden, or custom) to produce a run artifact.
Compare the candidate run against a baseline with statistical gating and bootstrap confidence intervals.
If the candidate passes all gates, promote it as the new baseline in the registry.
Mean Absolute Error for continuous prediction tasks like affinity and binding energy estimation.
Root Mean Square Error for regression tasks where large deviations are particularly penalized.
Classification accuracy for binary or multi-class prediction tasks.
Exact string or value match for deterministic outputs where precision matters.
Binary F1 score balancing precision and recall for imbalanced classification problems.
pip install refua-bench
refua-bench gate \
--suite benchmarks/sample_suite.yaml \
--baseline benchmarks/sample_baseline_run.json \
--adapter file \
--adapter-config benchmarks/sample_file_adapter_config.yaml \
--min-effect-size 0.02 \
--bootstrap-resamples 1000
Refua Bench ensures that every model update is backed by statistical evidence, preventing regressions from reaching production.