AgentCures - Refua Bench

Refua Bench

Benchmark and regression-gate tooling for Refua model workflows. Enforce safe model updates with statistical gating, baseline registries, and adapter-based evaluation.

What Is Refua Bench?

Refua Bench is a standalone benchmark and regression-gating project for Refua model workflows. It defines benchmark suites, runs evaluations via pluggable adapters, and enforces safe model updates through statistical regression gates with minimum practical effect sizes and bootstrap confidence intervals.

The baseline registry system enables named baselines with safe promotion flows, ensuring that model upgrades never silently degrade performance.

Key Features

YAML/JSON benchmark suite schema with tasks, metrics, and tolerances
Adapter system: golden, file, command, and custom adapters
Statistical regression gating with bootstrap CI
Named baseline registry with safe promotion flow
Automatic run provenance (git, runtime, model, deps)
JSON + Markdown output artifacts

How It Works

Define Suite

Write a benchmark suite in YAML with tasks, metrics, expected values, and regression tolerances.

Run Benchmark

Execute the suite with a model adapter (file, command, golden, or custom) to produce a run artifact.

Compare & Gate

Compare the candidate run against a baseline with statistical gating and bootstrap confidence intervals.

Promote

If the candidate passes all gates, promote it as the new baseline in the registry.

Supported Metrics

📏

MAE

Mean Absolute Error for continuous prediction tasks like affinity and binding energy estimation.

📐

RMSE

Root Mean Square Error for regression tasks where large deviations are particularly penalized.

🎯

Accuracy

Classification accuracy for binary or multi-class prediction tasks.

✅

Exact Match

Exact string or value match for deterministic outputs where precision matters.

⚖️

F1 Score

Binary F1 score balancing precision and recall for imbalanced classification problems.

Get Started

Installation

pip install refua-bench

Run + Gate in One Command

refua-bench gate \
  --suite benchmarks/sample_suite.yaml \
  --baseline benchmarks/sample_baseline_run.json \
  --adapter file \
  --adapter-config benchmarks/sample_file_adapter_config.yaml \
  --min-effect-size 0.02 \
  --bootstrap-resamples 1000

Ship Models With Confidence

Refua Bench ensures that every model update is backed by statistical evidence, preventing regressions from reaching production.