AgentCures - Refua Data

Refua Data

A curated dataset catalog with intelligent local caching and parquet materialization, optimized for downstream drug discovery modeling and campaign workflows.

What Is Refua Data?

Refua Data is the data layer for the Refua ecosystem. It provides a built-in catalog of curated drug discovery datasets with a download pipeline that handles caching, conditional refresh, and parquet materialization, so you never have to manually wrangle data files again.

The package supports both static file datasets (ZINC, MoleculeNet benchmarks) and API-based presets (ChEMBL, UniProt) with paginated JSON ingestion and incremental parquet materialization.

Key Features

Built-in catalog of 25+ drug discovery datasets
Dataset-aware download with cache reuse and metadata tracking
HTTP conditional refresh (ETag / Last-Modified)
Incremental parquet materialization with chunked processing
API ingestion for ChEMBL and UniProt
Source health checks for CI and environment diagnostics

Included Datasets

Refua Data ships with a comprehensive catalog covering molecular properties, bioactivity, toxicity, and protein targets.

⚗️

ZINC Datasets

ZINC15 250K sample plus five tranche presets covering drug-like compounds across molecular weight and logP bins. Includes in-stock, agent, and boutique tranches.

🧪

MoleculeNet Benchmarks

Tox21, BBBP, BACE, ClinTox, SIDER, HIV, MUV, ESOL, FreeSolv, Lipophilicity, and PCBA benchmark datasets for molecular property prediction.

📊

ChEMBL Bioactivity

Human Ki and IC50 activity data, binding assays, single-protein targets, and Phase 3+ molecules fetched directly from the ChEMBL REST API.

🧬

UniProt Protein Targets

Reviewed human proteome, kinases, GPCRs, ion channels, and transporters fetched from the UniProt REST API with smart caching.

Get Started

Installation

pip install refua-data

CLI Usage

# List available datasets
refua-data list

# Fetch raw data with caching
refua-data fetch zinc15_250k
refua-data fetch chembl_activity_ki_human

# Materialize to parquet
refua-data materialize zinc15_250k

# Validate data sources
refua-data validate-sources --fail-on-error

Python API

from refua_data import DatasetManager

manager = DatasetManager()
manager.fetch("zinc15_250k")
result = manager.materialize("zinc15_250k")
print(result.parquet_dir)

Data, Ready to Model

Refua Data gives you curated, cached, and materialized datasets for drug discovery so you can spend your time on science instead of data wrangling.