A curated dataset catalog with intelligent local caching and parquet materialization, optimized for downstream drug discovery modeling and campaign workflows.
Refua Data is the data layer for the Refua ecosystem. It provides a built-in catalog of curated drug discovery datasets with a download pipeline that handles caching, conditional refresh, and parquet materialization, so you never have to manually wrangle data files again.
The package supports both static file datasets (ZINC, MoleculeNet benchmarks) and API-based presets (ChEMBL, UniProt) with paginated JSON ingestion and incremental parquet materialization.
Refua Data ships with a comprehensive catalog covering molecular properties, bioactivity, toxicity, and protein targets.
ZINC15 250K sample plus five tranche presets covering drug-like compounds across molecular weight and logP bins. Includes in-stock, agent, and boutique tranches.
Tox21, BBBP, BACE, ClinTox, SIDER, HIV, MUV, ESOL, FreeSolv, Lipophilicity, and PCBA benchmark datasets for molecular property prediction.
Human Ki and IC50 activity data, binding assays, single-protein targets, and Phase 3+ molecules fetched directly from the ChEMBL REST API.
Reviewed human proteome, kinases, GPCRs, ion channels, and transporters fetched from the UniProt REST API with smart caching.
pip install refua-data
# List available datasets
refua-data list
# Fetch raw data with caching
refua-data fetch zinc15_250k
refua-data fetch chembl_activity_ki_human
# Materialize to parquet
refua-data materialize zinc15_250k
# Validate data sources
refua-data validate-sources --fail-on-error
from refua_data import DatasetManager
manager = DatasetManager()
manager.fetch("zinc15_250k")
result = manager.materialize("zinc15_250k")
print(result.parquet_dir)
Refua Data gives you curated, cached, and materialized datasets for drug discovery so you can spend your time on science instead of data wrangling.