Data Engineering

Refua Data

A curated dataset catalog with intelligent local caching and parquet materialization, optimized for downstream drug discovery modeling and campaign workflows.

25+ Datasets
Parquet Materialization
Smart Caching
3 Major Sources

What Is Refua Data?

Refua Data is the data layer for the Refua ecosystem. It provides a built-in catalog of curated drug discovery datasets with a download pipeline that handles caching, conditional refresh, and parquet materialization, so you never have to manually wrangle data files again.

The package supports both static file datasets (ZINC, MoleculeNet benchmarks) and API-based presets (ChEMBL, UniProt) with paginated JSON ingestion and incremental parquet materialization.

Key Features

  • Built-in catalog of 25+ drug discovery datasets
  • Dataset-aware download with cache reuse and metadata tracking
  • HTTP conditional refresh (ETag / Last-Modified)
  • Incremental parquet materialization with chunked processing
  • API ingestion for ChEMBL and UniProt
  • Source health checks for CI and environment diagnostics

Included Datasets

Refua Data ships with a comprehensive catalog covering molecular properties, bioactivity, toxicity, and protein targets.

โš—๏ธ

ZINC Datasets

ZINC15 250K sample plus five tranche presets covering drug-like compounds across molecular weight and logP bins. Includes in-stock, agent, and boutique tranches.

๐Ÿงช

MoleculeNet Benchmarks

Tox21, BBBP, BACE, ClinTox, SIDER, HIV, MUV, ESOL, FreeSolv, Lipophilicity, and PCBA benchmark datasets for molecular property prediction.

๐Ÿ“Š

ChEMBL Bioactivity

Human Ki and IC50 activity data, binding assays, single-protein targets, and Phase 3+ molecules fetched directly from the ChEMBL REST API.

๐Ÿงฌ

UniProt Protein Targets

Reviewed human proteome, kinases, GPCRs, ion channels, and transporters fetched from the UniProt REST API with smart caching.

Get Started

Installation

pip install refua-data

CLI Usage

# List available datasets
refua-data list

# Fetch raw data with caching
refua-data fetch zinc15_250k
refua-data fetch chembl_activity_ki_human

# Materialize to parquet
refua-data materialize zinc15_250k

# Validate data sources
refua-data validate-sources --fail-on-error

Python API

from refua_data import DatasetManager

manager = DatasetManager()
manager.fetch("zinc15_250k")
result = manager.materialize("zinc15_250k")
print(result.parquet_dir)

Data, Ready to Model

Refua Data gives you curated, cached, and materialized datasets for drug discovery so you can spend your time on science instead of data wrangling.