MoleculIB: Molecular Data Library for Machine Learning¶

MoleculIB is a comprehensive Python package designed to streamline the processing and management of biomolecular data within machine learning workflows. It empowers researchers and developers with a robust set of functionalities for working with proteins, nucleic acids, and small molecules, simplifying complex data pipelines and accelerating the journey from raw biomolecular data to valuable insights.

Key Features¶

Data Processing: Preprocessing and cleaning capabilities for data normalization and feature engineering
Machine Learning-First: Designed to readily integrate with JAX and pytrees in machine learning pipelines
Reproducibility: Modular classes and workflows for better organization and reproducible research
Multi-scale Representation: Work with molecules at different scales - from sequences to 3D structures
Interoperability: Seamless conversion between common biomolecular file formats and data structures

Core Components¶

Protein Module¶

The protein module provides comprehensive tools for working with protein structures and sequences:

ProteinDatum: Core class for protein structure representation with residue and atom-level information
Manipulation of protein coordinates, sequences, and structural features
Support for common operations like alignment, chain separation, and structural analysis
Conversion between different protein file formats (PDB, mmCIF, mmtf)

Nucleic Module¶

The nucleic module handles RNA and DNA structures and sequences:

Support for standard nucleic acid operations and representations
Tools for working with RNA/DNA secondary structure
Conversion utilities for different nucleic acid file formats

Molecule Module¶

The molecule module focuses on small molecule representation and processing:

Tools for representing and manipulating small molecules
Atom typing and feature extraction
Conversion between different molecular file formats

Assembly Module¶

The assembly module handles multi-component biological assemblies:

Tools for working with protein-protein complexes
Support for protein-ligand and protein-nucleic acid interactions
Assembly generation and analysis capabilities

Sequence Module¶

Tools for biological sequence analysis:

Sequence alignment and comparison
Feature extraction from sequences
Integration with structure-based features

Metrics and Loss¶

The library includes specialized modules for molecular metrics and loss functions:

Domain-specific metrics for evaluating molecular models
Custom loss functions designed for biomolecular machine learning tasks
Support for both training and evaluation workflows

Installation¶

Install MoleculIB via pip:

pip install moleculib

Requirements:

Python ≥ 3.7
NumPy ≥ 1.20.0
JAX ≥ 0.3.0
Biotite ≥ 0.30.0
py3Dmol ≥ 1.8.0 (for visualization)

Getting Started¶

Basic usage example:

import moleculib as mol

# Load a protein from a PDB file
protein = mol.protein.ProteinDatum.from_filepath("example.pdb")

# Access protein attributes
print(f"Number of residues: {len(protein.residue_token)}")
print(f"Protein sequence: {protein.get_sequence()}")

# Perform operations
chains = mol.protein.ProteinDatum.separate_chains(protein)
aligned_protein = protein.align_to(reference_protein)

# Save in different format
protein.save_mmcif("output.cif")

For more examples and detailed documentation, see the specific module sections below.

API Reference¶

Development

API Reference