MoleculIB: Molecular Data Library for Machine Learning ====================================================== **MoleculIB** is a comprehensive Python package designed to streamline the processing and management of biomolecular data within machine learning workflows. It empowers researchers and developers with a robust set of functionalities for working with proteins, nucleic acids, and small molecules, simplifying complex data pipelines and accelerating the journey from raw biomolecular data to valuable insights. Key Features ----------- * **Data Processing**: Preprocessing and cleaning capabilities for data normalization and feature engineering * **Machine Learning-First**: Designed to readily integrate with JAX and pytrees in machine learning pipelines * **Reproducibility**: Modular classes and workflows for better organization and reproducible research * **Multi-scale Representation**: Work with molecules at different scales - from sequences to 3D structures * **Interoperability**: Seamless conversion between common biomolecular file formats and data structures Core Components ------------- Protein Module ~~~~~~~~~~~~~ The ``protein`` module provides comprehensive tools for working with protein structures and sequences: * ``ProteinDatum``: Core class for protein structure representation with residue and atom-level information * Manipulation of protein coordinates, sequences, and structural features * Support for common operations like alignment, chain separation, and structural analysis * Conversion between different protein file formats (PDB, mmCIF, mmtf) Nucleic Module ~~~~~~~~~~~~ The ``nucleic`` module handles RNA and DNA structures and sequences: * Support for standard nucleic acid operations and representations * Tools for working with RNA/DNA secondary structure * Conversion utilities for different nucleic acid file formats Molecule Module ~~~~~~~~~~~~~ The ``molecule`` module focuses on small molecule representation and processing: * Tools for representing and manipulating small molecules * Atom typing and feature extraction * Conversion between different molecular file formats Assembly Module ~~~~~~~~~~~~~ The ``assembly`` module handles multi-component biological assemblies: * Tools for working with protein-protein complexes * Support for protein-ligand and protein-nucleic acid interactions * Assembly generation and analysis capabilities Sequence Module ~~~~~~~~~~~~~ Tools for biological sequence analysis: * Sequence alignment and comparison * Feature extraction from sequences * Integration with structure-based features Metrics and Loss ~~~~~~~~~~~~~~ The library includes specialized modules for molecular metrics and loss functions: * Domain-specific metrics for evaluating molecular models * Custom loss functions designed for biomolecular machine learning tasks * Support for both training and evaluation workflows Installation ----------- Install MoleculIB via pip:: pip install moleculib Requirements: * Python ≥ 3.7 * NumPy ≥ 1.20.0 * JAX ≥ 0.3.0 * Biotite ≥ 0.30.0 * py3Dmol ≥ 1.8.0 (for visualization) Getting Started ------------- Basic usage example:: import moleculib as mol # Load a protein from a PDB file protein = mol.protein.ProteinDatum.from_filepath("example.pdb") # Access protein attributes print(f"Number of residues: {len(protein.residue_token)}") print(f"Protein sequence: {protein.get_sequence()}") # Perform operations chains = mol.protein.ProteinDatum.separate_chains(protein) aligned_protein = protein.align_to(reference_protein) # Save in different format protein.save_mmcif("output.cif") For more examples and detailed documentation, see the specific module sections below. API Reference ----------- .. toctree:: :maxdepth: 2 :caption: Modules modules/protein modules/nucleic modules/molecule modules/assembly modules/sequence modules/metrics modules/loss .. toctree:: :maxdepth: 1 :caption: Development contributing changelog * :ref:`genindex` * :ref:`modindex` * :ref:`search`