# RDTools
High-performance molecular operations using RDKit C++ with numpy arrays.
RDTools provides fast molecular descriptor calculations, SMILES validation, and fingerprint generation by exposing RDKit's C++ core through pybind11 bindings. This allows for efficient processing of large numpy arrays of SMILES strings.
## Features
- **Fast Molecular Descriptors**: Calculate molecular weight, LogP, TPSA in parallel
- **SMILES Validation**: Efficiently validate large arrays of SMILES strings
- **Fingerprint Generation**: Morgan fingerprints as numpy bit vectors
- **TensorFlow Integration**: Custom TensorFlow operations for data pipelines (tf.data compatible)
- **Batch Processing**: Optimized for large datasets with batch processing utilities
- **NumPy Integration**: Native support for numpy arrays with proper error handling
## Installation
### System Requirements
- **Operating System**: Linux (Ubuntu/Debian recommended), macOS, or Windows
- **Python**: 3.12 or later (specified in pyproject.toml)
- **Memory**: At least 4GB RAM recommended for building
- **Disk Space**: ~500MB for dependencies and build artifacts
### Dependencies
#### 1. System Dependencies
**Ubuntu/Debian (Recommended)**:
```bash
# Update package manager
sudo apt-get update
# Install RDKit C++ libraries and development headers
sudo apt-get install -y python3-rdkit librdkit-dev librdkit1 rdkit-data
# Install build essentials (if not already installed)
sudo apt-get install -y build-essential cmake pkg-config
```
**macOS with Homebrew**:
```bash
# Install RDKit
brew install rdkit
# Install build tools (usually pre-installed with Xcode)
xcode-select --install
```
**Windows**:
```bash
# Using conda (recommended for Windows)
conda install -c conda-forge rdkit-dev rdkit
# You'll also need Visual Studio Build Tools
```
#### 2. Python Dependencies
The build system automatically handles Python dependencies, but here's what gets installed:
**Build-time dependencies** (handled by pyproject.toml):
- `scikit-build-core>=0.10` - Modern CMake-based Python build backend
- `nanobind>=2.0.0` - Fast C++/Python binding library
- `tensorflow==2.19.0` - For TensorFlow custom operations (built automatically when available)
**Runtime dependencies**:
- `numpy>=1.20.0` - Core numerical operations
- `tensorflow==2.19.0` - Required for TensorFlow custom operations
**Development dependencies** (optional):
- `pytest>=6.0` - Testing framework
- `black` - Code formatting
- `isort` - Import sorting
### Build and Installation
#### Method 1: Quick Install (Recommended)
```bash
# 1. Clone the repository
git clone <repository-url>
cd rdtools
# 2. Install using uv (handles everything automatically)
uv sync
# 3. Build the C++ extension (with verbose output for debugging)
SKBUILD_EDITABLE_VERBOSE=1 uv sync --reinstall
```
#### Method 2: Step-by-Step Installation
```bash
# 1. Clone and navigate
git clone <repository-url>
cd rdtools
# 2. Create and activate virtual environment (if not using uv)
python -m venv .venv
source .venv/bin/activate # Linux/macOS
# or
.venv\Scripts\activate # Windows
# 3. Install build dependencies
pip install scikit-build-core pybind11 numpy
# 4. Build and install in editable mode
pip install -e .
```
#### Method 3: Manual CMake Build (Advanced)
For development or debugging the build process:
```bash
# 1. Ensure RDKit is installed (see System Dependencies above)
# 2. Create build directory
mkdir build && cd build
# 3. Configure with CMake
cmake .. \
-DCMAKE_BUILD_TYPE=Release \
-DCMAKE_CXX_STANDARD=17 \
-DPYTHON_EXECUTABLE=$(which python)
# 4. Build with verbose output
make VERBOSE=1 -j$(nproc) # Linux/macOS
# or
cmake --build . --config Release --verbose # Windows
# 5. The extension will be in: build/_rdtools_core.so (or .pyd on Windows)
```
### Verification
After installation, verify everything works:
```bash
# Test basic import
uv run python -c "import rdtools; print('RDTools version:', rdtools.__version__)"
# Test C++ extension loading
uv run python -c "import rdtools; print('Extension available:', rdtools._EXTENSION_AVAILABLE)"
# Run basic functionality test
uv run python -c "
import rdtools
import numpy as np
smiles = np.array(['CCO', 'c1ccccc1'])
result = rdtools.molecular_weights(smiles)
print('Molecular weights:', result)
print('✓ Installation successful!')
"
# Run the comprehensive example
uv run python examples/basic_usage.py
```
### Troubleshooting
#### RDKit Not Found
```
Error: "RDKit not found. Please install RDKit or set RDBASE environment variable."
```
**Solution**: Install RDKit system packages as shown in System Dependencies above.
#### CMake Version Too Old
```
Error: "CMake 3.15 or higher is required"
```
**Solution**:
```bash
# Ubuntu/Debian
sudo apt-get install cmake
# Verify version
cmake --version
```
#### Compilation Errors
```
Error: Cannot find RDKit headers
```
**Solutions**:
1. Install development packages: `sudo apt-get install librdkit-dev`
2. Set environment variable: `export RDBASE=/path/to/rdkit`
3. For conda installations: `export RDBASE=$CONDA_PREFIX`
#### Python Version Issues
```
Error: "requires-python = '>=3.12'"
```
**Solution**: Use Python 3.12 or later. Check with: `python --version`
#### Import Errors After Build
```
ImportError: cannot import name '_rdtools_core'
```
**Solutions**:
1. Rebuild: `SKBUILD_EDITABLE_VERBOSE=1 uv sync --reinstall`
2. Check installation: `find .venv -name "*rdtools*" -type f`
3. Verify RDKit: `python -c "import rdkit; print(rdkit.__version__)"`
#### TensorFlow Custom Ops Errors
```
ImportError: RDK-Tools TensorFlow ops not available: undefined symbol
```
**Solutions**:
1. Ensure TensorFlow versions match between build and runtime environments
2. Rebuild with correct TensorFlow version: `rm -rf build/ dist/ && uv build`
3. For distribution, use: `uv run auditwheel repair dist/*.whl --exclude "libtensorflow*"`
### Build Configuration
The build process is configured via `pyproject.toml`:
```toml
[tool.scikit-build]
build-dir = "build/{wheel_tag}" # Separate builds for different Python versions
cmake.version = ">=3.15" # Minimum CMake version
cmake.define.CMAKE_BUILD_TYPE = "Release" # Optimized builds
cmake.define.CMAKE_CXX_STANDARD = "17" # C++17 for RDKit compatibility
```
Key files:
- `pyproject.toml` - Python packaging and build configuration
- `CMakeLists.txt` - CMake build instructions for C++ extension
- `src/cpp/` - C++ source code using RDKit and pybind11
- `src/rdtools/__init__.py` - Python API wrapper
## Quick Start
```python
import numpy as np
import rdtools
# Sample SMILES strings
smiles = np.array(['CCO', 'c1ccccc1', 'CC(=O)O', 'invalid_smiles'])
# Validate SMILES
valid = rdtools.is_valid(smiles)
print(f"Valid: {valid}") # [True, True, True, False]
# Calculate molecular descriptors
descriptors = rdtools.descriptors(smiles)
print(f"Molecular weights: {descriptors['molecular_weight']}")
print(f"LogP values: {descriptors['logp']}")
print(f"TPSA values: {descriptors['tpsa']}")
# Generate fingerprints
fingerprints = rdtools.morgan_fingerprints(smiles, radius=2, nbits=2048)
print(f"Fingerprint shape: {fingerprints.shape}") # (4, 2048)
```
## API Reference
### Core Functions
#### `rdtools.molecular_weights(smiles_array)`
Calculate molecular weights for SMILES strings.
**Parameters:**
- `smiles_array`: numpy array or list of SMILES strings
**Returns:**
- numpy array of molecular weights (float64), NaN for invalid SMILES
#### `rdtools.logp(smiles_array)`
Calculate LogP (octanol-water partition coefficient) values.
#### `rdtools.tpsa(smiles_array)`
Calculate TPSA (Topological Polar Surface Area) values.
#### `rdtools.is_valid(smiles_array)`
Validate SMILES strings.
**Returns:**
- boolean numpy array indicating validity
#### `rdtools.descriptors(smiles_array)`
Calculate multiple descriptors efficiently in a single pass.
**Returns:**
- Dictionary with keys: 'molecular_weight', 'logp', 'tpsa'
#### `rdtools.canonical_smiles(smiles_array)`
Convert SMILES to canonical form.
#### `rdtools.morgan_fingerprints(smiles_array, radius=2, nbits=2048)`
Calculate Morgan fingerprints as bit vectors.
**Parameters:**
- `radius`: fingerprint radius (default: 2)
- `nbits`: number of bits (default: 2048)
**Returns:**
- 2D numpy array of shape (n_molecules, nbits) with dtype uint8
### Utility Functions
#### `rdtools.filter_valid(smiles_array)`
Filter array to keep only valid SMILES.
#### `rdtools.batch_process(smiles_array, batch_size=1000, **kwargs)`
Process large arrays in batches with comprehensive results.
### TensorFlow Operations
#### `rdtools.tf_ops.string_process(smiles_tensor)`
TensorFlow custom operation for string processing in data pipelines.
**Parameters:**
- `smiles_tensor`: TensorFlow string tensor
**Returns:**
- TensorFlow string tensor with processed SMILES
**Example:**
```python
import tensorflow as tf
import rdtools.tf_ops
# Use in tf.data pipeline
dataset = tf.data.Dataset.from_tensor_slices(["CCO", "c1ccccc1"])
dataset = dataset.map(rdtools.tf_ops.string_process)
```
## Performance
RDTools is optimized for high-throughput molecular processing:
- **Batch Processing**: Calculate multiple descriptors in a single pass
- **C++ Core**: Uses RDKit's optimized C++ implementation
- **Memory Efficient**: Minimal Python overhead with direct numpy array access
- **Parallel Ready**: Functions are designed to work well with joblib/multiprocessing
### Benchmarks
On a modern CPU, RDTools can process:
- ~10,000-50,000 molecules/second for descriptor calculations
- ~5,000-20,000 molecules/second for fingerprint generation
Performance varies by molecule complexity and system specifications.
## Examples
See the `examples/` directory for detailed usage examples:
- `basic_usage.py`: Comprehensive demonstration of all features
- `performance_comparison.py`: Benchmarking against pure Python approaches
## Development
### Development Setup
1. **Clone and install in development mode**:
```bash
git clone <repository-url>
cd rdtools
# Install with development dependencies
uv sync
# Build the C++ extension
SKBUILD_EDITABLE_VERBOSE=1 uv sync --reinstall
```
2. **Development workflow**:
```bash
# Run tests
uv run pytest tests/
# Code formatting
uv run black src/
uv run isort src/
# Build wheel for distribution
uv build
uv run auditwheel repair dist/*.whl --exclude "libtensorflow*"
# Run examples
uv run python examples/basic_usage.py
```
3. **Rebuilding after C++ changes**:
```bash
# Clean rebuild (if needed)
rm -rf build/
SKBUILD_EDITABLE_VERBOSE=1 uv sync --reinstall
# Or force rebuild with pip
uv run python -m pip install -e . --force-reinstall --no-deps
```
### Testing
Run the test suite to ensure everything works:
```bash
# Run all tests
uv run pytest tests/ -v
# Run specific test
uv run pytest tests/test_rdtools.py::TestBasicFunctions::test_molecular_weights -v
# Test with coverage
uv run pytest tests/ --cov=rdtools --cov-report=html
```
### Build Debugging
If you encounter build issues:
```bash
# Build with maximum verbosity
SKBUILD_EDITABLE_VERBOSE=1 CMAKE_VERBOSE_MAKEFILE=ON uv sync --reinstall
# Check CMake configuration
cd build/cp312-cp312-linux_x86_64 # (or your platform directory)
cmake .. -DCMAKE_BUILD_TYPE=Debug
# Manual build for debugging
mkdir debug_build && cd debug_build
cmake .. -DCMAKE_BUILD_TYPE=Debug -DCMAKE_VERBOSE_MAKEFILE=ON
make VERBOSE=1
```
### Project Structure
```
rdtools/
├── src/
│ ├── rdtools/ # Python package
│ │ └── __init__.py # Main API and function wrappers
│ └── cpp/ # C++ source code
│ ├── molecular_ops.hpp # C++ function declarations
│ ├── molecular_ops.cpp # C++ implementations using RDKit
│ └── pybind_module.cpp # pybind11 bindings
├── examples/ # Usage examples and demos
│ └── basic_usage.py # Comprehensive feature demonstration
├── tests/ # Test suite
│ └── test_rdtools.py # Unit tests
├── build/ # Build artifacts (generated)
├── CMakeLists.txt # CMake build configuration
├── pyproject.toml # Project metadata and build config
├── uv.lock # Dependency lock file
└── CLAUDE.md # Development guidelines
```
## Dependencies
### Runtime Dependencies
- **Python**: >= 3.12 (as specified in pyproject.toml)
- **NumPy**: >= 1.20.0 (for array operations)
### System Dependencies
- **RDKit**: >= 2022.9.1 (C++ libraries and headers required)
- `librdkit-dev` - Development headers
- `librdkit1` - Runtime libraries
- `rdkit-data` - Data files
- `python3-rdkit` - Python bindings (optional, for compatibility testing)
### Build Dependencies
- **scikit-build-core**: >= 0.10 (modern CMake-based build backend)
- **nanobind**: >= 2.0.0 (fast C++/Python binding library)
- **tensorflow**: == 2.19.0 (for TensorFlow custom operations)
- **CMake**: >= 3.15 (build system)
- **C++ Compiler**: Supporting C++17 standard (GCC 7+, Clang 5+, MSVC 2017+)
### Development Dependencies (Optional)
- **pytest**: >= 6.0 (testing framework)
- **black** (code formatting)
- **isort** (import sorting)
- **uv** (recommended package manager)
## License
[Add your license here]
## Contributing
[Add contributing guidelines here]
## Citation
If you use RDTools in your research, please cite:
[Add citation information here]
Raw data
{
"_id": null,
"home_page": null,
"name": "rdkit-data-pipeline-tools",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.12",
"maintainer_email": null,
"keywords": "chemistry, cheminformatics, rdkit, molecular, descriptors",
"author": null,
"author_email": "wuhao <wuhao@novogaia.bio>",
"download_url": null,
"platform": null,
"description": "# RDTools\n\nHigh-performance molecular operations using RDKit C++ with numpy arrays.\n\nRDTools provides fast molecular descriptor calculations, SMILES validation, and fingerprint generation by exposing RDKit's C++ core through pybind11 bindings. This allows for efficient processing of large numpy arrays of SMILES strings.\n\n## Features\n\n- **Fast Molecular Descriptors**: Calculate molecular weight, LogP, TPSA in parallel\n- **SMILES Validation**: Efficiently validate large arrays of SMILES strings\n- **Fingerprint Generation**: Morgan fingerprints as numpy bit vectors\n- **TensorFlow Integration**: Custom TensorFlow operations for data pipelines (tf.data compatible)\n- **Batch Processing**: Optimized for large datasets with batch processing utilities\n- **NumPy Integration**: Native support for numpy arrays with proper error handling\n\n## Installation\n\n### System Requirements\n\n- **Operating System**: Linux (Ubuntu/Debian recommended), macOS, or Windows\n- **Python**: 3.12 or later (specified in pyproject.toml)\n- **Memory**: At least 4GB RAM recommended for building\n- **Disk Space**: ~500MB for dependencies and build artifacts\n\n### Dependencies\n\n#### 1. System Dependencies\n\n**Ubuntu/Debian (Recommended)**:\n```bash\n# Update package manager\nsudo apt-get update\n\n# Install RDKit C++ libraries and development headers\nsudo apt-get install -y python3-rdkit librdkit-dev librdkit1 rdkit-data\n\n# Install build essentials (if not already installed)\nsudo apt-get install -y build-essential cmake pkg-config\n```\n\n**macOS with Homebrew**:\n```bash\n# Install RDKit\nbrew install rdkit\n\n# Install build tools (usually pre-installed with Xcode)\nxcode-select --install\n```\n\n**Windows**:\n```bash\n# Using conda (recommended for Windows)\nconda install -c conda-forge rdkit-dev rdkit\n# You'll also need Visual Studio Build Tools\n```\n\n#### 2. Python Dependencies\n\nThe build system automatically handles Python dependencies, but here's what gets installed:\n\n**Build-time dependencies** (handled by pyproject.toml):\n- `scikit-build-core>=0.10` - Modern CMake-based Python build backend\n- `nanobind>=2.0.0` - Fast C++/Python binding library\n- `tensorflow==2.19.0` - For TensorFlow custom operations (built automatically when available)\n\n**Runtime dependencies**:\n- `numpy>=1.20.0` - Core numerical operations\n- `tensorflow==2.19.0` - Required for TensorFlow custom operations\n\n**Development dependencies** (optional):\n- `pytest>=6.0` - Testing framework\n- `black` - Code formatting\n- `isort` - Import sorting\n\n### Build and Installation\n\n#### Method 1: Quick Install (Recommended)\n\n```bash\n# 1. Clone the repository\ngit clone <repository-url>\ncd rdtools\n\n# 2. Install using uv (handles everything automatically)\nuv sync\n\n# 3. Build the C++ extension (with verbose output for debugging)\nSKBUILD_EDITABLE_VERBOSE=1 uv sync --reinstall\n```\n\n#### Method 2: Step-by-Step Installation\n\n```bash\n# 1. Clone and navigate\ngit clone <repository-url>\ncd rdtools\n\n# 2. Create and activate virtual environment (if not using uv)\npython -m venv .venv\nsource .venv/bin/activate # Linux/macOS\n# or\n.venv\\Scripts\\activate # Windows\n\n# 3. Install build dependencies\npip install scikit-build-core pybind11 numpy\n\n# 4. Build and install in editable mode\npip install -e .\n```\n\n#### Method 3: Manual CMake Build (Advanced)\n\nFor development or debugging the build process:\n\n```bash\n# 1. Ensure RDKit is installed (see System Dependencies above)\n\n# 2. Create build directory\nmkdir build && cd build\n\n# 3. Configure with CMake\ncmake .. \\\n -DCMAKE_BUILD_TYPE=Release \\\n -DCMAKE_CXX_STANDARD=17 \\\n -DPYTHON_EXECUTABLE=$(which python)\n\n# 4. Build with verbose output\nmake VERBOSE=1 -j$(nproc) # Linux/macOS\n# or\ncmake --build . --config Release --verbose # Windows\n\n# 5. The extension will be in: build/_rdtools_core.so (or .pyd on Windows)\n```\n\n### Verification\n\nAfter installation, verify everything works:\n\n```bash\n# Test basic import\nuv run python -c \"import rdtools; print('RDTools version:', rdtools.__version__)\"\n\n# Test C++ extension loading\nuv run python -c \"import rdtools; print('Extension available:', rdtools._EXTENSION_AVAILABLE)\"\n\n# Run basic functionality test\nuv run python -c \"\nimport rdtools\nimport numpy as np\nsmiles = np.array(['CCO', 'c1ccccc1'])\nresult = rdtools.molecular_weights(smiles)\nprint('Molecular weights:', result)\nprint('\u2713 Installation successful!')\n\"\n\n# Run the comprehensive example\nuv run python examples/basic_usage.py\n```\n\n### Troubleshooting\n\n#### RDKit Not Found\n```\nError: \"RDKit not found. Please install RDKit or set RDBASE environment variable.\"\n```\n**Solution**: Install RDKit system packages as shown in System Dependencies above.\n\n#### CMake Version Too Old\n```\nError: \"CMake 3.15 or higher is required\"\n```\n**Solution**: \n```bash\n# Ubuntu/Debian\nsudo apt-get install cmake\n# Verify version\ncmake --version\n```\n\n#### Compilation Errors\n```\nError: Cannot find RDKit headers\n```\n**Solutions**:\n1. Install development packages: `sudo apt-get install librdkit-dev`\n2. Set environment variable: `export RDBASE=/path/to/rdkit`\n3. For conda installations: `export RDBASE=$CONDA_PREFIX`\n\n#### Python Version Issues\n```\nError: \"requires-python = '>=3.12'\"\n```\n**Solution**: Use Python 3.12 or later. Check with: `python --version`\n\n#### Import Errors After Build\n```\nImportError: cannot import name '_rdtools_core'\n```\n**Solutions**:\n1. Rebuild: `SKBUILD_EDITABLE_VERBOSE=1 uv sync --reinstall`\n2. Check installation: `find .venv -name \"*rdtools*\" -type f`\n3. Verify RDKit: `python -c \"import rdkit; print(rdkit.__version__)\"`\n\n#### TensorFlow Custom Ops Errors\n```\nImportError: RDK-Tools TensorFlow ops not available: undefined symbol\n```\n**Solutions**:\n1. Ensure TensorFlow versions match between build and runtime environments\n2. Rebuild with correct TensorFlow version: `rm -rf build/ dist/ && uv build`\n3. For distribution, use: `uv run auditwheel repair dist/*.whl --exclude \"libtensorflow*\"`\n\n### Build Configuration\n\nThe build process is configured via `pyproject.toml`:\n\n```toml\n[tool.scikit-build]\nbuild-dir = \"build/{wheel_tag}\" # Separate builds for different Python versions\ncmake.version = \">=3.15\" # Minimum CMake version\ncmake.define.CMAKE_BUILD_TYPE = \"Release\" # Optimized builds\ncmake.define.CMAKE_CXX_STANDARD = \"17\" # C++17 for RDKit compatibility\n```\n\nKey files:\n- `pyproject.toml` - Python packaging and build configuration\n- `CMakeLists.txt` - CMake build instructions for C++ extension\n- `src/cpp/` - C++ source code using RDKit and pybind11\n- `src/rdtools/__init__.py` - Python API wrapper\n\n## Quick Start\n\n```python\nimport numpy as np\nimport rdtools\n\n# Sample SMILES strings\nsmiles = np.array(['CCO', 'c1ccccc1', 'CC(=O)O', 'invalid_smiles'])\n\n# Validate SMILES\nvalid = rdtools.is_valid(smiles)\nprint(f\"Valid: {valid}\") # [True, True, True, False]\n\n# Calculate molecular descriptors\ndescriptors = rdtools.descriptors(smiles)\nprint(f\"Molecular weights: {descriptors['molecular_weight']}\")\nprint(f\"LogP values: {descriptors['logp']}\")\nprint(f\"TPSA values: {descriptors['tpsa']}\")\n\n# Generate fingerprints\nfingerprints = rdtools.morgan_fingerprints(smiles, radius=2, nbits=2048)\nprint(f\"Fingerprint shape: {fingerprints.shape}\") # (4, 2048)\n```\n\n## API Reference\n\n### Core Functions\n\n#### `rdtools.molecular_weights(smiles_array)`\nCalculate molecular weights for SMILES strings.\n\n**Parameters:**\n- `smiles_array`: numpy array or list of SMILES strings\n\n**Returns:**\n- numpy array of molecular weights (float64), NaN for invalid SMILES\n\n#### `rdtools.logp(smiles_array)`\nCalculate LogP (octanol-water partition coefficient) values.\n\n#### `rdtools.tpsa(smiles_array)`\nCalculate TPSA (Topological Polar Surface Area) values.\n\n#### `rdtools.is_valid(smiles_array)`\nValidate SMILES strings.\n\n**Returns:**\n- boolean numpy array indicating validity\n\n#### `rdtools.descriptors(smiles_array)`\nCalculate multiple descriptors efficiently in a single pass.\n\n**Returns:**\n- Dictionary with keys: 'molecular_weight', 'logp', 'tpsa'\n\n#### `rdtools.canonical_smiles(smiles_array)`\nConvert SMILES to canonical form.\n\n#### `rdtools.morgan_fingerprints(smiles_array, radius=2, nbits=2048)`\nCalculate Morgan fingerprints as bit vectors.\n\n**Parameters:**\n- `radius`: fingerprint radius (default: 2)\n- `nbits`: number of bits (default: 2048)\n\n**Returns:**\n- 2D numpy array of shape (n_molecules, nbits) with dtype uint8\n\n### Utility Functions\n\n#### `rdtools.filter_valid(smiles_array)`\nFilter array to keep only valid SMILES.\n\n#### `rdtools.batch_process(smiles_array, batch_size=1000, **kwargs)`\nProcess large arrays in batches with comprehensive results.\n\n### TensorFlow Operations\n\n#### `rdtools.tf_ops.string_process(smiles_tensor)`\nTensorFlow custom operation for string processing in data pipelines.\n\n**Parameters:**\n- `smiles_tensor`: TensorFlow string tensor\n\n**Returns:**\n- TensorFlow string tensor with processed SMILES\n\n**Example:**\n```python\nimport tensorflow as tf\nimport rdtools.tf_ops\n\n# Use in tf.data pipeline\ndataset = tf.data.Dataset.from_tensor_slices([\"CCO\", \"c1ccccc1\"])\ndataset = dataset.map(rdtools.tf_ops.string_process)\n```\n\n## Performance\n\nRDTools is optimized for high-throughput molecular processing:\n\n- **Batch Processing**: Calculate multiple descriptors in a single pass\n- **C++ Core**: Uses RDKit's optimized C++ implementation\n- **Memory Efficient**: Minimal Python overhead with direct numpy array access\n- **Parallel Ready**: Functions are designed to work well with joblib/multiprocessing\n\n### Benchmarks\n\nOn a modern CPU, RDTools can process:\n- ~10,000-50,000 molecules/second for descriptor calculations\n- ~5,000-20,000 molecules/second for fingerprint generation\n\nPerformance varies by molecule complexity and system specifications.\n\n## Examples\n\nSee the `examples/` directory for detailed usage examples:\n\n- `basic_usage.py`: Comprehensive demonstration of all features\n- `performance_comparison.py`: Benchmarking against pure Python approaches\n\n## Development\n\n### Development Setup\n\n1. **Clone and install in development mode**:\n ```bash\n git clone <repository-url>\n cd rdtools\n \n # Install with development dependencies\n uv sync\n \n # Build the C++ extension\n SKBUILD_EDITABLE_VERBOSE=1 uv sync --reinstall\n ```\n\n2. **Development workflow**:\n ```bash\n # Run tests\n uv run pytest tests/\n \n # Code formatting\n uv run black src/\n uv run isort src/\n \n # Build wheel for distribution\n uv build\n uv run auditwheel repair dist/*.whl --exclude \"libtensorflow*\"\n \n # Run examples\n uv run python examples/basic_usage.py\n ```\n\n3. **Rebuilding after C++ changes**:\n ```bash\n # Clean rebuild (if needed)\n rm -rf build/\n SKBUILD_EDITABLE_VERBOSE=1 uv sync --reinstall\n \n # Or force rebuild with pip\n uv run python -m pip install -e . --force-reinstall --no-deps\n ```\n\n### Testing\n\nRun the test suite to ensure everything works:\n\n```bash\n# Run all tests\nuv run pytest tests/ -v\n\n# Run specific test\nuv run pytest tests/test_rdtools.py::TestBasicFunctions::test_molecular_weights -v\n\n# Test with coverage\nuv run pytest tests/ --cov=rdtools --cov-report=html\n```\n\n### Build Debugging\n\nIf you encounter build issues:\n\n```bash\n# Build with maximum verbosity\nSKBUILD_EDITABLE_VERBOSE=1 CMAKE_VERBOSE_MAKEFILE=ON uv sync --reinstall\n\n# Check CMake configuration\ncd build/cp312-cp312-linux_x86_64 # (or your platform directory)\ncmake .. -DCMAKE_BUILD_TYPE=Debug\n\n# Manual build for debugging\nmkdir debug_build && cd debug_build\ncmake .. -DCMAKE_BUILD_TYPE=Debug -DCMAKE_VERBOSE_MAKEFILE=ON\nmake VERBOSE=1\n```\n\n### Project Structure\n\n```\nrdtools/\n\u251c\u2500\u2500 src/\n\u2502 \u251c\u2500\u2500 rdtools/ # Python package\n\u2502 \u2502 \u2514\u2500\u2500 __init__.py # Main API and function wrappers\n\u2502 \u2514\u2500\u2500 cpp/ # C++ source code\n\u2502 \u251c\u2500\u2500 molecular_ops.hpp # C++ function declarations\n\u2502 \u251c\u2500\u2500 molecular_ops.cpp # C++ implementations using RDKit\n\u2502 \u2514\u2500\u2500 pybind_module.cpp # pybind11 bindings\n\u251c\u2500\u2500 examples/ # Usage examples and demos\n\u2502 \u2514\u2500\u2500 basic_usage.py # Comprehensive feature demonstration\n\u251c\u2500\u2500 tests/ # Test suite\n\u2502 \u2514\u2500\u2500 test_rdtools.py # Unit tests\n\u251c\u2500\u2500 build/ # Build artifacts (generated)\n\u251c\u2500\u2500 CMakeLists.txt # CMake build configuration\n\u251c\u2500\u2500 pyproject.toml # Project metadata and build config\n\u251c\u2500\u2500 uv.lock # Dependency lock file\n\u2514\u2500\u2500 CLAUDE.md # Development guidelines\n```\n\n## Dependencies\n\n### Runtime Dependencies\n- **Python**: >= 3.12 (as specified in pyproject.toml)\n- **NumPy**: >= 1.20.0 (for array operations)\n\n### System Dependencies\n- **RDKit**: >= 2022.9.1 (C++ libraries and headers required)\n - `librdkit-dev` - Development headers\n - `librdkit1` - Runtime libraries \n - `rdkit-data` - Data files\n - `python3-rdkit` - Python bindings (optional, for compatibility testing)\n\n### Build Dependencies\n- **scikit-build-core**: >= 0.10 (modern CMake-based build backend)\n- **nanobind**: >= 2.0.0 (fast C++/Python binding library)\n- **tensorflow**: == 2.19.0 (for TensorFlow custom operations)\n- **CMake**: >= 3.15 (build system)\n- **C++ Compiler**: Supporting C++17 standard (GCC 7+, Clang 5+, MSVC 2017+)\n\n### Development Dependencies (Optional)\n- **pytest**: >= 6.0 (testing framework)\n- **black** (code formatting)\n- **isort** (import sorting)\n- **uv** (recommended package manager)\n\n## License\n\n[Add your license here]\n\n## Contributing\n\n[Add contributing guidelines here]\n\n## Citation\n\nIf you use RDTools in your research, please cite:\n[Add citation information here]",
"bugtrack_url": null,
"license": "MIT",
"summary": "High-performance molecular operations using RDKit's C++ core through nanobind bindings",
"version": "0.1.5",
"project_urls": null,
"split_keywords": [
"chemistry",
" cheminformatics",
" rdkit",
" molecular",
" descriptors"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "1e442b9aa0149c858912d09d61a0235177b1e00356bef74c754b782477d20134",
"md5": "4ec8359a95116ccddf75561809bdb7b3",
"sha256": "c8bf72a4846ca3a85a3e22f662fef5bab9732827aea0393219e707ac6b507432"
},
"downloads": -1,
"filename": "rdkit_data_pipeline_tools-0.1.5-cp312-cp312-manylinux_2_39_x86_64.whl",
"has_sig": false,
"md5_digest": "4ec8359a95116ccddf75561809bdb7b3",
"packagetype": "bdist_wheel",
"python_version": "cp312",
"requires_python": ">=3.12",
"size": 6278440,
"upload_time": "2025-08-30T14:02:33",
"upload_time_iso_8601": "2025-08-30T14:02:33.759737Z",
"url": "https://files.pythonhosted.org/packages/1e/44/2b9aa0149c858912d09d61a0235177b1e00356bef74c754b782477d20134/rdkit_data_pipeline_tools-0.1.5-cp312-cp312-manylinux_2_39_x86_64.whl",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-08-30 14:02:33",
"github": false,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"lcname": "rdkit-data-pipeline-tools"
}