rdkit-data-pipeline-tools


Namerdkit-data-pipeline-tools JSON
Version 0.1.5 PyPI version JSON
download
home_pageNone
SummaryHigh-performance molecular operations using RDKit's C++ core through nanobind bindings
upload_time2025-08-30 14:02:33
maintainerNone
docs_urlNone
authorNone
requires_python>=3.12
licenseMIT
keywords chemistry cheminformatics rdkit molecular descriptors
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # RDTools

High-performance molecular operations using RDKit C++ with numpy arrays.

RDTools provides fast molecular descriptor calculations, SMILES validation, and fingerprint generation by exposing RDKit's C++ core through pybind11 bindings. This allows for efficient processing of large numpy arrays of SMILES strings.

## Features

- **Fast Molecular Descriptors**: Calculate molecular weight, LogP, TPSA in parallel
- **SMILES Validation**: Efficiently validate large arrays of SMILES strings
- **Fingerprint Generation**: Morgan fingerprints as numpy bit vectors
- **TensorFlow Integration**: Custom TensorFlow operations for data pipelines (tf.data compatible)
- **Batch Processing**: Optimized for large datasets with batch processing utilities
- **NumPy Integration**: Native support for numpy arrays with proper error handling

## Installation

### System Requirements

- **Operating System**: Linux (Ubuntu/Debian recommended), macOS, or Windows
- **Python**: 3.12 or later (specified in pyproject.toml)
- **Memory**: At least 4GB RAM recommended for building
- **Disk Space**: ~500MB for dependencies and build artifacts

### Dependencies

#### 1. System Dependencies

**Ubuntu/Debian (Recommended)**:
```bash
# Update package manager
sudo apt-get update

# Install RDKit C++ libraries and development headers
sudo apt-get install -y python3-rdkit librdkit-dev librdkit1 rdkit-data

# Install build essentials (if not already installed)
sudo apt-get install -y build-essential cmake pkg-config
```

**macOS with Homebrew**:
```bash
# Install RDKit
brew install rdkit

# Install build tools (usually pre-installed with Xcode)
xcode-select --install
```

**Windows**:
```bash
# Using conda (recommended for Windows)
conda install -c conda-forge rdkit-dev rdkit
# You'll also need Visual Studio Build Tools
```

#### 2. Python Dependencies

The build system automatically handles Python dependencies, but here's what gets installed:

**Build-time dependencies** (handled by pyproject.toml):
- `scikit-build-core>=0.10` - Modern CMake-based Python build backend
- `nanobind>=2.0.0` - Fast C++/Python binding library
- `tensorflow==2.19.0` - For TensorFlow custom operations (built automatically when available)

**Runtime dependencies**:
- `numpy>=1.20.0` - Core numerical operations
- `tensorflow==2.19.0` - Required for TensorFlow custom operations

**Development dependencies** (optional):
- `pytest>=6.0` - Testing framework
- `black` - Code formatting
- `isort` - Import sorting

### Build and Installation

#### Method 1: Quick Install (Recommended)

```bash
# 1. Clone the repository
git clone <repository-url>
cd rdtools

# 2. Install using uv (handles everything automatically)
uv sync

# 3. Build the C++ extension (with verbose output for debugging)
SKBUILD_EDITABLE_VERBOSE=1 uv sync --reinstall
```

#### Method 2: Step-by-Step Installation

```bash
# 1. Clone and navigate
git clone <repository-url>
cd rdtools

# 2. Create and activate virtual environment (if not using uv)
python -m venv .venv
source .venv/bin/activate  # Linux/macOS
# or
.venv\Scripts\activate     # Windows

# 3. Install build dependencies
pip install scikit-build-core pybind11 numpy

# 4. Build and install in editable mode
pip install -e .
```

#### Method 3: Manual CMake Build (Advanced)

For development or debugging the build process:

```bash
# 1. Ensure RDKit is installed (see System Dependencies above)

# 2. Create build directory
mkdir build && cd build

# 3. Configure with CMake
cmake .. \
    -DCMAKE_BUILD_TYPE=Release \
    -DCMAKE_CXX_STANDARD=17 \
    -DPYTHON_EXECUTABLE=$(which python)

# 4. Build with verbose output
make VERBOSE=1 -j$(nproc)  # Linux/macOS
# or
cmake --build . --config Release --verbose  # Windows

# 5. The extension will be in: build/_rdtools_core.so (or .pyd on Windows)
```

### Verification

After installation, verify everything works:

```bash
# Test basic import
uv run python -c "import rdtools; print('RDTools version:', rdtools.__version__)"

# Test C++ extension loading
uv run python -c "import rdtools; print('Extension available:', rdtools._EXTENSION_AVAILABLE)"

# Run basic functionality test
uv run python -c "
import rdtools
import numpy as np
smiles = np.array(['CCO', 'c1ccccc1'])
result = rdtools.molecular_weights(smiles)
print('Molecular weights:', result)
print('✓ Installation successful!')
"

# Run the comprehensive example
uv run python examples/basic_usage.py
```

### Troubleshooting

#### RDKit Not Found
```
Error: "RDKit not found. Please install RDKit or set RDBASE environment variable."
```
**Solution**: Install RDKit system packages as shown in System Dependencies above.

#### CMake Version Too Old
```
Error: "CMake 3.15 or higher is required"
```
**Solution**: 
```bash
# Ubuntu/Debian
sudo apt-get install cmake
# Verify version
cmake --version
```

#### Compilation Errors
```
Error: Cannot find RDKit headers
```
**Solutions**:
1. Install development packages: `sudo apt-get install librdkit-dev`
2. Set environment variable: `export RDBASE=/path/to/rdkit`
3. For conda installations: `export RDBASE=$CONDA_PREFIX`

#### Python Version Issues
```
Error: "requires-python = '>=3.12'"
```
**Solution**: Use Python 3.12 or later. Check with: `python --version`

#### Import Errors After Build
```
ImportError: cannot import name '_rdtools_core'
```
**Solutions**:
1. Rebuild: `SKBUILD_EDITABLE_VERBOSE=1 uv sync --reinstall`
2. Check installation: `find .venv -name "*rdtools*" -type f`
3. Verify RDKit: `python -c "import rdkit; print(rdkit.__version__)"`

#### TensorFlow Custom Ops Errors
```
ImportError: RDK-Tools TensorFlow ops not available: undefined symbol
```
**Solutions**:
1. Ensure TensorFlow versions match between build and runtime environments
2. Rebuild with correct TensorFlow version: `rm -rf build/ dist/ && uv build`
3. For distribution, use: `uv run auditwheel repair dist/*.whl --exclude "libtensorflow*"`

### Build Configuration

The build process is configured via `pyproject.toml`:

```toml
[tool.scikit-build]
build-dir = "build/{wheel_tag}"        # Separate builds for different Python versions
cmake.version = ">=3.15"               # Minimum CMake version
cmake.define.CMAKE_BUILD_TYPE = "Release"  # Optimized builds
cmake.define.CMAKE_CXX_STANDARD = "17"     # C++17 for RDKit compatibility
```

Key files:
- `pyproject.toml` - Python packaging and build configuration
- `CMakeLists.txt` - CMake build instructions for C++ extension
- `src/cpp/` - C++ source code using RDKit and pybind11
- `src/rdtools/__init__.py` - Python API wrapper

## Quick Start

```python
import numpy as np
import rdtools

# Sample SMILES strings
smiles = np.array(['CCO', 'c1ccccc1', 'CC(=O)O', 'invalid_smiles'])

# Validate SMILES
valid = rdtools.is_valid(smiles)
print(f"Valid: {valid}")  # [True, True, True, False]

# Calculate molecular descriptors
descriptors = rdtools.descriptors(smiles)
print(f"Molecular weights: {descriptors['molecular_weight']}")
print(f"LogP values: {descriptors['logp']}")
print(f"TPSA values: {descriptors['tpsa']}")

# Generate fingerprints
fingerprints = rdtools.morgan_fingerprints(smiles, radius=2, nbits=2048)
print(f"Fingerprint shape: {fingerprints.shape}")  # (4, 2048)
```

## API Reference

### Core Functions

#### `rdtools.molecular_weights(smiles_array)`
Calculate molecular weights for SMILES strings.

**Parameters:**
- `smiles_array`: numpy array or list of SMILES strings

**Returns:**
- numpy array of molecular weights (float64), NaN for invalid SMILES

#### `rdtools.logp(smiles_array)`
Calculate LogP (octanol-water partition coefficient) values.

#### `rdtools.tpsa(smiles_array)`
Calculate TPSA (Topological Polar Surface Area) values.

#### `rdtools.is_valid(smiles_array)`
Validate SMILES strings.

**Returns:**
- boolean numpy array indicating validity

#### `rdtools.descriptors(smiles_array)`
Calculate multiple descriptors efficiently in a single pass.

**Returns:**
- Dictionary with keys: 'molecular_weight', 'logp', 'tpsa'

#### `rdtools.canonical_smiles(smiles_array)`
Convert SMILES to canonical form.

#### `rdtools.morgan_fingerprints(smiles_array, radius=2, nbits=2048)`
Calculate Morgan fingerprints as bit vectors.

**Parameters:**
- `radius`: fingerprint radius (default: 2)
- `nbits`: number of bits (default: 2048)

**Returns:**
- 2D numpy array of shape (n_molecules, nbits) with dtype uint8

### Utility Functions

#### `rdtools.filter_valid(smiles_array)`
Filter array to keep only valid SMILES.

#### `rdtools.batch_process(smiles_array, batch_size=1000, **kwargs)`
Process large arrays in batches with comprehensive results.

### TensorFlow Operations

#### `rdtools.tf_ops.string_process(smiles_tensor)`
TensorFlow custom operation for string processing in data pipelines.

**Parameters:**
- `smiles_tensor`: TensorFlow string tensor

**Returns:**
- TensorFlow string tensor with processed SMILES

**Example:**
```python
import tensorflow as tf
import rdtools.tf_ops

# Use in tf.data pipeline
dataset = tf.data.Dataset.from_tensor_slices(["CCO", "c1ccccc1"])
dataset = dataset.map(rdtools.tf_ops.string_process)
```

## Performance

RDTools is optimized for high-throughput molecular processing:

- **Batch Processing**: Calculate multiple descriptors in a single pass
- **C++ Core**: Uses RDKit's optimized C++ implementation
- **Memory Efficient**: Minimal Python overhead with direct numpy array access
- **Parallel Ready**: Functions are designed to work well with joblib/multiprocessing

### Benchmarks

On a modern CPU, RDTools can process:
- ~10,000-50,000 molecules/second for descriptor calculations
- ~5,000-20,000 molecules/second for fingerprint generation

Performance varies by molecule complexity and system specifications.

## Examples

See the `examples/` directory for detailed usage examples:

- `basic_usage.py`: Comprehensive demonstration of all features
- `performance_comparison.py`: Benchmarking against pure Python approaches

## Development

### Development Setup

1. **Clone and install in development mode**:
   ```bash
   git clone <repository-url>
   cd rdtools
   
   # Install with development dependencies
   uv sync
   
   # Build the C++ extension
   SKBUILD_EDITABLE_VERBOSE=1 uv sync --reinstall
   ```

2. **Development workflow**:
   ```bash
   # Run tests
   uv run pytest tests/
   
   # Code formatting
   uv run black src/
   uv run isort src/
   
   # Build wheel for distribution
   uv build
   uv run auditwheel repair dist/*.whl --exclude "libtensorflow*"
   
   # Run examples
   uv run python examples/basic_usage.py
   ```

3. **Rebuilding after C++ changes**:
   ```bash
   # Clean rebuild (if needed)
   rm -rf build/
   SKBUILD_EDITABLE_VERBOSE=1 uv sync --reinstall
   
   # Or force rebuild with pip
   uv run python -m pip install -e . --force-reinstall --no-deps
   ```

### Testing

Run the test suite to ensure everything works:

```bash
# Run all tests
uv run pytest tests/ -v

# Run specific test
uv run pytest tests/test_rdtools.py::TestBasicFunctions::test_molecular_weights -v

# Test with coverage
uv run pytest tests/ --cov=rdtools --cov-report=html
```

### Build Debugging

If you encounter build issues:

```bash
# Build with maximum verbosity
SKBUILD_EDITABLE_VERBOSE=1 CMAKE_VERBOSE_MAKEFILE=ON uv sync --reinstall

# Check CMake configuration
cd build/cp312-cp312-linux_x86_64  # (or your platform directory)
cmake .. -DCMAKE_BUILD_TYPE=Debug

# Manual build for debugging
mkdir debug_build && cd debug_build
cmake .. -DCMAKE_BUILD_TYPE=Debug -DCMAKE_VERBOSE_MAKEFILE=ON
make VERBOSE=1
```

### Project Structure

```
rdtools/
├── src/
│   ├── rdtools/           # Python package
│   │   └── __init__.py    # Main API and function wrappers
│   └── cpp/               # C++ source code
│       ├── molecular_ops.hpp    # C++ function declarations
│       ├── molecular_ops.cpp    # C++ implementations using RDKit
│       └── pybind_module.cpp    # pybind11 bindings
├── examples/              # Usage examples and demos
│   └── basic_usage.py     # Comprehensive feature demonstration
├── tests/                 # Test suite
│   └── test_rdtools.py    # Unit tests
├── build/                 # Build artifacts (generated)
├── CMakeLists.txt         # CMake build configuration
├── pyproject.toml         # Project metadata and build config
├── uv.lock               # Dependency lock file
└── CLAUDE.md             # Development guidelines
```

## Dependencies

### Runtime Dependencies
- **Python**: >= 3.12 (as specified in pyproject.toml)
- **NumPy**: >= 1.20.0 (for array operations)

### System Dependencies
- **RDKit**: >= 2022.9.1 (C++ libraries and headers required)
  - `librdkit-dev` - Development headers
  - `librdkit1` - Runtime libraries  
  - `rdkit-data` - Data files
  - `python3-rdkit` - Python bindings (optional, for compatibility testing)

### Build Dependencies
- **scikit-build-core**: >= 0.10 (modern CMake-based build backend)
- **nanobind**: >= 2.0.0 (fast C++/Python binding library)
- **tensorflow**: == 2.19.0 (for TensorFlow custom operations)
- **CMake**: >= 3.15 (build system)
- **C++ Compiler**: Supporting C++17 standard (GCC 7+, Clang 5+, MSVC 2017+)

### Development Dependencies (Optional)
- **pytest**: >= 6.0 (testing framework)
- **black** (code formatting)
- **isort** (import sorting)
- **uv** (recommended package manager)

## License

[Add your license here]

## Contributing

[Add contributing guidelines here]

## Citation

If you use RDTools in your research, please cite:
[Add citation information here]
            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "rdkit-data-pipeline-tools",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.12",
    "maintainer_email": null,
    "keywords": "chemistry, cheminformatics, rdkit, molecular, descriptors",
    "author": null,
    "author_email": "wuhao <wuhao@novogaia.bio>",
    "download_url": null,
    "platform": null,
    "description": "# RDTools\n\nHigh-performance molecular operations using RDKit C++ with numpy arrays.\n\nRDTools provides fast molecular descriptor calculations, SMILES validation, and fingerprint generation by exposing RDKit's C++ core through pybind11 bindings. This allows for efficient processing of large numpy arrays of SMILES strings.\n\n## Features\n\n- **Fast Molecular Descriptors**: Calculate molecular weight, LogP, TPSA in parallel\n- **SMILES Validation**: Efficiently validate large arrays of SMILES strings\n- **Fingerprint Generation**: Morgan fingerprints as numpy bit vectors\n- **TensorFlow Integration**: Custom TensorFlow operations for data pipelines (tf.data compatible)\n- **Batch Processing**: Optimized for large datasets with batch processing utilities\n- **NumPy Integration**: Native support for numpy arrays with proper error handling\n\n## Installation\n\n### System Requirements\n\n- **Operating System**: Linux (Ubuntu/Debian recommended), macOS, or Windows\n- **Python**: 3.12 or later (specified in pyproject.toml)\n- **Memory**: At least 4GB RAM recommended for building\n- **Disk Space**: ~500MB for dependencies and build artifacts\n\n### Dependencies\n\n#### 1. System Dependencies\n\n**Ubuntu/Debian (Recommended)**:\n```bash\n# Update package manager\nsudo apt-get update\n\n# Install RDKit C++ libraries and development headers\nsudo apt-get install -y python3-rdkit librdkit-dev librdkit1 rdkit-data\n\n# Install build essentials (if not already installed)\nsudo apt-get install -y build-essential cmake pkg-config\n```\n\n**macOS with Homebrew**:\n```bash\n# Install RDKit\nbrew install rdkit\n\n# Install build tools (usually pre-installed with Xcode)\nxcode-select --install\n```\n\n**Windows**:\n```bash\n# Using conda (recommended for Windows)\nconda install -c conda-forge rdkit-dev rdkit\n# You'll also need Visual Studio Build Tools\n```\n\n#### 2. Python Dependencies\n\nThe build system automatically handles Python dependencies, but here's what gets installed:\n\n**Build-time dependencies** (handled by pyproject.toml):\n- `scikit-build-core>=0.10` - Modern CMake-based Python build backend\n- `nanobind>=2.0.0` - Fast C++/Python binding library\n- `tensorflow==2.19.0` - For TensorFlow custom operations (built automatically when available)\n\n**Runtime dependencies**:\n- `numpy>=1.20.0` - Core numerical operations\n- `tensorflow==2.19.0` - Required for TensorFlow custom operations\n\n**Development dependencies** (optional):\n- `pytest>=6.0` - Testing framework\n- `black` - Code formatting\n- `isort` - Import sorting\n\n### Build and Installation\n\n#### Method 1: Quick Install (Recommended)\n\n```bash\n# 1. Clone the repository\ngit clone <repository-url>\ncd rdtools\n\n# 2. Install using uv (handles everything automatically)\nuv sync\n\n# 3. Build the C++ extension (with verbose output for debugging)\nSKBUILD_EDITABLE_VERBOSE=1 uv sync --reinstall\n```\n\n#### Method 2: Step-by-Step Installation\n\n```bash\n# 1. Clone and navigate\ngit clone <repository-url>\ncd rdtools\n\n# 2. Create and activate virtual environment (if not using uv)\npython -m venv .venv\nsource .venv/bin/activate  # Linux/macOS\n# or\n.venv\\Scripts\\activate     # Windows\n\n# 3. Install build dependencies\npip install scikit-build-core pybind11 numpy\n\n# 4. Build and install in editable mode\npip install -e .\n```\n\n#### Method 3: Manual CMake Build (Advanced)\n\nFor development or debugging the build process:\n\n```bash\n# 1. Ensure RDKit is installed (see System Dependencies above)\n\n# 2. Create build directory\nmkdir build && cd build\n\n# 3. Configure with CMake\ncmake .. \\\n    -DCMAKE_BUILD_TYPE=Release \\\n    -DCMAKE_CXX_STANDARD=17 \\\n    -DPYTHON_EXECUTABLE=$(which python)\n\n# 4. Build with verbose output\nmake VERBOSE=1 -j$(nproc)  # Linux/macOS\n# or\ncmake --build . --config Release --verbose  # Windows\n\n# 5. The extension will be in: build/_rdtools_core.so (or .pyd on Windows)\n```\n\n### Verification\n\nAfter installation, verify everything works:\n\n```bash\n# Test basic import\nuv run python -c \"import rdtools; print('RDTools version:', rdtools.__version__)\"\n\n# Test C++ extension loading\nuv run python -c \"import rdtools; print('Extension available:', rdtools._EXTENSION_AVAILABLE)\"\n\n# Run basic functionality test\nuv run python -c \"\nimport rdtools\nimport numpy as np\nsmiles = np.array(['CCO', 'c1ccccc1'])\nresult = rdtools.molecular_weights(smiles)\nprint('Molecular weights:', result)\nprint('\u2713 Installation successful!')\n\"\n\n# Run the comprehensive example\nuv run python examples/basic_usage.py\n```\n\n### Troubleshooting\n\n#### RDKit Not Found\n```\nError: \"RDKit not found. Please install RDKit or set RDBASE environment variable.\"\n```\n**Solution**: Install RDKit system packages as shown in System Dependencies above.\n\n#### CMake Version Too Old\n```\nError: \"CMake 3.15 or higher is required\"\n```\n**Solution**: \n```bash\n# Ubuntu/Debian\nsudo apt-get install cmake\n# Verify version\ncmake --version\n```\n\n#### Compilation Errors\n```\nError: Cannot find RDKit headers\n```\n**Solutions**:\n1. Install development packages: `sudo apt-get install librdkit-dev`\n2. Set environment variable: `export RDBASE=/path/to/rdkit`\n3. For conda installations: `export RDBASE=$CONDA_PREFIX`\n\n#### Python Version Issues\n```\nError: \"requires-python = '>=3.12'\"\n```\n**Solution**: Use Python 3.12 or later. Check with: `python --version`\n\n#### Import Errors After Build\n```\nImportError: cannot import name '_rdtools_core'\n```\n**Solutions**:\n1. Rebuild: `SKBUILD_EDITABLE_VERBOSE=1 uv sync --reinstall`\n2. Check installation: `find .venv -name \"*rdtools*\" -type f`\n3. Verify RDKit: `python -c \"import rdkit; print(rdkit.__version__)\"`\n\n#### TensorFlow Custom Ops Errors\n```\nImportError: RDK-Tools TensorFlow ops not available: undefined symbol\n```\n**Solutions**:\n1. Ensure TensorFlow versions match between build and runtime environments\n2. Rebuild with correct TensorFlow version: `rm -rf build/ dist/ && uv build`\n3. For distribution, use: `uv run auditwheel repair dist/*.whl --exclude \"libtensorflow*\"`\n\n### Build Configuration\n\nThe build process is configured via `pyproject.toml`:\n\n```toml\n[tool.scikit-build]\nbuild-dir = \"build/{wheel_tag}\"        # Separate builds for different Python versions\ncmake.version = \">=3.15\"               # Minimum CMake version\ncmake.define.CMAKE_BUILD_TYPE = \"Release\"  # Optimized builds\ncmake.define.CMAKE_CXX_STANDARD = \"17\"     # C++17 for RDKit compatibility\n```\n\nKey files:\n- `pyproject.toml` - Python packaging and build configuration\n- `CMakeLists.txt` - CMake build instructions for C++ extension\n- `src/cpp/` - C++ source code using RDKit and pybind11\n- `src/rdtools/__init__.py` - Python API wrapper\n\n## Quick Start\n\n```python\nimport numpy as np\nimport rdtools\n\n# Sample SMILES strings\nsmiles = np.array(['CCO', 'c1ccccc1', 'CC(=O)O', 'invalid_smiles'])\n\n# Validate SMILES\nvalid = rdtools.is_valid(smiles)\nprint(f\"Valid: {valid}\")  # [True, True, True, False]\n\n# Calculate molecular descriptors\ndescriptors = rdtools.descriptors(smiles)\nprint(f\"Molecular weights: {descriptors['molecular_weight']}\")\nprint(f\"LogP values: {descriptors['logp']}\")\nprint(f\"TPSA values: {descriptors['tpsa']}\")\n\n# Generate fingerprints\nfingerprints = rdtools.morgan_fingerprints(smiles, radius=2, nbits=2048)\nprint(f\"Fingerprint shape: {fingerprints.shape}\")  # (4, 2048)\n```\n\n## API Reference\n\n### Core Functions\n\n#### `rdtools.molecular_weights(smiles_array)`\nCalculate molecular weights for SMILES strings.\n\n**Parameters:**\n- `smiles_array`: numpy array or list of SMILES strings\n\n**Returns:**\n- numpy array of molecular weights (float64), NaN for invalid SMILES\n\n#### `rdtools.logp(smiles_array)`\nCalculate LogP (octanol-water partition coefficient) values.\n\n#### `rdtools.tpsa(smiles_array)`\nCalculate TPSA (Topological Polar Surface Area) values.\n\n#### `rdtools.is_valid(smiles_array)`\nValidate SMILES strings.\n\n**Returns:**\n- boolean numpy array indicating validity\n\n#### `rdtools.descriptors(smiles_array)`\nCalculate multiple descriptors efficiently in a single pass.\n\n**Returns:**\n- Dictionary with keys: 'molecular_weight', 'logp', 'tpsa'\n\n#### `rdtools.canonical_smiles(smiles_array)`\nConvert SMILES to canonical form.\n\n#### `rdtools.morgan_fingerprints(smiles_array, radius=2, nbits=2048)`\nCalculate Morgan fingerprints as bit vectors.\n\n**Parameters:**\n- `radius`: fingerprint radius (default: 2)\n- `nbits`: number of bits (default: 2048)\n\n**Returns:**\n- 2D numpy array of shape (n_molecules, nbits) with dtype uint8\n\n### Utility Functions\n\n#### `rdtools.filter_valid(smiles_array)`\nFilter array to keep only valid SMILES.\n\n#### `rdtools.batch_process(smiles_array, batch_size=1000, **kwargs)`\nProcess large arrays in batches with comprehensive results.\n\n### TensorFlow Operations\n\n#### `rdtools.tf_ops.string_process(smiles_tensor)`\nTensorFlow custom operation for string processing in data pipelines.\n\n**Parameters:**\n- `smiles_tensor`: TensorFlow string tensor\n\n**Returns:**\n- TensorFlow string tensor with processed SMILES\n\n**Example:**\n```python\nimport tensorflow as tf\nimport rdtools.tf_ops\n\n# Use in tf.data pipeline\ndataset = tf.data.Dataset.from_tensor_slices([\"CCO\", \"c1ccccc1\"])\ndataset = dataset.map(rdtools.tf_ops.string_process)\n```\n\n## Performance\n\nRDTools is optimized for high-throughput molecular processing:\n\n- **Batch Processing**: Calculate multiple descriptors in a single pass\n- **C++ Core**: Uses RDKit's optimized C++ implementation\n- **Memory Efficient**: Minimal Python overhead with direct numpy array access\n- **Parallel Ready**: Functions are designed to work well with joblib/multiprocessing\n\n### Benchmarks\n\nOn a modern CPU, RDTools can process:\n- ~10,000-50,000 molecules/second for descriptor calculations\n- ~5,000-20,000 molecules/second for fingerprint generation\n\nPerformance varies by molecule complexity and system specifications.\n\n## Examples\n\nSee the `examples/` directory for detailed usage examples:\n\n- `basic_usage.py`: Comprehensive demonstration of all features\n- `performance_comparison.py`: Benchmarking against pure Python approaches\n\n## Development\n\n### Development Setup\n\n1. **Clone and install in development mode**:\n   ```bash\n   git clone <repository-url>\n   cd rdtools\n   \n   # Install with development dependencies\n   uv sync\n   \n   # Build the C++ extension\n   SKBUILD_EDITABLE_VERBOSE=1 uv sync --reinstall\n   ```\n\n2. **Development workflow**:\n   ```bash\n   # Run tests\n   uv run pytest tests/\n   \n   # Code formatting\n   uv run black src/\n   uv run isort src/\n   \n   # Build wheel for distribution\n   uv build\n   uv run auditwheel repair dist/*.whl --exclude \"libtensorflow*\"\n   \n   # Run examples\n   uv run python examples/basic_usage.py\n   ```\n\n3. **Rebuilding after C++ changes**:\n   ```bash\n   # Clean rebuild (if needed)\n   rm -rf build/\n   SKBUILD_EDITABLE_VERBOSE=1 uv sync --reinstall\n   \n   # Or force rebuild with pip\n   uv run python -m pip install -e . --force-reinstall --no-deps\n   ```\n\n### Testing\n\nRun the test suite to ensure everything works:\n\n```bash\n# Run all tests\nuv run pytest tests/ -v\n\n# Run specific test\nuv run pytest tests/test_rdtools.py::TestBasicFunctions::test_molecular_weights -v\n\n# Test with coverage\nuv run pytest tests/ --cov=rdtools --cov-report=html\n```\n\n### Build Debugging\n\nIf you encounter build issues:\n\n```bash\n# Build with maximum verbosity\nSKBUILD_EDITABLE_VERBOSE=1 CMAKE_VERBOSE_MAKEFILE=ON uv sync --reinstall\n\n# Check CMake configuration\ncd build/cp312-cp312-linux_x86_64  # (or your platform directory)\ncmake .. -DCMAKE_BUILD_TYPE=Debug\n\n# Manual build for debugging\nmkdir debug_build && cd debug_build\ncmake .. -DCMAKE_BUILD_TYPE=Debug -DCMAKE_VERBOSE_MAKEFILE=ON\nmake VERBOSE=1\n```\n\n### Project Structure\n\n```\nrdtools/\n\u251c\u2500\u2500 src/\n\u2502   \u251c\u2500\u2500 rdtools/           # Python package\n\u2502   \u2502   \u2514\u2500\u2500 __init__.py    # Main API and function wrappers\n\u2502   \u2514\u2500\u2500 cpp/               # C++ source code\n\u2502       \u251c\u2500\u2500 molecular_ops.hpp    # C++ function declarations\n\u2502       \u251c\u2500\u2500 molecular_ops.cpp    # C++ implementations using RDKit\n\u2502       \u2514\u2500\u2500 pybind_module.cpp    # pybind11 bindings\n\u251c\u2500\u2500 examples/              # Usage examples and demos\n\u2502   \u2514\u2500\u2500 basic_usage.py     # Comprehensive feature demonstration\n\u251c\u2500\u2500 tests/                 # Test suite\n\u2502   \u2514\u2500\u2500 test_rdtools.py    # Unit tests\n\u251c\u2500\u2500 build/                 # Build artifacts (generated)\n\u251c\u2500\u2500 CMakeLists.txt         # CMake build configuration\n\u251c\u2500\u2500 pyproject.toml         # Project metadata and build config\n\u251c\u2500\u2500 uv.lock               # Dependency lock file\n\u2514\u2500\u2500 CLAUDE.md             # Development guidelines\n```\n\n## Dependencies\n\n### Runtime Dependencies\n- **Python**: >= 3.12 (as specified in pyproject.toml)\n- **NumPy**: >= 1.20.0 (for array operations)\n\n### System Dependencies\n- **RDKit**: >= 2022.9.1 (C++ libraries and headers required)\n  - `librdkit-dev` - Development headers\n  - `librdkit1` - Runtime libraries  \n  - `rdkit-data` - Data files\n  - `python3-rdkit` - Python bindings (optional, for compatibility testing)\n\n### Build Dependencies\n- **scikit-build-core**: >= 0.10 (modern CMake-based build backend)\n- **nanobind**: >= 2.0.0 (fast C++/Python binding library)\n- **tensorflow**: == 2.19.0 (for TensorFlow custom operations)\n- **CMake**: >= 3.15 (build system)\n- **C++ Compiler**: Supporting C++17 standard (GCC 7+, Clang 5+, MSVC 2017+)\n\n### Development Dependencies (Optional)\n- **pytest**: >= 6.0 (testing framework)\n- **black** (code formatting)\n- **isort** (import sorting)\n- **uv** (recommended package manager)\n\n## License\n\n[Add your license here]\n\n## Contributing\n\n[Add contributing guidelines here]\n\n## Citation\n\nIf you use RDTools in your research, please cite:\n[Add citation information here]",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "High-performance molecular operations using RDKit's C++ core through nanobind bindings",
    "version": "0.1.5",
    "project_urls": null,
    "split_keywords": [
        "chemistry",
        " cheminformatics",
        " rdkit",
        " molecular",
        " descriptors"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "1e442b9aa0149c858912d09d61a0235177b1e00356bef74c754b782477d20134",
                "md5": "4ec8359a95116ccddf75561809bdb7b3",
                "sha256": "c8bf72a4846ca3a85a3e22f662fef5bab9732827aea0393219e707ac6b507432"
            },
            "downloads": -1,
            "filename": "rdkit_data_pipeline_tools-0.1.5-cp312-cp312-manylinux_2_39_x86_64.whl",
            "has_sig": false,
            "md5_digest": "4ec8359a95116ccddf75561809bdb7b3",
            "packagetype": "bdist_wheel",
            "python_version": "cp312",
            "requires_python": ">=3.12",
            "size": 6278440,
            "upload_time": "2025-08-30T14:02:33",
            "upload_time_iso_8601": "2025-08-30T14:02:33.759737Z",
            "url": "https://files.pythonhosted.org/packages/1e/44/2b9aa0149c858912d09d61a0235177b1e00356bef74c754b782477d20134/rdkit_data_pipeline_tools-0.1.5-cp312-cp312-manylinux_2_39_x86_64.whl",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-08-30 14:02:33",
    "github": false,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "lcname": "rdkit-data-pipeline-tools"
}
        
Elapsed time: 1.58033s