molscraper-tool


Namemolscraper-tool JSON
Version 1.0.0 PyPI version JSON
download
home_pageNone
SummaryChemical data extraction tool for researchers and chemists
upload_time2025-07-23 04:09:52
maintainerNone
docs_urlNone
authorXhuliano Brace, Timothy Chia
requires_python>=3.8
licenseNone
keywords chemistry pubchem chemical-data molecules research cas-numbers chemical-properties data-extraction scientific-computing cheminformatics
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # Molscraper

[![PyPI version](https://badge.fury.io/py/molscraper.svg)](https://badge.fury.io/py/molscraper)
[![Python 3.8+](https://img.shields.io/badge/python-3.8+-blue.svg)](https://www.python.org/downloads/)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)

> Chemical data extraction tool for researchers, students, and industry professionals.

## ๐Ÿงช What is Molscraper?

Molscraper is a comprehensive Python tool that extracts detailed chemical compound information from PubChem's REST API. Transform chemical compound names into structured datasets containing molecular properties, safety information, applications, and more.

## โœจ Key Features

- **๐ŸŽฏ Advanced CAS Number Extraction** - Multiple extraction strategies with automatic fallbacks
- **๐Ÿ“‹ Comprehensive Data Extraction** - 11 standardized data fields per compound
- **๐Ÿ”’ Advanced Safety Data** - MSDS information and hazard classifications  
- **โšก High Performance** - Smart rate limiting and efficient API usage
- **๐Ÿ Python Integration** - Use as CLI tool or import as Python library
- **๐Ÿ“Š Research Ready** - Outputs to CSV for immediate analysis
- **๐Ÿ” Smart Recognition** - Functional group identification and applications

## ๐Ÿš€ Quick Installation

Install from PyPI (recommended):

```bash
pip install molscraper
```

## ๐Ÿš€ Quick Start

### Installation
```bash
pip install molscraper
```

### Command Line Interface

```bash
# Extract data for specific compounds
molscraper -c "Benzaldehyde" "Caffeine" "Aspirin"

# From text file (one compound per line)
molscraper -f examples/compounds.txt

# From CSV (auto-detects compound column)
molscraper -f examples/research_data.csv -o results.csv

# From Excel file  
molscraper -f examples/compound_list.xlsx --verbose

# Custom settings
molscraper -f my_data.csv --delay 0.5 --verbose
```

### File Input Support

```bash
# Text files - one compound per line
molscraper -f compounds.txt

# CSV files - auto-detects compound columns
molscraper -f research_data.csv

# Excel files (.xlsx) 
molscraper -f lab_compounds.xlsx

# JSON files
molscraper -f compound_list.json

# CSV with different separators
molscraper -f messy_data.csv
```

**Note:** For best results, name your CSV column "compound" or "chemical", though other column names are automatically detected.

### Python API

```python
from molecule_scraper import process_compounds_to_csv, PubChemScraper

# Quick extraction to CSV
compounds = ["Benzaldehyde", "Caffeine", "Aspirin"]
df = process_compounds_to_csv(compounds, 'results.csv')

# Advanced usage with custom scraper
scraper = PubChemScraper(delay=0.3)
data = scraper.get_compound_data("Morphine")
print(f"CAS Number: {data.cas_number}")
print(f"Formula: {data.chemical_formula}")
```

## ๐Ÿ“Š Data Fields Extracted

| Field | Description |
|-------|-------------|
| Chemical Species | IUPAC name |
| Functional Group | Applications and uses |
| Chemical Name | Common name |
| Chemical Formula | Molecular formula |
| Structural Formula | SMILES notation |
| Extended SMILES | Canonical SMILES |
| CAS# | Chemical registry number |
| Properties | Physical properties |
| Applications | Commercial uses |
| MSDS | Safety data |
| Hazard Information | Safety classifications |

## ๐ŸŽฏ Perfect for Chemists

**Computational Chemists:**
- Integrates seamlessly with existing Python workflows
- Works in Jupyter notebooks
- Compatible with pandas, numpy, matplotlib
- Easy to include in `requirements.txt`

**Academic Researchers:**
- Reproducible research workflows
- Batch processing capabilities
- Publication-ready data formats

**Industry Professionals:**
- High-throughput compound analysis
- Automated safety data collection
- Integration with existing systems

## ๐Ÿ“ Input Formats

### ๐Ÿ“„ Text Files (`.txt`)
```
Benzaldehyde
Caffeine
Aspirin
```

### ๐Ÿ“Š CSV Files (`.csv`) - Auto-detects compound columns!
```csv
compound,category,priority
Aspirin,NSAID,High
Caffeine,Stimulant,Medium
Ibuprofen,NSAID,High
```

**Supported column names:** `compound`, `chemical`, `molecule`, `name`, `substance`, and others.
**Supported separators:** Commas, semicolons, tabs.

### ๐Ÿ“Š Excel Files (`.xlsx`)
Just save your spreadsheet as Excel format - same auto-detection as CSV!

### ๐Ÿ“‹ JSON Files (`.json`)
```json
{
  "compounds": ["Benzaldehyde", "Caffeine", "Aspirin"],
  "description": "Test compounds for analysis"
}
```

### ๐Ÿ”ง Data Format Support
The parser handles:
- Different separators (`,` `;` `\t`)
- Various encodings (UTF-8, Latin1, etc.)
- Headers and metadata
- Quoted fields
- Empty rows

**Note:** Most standard data files work without reformatting.

## ๐ŸŽฏ Try Our Examples!

We've included ready-to-use example files:

```bash
# ๐Ÿงช Basic pharmaceutical compounds
molscraper -f examples/sample_compounds.txt

# ๐Ÿ”ฌ Research lignans and flavonoids  
molscraper -f examples/research_data.csv --verbose

# ๐Ÿ’Š Drug discovery compounds
molscraper -f examples/drug_discovery.csv

# ๐ŸŒฟ Natural products from traditional medicine
molscraper -f examples/natural_products.csv

# ๐Ÿ“Š Excel file with therapeutic classifications
molscraper -f examples/compound_list.xlsx

# ๐Ÿ“‹ JSON format research compounds
molscraper -f examples/research_compounds.json
```

๐Ÿ“š **New to molscraper?** Check out our [complete tutorial](examples/tutorials/getting_started.md)!

## ๐Ÿ› ๏ธ Advanced Usage

### CLI Options

```bash
molscraper --help

Options:
  -c, --compounds TEXT         Compound names (space-separated)
  -f, --file PATH             Input file (.txt or .json)
  -o, --output PATH           Output CSV file
  --delay FLOAT               Delay between requests (default: 0.2)
  --verbose                   Enable detailed logging
  --help                      Show this message and exit
```

### Error Handling

The tool includes comprehensive error handling:
- Network timeouts and retries
- Invalid compound name detection
- API rate limit management
- Graceful degradation for partial data

## ๐Ÿ”ฌ Why Choose Molscraper?

### vs. Manual PubChem Searches
- **Automated batch processing** vs manual lookup
- **Consistent formatting** across all compounds
- **Batch processing** capabilities
- **Automated CAS number extraction**

### vs. Other Chemical APIs
- **No API keys required** - uses free PubChem REST API
- **Advanced data extraction** - gets safety and application data
- **Robust CAS extraction** - multiple strategies with automatic fallbacks
- **Ready-to-use** - no complex setup

### vs. Basic Scripts
- **Robust error handling** - network timeouts and retries
- **Rate limiting** - respects API limits
- **Comprehensive data** - extracts 11+ data fields
- **Structured output** - standardized CSV format

## ๐Ÿ“š Examples

See the `examples/` directory for:
- `compounds.txt` - Basic compound list
- `research_compounds.json` - Research-focused compounds
- `sample_output.csv` - Example results

## ๐Ÿค Contributing

For product feedback, feature requests, or questions, please reach out to: **x@rhizome-research.com**

## ๐Ÿ“„ License

MIT License - see LICENSE file for details.

## ๐Ÿ†˜ Support

- **Documentation**: This README and inline code docs
- **Contact**: Send questions or feedback to x@rhizome-research.com
- **Examples**: Check the `examples/` directory

---

**Quick Start:**

```bash
pip install molscraper
molscraper -c "Caffeine" -o test.csv
``` 

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "molscraper-tool",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": null,
    "keywords": "chemistry, pubchem, chemical-data, molecules, research, cas-numbers, chemical-properties, data-extraction, scientific-computing, cheminformatics",
    "author": "Xhuliano Brace, Timothy Chia",
    "author_email": "x@rhizome-research.com, tim@rhizome-research.com",
    "download_url": "https://files.pythonhosted.org/packages/1b/e6/d9bc073da17fa17c6660ab9641f7bb7d9163ccfc1c5d123e16e6dcf86d29/molscraper_tool-1.0.0.tar.gz",
    "platform": null,
    "description": "# Molscraper\n\n[![PyPI version](https://badge.fury.io/py/molscraper.svg)](https://badge.fury.io/py/molscraper)\n[![Python 3.8+](https://img.shields.io/badge/python-3.8+-blue.svg)](https://www.python.org/downloads/)\n[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)\n\n> Chemical data extraction tool for researchers, students, and industry professionals.\n\n## \ud83e\uddea What is Molscraper?\n\nMolscraper is a comprehensive Python tool that extracts detailed chemical compound information from PubChem's REST API. Transform chemical compound names into structured datasets containing molecular properties, safety information, applications, and more.\n\n## \u2728 Key Features\n\n- **\ud83c\udfaf Advanced CAS Number Extraction** - Multiple extraction strategies with automatic fallbacks\n- **\ud83d\udccb Comprehensive Data Extraction** - 11 standardized data fields per compound\n- **\ud83d\udd12 Advanced Safety Data** - MSDS information and hazard classifications  \n- **\u26a1 High Performance** - Smart rate limiting and efficient API usage\n- **\ud83d\udc0d Python Integration** - Use as CLI tool or import as Python library\n- **\ud83d\udcca Research Ready** - Outputs to CSV for immediate analysis\n- **\ud83d\udd0d Smart Recognition** - Functional group identification and applications\n\n## \ud83d\ude80 Quick Installation\n\nInstall from PyPI (recommended):\n\n```bash\npip install molscraper\n```\n\n## \ud83d\ude80 Quick Start\n\n### Installation\n```bash\npip install molscraper\n```\n\n### Command Line Interface\n\n```bash\n# Extract data for specific compounds\nmolscraper -c \"Benzaldehyde\" \"Caffeine\" \"Aspirin\"\n\n# From text file (one compound per line)\nmolscraper -f examples/compounds.txt\n\n# From CSV (auto-detects compound column)\nmolscraper -f examples/research_data.csv -o results.csv\n\n# From Excel file  \nmolscraper -f examples/compound_list.xlsx --verbose\n\n# Custom settings\nmolscraper -f my_data.csv --delay 0.5 --verbose\n```\n\n### File Input Support\n\n```bash\n# Text files - one compound per line\nmolscraper -f compounds.txt\n\n# CSV files - auto-detects compound columns\nmolscraper -f research_data.csv\n\n# Excel files (.xlsx) \nmolscraper -f lab_compounds.xlsx\n\n# JSON files\nmolscraper -f compound_list.json\n\n# CSV with different separators\nmolscraper -f messy_data.csv\n```\n\n**Note:** For best results, name your CSV column \"compound\" or \"chemical\", though other column names are automatically detected.\n\n### Python API\n\n```python\nfrom molecule_scraper import process_compounds_to_csv, PubChemScraper\n\n# Quick extraction to CSV\ncompounds = [\"Benzaldehyde\", \"Caffeine\", \"Aspirin\"]\ndf = process_compounds_to_csv(compounds, 'results.csv')\n\n# Advanced usage with custom scraper\nscraper = PubChemScraper(delay=0.3)\ndata = scraper.get_compound_data(\"Morphine\")\nprint(f\"CAS Number: {data.cas_number}\")\nprint(f\"Formula: {data.chemical_formula}\")\n```\n\n## \ud83d\udcca Data Fields Extracted\n\n| Field | Description |\n|-------|-------------|\n| Chemical Species | IUPAC name |\n| Functional Group | Applications and uses |\n| Chemical Name | Common name |\n| Chemical Formula | Molecular formula |\n| Structural Formula | SMILES notation |\n| Extended SMILES | Canonical SMILES |\n| CAS# | Chemical registry number |\n| Properties | Physical properties |\n| Applications | Commercial uses |\n| MSDS | Safety data |\n| Hazard Information | Safety classifications |\n\n## \ud83c\udfaf Perfect for Chemists\n\n**Computational Chemists:**\n- Integrates seamlessly with existing Python workflows\n- Works in Jupyter notebooks\n- Compatible with pandas, numpy, matplotlib\n- Easy to include in `requirements.txt`\n\n**Academic Researchers:**\n- Reproducible research workflows\n- Batch processing capabilities\n- Publication-ready data formats\n\n**Industry Professionals:**\n- High-throughput compound analysis\n- Automated safety data collection\n- Integration with existing systems\n\n## \ud83d\udcc1 Input Formats\n\n### \ud83d\udcc4 Text Files (`.txt`)\n```\nBenzaldehyde\nCaffeine\nAspirin\n```\n\n### \ud83d\udcca CSV Files (`.csv`) - Auto-detects compound columns!\n```csv\ncompound,category,priority\nAspirin,NSAID,High\nCaffeine,Stimulant,Medium\nIbuprofen,NSAID,High\n```\n\n**Supported column names:** `compound`, `chemical`, `molecule`, `name`, `substance`, and others.\n**Supported separators:** Commas, semicolons, tabs.\n\n### \ud83d\udcca Excel Files (`.xlsx`)\nJust save your spreadsheet as Excel format - same auto-detection as CSV!\n\n### \ud83d\udccb JSON Files (`.json`)\n```json\n{\n  \"compounds\": [\"Benzaldehyde\", \"Caffeine\", \"Aspirin\"],\n  \"description\": \"Test compounds for analysis\"\n}\n```\n\n### \ud83d\udd27 Data Format Support\nThe parser handles:\n- Different separators (`,` `;` `\\t`)\n- Various encodings (UTF-8, Latin1, etc.)\n- Headers and metadata\n- Quoted fields\n- Empty rows\n\n**Note:** Most standard data files work without reformatting.\n\n## \ud83c\udfaf Try Our Examples!\n\nWe've included ready-to-use example files:\n\n```bash\n# \ud83e\uddea Basic pharmaceutical compounds\nmolscraper -f examples/sample_compounds.txt\n\n# \ud83d\udd2c Research lignans and flavonoids  \nmolscraper -f examples/research_data.csv --verbose\n\n# \ud83d\udc8a Drug discovery compounds\nmolscraper -f examples/drug_discovery.csv\n\n# \ud83c\udf3f Natural products from traditional medicine\nmolscraper -f examples/natural_products.csv\n\n# \ud83d\udcca Excel file with therapeutic classifications\nmolscraper -f examples/compound_list.xlsx\n\n# \ud83d\udccb JSON format research compounds\nmolscraper -f examples/research_compounds.json\n```\n\n\ud83d\udcda **New to molscraper?** Check out our [complete tutorial](examples/tutorials/getting_started.md)!\n\n## \ud83d\udee0\ufe0f Advanced Usage\n\n### CLI Options\n\n```bash\nmolscraper --help\n\nOptions:\n  -c, --compounds TEXT         Compound names (space-separated)\n  -f, --file PATH             Input file (.txt or .json)\n  -o, --output PATH           Output CSV file\n  --delay FLOAT               Delay between requests (default: 0.2)\n  --verbose                   Enable detailed logging\n  --help                      Show this message and exit\n```\n\n### Error Handling\n\nThe tool includes comprehensive error handling:\n- Network timeouts and retries\n- Invalid compound name detection\n- API rate limit management\n- Graceful degradation for partial data\n\n## \ud83d\udd2c Why Choose Molscraper?\n\n### vs. Manual PubChem Searches\n- **Automated batch processing** vs manual lookup\n- **Consistent formatting** across all compounds\n- **Batch processing** capabilities\n- **Automated CAS number extraction**\n\n### vs. Other Chemical APIs\n- **No API keys required** - uses free PubChem REST API\n- **Advanced data extraction** - gets safety and application data\n- **Robust CAS extraction** - multiple strategies with automatic fallbacks\n- **Ready-to-use** - no complex setup\n\n### vs. Basic Scripts\n- **Robust error handling** - network timeouts and retries\n- **Rate limiting** - respects API limits\n- **Comprehensive data** - extracts 11+ data fields\n- **Structured output** - standardized CSV format\n\n## \ud83d\udcda Examples\n\nSee the `examples/` directory for:\n- `compounds.txt` - Basic compound list\n- `research_compounds.json` - Research-focused compounds\n- `sample_output.csv` - Example results\n\n## \ud83e\udd1d Contributing\n\nFor product feedback, feature requests, or questions, please reach out to: **x@rhizome-research.com**\n\n## \ud83d\udcc4 License\n\nMIT License - see LICENSE file for details.\n\n## \ud83c\udd98 Support\n\n- **Documentation**: This README and inline code docs\n- **Contact**: Send questions or feedback to x@rhizome-research.com\n- **Examples**: Check the `examples/` directory\n\n---\n\n**Quick Start:**\n\n```bash\npip install molscraper\nmolscraper -c \"Caffeine\" -o test.csv\n``` \n",
    "bugtrack_url": null,
    "license": null,
    "summary": "Chemical data extraction tool for researchers and chemists",
    "version": "1.0.0",
    "project_urls": {
        "Bug Reports": "https://github.com/xhuliano/molscraper/issues",
        "Documentation": "https://github.com/xhuliano/molscraper#readme",
        "Source": "https://github.com/xhuliano/molscraper"
    },
    "split_keywords": [
        "chemistry",
        " pubchem",
        " chemical-data",
        " molecules",
        " research",
        " cas-numbers",
        " chemical-properties",
        " data-extraction",
        " scientific-computing",
        " cheminformatics"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "35c0cb9ebcd067ea7b11a28a5390f8cc1ce6a2403b626c3988d702c4549c88bd",
                "md5": "485f7bbdc918bb6e8544a83534f11425",
                "sha256": "4ffa57415ad67727ae401fe4f6dea7f8ba12fd1ced6226a2e3fb8e6d72835d30"
            },
            "downloads": -1,
            "filename": "molscraper_tool-1.0.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "485f7bbdc918bb6e8544a83534f11425",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8",
            "size": 281283,
            "upload_time": "2025-07-23T04:09:50",
            "upload_time_iso_8601": "2025-07-23T04:09:50.940605Z",
            "url": "https://files.pythonhosted.org/packages/35/c0/cb9ebcd067ea7b11a28a5390f8cc1ce6a2403b626c3988d702c4549c88bd/molscraper_tool-1.0.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "1be6d9bc073da17fa17c6660ab9641f7bb7d9163ccfc1c5d123e16e6dcf86d29",
                "md5": "01e2344e8d5f0190e36ab597497b6fd3",
                "sha256": "e0aefea79d3c60413440efc2b0eca516445f80d16cef6850556640d032d23d95"
            },
            "downloads": -1,
            "filename": "molscraper_tool-1.0.0.tar.gz",
            "has_sig": false,
            "md5_digest": "01e2344e8d5f0190e36ab597497b6fd3",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8",
            "size": 296261,
            "upload_time": "2025-07-23T04:09:52",
            "upload_time_iso_8601": "2025-07-23T04:09:52.574702Z",
            "url": "https://files.pythonhosted.org/packages/1b/e6/d9bc073da17fa17c6660ab9641f7bb7d9163ccfc1c5d123e16e6dcf86d29/molscraper_tool-1.0.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-07-23 04:09:52",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "xhuliano",
    "github_project": "molscraper",
    "github_not_found": true,
    "lcname": "molscraper-tool"
}
        
Elapsed time: 0.64444s