# Molscraper
[](https://badge.fury.io/py/molscraper)
[](https://www.python.org/downloads/)
[](https://opensource.org/licenses/MIT)
> Chemical data extraction tool for researchers, students, and industry professionals.
## ๐งช What is Molscraper?
Molscraper is a comprehensive Python tool that extracts detailed chemical compound information from PubChem's REST API. Transform chemical compound names into structured datasets containing molecular properties, safety information, applications, and more.
## โจ Key Features
- **๐ฏ Advanced CAS Number Extraction** - Multiple extraction strategies with automatic fallbacks
- **๐ Comprehensive Data Extraction** - 11 standardized data fields per compound
- **๐ Advanced Safety Data** - MSDS information and hazard classifications
- **โก High Performance** - Smart rate limiting and efficient API usage
- **๐ Python Integration** - Use as CLI tool or import as Python library
- **๐ Research Ready** - Outputs to CSV for immediate analysis
- **๐ Smart Recognition** - Functional group identification and applications
## ๐ Quick Installation
Install from PyPI (recommended):
```bash
pip install molscraper
```
## ๐ Quick Start
### Installation
```bash
pip install molscraper
```
### Command Line Interface
```bash
# Extract data for specific compounds
molscraper -c "Benzaldehyde" "Caffeine" "Aspirin"
# From text file (one compound per line)
molscraper -f examples/compounds.txt
# From CSV (auto-detects compound column)
molscraper -f examples/research_data.csv -o results.csv
# From Excel file
molscraper -f examples/compound_list.xlsx --verbose
# Custom settings
molscraper -f my_data.csv --delay 0.5 --verbose
```
### File Input Support
```bash
# Text files - one compound per line
molscraper -f compounds.txt
# CSV files - auto-detects compound columns
molscraper -f research_data.csv
# Excel files (.xlsx)
molscraper -f lab_compounds.xlsx
# JSON files
molscraper -f compound_list.json
# CSV with different separators
molscraper -f messy_data.csv
```
**Note:** For best results, name your CSV column "compound" or "chemical", though other column names are automatically detected.
### Python API
```python
from molecule_scraper import process_compounds_to_csv, PubChemScraper
# Quick extraction to CSV
compounds = ["Benzaldehyde", "Caffeine", "Aspirin"]
df = process_compounds_to_csv(compounds, 'results.csv')
# Advanced usage with custom scraper
scraper = PubChemScraper(delay=0.3)
data = scraper.get_compound_data("Morphine")
print(f"CAS Number: {data.cas_number}")
print(f"Formula: {data.chemical_formula}")
```
## ๐ Data Fields Extracted
| Field | Description |
|-------|-------------|
| Chemical Species | IUPAC name |
| Functional Group | Applications and uses |
| Chemical Name | Common name |
| Chemical Formula | Molecular formula |
| Structural Formula | SMILES notation |
| Extended SMILES | Canonical SMILES |
| CAS# | Chemical registry number |
| Properties | Physical properties |
| Applications | Commercial uses |
| MSDS | Safety data |
| Hazard Information | Safety classifications |
## ๐ฏ Perfect for Chemists
**Computational Chemists:**
- Integrates seamlessly with existing Python workflows
- Works in Jupyter notebooks
- Compatible with pandas, numpy, matplotlib
- Easy to include in `requirements.txt`
**Academic Researchers:**
- Reproducible research workflows
- Batch processing capabilities
- Publication-ready data formats
**Industry Professionals:**
- High-throughput compound analysis
- Automated safety data collection
- Integration with existing systems
## ๐ Input Formats
### ๐ Text Files (`.txt`)
```
Benzaldehyde
Caffeine
Aspirin
```
### ๐ CSV Files (`.csv`) - Auto-detects compound columns!
```csv
compound,category,priority
Aspirin,NSAID,High
Caffeine,Stimulant,Medium
Ibuprofen,NSAID,High
```
**Supported column names:** `compound`, `chemical`, `molecule`, `name`, `substance`, and others.
**Supported separators:** Commas, semicolons, tabs.
### ๐ Excel Files (`.xlsx`)
Just save your spreadsheet as Excel format - same auto-detection as CSV!
### ๐ JSON Files (`.json`)
```json
{
"compounds": ["Benzaldehyde", "Caffeine", "Aspirin"],
"description": "Test compounds for analysis"
}
```
### ๐ง Data Format Support
The parser handles:
- Different separators (`,` `;` `\t`)
- Various encodings (UTF-8, Latin1, etc.)
- Headers and metadata
- Quoted fields
- Empty rows
**Note:** Most standard data files work without reformatting.
## ๐ฏ Try Our Examples!
We've included ready-to-use example files:
```bash
# ๐งช Basic pharmaceutical compounds
molscraper -f examples/sample_compounds.txt
# ๐ฌ Research lignans and flavonoids
molscraper -f examples/research_data.csv --verbose
# ๐ Drug discovery compounds
molscraper -f examples/drug_discovery.csv
# ๐ฟ Natural products from traditional medicine
molscraper -f examples/natural_products.csv
# ๐ Excel file with therapeutic classifications
molscraper -f examples/compound_list.xlsx
# ๐ JSON format research compounds
molscraper -f examples/research_compounds.json
```
๐ **New to molscraper?** Check out our [complete tutorial](examples/tutorials/getting_started.md)!
## ๐ ๏ธ Advanced Usage
### CLI Options
```bash
molscraper --help
Options:
-c, --compounds TEXT Compound names (space-separated)
-f, --file PATH Input file (.txt or .json)
-o, --output PATH Output CSV file
--delay FLOAT Delay between requests (default: 0.2)
--verbose Enable detailed logging
--help Show this message and exit
```
### Error Handling
The tool includes comprehensive error handling:
- Network timeouts and retries
- Invalid compound name detection
- API rate limit management
- Graceful degradation for partial data
## ๐ฌ Why Choose Molscraper?
### vs. Manual PubChem Searches
- **Automated batch processing** vs manual lookup
- **Consistent formatting** across all compounds
- **Batch processing** capabilities
- **Automated CAS number extraction**
### vs. Other Chemical APIs
- **No API keys required** - uses free PubChem REST API
- **Advanced data extraction** - gets safety and application data
- **Robust CAS extraction** - multiple strategies with automatic fallbacks
- **Ready-to-use** - no complex setup
### vs. Basic Scripts
- **Robust error handling** - network timeouts and retries
- **Rate limiting** - respects API limits
- **Comprehensive data** - extracts 11+ data fields
- **Structured output** - standardized CSV format
## ๐ Examples
See the `examples/` directory for:
- `compounds.txt` - Basic compound list
- `research_compounds.json` - Research-focused compounds
- `sample_output.csv` - Example results
## ๐ค Contributing
For product feedback, feature requests, or questions, please reach out to: **x@rhizome-research.com**
## ๐ License
MIT License - see LICENSE file for details.
## ๐ Support
- **Documentation**: This README and inline code docs
- **Contact**: Send questions or feedback to x@rhizome-research.com
- **Examples**: Check the `examples/` directory
---
**Quick Start:**
```bash
pip install molscraper
molscraper -c "Caffeine" -o test.csv
```
Raw data
{
"_id": null,
"home_page": null,
"name": "molscraper-tool",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.8",
"maintainer_email": null,
"keywords": "chemistry, pubchem, chemical-data, molecules, research, cas-numbers, chemical-properties, data-extraction, scientific-computing, cheminformatics",
"author": "Xhuliano Brace, Timothy Chia",
"author_email": "x@rhizome-research.com, tim@rhizome-research.com",
"download_url": "https://files.pythonhosted.org/packages/1b/e6/d9bc073da17fa17c6660ab9641f7bb7d9163ccfc1c5d123e16e6dcf86d29/molscraper_tool-1.0.0.tar.gz",
"platform": null,
"description": "# Molscraper\n\n[](https://badge.fury.io/py/molscraper)\n[](https://www.python.org/downloads/)\n[](https://opensource.org/licenses/MIT)\n\n> Chemical data extraction tool for researchers, students, and industry professionals.\n\n## \ud83e\uddea What is Molscraper?\n\nMolscraper is a comprehensive Python tool that extracts detailed chemical compound information from PubChem's REST API. Transform chemical compound names into structured datasets containing molecular properties, safety information, applications, and more.\n\n## \u2728 Key Features\n\n- **\ud83c\udfaf Advanced CAS Number Extraction** - Multiple extraction strategies with automatic fallbacks\n- **\ud83d\udccb Comprehensive Data Extraction** - 11 standardized data fields per compound\n- **\ud83d\udd12 Advanced Safety Data** - MSDS information and hazard classifications \n- **\u26a1 High Performance** - Smart rate limiting and efficient API usage\n- **\ud83d\udc0d Python Integration** - Use as CLI tool or import as Python library\n- **\ud83d\udcca Research Ready** - Outputs to CSV for immediate analysis\n- **\ud83d\udd0d Smart Recognition** - Functional group identification and applications\n\n## \ud83d\ude80 Quick Installation\n\nInstall from PyPI (recommended):\n\n```bash\npip install molscraper\n```\n\n## \ud83d\ude80 Quick Start\n\n### Installation\n```bash\npip install molscraper\n```\n\n### Command Line Interface\n\n```bash\n# Extract data for specific compounds\nmolscraper -c \"Benzaldehyde\" \"Caffeine\" \"Aspirin\"\n\n# From text file (one compound per line)\nmolscraper -f examples/compounds.txt\n\n# From CSV (auto-detects compound column)\nmolscraper -f examples/research_data.csv -o results.csv\n\n# From Excel file \nmolscraper -f examples/compound_list.xlsx --verbose\n\n# Custom settings\nmolscraper -f my_data.csv --delay 0.5 --verbose\n```\n\n### File Input Support\n\n```bash\n# Text files - one compound per line\nmolscraper -f compounds.txt\n\n# CSV files - auto-detects compound columns\nmolscraper -f research_data.csv\n\n# Excel files (.xlsx) \nmolscraper -f lab_compounds.xlsx\n\n# JSON files\nmolscraper -f compound_list.json\n\n# CSV with different separators\nmolscraper -f messy_data.csv\n```\n\n**Note:** For best results, name your CSV column \"compound\" or \"chemical\", though other column names are automatically detected.\n\n### Python API\n\n```python\nfrom molecule_scraper import process_compounds_to_csv, PubChemScraper\n\n# Quick extraction to CSV\ncompounds = [\"Benzaldehyde\", \"Caffeine\", \"Aspirin\"]\ndf = process_compounds_to_csv(compounds, 'results.csv')\n\n# Advanced usage with custom scraper\nscraper = PubChemScraper(delay=0.3)\ndata = scraper.get_compound_data(\"Morphine\")\nprint(f\"CAS Number: {data.cas_number}\")\nprint(f\"Formula: {data.chemical_formula}\")\n```\n\n## \ud83d\udcca Data Fields Extracted\n\n| Field | Description |\n|-------|-------------|\n| Chemical Species | IUPAC name |\n| Functional Group | Applications and uses |\n| Chemical Name | Common name |\n| Chemical Formula | Molecular formula |\n| Structural Formula | SMILES notation |\n| Extended SMILES | Canonical SMILES |\n| CAS# | Chemical registry number |\n| Properties | Physical properties |\n| Applications | Commercial uses |\n| MSDS | Safety data |\n| Hazard Information | Safety classifications |\n\n## \ud83c\udfaf Perfect for Chemists\n\n**Computational Chemists:**\n- Integrates seamlessly with existing Python workflows\n- Works in Jupyter notebooks\n- Compatible with pandas, numpy, matplotlib\n- Easy to include in `requirements.txt`\n\n**Academic Researchers:**\n- Reproducible research workflows\n- Batch processing capabilities\n- Publication-ready data formats\n\n**Industry Professionals:**\n- High-throughput compound analysis\n- Automated safety data collection\n- Integration with existing systems\n\n## \ud83d\udcc1 Input Formats\n\n### \ud83d\udcc4 Text Files (`.txt`)\n```\nBenzaldehyde\nCaffeine\nAspirin\n```\n\n### \ud83d\udcca CSV Files (`.csv`) - Auto-detects compound columns!\n```csv\ncompound,category,priority\nAspirin,NSAID,High\nCaffeine,Stimulant,Medium\nIbuprofen,NSAID,High\n```\n\n**Supported column names:** `compound`, `chemical`, `molecule`, `name`, `substance`, and others.\n**Supported separators:** Commas, semicolons, tabs.\n\n### \ud83d\udcca Excel Files (`.xlsx`)\nJust save your spreadsheet as Excel format - same auto-detection as CSV!\n\n### \ud83d\udccb JSON Files (`.json`)\n```json\n{\n \"compounds\": [\"Benzaldehyde\", \"Caffeine\", \"Aspirin\"],\n \"description\": \"Test compounds for analysis\"\n}\n```\n\n### \ud83d\udd27 Data Format Support\nThe parser handles:\n- Different separators (`,` `;` `\\t`)\n- Various encodings (UTF-8, Latin1, etc.)\n- Headers and metadata\n- Quoted fields\n- Empty rows\n\n**Note:** Most standard data files work without reformatting.\n\n## \ud83c\udfaf Try Our Examples!\n\nWe've included ready-to-use example files:\n\n```bash\n# \ud83e\uddea Basic pharmaceutical compounds\nmolscraper -f examples/sample_compounds.txt\n\n# \ud83d\udd2c Research lignans and flavonoids \nmolscraper -f examples/research_data.csv --verbose\n\n# \ud83d\udc8a Drug discovery compounds\nmolscraper -f examples/drug_discovery.csv\n\n# \ud83c\udf3f Natural products from traditional medicine\nmolscraper -f examples/natural_products.csv\n\n# \ud83d\udcca Excel file with therapeutic classifications\nmolscraper -f examples/compound_list.xlsx\n\n# \ud83d\udccb JSON format research compounds\nmolscraper -f examples/research_compounds.json\n```\n\n\ud83d\udcda **New to molscraper?** Check out our [complete tutorial](examples/tutorials/getting_started.md)!\n\n## \ud83d\udee0\ufe0f Advanced Usage\n\n### CLI Options\n\n```bash\nmolscraper --help\n\nOptions:\n -c, --compounds TEXT Compound names (space-separated)\n -f, --file PATH Input file (.txt or .json)\n -o, --output PATH Output CSV file\n --delay FLOAT Delay between requests (default: 0.2)\n --verbose Enable detailed logging\n --help Show this message and exit\n```\n\n### Error Handling\n\nThe tool includes comprehensive error handling:\n- Network timeouts and retries\n- Invalid compound name detection\n- API rate limit management\n- Graceful degradation for partial data\n\n## \ud83d\udd2c Why Choose Molscraper?\n\n### vs. Manual PubChem Searches\n- **Automated batch processing** vs manual lookup\n- **Consistent formatting** across all compounds\n- **Batch processing** capabilities\n- **Automated CAS number extraction**\n\n### vs. Other Chemical APIs\n- **No API keys required** - uses free PubChem REST API\n- **Advanced data extraction** - gets safety and application data\n- **Robust CAS extraction** - multiple strategies with automatic fallbacks\n- **Ready-to-use** - no complex setup\n\n### vs. Basic Scripts\n- **Robust error handling** - network timeouts and retries\n- **Rate limiting** - respects API limits\n- **Comprehensive data** - extracts 11+ data fields\n- **Structured output** - standardized CSV format\n\n## \ud83d\udcda Examples\n\nSee the `examples/` directory for:\n- `compounds.txt` - Basic compound list\n- `research_compounds.json` - Research-focused compounds\n- `sample_output.csv` - Example results\n\n## \ud83e\udd1d Contributing\n\nFor product feedback, feature requests, or questions, please reach out to: **x@rhizome-research.com**\n\n## \ud83d\udcc4 License\n\nMIT License - see LICENSE file for details.\n\n## \ud83c\udd98 Support\n\n- **Documentation**: This README and inline code docs\n- **Contact**: Send questions or feedback to x@rhizome-research.com\n- **Examples**: Check the `examples/` directory\n\n---\n\n**Quick Start:**\n\n```bash\npip install molscraper\nmolscraper -c \"Caffeine\" -o test.csv\n``` \n",
"bugtrack_url": null,
"license": null,
"summary": "Chemical data extraction tool for researchers and chemists",
"version": "1.0.0",
"project_urls": {
"Bug Reports": "https://github.com/xhuliano/molscraper/issues",
"Documentation": "https://github.com/xhuliano/molscraper#readme",
"Source": "https://github.com/xhuliano/molscraper"
},
"split_keywords": [
"chemistry",
" pubchem",
" chemical-data",
" molecules",
" research",
" cas-numbers",
" chemical-properties",
" data-extraction",
" scientific-computing",
" cheminformatics"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "35c0cb9ebcd067ea7b11a28a5390f8cc1ce6a2403b626c3988d702c4549c88bd",
"md5": "485f7bbdc918bb6e8544a83534f11425",
"sha256": "4ffa57415ad67727ae401fe4f6dea7f8ba12fd1ced6226a2e3fb8e6d72835d30"
},
"downloads": -1,
"filename": "molscraper_tool-1.0.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "485f7bbdc918bb6e8544a83534f11425",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.8",
"size": 281283,
"upload_time": "2025-07-23T04:09:50",
"upload_time_iso_8601": "2025-07-23T04:09:50.940605Z",
"url": "https://files.pythonhosted.org/packages/35/c0/cb9ebcd067ea7b11a28a5390f8cc1ce6a2403b626c3988d702c4549c88bd/molscraper_tool-1.0.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "1be6d9bc073da17fa17c6660ab9641f7bb7d9163ccfc1c5d123e16e6dcf86d29",
"md5": "01e2344e8d5f0190e36ab597497b6fd3",
"sha256": "e0aefea79d3c60413440efc2b0eca516445f80d16cef6850556640d032d23d95"
},
"downloads": -1,
"filename": "molscraper_tool-1.0.0.tar.gz",
"has_sig": false,
"md5_digest": "01e2344e8d5f0190e36ab597497b6fd3",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.8",
"size": 296261,
"upload_time": "2025-07-23T04:09:52",
"upload_time_iso_8601": "2025-07-23T04:09:52.574702Z",
"url": "https://files.pythonhosted.org/packages/1b/e6/d9bc073da17fa17c6660ab9641f7bb7d9163ccfc1c5d123e16e6dcf86d29/molscraper_tool-1.0.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-07-23 04:09:52",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "xhuliano",
"github_project": "molscraper",
"github_not_found": true,
"lcname": "molscraper-tool"
}