oscapify


Nameoscapify JSON
Version 0.1.1 PyPI version JSON
download
home_pageNone
SummaryA robust tool for converting scientific literature CSV files to OSCAP-compatible format
upload_time2025-07-17 20:54:06
maintainerNone
docs_urlNone
authorJordan R. Willis
requires_python<4.0,>=3.8
licenseMIT
keywords oscap scientific-literature csv converter
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # Oscapify

[![PyPI version](https://badge.fury.io/py/oscapify.svg)](https://badge.fury.io/py/oscapify)
[![Python versions](https://img.shields.io/pypi/pyversions/oscapify.svg)](https://pypi.org/project/oscapify/)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)

A robust tool for converting scientific literature CSV files to OSCAP-compatible format. Oscapify processes neuroscience connectivity data from PubMed/PMC sources, validates headers, retrieves DOIs, and handles errors gracefully.

## Features

- **Intelligent Header Validation**: Automatically detects and corrects common header issues
- **Flexible Header Mapping**: Support for custom column names and formats
- **DOI Retrieval**: Fetches DOIs from NCBI API with built-in caching
- **Error Recovery**: Continues processing even when individual records fail
- **Detailed Debugging**: Comprehensive logging and header analysis tools
- **Batch Processing**: Process multiple files or entire directories
- **Performance**: Persistent caching and efficient batch operations

## Installation

```bash
pip install oscapify
```

### Development Installation

```bash
git clone https://github.com/yourusername/oscapify.git
cd oscapify
pip install -e ".[dev]"
```

## Quick Start

### Basic Usage

```bash
# Process a single file
oscapify process input.csv

# Process multiple files
oscapify process file1.csv file2.csv

# Process all CSV files in a directory
oscapify process /path/to/csv/directory/

# Specify output directory
oscapify process input.csv --output ./results
```

### Header Validation and Debugging

```bash
# Validate CSV headers and see debugging info
oscapify validate input.csv

# Get header mapping suggestions
oscapify validate input.csv --suggest-mappings
```

### Custom Header Mapping

If your CSV files use different column names:

```bash
oscapify process input.csv \
  --header-pmid "PubMedID" \
  --header-sentence "text" \
  --preserve-fields "custom_field1" "custom_field2"
```

## Expected Input Format

Oscapify expects CSV files with the following columns (case-insensitive):

- `pmid` - PubMed ID
- `sentence` - Text content
- `pmcid` (optional) - PubMed Central ID
- `pubmed_url` (optional) - URL to PubMed/PMC article

Additional columns are preserved in the output.

### Example Input CSV

```csv
ID,pmid,pmcid,sentence,structure_1,structure_2,relation,score,pubmed_url
1,12345678,PMC1234567,"The brain connects to the spinal cord.",brain,spinal cord,connects,0.95,https://pubmed.ncbi.nlm.nih.gov/12345678/
```

## Output Format

Oscapify outputs CSV files with OSCAP-compatible formatting:

- `id` - Unique identifier (format: `nlp-{index}-{date}`)
- `pmid` - PubMed ID
- `pmcid` - PubMed Central ID
- `doi` - Digital Object Identifier (retrieved from NCBI)
- `sentence` - Original text
- `batch_name` - Processing batch identifier
- `sentence_id` - Sentence identifier
- `out_of_scope` - "yes" if DOI couldn't be retrieved, "no" otherwise

## Advanced Features

### Cache Management

```bash
# View cache statistics
oscapify cache-stats

# Clear the DOI cache
oscapify clear-cache
```

### Error Handling Options

```bash
# Stop on first error (strict mode)
oscapify process input.csv --strict

# Disable caching for testing
oscapify process input.csv --no-cache

# Skip header validation
oscapify process input.csv --no-validation
```

### Debug Mode

```bash
# Enable detailed debug logging
oscapify process input.csv --debug
```

## Python API

```python
from oscapify import OscapifyProcessor
from oscapify.models import ProcessingConfig

# Create configuration
config = ProcessingConfig(
    output_dir="./output",
    batch_name="my_batch"
)

# Process files
processor = OscapifyProcessor(config)
stats = processor.process_files(["input1.csv", "input2.csv"])

# Check results
print(f"Processed {stats.processed_files} files")
print(f"Total records: {stats.total_records}")
print(f"DOI lookups: {stats.successful_doi_lookups} successful, {stats.failed_doi_lookups} failed")
```

## Configuration

### Custom Header Mapping via API

```python
from oscapify.models import HeaderMapping, ProcessingConfig

# Define custom mapping
header_mapping = HeaderMapping(
    pmid="PubMedID",
    sentence="abstract_text",
    pmcid="PMC_ID",
    preserve_fields=["experiment_type", "confidence_score"]
)

config = ProcessingConfig(
    header_mapping=header_mapping
)
```

## Troubleshooting

### Common Issues

1. **Missing Headers Error**
   ```bash
   # Check what headers are in your file
   oscapify validate problematic.csv

   # Use suggested mappings
   oscapify validate problematic.csv --suggest-mappings
   ```

2. **DOI Retrieval Failures**
   - Check your internet connection
   - The tool implements rate limiting (3 requests/second) for NCBI API compliance

3. **Encoding Errors**
   - Oscapify automatically tries multiple encodings
   - If issues persist, convert your CSV to UTF-8

### Getting Help

```bash
# View all commands and options
oscapify --help

# View help for specific command
oscapify process --help
```

## Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

1. Fork the repository
2. Create your feature branch (`git checkout -b feature/AmazingFeature`)
3. Commit your changes (`git commit -m 'Add some AmazingFeature'`)
4. Push to the branch (`git push origin feature/AmazingFeature`)
5. Open a Pull Request

## License

This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.

## Citation

If you use Oscapify in your research, please cite:

```bibtex
@software{oscapify,
  author = {Troy Sincomb},
  title = {Oscapify: A tool for converting scientific literature CSV files to OSCAP format},
  year = {2025},
  url = {https://github.com/yourusername/oscapify}
}
```

## Acknowledgments

- Uses the NCBI E-utilities API for DOI retrieval
- Built with Click for CLI interface
- Pandas for data processing
- Pydantic for data validation

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "oscapify",
    "maintainer": null,
    "docs_url": null,
    "requires_python": "<4.0,>=3.8",
    "maintainer_email": null,
    "keywords": "oscap, scientific-literature, csv, converter",
    "author": "Jordan R. Willis",
    "author_email": "jwillis0720@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/0e/a8/88c861ef4af9cc2e59d3bfcaddb0cd6fe3046d0187ac6c1be0ecb1312f9b/oscapify-0.1.1.tar.gz",
    "platform": null,
    "description": "# Oscapify\n\n[![PyPI version](https://badge.fury.io/py/oscapify.svg)](https://badge.fury.io/py/oscapify)\n[![Python versions](https://img.shields.io/pypi/pyversions/oscapify.svg)](https://pypi.org/project/oscapify/)\n[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)\n\nA robust tool for converting scientific literature CSV files to OSCAP-compatible format. Oscapify processes neuroscience connectivity data from PubMed/PMC sources, validates headers, retrieves DOIs, and handles errors gracefully.\n\n## Features\n\n- **Intelligent Header Validation**: Automatically detects and corrects common header issues\n- **Flexible Header Mapping**: Support for custom column names and formats\n- **DOI Retrieval**: Fetches DOIs from NCBI API with built-in caching\n- **Error Recovery**: Continues processing even when individual records fail\n- **Detailed Debugging**: Comprehensive logging and header analysis tools\n- **Batch Processing**: Process multiple files or entire directories\n- **Performance**: Persistent caching and efficient batch operations\n\n## Installation\n\n```bash\npip install oscapify\n```\n\n### Development Installation\n\n```bash\ngit clone https://github.com/yourusername/oscapify.git\ncd oscapify\npip install -e \".[dev]\"\n```\n\n## Quick Start\n\n### Basic Usage\n\n```bash\n# Process a single file\noscapify process input.csv\n\n# Process multiple files\noscapify process file1.csv file2.csv\n\n# Process all CSV files in a directory\noscapify process /path/to/csv/directory/\n\n# Specify output directory\noscapify process input.csv --output ./results\n```\n\n### Header Validation and Debugging\n\n```bash\n# Validate CSV headers and see debugging info\noscapify validate input.csv\n\n# Get header mapping suggestions\noscapify validate input.csv --suggest-mappings\n```\n\n### Custom Header Mapping\n\nIf your CSV files use different column names:\n\n```bash\noscapify process input.csv \\\n  --header-pmid \"PubMedID\" \\\n  --header-sentence \"text\" \\\n  --preserve-fields \"custom_field1\" \"custom_field2\"\n```\n\n## Expected Input Format\n\nOscapify expects CSV files with the following columns (case-insensitive):\n\n- `pmid` - PubMed ID\n- `sentence` - Text content\n- `pmcid` (optional) - PubMed Central ID\n- `pubmed_url` (optional) - URL to PubMed/PMC article\n\nAdditional columns are preserved in the output.\n\n### Example Input CSV\n\n```csv\nID,pmid,pmcid,sentence,structure_1,structure_2,relation,score,pubmed_url\n1,12345678,PMC1234567,\"The brain connects to the spinal cord.\",brain,spinal cord,connects,0.95,https://pubmed.ncbi.nlm.nih.gov/12345678/\n```\n\n## Output Format\n\nOscapify outputs CSV files with OSCAP-compatible formatting:\n\n- `id` - Unique identifier (format: `nlp-{index}-{date}`)\n- `pmid` - PubMed ID\n- `pmcid` - PubMed Central ID\n- `doi` - Digital Object Identifier (retrieved from NCBI)\n- `sentence` - Original text\n- `batch_name` - Processing batch identifier\n- `sentence_id` - Sentence identifier\n- `out_of_scope` - \"yes\" if DOI couldn't be retrieved, \"no\" otherwise\n\n## Advanced Features\n\n### Cache Management\n\n```bash\n# View cache statistics\noscapify cache-stats\n\n# Clear the DOI cache\noscapify clear-cache\n```\n\n### Error Handling Options\n\n```bash\n# Stop on first error (strict mode)\noscapify process input.csv --strict\n\n# Disable caching for testing\noscapify process input.csv --no-cache\n\n# Skip header validation\noscapify process input.csv --no-validation\n```\n\n### Debug Mode\n\n```bash\n# Enable detailed debug logging\noscapify process input.csv --debug\n```\n\n## Python API\n\n```python\nfrom oscapify import OscapifyProcessor\nfrom oscapify.models import ProcessingConfig\n\n# Create configuration\nconfig = ProcessingConfig(\n    output_dir=\"./output\",\n    batch_name=\"my_batch\"\n)\n\n# Process files\nprocessor = OscapifyProcessor(config)\nstats = processor.process_files([\"input1.csv\", \"input2.csv\"])\n\n# Check results\nprint(f\"Processed {stats.processed_files} files\")\nprint(f\"Total records: {stats.total_records}\")\nprint(f\"DOI lookups: {stats.successful_doi_lookups} successful, {stats.failed_doi_lookups} failed\")\n```\n\n## Configuration\n\n### Custom Header Mapping via API\n\n```python\nfrom oscapify.models import HeaderMapping, ProcessingConfig\n\n# Define custom mapping\nheader_mapping = HeaderMapping(\n    pmid=\"PubMedID\",\n    sentence=\"abstract_text\",\n    pmcid=\"PMC_ID\",\n    preserve_fields=[\"experiment_type\", \"confidence_score\"]\n)\n\nconfig = ProcessingConfig(\n    header_mapping=header_mapping\n)\n```\n\n## Troubleshooting\n\n### Common Issues\n\n1. **Missing Headers Error**\n   ```bash\n   # Check what headers are in your file\n   oscapify validate problematic.csv\n\n   # Use suggested mappings\n   oscapify validate problematic.csv --suggest-mappings\n   ```\n\n2. **DOI Retrieval Failures**\n   - Check your internet connection\n   - The tool implements rate limiting (3 requests/second) for NCBI API compliance\n\n3. **Encoding Errors**\n   - Oscapify automatically tries multiple encodings\n   - If issues persist, convert your CSV to UTF-8\n\n### Getting Help\n\n```bash\n# View all commands and options\noscapify --help\n\n# View help for specific command\noscapify process --help\n```\n\n## Contributing\n\nContributions are welcome! Please feel free to submit a Pull Request.\n\n1. Fork the repository\n2. Create your feature branch (`git checkout -b feature/AmazingFeature`)\n3. Commit your changes (`git commit -m 'Add some AmazingFeature'`)\n4. Push to the branch (`git push origin feature/AmazingFeature`)\n5. Open a Pull Request\n\n## License\n\nThis project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.\n\n## Citation\n\nIf you use Oscapify in your research, please cite:\n\n```bibtex\n@software{oscapify,\n  author = {Troy Sincomb},\n  title = {Oscapify: A tool for converting scientific literature CSV files to OSCAP format},\n  year = {2025},\n  url = {https://github.com/yourusername/oscapify}\n}\n```\n\n## Acknowledgments\n\n- Uses the NCBI E-utilities API for DOI retrieval\n- Built with Click for CLI interface\n- Pandas for data processing\n- Pydantic for data validation\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "A robust tool for converting scientific literature CSV files to OSCAP-compatible format",
    "version": "0.1.1",
    "project_urls": {
        "Documentation": "https://github.com/yourusername/oscapify",
        "Homepage": "https://github.com/yourusername/oscapify",
        "Issues": "https://github.com/yourusername/oscapify/issues",
        "Repository": "https://github.com/yourusername/oscapify"
    },
    "split_keywords": [
        "oscap",
        " scientific-literature",
        " csv",
        " converter"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "3a5d27faca79d2603e707f327d8a8cfaced5416d8314d60a693edb75bd0ab6b0",
                "md5": "e8ed9890cabe29e5f0e6db9c0f6ac382",
                "sha256": "da73a720acc0e3fac4e05d144abcecee090e973950cec20c6ca50c8fa7068d17"
            },
            "downloads": -1,
            "filename": "oscapify-0.1.1-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "e8ed9890cabe29e5f0e6db9c0f6ac382",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": "<4.0,>=3.8",
            "size": 19470,
            "upload_time": "2025-07-17T20:54:04",
            "upload_time_iso_8601": "2025-07-17T20:54:04.658035Z",
            "url": "https://files.pythonhosted.org/packages/3a/5d/27faca79d2603e707f327d8a8cfaced5416d8314d60a693edb75bd0ab6b0/oscapify-0.1.1-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "0ea888c861ef4af9cc2e59d3bfcaddb0cd6fe3046d0187ac6c1be0ecb1312f9b",
                "md5": "38f7e4a445984cfbfc024f0ff5cb831b",
                "sha256": "135ab54cb3abf078d4a363fff447c54b05b3097607a344a487445032ab1575f4"
            },
            "downloads": -1,
            "filename": "oscapify-0.1.1.tar.gz",
            "has_sig": false,
            "md5_digest": "38f7e4a445984cfbfc024f0ff5cb831b",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": "<4.0,>=3.8",
            "size": 18563,
            "upload_time": "2025-07-17T20:54:06",
            "upload_time_iso_8601": "2025-07-17T20:54:06.183413Z",
            "url": "https://files.pythonhosted.org/packages/0e/a8/88c861ef4af9cc2e59d3bfcaddb0cd6fe3046d0187ac6c1be0ecb1312f9b/oscapify-0.1.1.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-07-17 20:54:06",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "yourusername",
    "github_project": "oscapify",
    "github_not_found": true,
    "lcname": "oscapify"
}
        
Elapsed time: 0.44671s