scixtract

Name	scixtract JSON
Version	1.0.5 JSON
	download
home_page	None
Summary	AI-assisted scientific PDF text extraction using local Ollama models
upload_time	2025-11-02 16:47:30
maintainer	Reto Stamm
docs_url	None
author	Reto Stamm
requires_python	>=3.10
license	GPL-3.0-or-later
keywords	pdf extraction ai ollama academic research knowledge indexing
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            # scixtract

[![Python](https://img.shields.io/pypi/pyversions/scixtract.svg)](https://pypi.org/project/scixtract/)
[![PyPI version](https://img.shields.io/pypi/v/scixtract.svg)](https://pypi.org/project/scixtract/)
[![License: GPL v3](https://img.shields.io/badge/License-GPLv3-blue.svg)](https://github.com/retospect/scixtract/blob/main/LICENSE.txt)
[![Tests](https://github.com/retospect/scixtract/actions/workflows/test.yml/badge.svg)](https://github.com/retospect/scixtract/actions/workflows/test.yml)

**AI-assisted scientific PDF text extraction**

Scixtract solves the problem that PDF text extraction is messy and full of artifacts. This tool uses AI assistance to clean up extracted text from scientific PDFs, preserving important formatting like chemical formulas and citations while removing common extraction artifacts.

Designed specifically for academic and scientific literature, scixtract provides clean, structured text output that maintains the integrity of your research content.

## What scixtract does

- **Cleans messy PDF text**: Removes spacing artifacts, broken words, and formatting issues
- **Preserves scientific content**: Maintains chemical formulas (H₂O, CO₂), equations, and citations
- **Local AI processing**: Uses local AI models to fix text while preserving meaning
- **Privacy-focused**: All processing happens on your machine - no data sent to external services
- **Batch processing**: Handle multiple PDFs
- **Knowledge indexing**: Build searchable databases of extracted content

## Prerequisites

Before using scixtract, you need to install and set up Ollama:

### 1. Install Ollama

**macOS:**
```bash
brew install ollama
```

**Linux:**
```bash
curl -fsSL https://ollama.ai/install.sh | sh
```

**Windows:**
Download from [ollama.ai](https://ollama.ai/download)

### 2. Start Ollama service

```bash
ollama serve
```

### 3. Install a model

For scientific PDFs:

```bash
# Default: Good balance for most users (4.4GB)
ollama pull qwen2.5:7b

# Alternative: Larger, more accurate model (19GB)
ollama pull qwen2.5:32b-instruct-q4_K_M
```

## Installation

Install scixtract from PyPI:

```bash
pip install scixtract
```

## Quick Start

### Basic PDF extraction

```bash
# Extract a single PDF
scixtract extract paper.pdf

# Use specific model
scixtract extract paper.pdf --model qwen2.5:7b

# Process multiple PDFs
scixtract extract papers/*.pdf
```

### Python API

```python
from scixtract import AdvancedPDFProcessor
from pathlib import Path

# Initialize processor
processor = AdvancedPDFProcessor(
    model="qwen2.5:7b"
)

# Process PDF
result = processor.process_pdf(Path("paper.pdf"))

# Access cleaned text
print(f"Title: {result.metadata.title}")
print(f"Authors: {', '.join(result.metadata.authors)}")

# Get page content
for page in result.pages:
    print(f"Page {page.page_number}: {page.content[:200]}...")
```

### Knowledge management

Build a searchable database of your extracted content:

```bash
# Extract and add to knowledge base (with bibliography for author name recognition)
scixtract extract paper.pdf --bib-file references.bib --update-knowledge

# Search your knowledge base
scixtract knowledge --search "catalysis"

# View statistics
scixtract knowledge --stats
```

## Output formats

Scixtract provides multiple output formats:

- **JSON**: Structured data with metadata, page content, and extracted keywords
- **Markdown**: Clean, readable text with AI-generated summaries
- **Knowledge database**: SQLite database for searching across multiple documents

## Model recommendations

Based on testing with scientific papers:

**Default: qwen2.5:7b**
- Good balance of performance and size
- Reliable JSON output
- Size: 4.4GB

**High-performance option: qwen2.5:32b-instruct-q4_K_M**
- Better accuracy for complex scientific content
- Larger model with more capabilities
- Size: 19GB

## System requirements

- **Python**: 3.10 or higher
- **Memory**: 8GB RAM minimum (16GB+ recommended for large models)
- **Storage**: 20GB+ free space for AI models
- **Ollama**: Required for AI processing

## Help and setup

Use the built-in setup helper:

```bash
# Check if Ollama is properly configured
scixtract-setup-ollama --check-only

# List available models
scixtract-setup-ollama --list-models

# Complete setup with default model
scixtract-setup-ollama --model qwen2.5:7b
```

## License

This project is licensed under the GNU General Public License v3.0 - see the [LICENSE.txt](LICENSE.txt) file for details.

## Support

For technical documentation, API reference, and development information, see [MAINTAINER_README.md](MAINTAINER_README.md).

For issues and questions, please visit the [GitHub repository](https://github.com/retospect/scixtract).

---

Built with [Windsurf](https://codeium.com/windsurf).

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "scixtract",
    "maintainer": "Reto Stamm",
    "docs_url": null,
    "requires_python": ">=3.10",
    "maintainer_email": "reto.stamm@example.com",
    "keywords": "pdf, extraction, ai, ollama, academic, research, knowledge, indexing",
    "author": "Reto Stamm",
    "author_email": "reto.stamm@example.com",
    "download_url": "https://files.pythonhosted.org/packages/22/e9/5624470c524734985d814a6facad1ebce3d819e7a97b947e24a047f3e028/scixtract-1.0.5.tar.gz",
    "platform": null,
    "description": "# scixtract\n\n[![Python](https://img.shields.io/pypi/pyversions/scixtract.svg)](https://pypi.org/project/scixtract/)\n[![PyPI version](https://img.shields.io/pypi/v/scixtract.svg)](https://pypi.org/project/scixtract/)\n[![License: GPL v3](https://img.shields.io/badge/License-GPLv3-blue.svg)](https://github.com/retospect/scixtract/blob/main/LICENSE.txt)\n[![Tests](https://github.com/retospect/scixtract/actions/workflows/test.yml/badge.svg)](https://github.com/retospect/scixtract/actions/workflows/test.yml)\n\n**AI-assisted scientific PDF text extraction**\n\nScixtract solves the problem that PDF text extraction is messy and full of artifacts. This tool uses AI assistance to clean up extracted text from scientific PDFs, preserving important formatting like chemical formulas and citations while removing common extraction artifacts.\n\nDesigned specifically for academic and scientific literature, scixtract provides clean, structured text output that maintains the integrity of your research content.\n\n## What scixtract does\n\n- **Cleans messy PDF text**: Removes spacing artifacts, broken words, and formatting issues\n- **Preserves scientific content**: Maintains chemical formulas (H\u2082O, CO\u2082), equations, and citations\n- **Local AI processing**: Uses local AI models to fix text while preserving meaning\n- **Privacy-focused**: All processing happens on your machine - no data sent to external services\n- **Batch processing**: Handle multiple PDFs\n- **Knowledge indexing**: Build searchable databases of extracted content\n\n## Prerequisites\n\nBefore using scixtract, you need to install and set up Ollama:\n\n### 1. Install Ollama\n\n**macOS:**\n```bash\nbrew install ollama\n```\n\n**Linux:**\n```bash\ncurl -fsSL https://ollama.ai/install.sh | sh\n```\n\n**Windows:**\nDownload from [ollama.ai](https://ollama.ai/download)\n\n### 2. Start Ollama service\n\n```bash\nollama serve\n```\n\n### 3. Install a model\n\nFor scientific PDFs:\n\n```bash\n# Default: Good balance for most users (4.4GB)\nollama pull qwen2.5:7b\n\n# Alternative: Larger, more accurate model (19GB)\nollama pull qwen2.5:32b-instruct-q4_K_M\n```\n\n## Installation\n\nInstall scixtract from PyPI:\n\n```bash\npip install scixtract\n```\n\n## Quick Start\n\n### Basic PDF extraction\n\n```bash\n# Extract a single PDF\nscixtract extract paper.pdf\n\n# Use specific model\nscixtract extract paper.pdf --model qwen2.5:7b\n\n# Process multiple PDFs\nscixtract extract papers/*.pdf\n```\n\n### Python API\n\n```python\nfrom scixtract import AdvancedPDFProcessor\nfrom pathlib import Path\n\n# Initialize processor\nprocessor = AdvancedPDFProcessor(\n    model=\"qwen2.5:7b\"\n)\n\n# Process PDF\nresult = processor.process_pdf(Path(\"paper.pdf\"))\n\n# Access cleaned text\nprint(f\"Title: {result.metadata.title}\")\nprint(f\"Authors: {', '.join(result.metadata.authors)}\")\n\n# Get page content\nfor page in result.pages:\n    print(f\"Page {page.page_number}: {page.content[:200]}...\")\n```\n\n### Knowledge management\n\nBuild a searchable database of your extracted content:\n\n```bash\n# Extract and add to knowledge base (with bibliography for author name recognition)\nscixtract extract paper.pdf --bib-file references.bib --update-knowledge\n\n# Search your knowledge base\nscixtract knowledge --search \"catalysis\"\n\n# View statistics\nscixtract knowledge --stats\n```\n\n## Output formats\n\nScixtract provides multiple output formats:\n\n- **JSON**: Structured data with metadata, page content, and extracted keywords\n- **Markdown**: Clean, readable text with AI-generated summaries\n- **Knowledge database**: SQLite database for searching across multiple documents\n\n## Model recommendations\n\nBased on testing with scientific papers:\n\n**Default: qwen2.5:7b**\n- Good balance of performance and size\n- Reliable JSON output\n- Size: 4.4GB\n\n**High-performance option: qwen2.5:32b-instruct-q4_K_M**\n- Better accuracy for complex scientific content\n- Larger model with more capabilities\n- Size: 19GB\n\n## System requirements\n\n- **Python**: 3.10 or higher\n- **Memory**: 8GB RAM minimum (16GB+ recommended for large models)\n- **Storage**: 20GB+ free space for AI models\n- **Ollama**: Required for AI processing\n\n## Help and setup\n\nUse the built-in setup helper:\n\n```bash\n# Check if Ollama is properly configured\nscixtract-setup-ollama --check-only\n\n# List available models\nscixtract-setup-ollama --list-models\n\n# Complete setup with default model\nscixtract-setup-ollama --model qwen2.5:7b\n```\n\n## License\n\nThis project is licensed under the GNU General Public License v3.0 - see the [LICENSE.txt](LICENSE.txt) file for details.\n\n## Support\n\nFor technical documentation, API reference, and development information, see [MAINTAINER_README.md](MAINTAINER_README.md).\n\nFor issues and questions, please visit the [GitHub repository](https://github.com/retospect/scixtract).\n\n---\n\nBuilt with [Windsurf](https://codeium.com/windsurf).\n\n",
    "bugtrack_url": null,
    "license": "GPL-3.0-or-later",
    "summary": "AI-assisted scientific PDF text extraction using local Ollama models",
    "version": "1.0.5",
    "project_urls": {
        "Bug Tracker": "https://github.com/retostamm/scixtract/issues",
        "Documentation": "https://github.com/retostamm/scixtract#readme",
        "Homepage": "https://github.com/retostamm/scixtract",
        "Repository": "https://github.com/retostamm/scixtract"
    },
    "split_keywords": [
        "pdf",
        " extraction",
        " ai",
        " ollama",
        " academic",
        " research",
        " knowledge",
        " indexing"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "5a293c059a8523bdba5808aeb51aee57d9c7687b5a7f2842f6c3382c7c78415d",
                "md5": "235f814426b9a7656f542fb668f46f32",
                "sha256": "3cf61397fb3b8aade6ba8ddfb4616aa07ced8d9bff44031b7fc910e7a8565869"
            },
            "downloads": -1,
            "filename": "scixtract-1.0.5-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "235f814426b9a7656f542fb668f46f32",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.10",
            "size": 28034,
            "upload_time": "2025-11-02T16:47:28",
            "upload_time_iso_8601": "2025-11-02T16:47:28.482620Z",
            "url": "https://files.pythonhosted.org/packages/5a/29/3c059a8523bdba5808aeb51aee57d9c7687b5a7f2842f6c3382c7c78415d/scixtract-1.0.5-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "22e95624470c524734985d814a6facad1ebce3d819e7a97b947e24a047f3e028",
                "md5": "1842a0c707fc84072069c4d52559927b",
                "sha256": "4620767594d845acce32805d02e444c65aa1035b5dc45f40756955de70371a35"
            },
            "downloads": -1,
            "filename": "scixtract-1.0.5.tar.gz",
            "has_sig": false,
            "md5_digest": "1842a0c707fc84072069c4d52559927b",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.10",
            "size": 26871,
            "upload_time": "2025-11-02T16:47:30",
            "upload_time_iso_8601": "2025-11-02T16:47:30.131968Z",
            "url": "https://files.pythonhosted.org/packages/22/e9/5624470c524734985d814a6facad1ebce3d819e7a97b947e24a047f3e028/scixtract-1.0.5.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-11-02 16:47:30",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "retostamm",
    "github_project": "scixtract",
    "github_not_found": true,
    "lcname": "scixtract"
}

Reto Stamm