literature-mapper


Nameliterature-mapper JSON
Version 0.1.1 PyPI version JSON
download
home_pageNone
SummaryAI-powered Python library for systematic, scalable analysis of academic literature
upload_time2025-08-04 04:04:41
maintainerNone
docs_urlNone
authorNone
requires_python>=3.8
licenseMIT
keywords academic literature analysis ai research pdf
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # Literature Mapper

An AI-powered Python library for systematic, scalable analysis of academic literature.

Literature Mapper turns a folder of PDF articles into a structured, queryable SQLite database, enabling new forms of computational literature review. While primarily designed as a Python library for Jupyter and other interactive environments, it also offers a full-featured command-line interface (CLI) for quick tasks.

---

## Features

* **Gemini Models** – Works with any available Gemini model (default: `gemini-2.5-flash`)
* **Model-Aware Optimisation** – Automatically adjusts analysis depth based on model capabilities  
* **Automated Metadata Extraction** – Titles, authors, methodologies, key concepts, contributions  
* **Incremental Processing** – Only analyses new PDFs added since the last run  
* **Resilient Error Handling** – Gracefully skips corrupted PDFs, API hiccups, and edge cases with user-friendly messages
* **Flexible Database** – SQLite schema with relational tables for authors and concepts (allows duplicate paper titles)
* **Data Export** – One-line CSV export for R, Excel, or downstream ML pipelines  
* **Manual Entry** – Add papers that are not available as PDFs  
* **Simple CLI** – Process, query, and export directly from the terminal  

---

## Installation

```bash
# Install from PyPI
pip install literature-mapper

# Or install the latest commit from GitHub
pip install git+https://github.com/jeremiahbohr/literature-mapper.git

# Configure your Google AI API key
export GEMINI_API_KEY="your_api_key_here"
```

> **Tip:** Use a Python virtual environment  
> `python -m venv .venv && source .venv/bin/activate`  
> to keep dependencies isolated.

---

## Quick Start (Jupyter / Python)

```python
from literature_mapper import LiteratureMapper

# 1 – Initialise the mapper for your research folder
#     (creates ./my_ai_research/corpus.db on first run)
mapper = LiteratureMapper("./my_ai_research")

# 2 – Drop some PDF files into ./my_ai_research/

# 3 – Process any new papers
results = mapper.process_new_papers()
print(f"Processed: {results.processed}, Failed: {results.failed}, Skipped: {results.skipped}")
# Example output: "Processed: 12, Failed: 1, Skipped: 2"

# 4 – Load the analyses into a pandas DataFrame
df = mapper.get_all_analyses()
df.head()

# 5 – Optional: export the corpus to CSV
mapper.export_to_csv("ai_research_corpus.csv")
```

Need a different Gemini model? Just pass it in:

```python
mapper = LiteratureMapper("./my_ai_research", model_name="gemini-2.5-pro")
```

---

## Model Flexibility

List available Gemini models and their recommended use-cases:

```bash
literature-mapper models            # simple list
literature-mapper models --details  # table with guidance
```

**Model Recommendations:**
- **Flash**: Fast analysis, ideal for large batches
- **Pro**: Balanced analysis, best for most use cases  
- **Ultra**: Highest quality analysis, slower but most comprehensive

Then process with any model:

```bash
literature-mapper process ./my_ai_research --model gemini-2.5-pro
```

---

## Data Curation & Standardisation

```python
# Search for all papers that mention 'survey' as their methodology
survey_df = mapper.search_papers(column="methodology", query="survey")
print(survey_df[["id", "title", "methodology"]])

# Standardise the methodology field
ids = survey_df["id"].tolist()
mapper.update_papers(ids, {"methodology": "Survey"})
```

---

## Command-Line Interface Highlights

```bash
# Process a folder of PDFs
literature-mapper process ./my_research

# Show corpus status and basic stats
literature-mapper status ./my_research

# Export to CSV
literature-mapper export ./my_research output.csv

# List first 10 papers from 2024
literature-mapper papers ./my_research --year 2024 --limit 10
```

Run `literature-mapper --help` for the full command tree.

---

## Configuration via Environment Variables

| Variable | Purpose | Default |
|----------|---------|---------|
| `GEMINI_API_KEY` | **Required.** Google AI key | – |
| `LITERATURE_MAPPER_MODEL` | Default model for CLI | `gemini-2.5-flash` |
| `LITERATURE_MAPPER_MAX_FILE_SIZE` | Max PDF size (bytes) | `52428800` (50 MB) |
| `LITERATURE_MAPPER_BATCH_SIZE` | PDFs processed per batch | `10` |
| `LITERATURE_MAPPER_LOG_LEVEL` | Log level (`DEBUG`, `INFO`, …) | `INFO` |
| `LITERATURE_MAPPER_VERBOSE` | Set to `true` for debug logs | `false` |

---

## Advanced Usage

### Robust Error Handling

Literature Mapper provides user-friendly error messages for common issues:

```python
from literature_mapper.exceptions import PDFProcessingError, APIError, ValidationError

try:
    results = mapper.process_new_papers()
except PDFProcessingError as e:
    print(f"PDF issue: {e.user_message}")  # e.g., "File 'paper.pdf' is password-protected"
except APIError as e:
    print(f"API issue: {e.user_message}")  # e.g., "Gemini API rate limit exceeded"
except ValidationError as e:
    print(f"Input error: {e.user_message}")  # e.g., "Invalid API key format"
```

### Corpus Statistics

```python
stats = mapper.get_statistics()
print(f"Total papers: {stats.total_papers}")
print(f"Unique authors: {stats.total_authors}")
print(f"Key concepts: {stats.total_concepts}")
```

### Manual Entry

```python
mapper.add_manual_entry(
    title="Seminal Survey of AI Ethics",
    authors=["Smith, J.", "Doe, A."],
    year=2025,
    methodology="Systematic Literature Review",
    theoretical_framework="Ethics Framework",
    contribution_to_field="Comprehensive review of AI ethics landscape",
    key_concepts=["AI ethics", "survey", "responsible AI"]
)
```

---

## Testing

```bash
# Install development dependencies
pip install pytest

# Run the test suite
pytest tests/

# Run with coverage
pip install pytest-cov
pytest tests/ --cov=literature_mapper
```

---

## Requirements

* Python 3.8 or newer  
* Google AI API key ([create one here](https://makersuite.google.com/app/apikey))  
* A few MB of disk space for binaries plus additional space for your corpus database  

---

## Known Limitations

* **Duplicate papers**: Multiple papers with identical titles and years are allowed (common in academic literature with conference/journal versions)
* **PDF processing**: Requires readable text content (scanned documents without OCR may fail)
* **Processing speed**: Depends on chosen Gemini model and API rate limits
* **File size**: PDFs larger than 50MB are rejected by default (configurable)

---

## Design Philosophy

* **Simple** – Minimal setup, sensible defaults  
* **User-Centric** – Clear CLI and notebook ergonomics with helpful error messages
* **Secure** – Strict input validation and API-key handling  
* **Robust** – Comprehensive error handling and retry logic  
* **Future-Proof** – Model-agnostic architecture for the Gemini family  

---

## Contributing

Pull requests, feature ideas, and bug reports are welcome. Please open an issue first if you plan to work on a significant change.

For development:
```bash
git clone https://github.com/jeremiahbohr/literature-mapper.git
cd literature-mapper
pip install -e ".[dev]"
pytest tests/
```

---

## License

Released under the MIT License. See the `LICENSE` file for full text.

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "literature-mapper",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": null,
    "keywords": "academic, literature, analysis, ai, research, pdf",
    "author": null,
    "author_email": "Jeremiah Bohr <bohrj@uwosh.edu>",
    "download_url": "https://files.pythonhosted.org/packages/78/59/d052662c437e54d0e53ab3adf52900604fad5e1f3d511c499d6664a4f61e/literature_mapper-0.1.1.tar.gz",
    "platform": null,
    "description": "# Literature Mapper\r\n\r\nAn AI-powered Python library for systematic, scalable analysis of academic literature.\r\n\r\nLiterature Mapper turns a folder of PDF articles into a structured, queryable SQLite database, enabling new forms of computational literature review. While primarily designed as a Python library for Jupyter and other interactive environments, it also offers a full-featured command-line interface (CLI) for quick tasks.\r\n\r\n---\r\n\r\n## Features\r\n\r\n* **Gemini Models** \u2013 Works with any available Gemini model (default: `gemini-2.5-flash`)\r\n* **Model-Aware Optimisation** \u2013 Automatically adjusts analysis depth based on model capabilities  \r\n* **Automated Metadata Extraction** \u2013 Titles, authors, methodologies, key concepts, contributions  \r\n* **Incremental Processing** \u2013 Only analyses new PDFs added since the last run  \r\n* **Resilient Error Handling** \u2013 Gracefully skips corrupted PDFs, API hiccups, and edge cases with user-friendly messages\r\n* **Flexible Database** \u2013 SQLite schema with relational tables for authors and concepts (allows duplicate paper titles)\r\n* **Data Export** \u2013 One-line CSV export for R, Excel, or downstream ML pipelines  \r\n* **Manual Entry** \u2013 Add papers that are not available as PDFs  \r\n* **Simple CLI** \u2013 Process, query, and export directly from the terminal  \r\n\r\n---\r\n\r\n## Installation\r\n\r\n```bash\r\n# Install from PyPI\r\npip install literature-mapper\r\n\r\n# Or install the latest commit from GitHub\r\npip install git+https://github.com/jeremiahbohr/literature-mapper.git\r\n\r\n# Configure your Google AI API key\r\nexport GEMINI_API_KEY=\"your_api_key_here\"\r\n```\r\n\r\n> **Tip:** Use a Python virtual environment  \r\n> `python -m venv .venv && source .venv/bin/activate`  \r\n> to keep dependencies isolated.\r\n\r\n---\r\n\r\n## Quick Start (Jupyter / Python)\r\n\r\n```python\r\nfrom literature_mapper import LiteratureMapper\r\n\r\n# 1 \u2013 Initialise the mapper for your research folder\r\n#     (creates ./my_ai_research/corpus.db on first run)\r\nmapper = LiteratureMapper(\"./my_ai_research\")\r\n\r\n# 2 \u2013 Drop some PDF files into ./my_ai_research/\r\n\r\n# 3 \u2013 Process any new papers\r\nresults = mapper.process_new_papers()\r\nprint(f\"Processed: {results.processed}, Failed: {results.failed}, Skipped: {results.skipped}\")\r\n# Example output: \"Processed: 12, Failed: 1, Skipped: 2\"\r\n\r\n# 4 \u2013 Load the analyses into a pandas DataFrame\r\ndf = mapper.get_all_analyses()\r\ndf.head()\r\n\r\n# 5 \u2013 Optional: export the corpus to CSV\r\nmapper.export_to_csv(\"ai_research_corpus.csv\")\r\n```\r\n\r\nNeed a different Gemini model? Just pass it in:\r\n\r\n```python\r\nmapper = LiteratureMapper(\"./my_ai_research\", model_name=\"gemini-2.5-pro\")\r\n```\r\n\r\n---\r\n\r\n## Model Flexibility\r\n\r\nList available Gemini models and their recommended use-cases:\r\n\r\n```bash\r\nliterature-mapper models            # simple list\r\nliterature-mapper models --details  # table with guidance\r\n```\r\n\r\n**Model Recommendations:**\r\n- **Flash**: Fast analysis, ideal for large batches\r\n- **Pro**: Balanced analysis, best for most use cases  \r\n- **Ultra**: Highest quality analysis, slower but most comprehensive\r\n\r\nThen process with any model:\r\n\r\n```bash\r\nliterature-mapper process ./my_ai_research --model gemini-2.5-pro\r\n```\r\n\r\n---\r\n\r\n## Data Curation & Standardisation\r\n\r\n```python\r\n# Search for all papers that mention 'survey' as their methodology\r\nsurvey_df = mapper.search_papers(column=\"methodology\", query=\"survey\")\r\nprint(survey_df[[\"id\", \"title\", \"methodology\"]])\r\n\r\n# Standardise the methodology field\r\nids = survey_df[\"id\"].tolist()\r\nmapper.update_papers(ids, {\"methodology\": \"Survey\"})\r\n```\r\n\r\n---\r\n\r\n## Command-Line Interface Highlights\r\n\r\n```bash\r\n# Process a folder of PDFs\r\nliterature-mapper process ./my_research\r\n\r\n# Show corpus status and basic stats\r\nliterature-mapper status ./my_research\r\n\r\n# Export to CSV\r\nliterature-mapper export ./my_research output.csv\r\n\r\n# List first 10 papers from 2024\r\nliterature-mapper papers ./my_research --year 2024 --limit 10\r\n```\r\n\r\nRun `literature-mapper --help` for the full command tree.\r\n\r\n---\r\n\r\n## Configuration via Environment Variables\r\n\r\n| Variable | Purpose | Default |\r\n|----------|---------|---------|\r\n| `GEMINI_API_KEY` | **Required.** Google AI key | \u2013 |\r\n| `LITERATURE_MAPPER_MODEL` | Default model for CLI | `gemini-2.5-flash` |\r\n| `LITERATURE_MAPPER_MAX_FILE_SIZE` | Max PDF size (bytes) | `52428800` (50 MB) |\r\n| `LITERATURE_MAPPER_BATCH_SIZE` | PDFs processed per batch | `10` |\r\n| `LITERATURE_MAPPER_LOG_LEVEL` | Log level (`DEBUG`, `INFO`, \u2026) | `INFO` |\r\n| `LITERATURE_MAPPER_VERBOSE` | Set to `true` for debug logs | `false` |\r\n\r\n---\r\n\r\n## Advanced Usage\r\n\r\n### Robust Error Handling\r\n\r\nLiterature Mapper provides user-friendly error messages for common issues:\r\n\r\n```python\r\nfrom literature_mapper.exceptions import PDFProcessingError, APIError, ValidationError\r\n\r\ntry:\r\n    results = mapper.process_new_papers()\r\nexcept PDFProcessingError as e:\r\n    print(f\"PDF issue: {e.user_message}\")  # e.g., \"File 'paper.pdf' is password-protected\"\r\nexcept APIError as e:\r\n    print(f\"API issue: {e.user_message}\")  # e.g., \"Gemini API rate limit exceeded\"\r\nexcept ValidationError as e:\r\n    print(f\"Input error: {e.user_message}\")  # e.g., \"Invalid API key format\"\r\n```\r\n\r\n### Corpus Statistics\r\n\r\n```python\r\nstats = mapper.get_statistics()\r\nprint(f\"Total papers: {stats.total_papers}\")\r\nprint(f\"Unique authors: {stats.total_authors}\")\r\nprint(f\"Key concepts: {stats.total_concepts}\")\r\n```\r\n\r\n### Manual Entry\r\n\r\n```python\r\nmapper.add_manual_entry(\r\n    title=\"Seminal Survey of AI Ethics\",\r\n    authors=[\"Smith, J.\", \"Doe, A.\"],\r\n    year=2025,\r\n    methodology=\"Systematic Literature Review\",\r\n    theoretical_framework=\"Ethics Framework\",\r\n    contribution_to_field=\"Comprehensive review of AI ethics landscape\",\r\n    key_concepts=[\"AI ethics\", \"survey\", \"responsible AI\"]\r\n)\r\n```\r\n\r\n---\r\n\r\n## Testing\r\n\r\n```bash\r\n# Install development dependencies\r\npip install pytest\r\n\r\n# Run the test suite\r\npytest tests/\r\n\r\n# Run with coverage\r\npip install pytest-cov\r\npytest tests/ --cov=literature_mapper\r\n```\r\n\r\n---\r\n\r\n## Requirements\r\n\r\n* Python 3.8 or newer  \r\n* Google AI API key ([create one here](https://makersuite.google.com/app/apikey))  \r\n* A few MB of disk space for binaries plus additional space for your corpus database  \r\n\r\n---\r\n\r\n## Known Limitations\r\n\r\n* **Duplicate papers**: Multiple papers with identical titles and years are allowed (common in academic literature with conference/journal versions)\r\n* **PDF processing**: Requires readable text content (scanned documents without OCR may fail)\r\n* **Processing speed**: Depends on chosen Gemini model and API rate limits\r\n* **File size**: PDFs larger than 50MB are rejected by default (configurable)\r\n\r\n---\r\n\r\n## Design Philosophy\r\n\r\n* **Simple** \u2013 Minimal setup, sensible defaults  \r\n* **User-Centric** \u2013 Clear CLI and notebook ergonomics with helpful error messages\r\n* **Secure** \u2013 Strict input validation and API-key handling  \r\n* **Robust** \u2013 Comprehensive error handling and retry logic  \r\n* **Future-Proof** \u2013 Model-agnostic architecture for the Gemini family  \r\n\r\n---\r\n\r\n## Contributing\r\n\r\nPull requests, feature ideas, and bug reports are welcome. Please open an issue first if you plan to work on a significant change.\r\n\r\nFor development:\r\n```bash\r\ngit clone https://github.com/jeremiahbohr/literature-mapper.git\r\ncd literature-mapper\r\npip install -e \".[dev]\"\r\npytest tests/\r\n```\r\n\r\n---\r\n\r\n## License\r\n\r\nReleased under the MIT License. See the `LICENSE` file for full text.\r\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "AI-powered Python library for systematic, scalable analysis of academic literature",
    "version": "0.1.1",
    "project_urls": {
        "Homepage": "https://github.com/jeremiahbohr/literature-mapper",
        "Issues": "https://github.com/jeremiahbohr/literature-mapper/issues",
        "Repository": "https://github.com/jeremiahbohr/literature-mapper.git"
    },
    "split_keywords": [
        "academic",
        " literature",
        " analysis",
        " ai",
        " research",
        " pdf"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "1a0fdc61dbdee1e36049497efdcda8cd2f636ba01b2572ce671beb75919d1f80",
                "md5": "5a75025f5e5026b88ec65a72639fd7b8",
                "sha256": "7e483a9cc1fb427df76e2a112263f9e918c8772c19c571986de145ce0047ad99"
            },
            "downloads": -1,
            "filename": "literature_mapper-0.1.1-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "5a75025f5e5026b88ec65a72639fd7b8",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8",
            "size": 27829,
            "upload_time": "2025-08-04T04:04:39",
            "upload_time_iso_8601": "2025-08-04T04:04:39.961201Z",
            "url": "https://files.pythonhosted.org/packages/1a/0f/dc61dbdee1e36049497efdcda8cd2f636ba01b2572ce671beb75919d1f80/literature_mapper-0.1.1-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "7859d052662c437e54d0e53ab3adf52900604fad5e1f3d511c499d6664a4f61e",
                "md5": "54fd368d71d2a1ca5ce45127081f193c",
                "sha256": "a9f0b8a66fa8a587920bdc8021cd8f023a783642c76916e8a870e2c7a28977d7"
            },
            "downloads": -1,
            "filename": "literature_mapper-0.1.1.tar.gz",
            "has_sig": false,
            "md5_digest": "54fd368d71d2a1ca5ce45127081f193c",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8",
            "size": 30442,
            "upload_time": "2025-08-04T04:04:41",
            "upload_time_iso_8601": "2025-08-04T04:04:41.347209Z",
            "url": "https://files.pythonhosted.org/packages/78/59/d052662c437e54d0e53ab3adf52900604fad5e1f3d511c499d6664a4f61e/literature_mapper-0.1.1.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-08-04 04:04:41",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "jeremiahbohr",
    "github_project": "literature-mapper",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "literature-mapper"
}
        
Elapsed time: 0.45178s