# Literature Mapper
An AI-powered Python library for systematic, scalable analysis of academic literature.
Literature Mapper turns a folder of PDF articles into a structured, queryable SQLite database, enabling new forms of computational literature review. While primarily designed as a Python library for Jupyter and other interactive environments, it also offers a full-featured command-line interface (CLI) for quick tasks.
---
## Features
* **Gemini Models** – Works with any available Gemini model (default: `gemini-2.5-flash`)
* **Model-Aware Optimisation** – Automatically adjusts analysis depth based on model capabilities
* **Automated Metadata Extraction** – Titles, authors, methodologies, key concepts, contributions
* **Incremental Processing** – Only analyses new PDFs added since the last run
* **Resilient Error Handling** – Gracefully skips corrupted PDFs, API hiccups, and edge cases with user-friendly messages
* **Flexible Database** – SQLite schema with relational tables for authors and concepts (allows duplicate paper titles)
* **Data Export** – One-line CSV export for R, Excel, or downstream ML pipelines
* **Manual Entry** – Add papers that are not available as PDFs
* **Simple CLI** – Process, query, and export directly from the terminal
---
## Installation
```bash
# Install from PyPI
pip install literature-mapper
# Or install the latest commit from GitHub
pip install git+https://github.com/jeremiahbohr/literature-mapper.git
# Configure your Google AI API key
export GEMINI_API_KEY="your_api_key_here"
```
> **Tip:** Use a Python virtual environment
> `python -m venv .venv && source .venv/bin/activate`
> to keep dependencies isolated.
---
## Quick Start (Jupyter / Python)
```python
from literature_mapper import LiteratureMapper
# 1 – Initialise the mapper for your research folder
# (creates ./my_ai_research/corpus.db on first run)
mapper = LiteratureMapper("./my_ai_research")
# 2 – Drop some PDF files into ./my_ai_research/
# 3 – Process any new papers
results = mapper.process_new_papers()
print(f"Processed: {results.processed}, Failed: {results.failed}, Skipped: {results.skipped}")
# Example output: "Processed: 12, Failed: 1, Skipped: 2"
# 4 – Load the analyses into a pandas DataFrame
df = mapper.get_all_analyses()
df.head()
# 5 – Optional: export the corpus to CSV
mapper.export_to_csv("ai_research_corpus.csv")
```
Need a different Gemini model? Just pass it in:
```python
mapper = LiteratureMapper("./my_ai_research", model_name="gemini-2.5-pro")
```
---
## Model Flexibility
List available Gemini models and their recommended use-cases:
```bash
literature-mapper models # simple list
literature-mapper models --details # table with guidance
```
**Model Recommendations:**
- **Flash**: Fast analysis, ideal for large batches
- **Pro**: Balanced analysis, best for most use cases
- **Ultra**: Highest quality analysis, slower but most comprehensive
Then process with any model:
```bash
literature-mapper process ./my_ai_research --model gemini-2.5-pro
```
---
## Data Curation & Standardisation
```python
# Search for all papers that mention 'survey' as their methodology
survey_df = mapper.search_papers(column="methodology", query="survey")
print(survey_df[["id", "title", "methodology"]])
# Standardise the methodology field
ids = survey_df["id"].tolist()
mapper.update_papers(ids, {"methodology": "Survey"})
```
---
## Command-Line Interface Highlights
```bash
# Process a folder of PDFs
literature-mapper process ./my_research
# Show corpus status and basic stats
literature-mapper status ./my_research
# Export to CSV
literature-mapper export ./my_research output.csv
# List first 10 papers from 2024
literature-mapper papers ./my_research --year 2024 --limit 10
```
Run `literature-mapper --help` for the full command tree.
---
## Configuration via Environment Variables
| Variable | Purpose | Default |
|----------|---------|---------|
| `GEMINI_API_KEY` | **Required.** Google AI key | – |
| `LITERATURE_MAPPER_MODEL` | Default model for CLI | `gemini-2.5-flash` |
| `LITERATURE_MAPPER_MAX_FILE_SIZE` | Max PDF size (bytes) | `52428800` (50 MB) |
| `LITERATURE_MAPPER_BATCH_SIZE` | PDFs processed per batch | `10` |
| `LITERATURE_MAPPER_LOG_LEVEL` | Log level (`DEBUG`, `INFO`, …) | `INFO` |
| `LITERATURE_MAPPER_VERBOSE` | Set to `true` for debug logs | `false` |
---
## Advanced Usage
### Robust Error Handling
Literature Mapper provides user-friendly error messages for common issues:
```python
from literature_mapper.exceptions import PDFProcessingError, APIError, ValidationError
try:
results = mapper.process_new_papers()
except PDFProcessingError as e:
print(f"PDF issue: {e.user_message}") # e.g., "File 'paper.pdf' is password-protected"
except APIError as e:
print(f"API issue: {e.user_message}") # e.g., "Gemini API rate limit exceeded"
except ValidationError as e:
print(f"Input error: {e.user_message}") # e.g., "Invalid API key format"
```
### Corpus Statistics
```python
stats = mapper.get_statistics()
print(f"Total papers: {stats.total_papers}")
print(f"Unique authors: {stats.total_authors}")
print(f"Key concepts: {stats.total_concepts}")
```
### Manual Entry
```python
mapper.add_manual_entry(
title="Seminal Survey of AI Ethics",
authors=["Smith, J.", "Doe, A."],
year=2025,
methodology="Systematic Literature Review",
theoretical_framework="Ethics Framework",
contribution_to_field="Comprehensive review of AI ethics landscape",
key_concepts=["AI ethics", "survey", "responsible AI"]
)
```
---
## Testing
```bash
# Install development dependencies
pip install pytest
# Run the test suite
pytest tests/
# Run with coverage
pip install pytest-cov
pytest tests/ --cov=literature_mapper
```
---
## Requirements
* Python 3.8 or newer
* Google AI API key ([create one here](https://makersuite.google.com/app/apikey))
* A few MB of disk space for binaries plus additional space for your corpus database
---
## Known Limitations
* **Duplicate papers**: Multiple papers with identical titles and years are allowed (common in academic literature with conference/journal versions)
* **PDF processing**: Requires readable text content (scanned documents without OCR may fail)
* **Processing speed**: Depends on chosen Gemini model and API rate limits
* **File size**: PDFs larger than 50MB are rejected by default (configurable)
---
## Design Philosophy
* **Simple** – Minimal setup, sensible defaults
* **User-Centric** – Clear CLI and notebook ergonomics with helpful error messages
* **Secure** – Strict input validation and API-key handling
* **Robust** – Comprehensive error handling and retry logic
* **Future-Proof** – Model-agnostic architecture for the Gemini family
---
## Contributing
Pull requests, feature ideas, and bug reports are welcome. Please open an issue first if you plan to work on a significant change.
For development:
```bash
git clone https://github.com/jeremiahbohr/literature-mapper.git
cd literature-mapper
pip install -e ".[dev]"
pytest tests/
```
---
## License
Released under the MIT License. See the `LICENSE` file for full text.
Raw data
{
"_id": null,
"home_page": null,
"name": "literature-mapper",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.8",
"maintainer_email": null,
"keywords": "academic, literature, analysis, ai, research, pdf",
"author": null,
"author_email": "Jeremiah Bohr <bohrj@uwosh.edu>",
"download_url": "https://files.pythonhosted.org/packages/78/59/d052662c437e54d0e53ab3adf52900604fad5e1f3d511c499d6664a4f61e/literature_mapper-0.1.1.tar.gz",
"platform": null,
"description": "# Literature Mapper\r\n\r\nAn AI-powered Python library for systematic, scalable analysis of academic literature.\r\n\r\nLiterature Mapper turns a folder of PDF articles into a structured, queryable SQLite database, enabling new forms of computational literature review. While primarily designed as a Python library for Jupyter and other interactive environments, it also offers a full-featured command-line interface (CLI) for quick tasks.\r\n\r\n---\r\n\r\n## Features\r\n\r\n* **Gemini Models** \u2013 Works with any available Gemini model (default: `gemini-2.5-flash`)\r\n* **Model-Aware Optimisation** \u2013 Automatically adjusts analysis depth based on model capabilities \r\n* **Automated Metadata Extraction** \u2013 Titles, authors, methodologies, key concepts, contributions \r\n* **Incremental Processing** \u2013 Only analyses new PDFs added since the last run \r\n* **Resilient Error Handling** \u2013 Gracefully skips corrupted PDFs, API hiccups, and edge cases with user-friendly messages\r\n* **Flexible Database** \u2013 SQLite schema with relational tables for authors and concepts (allows duplicate paper titles)\r\n* **Data Export** \u2013 One-line CSV export for R, Excel, or downstream ML pipelines \r\n* **Manual Entry** \u2013 Add papers that are not available as PDFs \r\n* **Simple CLI** \u2013 Process, query, and export directly from the terminal \r\n\r\n---\r\n\r\n## Installation\r\n\r\n```bash\r\n# Install from PyPI\r\npip install literature-mapper\r\n\r\n# Or install the latest commit from GitHub\r\npip install git+https://github.com/jeremiahbohr/literature-mapper.git\r\n\r\n# Configure your Google AI API key\r\nexport GEMINI_API_KEY=\"your_api_key_here\"\r\n```\r\n\r\n> **Tip:** Use a Python virtual environment \r\n> `python -m venv .venv && source .venv/bin/activate` \r\n> to keep dependencies isolated.\r\n\r\n---\r\n\r\n## Quick Start (Jupyter / Python)\r\n\r\n```python\r\nfrom literature_mapper import LiteratureMapper\r\n\r\n# 1 \u2013 Initialise the mapper for your research folder\r\n# (creates ./my_ai_research/corpus.db on first run)\r\nmapper = LiteratureMapper(\"./my_ai_research\")\r\n\r\n# 2 \u2013 Drop some PDF files into ./my_ai_research/\r\n\r\n# 3 \u2013 Process any new papers\r\nresults = mapper.process_new_papers()\r\nprint(f\"Processed: {results.processed}, Failed: {results.failed}, Skipped: {results.skipped}\")\r\n# Example output: \"Processed: 12, Failed: 1, Skipped: 2\"\r\n\r\n# 4 \u2013 Load the analyses into a pandas DataFrame\r\ndf = mapper.get_all_analyses()\r\ndf.head()\r\n\r\n# 5 \u2013 Optional: export the corpus to CSV\r\nmapper.export_to_csv(\"ai_research_corpus.csv\")\r\n```\r\n\r\nNeed a different Gemini model? Just pass it in:\r\n\r\n```python\r\nmapper = LiteratureMapper(\"./my_ai_research\", model_name=\"gemini-2.5-pro\")\r\n```\r\n\r\n---\r\n\r\n## Model Flexibility\r\n\r\nList available Gemini models and their recommended use-cases:\r\n\r\n```bash\r\nliterature-mapper models # simple list\r\nliterature-mapper models --details # table with guidance\r\n```\r\n\r\n**Model Recommendations:**\r\n- **Flash**: Fast analysis, ideal for large batches\r\n- **Pro**: Balanced analysis, best for most use cases \r\n- **Ultra**: Highest quality analysis, slower but most comprehensive\r\n\r\nThen process with any model:\r\n\r\n```bash\r\nliterature-mapper process ./my_ai_research --model gemini-2.5-pro\r\n```\r\n\r\n---\r\n\r\n## Data Curation & Standardisation\r\n\r\n```python\r\n# Search for all papers that mention 'survey' as their methodology\r\nsurvey_df = mapper.search_papers(column=\"methodology\", query=\"survey\")\r\nprint(survey_df[[\"id\", \"title\", \"methodology\"]])\r\n\r\n# Standardise the methodology field\r\nids = survey_df[\"id\"].tolist()\r\nmapper.update_papers(ids, {\"methodology\": \"Survey\"})\r\n```\r\n\r\n---\r\n\r\n## Command-Line Interface Highlights\r\n\r\n```bash\r\n# Process a folder of PDFs\r\nliterature-mapper process ./my_research\r\n\r\n# Show corpus status and basic stats\r\nliterature-mapper status ./my_research\r\n\r\n# Export to CSV\r\nliterature-mapper export ./my_research output.csv\r\n\r\n# List first 10 papers from 2024\r\nliterature-mapper papers ./my_research --year 2024 --limit 10\r\n```\r\n\r\nRun `literature-mapper --help` for the full command tree.\r\n\r\n---\r\n\r\n## Configuration via Environment Variables\r\n\r\n| Variable | Purpose | Default |\r\n|----------|---------|---------|\r\n| `GEMINI_API_KEY` | **Required.** Google AI key | \u2013 |\r\n| `LITERATURE_MAPPER_MODEL` | Default model for CLI | `gemini-2.5-flash` |\r\n| `LITERATURE_MAPPER_MAX_FILE_SIZE` | Max PDF size (bytes) | `52428800` (50 MB) |\r\n| `LITERATURE_MAPPER_BATCH_SIZE` | PDFs processed per batch | `10` |\r\n| `LITERATURE_MAPPER_LOG_LEVEL` | Log level (`DEBUG`, `INFO`, \u2026) | `INFO` |\r\n| `LITERATURE_MAPPER_VERBOSE` | Set to `true` for debug logs | `false` |\r\n\r\n---\r\n\r\n## Advanced Usage\r\n\r\n### Robust Error Handling\r\n\r\nLiterature Mapper provides user-friendly error messages for common issues:\r\n\r\n```python\r\nfrom literature_mapper.exceptions import PDFProcessingError, APIError, ValidationError\r\n\r\ntry:\r\n results = mapper.process_new_papers()\r\nexcept PDFProcessingError as e:\r\n print(f\"PDF issue: {e.user_message}\") # e.g., \"File 'paper.pdf' is password-protected\"\r\nexcept APIError as e:\r\n print(f\"API issue: {e.user_message}\") # e.g., \"Gemini API rate limit exceeded\"\r\nexcept ValidationError as e:\r\n print(f\"Input error: {e.user_message}\") # e.g., \"Invalid API key format\"\r\n```\r\n\r\n### Corpus Statistics\r\n\r\n```python\r\nstats = mapper.get_statistics()\r\nprint(f\"Total papers: {stats.total_papers}\")\r\nprint(f\"Unique authors: {stats.total_authors}\")\r\nprint(f\"Key concepts: {stats.total_concepts}\")\r\n```\r\n\r\n### Manual Entry\r\n\r\n```python\r\nmapper.add_manual_entry(\r\n title=\"Seminal Survey of AI Ethics\",\r\n authors=[\"Smith, J.\", \"Doe, A.\"],\r\n year=2025,\r\n methodology=\"Systematic Literature Review\",\r\n theoretical_framework=\"Ethics Framework\",\r\n contribution_to_field=\"Comprehensive review of AI ethics landscape\",\r\n key_concepts=[\"AI ethics\", \"survey\", \"responsible AI\"]\r\n)\r\n```\r\n\r\n---\r\n\r\n## Testing\r\n\r\n```bash\r\n# Install development dependencies\r\npip install pytest\r\n\r\n# Run the test suite\r\npytest tests/\r\n\r\n# Run with coverage\r\npip install pytest-cov\r\npytest tests/ --cov=literature_mapper\r\n```\r\n\r\n---\r\n\r\n## Requirements\r\n\r\n* Python 3.8 or newer \r\n* Google AI API key ([create one here](https://makersuite.google.com/app/apikey)) \r\n* A few MB of disk space for binaries plus additional space for your corpus database \r\n\r\n---\r\n\r\n## Known Limitations\r\n\r\n* **Duplicate papers**: Multiple papers with identical titles and years are allowed (common in academic literature with conference/journal versions)\r\n* **PDF processing**: Requires readable text content (scanned documents without OCR may fail)\r\n* **Processing speed**: Depends on chosen Gemini model and API rate limits\r\n* **File size**: PDFs larger than 50MB are rejected by default (configurable)\r\n\r\n---\r\n\r\n## Design Philosophy\r\n\r\n* **Simple** \u2013 Minimal setup, sensible defaults \r\n* **User-Centric** \u2013 Clear CLI and notebook ergonomics with helpful error messages\r\n* **Secure** \u2013 Strict input validation and API-key handling \r\n* **Robust** \u2013 Comprehensive error handling and retry logic \r\n* **Future-Proof** \u2013 Model-agnostic architecture for the Gemini family \r\n\r\n---\r\n\r\n## Contributing\r\n\r\nPull requests, feature ideas, and bug reports are welcome. Please open an issue first if you plan to work on a significant change.\r\n\r\nFor development:\r\n```bash\r\ngit clone https://github.com/jeremiahbohr/literature-mapper.git\r\ncd literature-mapper\r\npip install -e \".[dev]\"\r\npytest tests/\r\n```\r\n\r\n---\r\n\r\n## License\r\n\r\nReleased under the MIT License. See the `LICENSE` file for full text.\r\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "AI-powered Python library for systematic, scalable analysis of academic literature",
"version": "0.1.1",
"project_urls": {
"Homepage": "https://github.com/jeremiahbohr/literature-mapper",
"Issues": "https://github.com/jeremiahbohr/literature-mapper/issues",
"Repository": "https://github.com/jeremiahbohr/literature-mapper.git"
},
"split_keywords": [
"academic",
" literature",
" analysis",
" ai",
" research",
" pdf"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "1a0fdc61dbdee1e36049497efdcda8cd2f636ba01b2572ce671beb75919d1f80",
"md5": "5a75025f5e5026b88ec65a72639fd7b8",
"sha256": "7e483a9cc1fb427df76e2a112263f9e918c8772c19c571986de145ce0047ad99"
},
"downloads": -1,
"filename": "literature_mapper-0.1.1-py3-none-any.whl",
"has_sig": false,
"md5_digest": "5a75025f5e5026b88ec65a72639fd7b8",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.8",
"size": 27829,
"upload_time": "2025-08-04T04:04:39",
"upload_time_iso_8601": "2025-08-04T04:04:39.961201Z",
"url": "https://files.pythonhosted.org/packages/1a/0f/dc61dbdee1e36049497efdcda8cd2f636ba01b2572ce671beb75919d1f80/literature_mapper-0.1.1-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "7859d052662c437e54d0e53ab3adf52900604fad5e1f3d511c499d6664a4f61e",
"md5": "54fd368d71d2a1ca5ce45127081f193c",
"sha256": "a9f0b8a66fa8a587920bdc8021cd8f023a783642c76916e8a870e2c7a28977d7"
},
"downloads": -1,
"filename": "literature_mapper-0.1.1.tar.gz",
"has_sig": false,
"md5_digest": "54fd368d71d2a1ca5ce45127081f193c",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.8",
"size": 30442,
"upload_time": "2025-08-04T04:04:41",
"upload_time_iso_8601": "2025-08-04T04:04:41.347209Z",
"url": "https://files.pythonhosted.org/packages/78/59/d052662c437e54d0e53ab3adf52900604fad5e1f3d511c499d6664a4f61e/literature_mapper-0.1.1.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-08-04 04:04:41",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "jeremiahbohr",
"github_project": "literature-mapper",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"lcname": "literature-mapper"
}