noteparser

Name	noteparser JSON
Version	2.1.2 JSON
	download
home_page	None
Summary	A comprehensive document parser with AI-powered intelligence for converting and analyzing academic materials
upload_time	2025-08-22 22:35:08
maintainer	None
docs_url	None
author	None
requires_python	>=3.10
license	None
keywords	markdown latex document parser converter pdf docx pptx ai semantic-search knowledge-graph rag
VCS
bugtrack_url
requirements	click pyyaml rich markitdown flask SpeechRecognition moviepy pydub pytesseract pillow opencv-python numpy python-docx python-pptx openpyxl pypdf pdfplumber pytest pytest-cov black ruff mypy pre-commit beautifulsoup4 lxml pandas matplotlib requests gunicorn python-dotenv aiohttp asyncio sentence-transformers faiss-cpu tiktoken langchain openai sqlalchemy alembic redis psycopg2-binary elasticsearch prometheus-client opentelemetry-api opentelemetry-sdk opentelemetry-instrumentation-flask structlog pydantic marshmallow fastapi
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            # NoteParser 📚

**A comprehensive AI-powered document parser for converting and analyzing academic materials**

[![Python 3.10+](https://img.shields.io/badge/python-3.10+-blue.svg)](https://www.python.org/downloads/)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
[![PyPI version](https://badge.fury.io/py/noteparser.svg)](https://badge.fury.io/py/noteparser)
[![CI](https://github.com/CollegeNotesOrg/noteparser/workflows/CI/badge.svg)](https://github.com/CollegeNotesOrg/noteparser/actions)
[![codecov](https://codecov.io/gh/CollegeNotesOrg/noteparser/branch/master/graph/badge.svg)](https://codecov.io/gh/CollegeNotesOrg/noteparser)

NoteParser is a powerful AI-enhanced academic document processing system built on Microsoft's [MarkItDown](https://github.com/microsoft/markitdown) library. It combines traditional document parsing with cutting-edge AI services to provide intelligent document analysis, semantic search, and knowledge extraction for university students and educators.

## ✨ Key Features

### 🔄 **Multi-Format Support**
- **Documents**: PDF, DOCX, PPTX, XLSX, HTML, EPUB
- **Media**: Images with OCR, Audio/Video with transcription
- **Output**: Markdown, LaTeX, HTML

### 🎓 **Academic-Focused Processing**
- **Mathematical equations** preservation and enhancement
- **Code blocks** with syntax highlighting and language detection
- **Bibliography** and citation extraction
- **Chemical formulas** with proper subscript formatting
- **Academic keyword highlighting** (theorem, proof, definition, etc.)

### 🔌 **Extensible Plugin System**
- **Course-specific processors** (Math, Computer Science, Chemistry)
- **Custom parser plugins** for specialized content
- **Easy plugin development** with base classes

### 🌐 **Organization Integration**
- **Multi-repository synchronization** for course organization
- **Cross-reference detection** between related documents
- **Automated GitHub Actions** for continuous processing
- **Searchable indexing** across all notes

### 🤖 **AI-Powered Intelligence**
- **Semantic document analysis** with keyword and topic extraction
- **Natural language Q&A** over your document library
- **Intelligent summarization** and insight generation
- **Knowledge graph** construction and navigation
- **AI-enhanced search** with relevance ranking

### 🖥️ **Multiple Interfaces**
- **AI-enhanced CLI** with natural language commands
- **Interactive web dashboard** with AI features
- **Python API** with async AI integration
- **REST API** endpoints with AI processing

## 🚀 Quick Start

### Installation

#### Option 1: Using pip (Recommended)

```bash
# Install from PyPI with all features (recommended)
pip install noteparser[all]

# Install with AI features only
pip install noteparser[ai]

# Install basic version
pip install noteparser

# Install from source with all features (recommended)
git clone https://github.com/CollegeNotesOrg/noteparser.git
cd noteparser
pip install -e .[dev,all]
```

#### Option 2: Development Installation

```bash
# Clone the repository
git clone https://github.com/CollegeNotesOrg/noteparser.git
cd noteparser

# Create a virtual environment (recommended)
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install with all dependencies (includes dev tools)
pip install -e .[dev,all]

# Or install with specific feature sets
pip install -e .[dev]     # Development tools only
pip install -e .[ai]      # AI features only
```

> **Note**: As of v2.1.0, all dependencies are managed through `pyproject.toml`. The `requirements.txt` files are maintained for compatibility but using pip extras is the recommended approach.

#### System Dependencies

Some features require system packages:

```bash
# Ubuntu/Debian
sudo apt-get update
sudo apt-get install -y \
    tesseract-ocr \
    tesseract-ocr-eng \
    ffmpeg \
    poppler-utils

# macOS
brew install tesseract ffmpeg poppler

# Windows (using Chocolatey)
choco install tesseract ffmpeg poppler
```

#### Python Version Compatibility

- **Python 3.10+** is required (updated from 3.9+ due to markitdown dependency)
- Tested on Python 3.10, 3.11, and 3.12
- **Python 3.9 and earlier** support was removed due to compatibility requirements with latest dependencies

### Basic Usage

```bash
# Initialize in your project directory
noteparser init

# Parse a single document
noteparser parse lecture.pdf --format markdown

# Parse with AI enhancement
noteparser ai analyze lecture.pdf --output enhanced-lecture.md

# Query your knowledge base
noteparser ai query "What is machine learning?" --filters '{"course": "CS101"}'

# Batch process a directory
noteparser batch input/ --recursive --format latex

# Start the AI-enhanced web dashboard
noteparser web --host 0.0.0.0 --port 5000

# Check AI services health
noteparser ai health --detailed

# Sync to organization repository
noteparser sync output/*.md --target-repo study-notes --course CS101
```

### Python API

```python
import asyncio
from noteparser import NoteParser
from noteparser.integration import OrganizationSync

# Initialize parser with AI capabilities
parser = NoteParser(enable_ai=True, llm_client=your_llm_client)

# Parse single document
result = parser.parse_to_markdown("lecture.pdf")
print(result['content'])

# Parse with AI enhancement
async def ai_parse():
    result = await parser.parse_to_markdown_with_ai("lecture.pdf")
    print(f"Content: {result['content']}")
    print(f"AI Insights: {result['ai_processing']}")

asyncio.run(ai_parse())

# Query knowledge base
async def query_knowledge():
    result = await parser.query_knowledge(
        "What are the key concepts in machine learning?",
        filters={"course": "CS101"}
    )
    print(f"Answer: {result['answer']}")
    for doc in result['documents']:
        print(f"- {doc['title']} (relevance: {doc['score']:.2f})")

asyncio.run(query_knowledge())

# Batch processing
results = parser.parse_batch("notes/", output_format="markdown")

# Organization sync
org_sync = OrganizationSync()
sync_result = org_sync.sync_parsed_notes(
    source_files=["note1.md", "note2.md"],
    target_repo="study-notes",
    course="CS101"
)
```

## 📁 Project Structure

```
your-study-organization/
├── noteparser/                  # This repository - AI-powered parsing engine
├── noteparser-ai-services/     # AI microservices (RagFlow, DeepWiki)
├── study-notes/                # Main notes repository
│   ├── courses/
│   │   ├── CS101/
│   │   ├── MATH201/
│   │   └── PHYS301/
│   └── .noteparser.yml         # Organization configuration
├── note-templates/             # Shared LaTeX/Markdown templates
├── note-extensions/            # Custom plugins
└── note-dashboard/             # Optional: separate web interface
```

## 🤖 AI Services Setup

NoteParser can operate in two modes:

### Standalone Mode (Basic)
Works without external AI services - provides core document parsing functionality.

### AI-Enhanced Mode (Recommended)
Requires the `noteparser-ai-services` repository for full AI capabilities.

```bash
# Clone and start AI services
git clone https://github.com/CollegeNotesOrg/noteparser-ai-services.git
cd noteparser-ai-services
docker-compose up -d

# Verify services are running
curl http://localhost:8010/health  # RagFlow
curl http://localhost:8011/health  # DeepWiki

# Test AI integration
noteparser ai health
```

**AI Services Documentation**: [https://collegenotesorg.github.io/noteparser-ai-services/](https://collegenotesorg.github.io/noteparser-ai-services/)

## ⚙️ Configuration

### AI Services Configuration (`config/services.yml`)

```yaml
services:
  ragflow:
    host: localhost
    port: 8010
    enabled: true
    config:
      embedding_model: "sentence-transformers/all-MiniLM-L6-v2"
      vector_db_type: "faiss"
      chunk_size: 512
      top_k: 5

  deepwiki:
    host: localhost
    port: 8011
    enabled: true
    config:
      ai_model: "gpt-3.5-turbo"
      auto_link: true
      similarity_threshold: 0.7

features:
  enable_rag: true
  enable_wiki: true
  enable_ai_suggestions: true
```

### Organization Configuration (`.noteparser-org.yml`)

```yaml
organization:
  name: "my-study-notes"
  base_path: "."
  auto_discovery: true

repositories:
  study-notes:
    type: "notes"
    auto_sync: true
    formats: ["markdown", "latex"]
  noteparser:
    type: "parser"
    auto_sync: false

sync_settings:
  auto_commit: true
  commit_message_template: "Auto-sync: {timestamp} - {file_count} files updated"
  branch: "main"
  push_on_sync: false

cross_references:
  enabled: true
  similarity_threshold: 0.7
  max_suggestions: 5
```

### Plugin Configuration

```yaml
plugins:
  math_processor:
    enabled: true
    config:
      equation_numbering: true
      symbol_standardization: true

  cs_processor:
    enabled: true
    config:
      code_line_numbers: true
      auto_language_detection: true
```

## 🔌 Plugin Development

Create custom plugins for specialized course content:

```python
from noteparser.plugins import BasePlugin

class ChemistryPlugin(BasePlugin):
    name = "chemistry_processor"
    version = "1.0.0"
    description = "Enhanced processing for chemistry courses"
    course_types = ['chemistry', 'organic', 'biochemistry']

    def process_content(self, content: str, metadata: Dict[str, Any]) -> Dict[str, Any]:
        # Your custom processing logic here
        processed_content = self.enhance_chemical_formulas(content)

        return {
            'content': processed_content,
            'metadata': {**metadata, 'chemical_formulas_found': count}
        }
```

## 🌊 GitHub Actions Integration

Automatic processing when you push new documents:

```yaml
# .github/workflows/parse-notes.yml
name: Parse and Sync Notes
on:
  push:
    paths: ['input/**', 'raw-notes/**']

jobs:
  parse-notes:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Set up Python
        uses: actions/setup-python@v4
        with:
          python-version: '3.11'
      - name: Install dependencies
        run: pip install noteparser[all]
      - name: Parse documents
        run: noteparser batch input/ --format markdown
      - name: Sync to study-notes
        run: noteparser sync output/*.md --target-repo study-notes
```

## 🖥️ AI-Enhanced Web Dashboard

Access the AI-powered web interface at `http://localhost:5000`:

```bash
noteparser web
```

### Core Features:
- **Browse** all repositories and courses
- **Search** across all notes with semantic similarity
- **View** documents with syntax highlighting
- **Parse** new documents through the web interface
- **Manage** plugins and configuration
- **Monitor** sync status and cross-references

### AI Features (`/ai` dashboard):
- **🤖 AI Document Analysis**: Upload and analyze documents with AI insights
- **🔍 Knowledge Querying**: Natural language Q&A over your document library
- **📊 Text Analysis**: Extract keywords, topics, and summaries from content
- **🚀 Enhanced Search**: Semantic search with relevance ranking and AI answers
- **💡 Smart Insights**: Automatic topic detection and content relationships
- **📈 Service Health**: Real-time monitoring of AI service status

### Production Deployment:

```bash
# Using Docker Compose (recommended)
docker-compose -f docker-compose.prod.yml up -d

# Using deployment script
./scripts/deploy.sh production 2.1.0

# Access the application
open http://localhost:5000
open http://localhost:5000/ai  # AI Dashboard
```

## 📊 Use Cases

### 📖 **Individual Student**
```bash
# Daily workflow
noteparser parse "Today's Lecture.pdf"
noteparser sync output/todays-lecture.md --course CS101
```

### 🏫 **Course Organization**
```bash
# Semester setup
noteparser init
noteparser batch course-materials/ --recursive
noteparser index --format json > course-index.json
```

### 👥 **Study Group**
```bash
# Collaborative notes
noteparser parse shared-notes.docx --format markdown
git add . && git commit -m "Add processed notes"
git push origin main  # Triggers auto-sync via GitHub Actions
```

### 🔬 **Research Lab**
```bash
# Research paper processing
noteparser parse "Research Paper.pdf" --format latex
noteparser web  # Browse and cross-reference with existing notes
```

## 📚 Advanced Features

### 🔍 **Smart Content Detection**
- **Mathematical equations**: Automatic LaTeX formatting preservation
- **Code blocks**: Language detection and syntax highlighting
- **Citations**: APA, MLA, IEEE format recognition
- **Figures and tables**: Structured conversion with captions

### 🏷️ **Metadata Extraction**
- **Course identification** from file names and paths
- **Topic extraction** and categorization
- **Author and date** detection
- **Academic keywords** and tagging

### 🔗 **Cross-References**
- **Similar content detection** across documents
- **Prerequisite tracking** between topics
- **Citation network** visualization
- **Knowledge graph** construction

## 🛠️ Development

### Setup Development Environment

```bash
git clone https://github.com/CollegeNotesOrg/noteparser.git
cd noteparser
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install with all development dependencies (recommended)
pip install -e .[dev,all]

# Or install dev tools only
pip install -e .[dev]

# Install pre-commit hooks
pre-commit install
```

### Development Dependencies

The `[dev]` extra includes comprehensive development tools:

- **Testing**: `pytest`, `pytest-cov`, `pytest-mock`, `pytest-asyncio`, `pytest-xdist`
- **Code Quality**: `black`, `ruff`, `mypy`, `isort`, `pylint`, `pre-commit`
- **Documentation**: `sphinx`, `mkdocs-material`, `myst-parser`
- **Development Tools**: `ipython`, `jupyter`, `notebook`
- **Profiling**: `memory-profiler`, `line-profiler`
- **Security**: `bandit`, `safety`

### Run Tests

```bash
pytest tests/ -v --cov=noteparser
```

### Code Quality

```bash
# Auto-formatting (required)
black src/ tests/

# Linting with auto-fixes
ruff check src/ tests/ --fix

# Type checking
mypy src/noteparser/ --ignore-missing-imports

# All quality checks at once
make lint  # Runs black, ruff, and mypy
```

### CI/CD Information

The project uses GitHub Actions for continuous integration with the following jobs:

- **Cross-platform testing** (Ubuntu, Windows, macOS) on Python 3.10-3.12
- **Code quality checks** (black, ruff, mypy)
- **Security scans** (bandit, safety)
- **Performance benchmarking** with pytest-benchmark
- **Docker image testing** and validation
- **Integration testing** with Redis and PostgreSQL services

All dependencies are now managed through `pyproject.toml` for better reproducibility and CI reliability.

## 🤝 Contributing

1. Fork the repository
2. Create a feature branch (`git checkout -b feature/amazing-feature`)
3. Make your changes
4. Add tests for new functionality
5. Run the test suite
6. Commit your changes (`git commit -m 'Add amazing feature'`)
7. Push to the branch (`git push origin feature/amazing-feature`)
8. Open a Pull Request

## 📄 License

This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.

## 📦 Dependencies

All dependencies are managed through `pyproject.toml` with the following structure:

### Core Dependencies (included in base installation)
- **markitdown** - Microsoft's document parsing engine
- **Flask** - Web framework for dashboard
- **Click** - CLI interface
- **PyYAML** - Configuration management
- **Pillow** - Image processing
- **OpenCV** - Advanced image operations
- **pytesseract** - OCR capabilities
- **SpeechRecognition** - Audio transcription
- **moviepy** - Video processing
- **pandas** - Data processing
- **requests** - HTTP client
- **gunicorn** - Production WSGI server

### Optional Dependency Groups

#### `[ai]` - Advanced AI/ML Features
- **sentence-transformers** - Semantic embeddings
- **faiss-cpu** - Vector similarity search
- **langchain** - LLM framework integration
- **openai** - OpenAI API client
- **sqlalchemy** - Database ORM
- **elasticsearch** - Full-text search
- **prometheus-client** - Metrics collection
- **pydantic** - Data validation

#### `[dev]` - Development Tools
- **pytest** ecosystem - Testing framework
- **black**, **ruff**, **mypy** - Code quality
- **sphinx**, **mkdocs-material** - Documentation
- **jupyter**, **ipython** - Interactive development
- **bandit**, **safety** - Security scanning

#### `[all]` - All Optional Features
Combines AI and development dependencies for complete functionality.

### Installation Examples
```bash
pip install noteparser           # Core only
pip install noteparser[ai]       # Core + AI features
pip install noteparser[dev]      # Core + dev tools
pip install noteparser[all]      # Everything
```

## 🙏 Acknowledgments

- **Microsoft MarkItDown** - The core parsing engine that powers format conversion
- **Academic Community** - For inspiration and requirements gathering
- **Open Source Libraries** - All the amazing Python packages that make this possible

## 📞 Support

- **Documentation**: [https://collegenotesorg.github.io/noteparser/](https://collegenotesorg.github.io/noteparser/)
- **Issues**: [GitHub Issues](https://github.com/CollegeNotesOrg/noteparser/issues)
- **Discussions**: [GitHub Discussions](https://github.com/CollegeNotesOrg/noteparser/discussions)

---

**Made with ❤️ for students, by a student**

*Transform your study materials into a searchable, interconnected knowledge base*

---

**Author**: Suryansh Sijwali
**GitHub**: [@SuryanshSS1011](https://github.com/SuryanshSS1011)
**Organization**: [CollegeNotesOrg](https://github.com/CollegeNotesOrg)

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "noteparser",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.10",
    "maintainer_email": null,
    "keywords": "markdown, latex, document, parser, converter, pdf, docx, pptx, ai, semantic-search, knowledge-graph, rag",
    "author": null,
    "author_email": "Suryansh Sijwali <suryanshss1011@gmail.com>",
    "download_url": "https://files.pythonhosted.org/packages/93/2e/5e21d73d925b33e124c1edf8285a1fd0c6e83a1ca999e2b1461623375f26/noteparser-2.1.2.tar.gz",
    "platform": null,
    "description": "# NoteParser \ud83d\udcda\n\n**A comprehensive AI-powered document parser for converting and analyzing academic materials**\n\n[![Python 3.10+](https://img.shields.io/badge/python-3.10+-blue.svg)](https://www.python.org/downloads/)\n[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)\n[![PyPI version](https://badge.fury.io/py/noteparser.svg)](https://badge.fury.io/py/noteparser)\n[![CI](https://github.com/CollegeNotesOrg/noteparser/workflows/CI/badge.svg)](https://github.com/CollegeNotesOrg/noteparser/actions)\n[![codecov](https://codecov.io/gh/CollegeNotesOrg/noteparser/branch/master/graph/badge.svg)](https://codecov.io/gh/CollegeNotesOrg/noteparser)\n\nNoteParser is a powerful AI-enhanced academic document processing system built on Microsoft's [MarkItDown](https://github.com/microsoft/markitdown) library. It combines traditional document parsing with cutting-edge AI services to provide intelligent document analysis, semantic search, and knowledge extraction for university students and educators.\n\n## \u2728 Key Features\n\n### \ud83d\udd04 **Multi-Format Support**\n- **Documents**: PDF, DOCX, PPTX, XLSX, HTML, EPUB\n- **Media**: Images with OCR, Audio/Video with transcription\n- **Output**: Markdown, LaTeX, HTML\n\n### \ud83c\udf93 **Academic-Focused Processing**\n- **Mathematical equations** preservation and enhancement\n- **Code blocks** with syntax highlighting and language detection\n- **Bibliography** and citation extraction\n- **Chemical formulas** with proper subscript formatting\n- **Academic keyword highlighting** (theorem, proof, definition, etc.)\n\n### \ud83d\udd0c **Extensible Plugin System**\n- **Course-specific processors** (Math, Computer Science, Chemistry)\n- **Custom parser plugins** for specialized content\n- **Easy plugin development** with base classes\n\n### \ud83c\udf10 **Organization Integration**\n- **Multi-repository synchronization** for course organization\n- **Cross-reference detection** between related documents\n- **Automated GitHub Actions** for continuous processing\n- **Searchable indexing** across all notes\n\n### \ud83e\udd16 **AI-Powered Intelligence**\n- **Semantic document analysis** with keyword and topic extraction\n- **Natural language Q&A** over your document library\n- **Intelligent summarization** and insight generation\n- **Knowledge graph** construction and navigation\n- **AI-enhanced search** with relevance ranking\n\n### \ud83d\udda5\ufe0f **Multiple Interfaces**\n- **AI-enhanced CLI** with natural language commands\n- **Interactive web dashboard** with AI features\n- **Python API** with async AI integration\n- **REST API** endpoints with AI processing\n\n## \ud83d\ude80 Quick Start\n\n### Installation\n\n#### Option 1: Using pip (Recommended)\n\n```bash\n# Install from PyPI with all features (recommended)\npip install noteparser[all]\n\n# Install with AI features only\npip install noteparser[ai]\n\n# Install basic version\npip install noteparser\n\n# Install from source with all features (recommended)\ngit clone https://github.com/CollegeNotesOrg/noteparser.git\ncd noteparser\npip install -e .[dev,all]\n```\n\n#### Option 2: Development Installation\n\n```bash\n# Clone the repository\ngit clone https://github.com/CollegeNotesOrg/noteparser.git\ncd noteparser\n\n# Create a virtual environment (recommended)\npython -m venv venv\nsource venv/bin/activate  # On Windows: venv\\Scripts\\activate\n\n# Install with all dependencies (includes dev tools)\npip install -e .[dev,all]\n\n# Or install with specific feature sets\npip install -e .[dev]     # Development tools only\npip install -e .[ai]      # AI features only\n```\n\n> **Note**: As of v2.1.0, all dependencies are managed through `pyproject.toml`. The `requirements.txt` files are maintained for compatibility but using pip extras is the recommended approach.\n\n#### System Dependencies\n\nSome features require system packages:\n\n```bash\n# Ubuntu/Debian\nsudo apt-get update\nsudo apt-get install -y \\\n    tesseract-ocr \\\n    tesseract-ocr-eng \\\n    ffmpeg \\\n    poppler-utils\n\n# macOS\nbrew install tesseract ffmpeg poppler\n\n# Windows (using Chocolatey)\nchoco install tesseract ffmpeg poppler\n```\n\n#### Python Version Compatibility\n\n- **Python 3.10+** is required (updated from 3.9+ due to markitdown dependency)\n- Tested on Python 3.10, 3.11, and 3.12\n- **Python 3.9 and earlier** support was removed due to compatibility requirements with latest dependencies\n\n### Basic Usage\n\n```bash\n# Initialize in your project directory\nnoteparser init\n\n# Parse a single document\nnoteparser parse lecture.pdf --format markdown\n\n# Parse with AI enhancement\nnoteparser ai analyze lecture.pdf --output enhanced-lecture.md\n\n# Query your knowledge base\nnoteparser ai query \"What is machine learning?\" --filters '{\"course\": \"CS101\"}'\n\n# Batch process a directory\nnoteparser batch input/ --recursive --format latex\n\n# Start the AI-enhanced web dashboard\nnoteparser web --host 0.0.0.0 --port 5000\n\n# Check AI services health\nnoteparser ai health --detailed\n\n# Sync to organization repository\nnoteparser sync output/*.md --target-repo study-notes --course CS101\n```\n\n### Python API\n\n```python\nimport asyncio\nfrom noteparser import NoteParser\nfrom noteparser.integration import OrganizationSync\n\n# Initialize parser with AI capabilities\nparser = NoteParser(enable_ai=True, llm_client=your_llm_client)\n\n# Parse single document\nresult = parser.parse_to_markdown(\"lecture.pdf\")\nprint(result['content'])\n\n# Parse with AI enhancement\nasync def ai_parse():\n    result = await parser.parse_to_markdown_with_ai(\"lecture.pdf\")\n    print(f\"Content: {result['content']}\")\n    print(f\"AI Insights: {result['ai_processing']}\")\n\nasyncio.run(ai_parse())\n\n# Query knowledge base\nasync def query_knowledge():\n    result = await parser.query_knowledge(\n        \"What are the key concepts in machine learning?\",\n        filters={\"course\": \"CS101\"}\n    )\n    print(f\"Answer: {result['answer']}\")\n    for doc in result['documents']:\n        print(f\"- {doc['title']} (relevance: {doc['score']:.2f})\")\n\nasyncio.run(query_knowledge())\n\n# Batch processing\nresults = parser.parse_batch(\"notes/\", output_format=\"markdown\")\n\n# Organization sync\norg_sync = OrganizationSync()\nsync_result = org_sync.sync_parsed_notes(\n    source_files=[\"note1.md\", \"note2.md\"],\n    target_repo=\"study-notes\",\n    course=\"CS101\"\n)\n```\n\n## \ud83d\udcc1 Project Structure\n\n```\nyour-study-organization/\n\u251c\u2500\u2500 noteparser/                  # This repository - AI-powered parsing engine\n\u251c\u2500\u2500 noteparser-ai-services/     # AI microservices (RagFlow, DeepWiki)\n\u251c\u2500\u2500 study-notes/                # Main notes repository\n\u2502   \u251c\u2500\u2500 courses/\n\u2502   \u2502   \u251c\u2500\u2500 CS101/\n\u2502   \u2502   \u251c\u2500\u2500 MATH201/\n\u2502   \u2502   \u2514\u2500\u2500 PHYS301/\n\u2502   \u2514\u2500\u2500 .noteparser.yml         # Organization configuration\n\u251c\u2500\u2500 note-templates/             # Shared LaTeX/Markdown templates\n\u251c\u2500\u2500 note-extensions/            # Custom plugins\n\u2514\u2500\u2500 note-dashboard/             # Optional: separate web interface\n```\n\n## \ud83e\udd16 AI Services Setup\n\nNoteParser can operate in two modes:\n\n### Standalone Mode (Basic)\nWorks without external AI services - provides core document parsing functionality.\n\n### AI-Enhanced Mode (Recommended)\nRequires the `noteparser-ai-services` repository for full AI capabilities.\n\n```bash\n# Clone and start AI services\ngit clone https://github.com/CollegeNotesOrg/noteparser-ai-services.git\ncd noteparser-ai-services\ndocker-compose up -d\n\n# Verify services are running\ncurl http://localhost:8010/health  # RagFlow\ncurl http://localhost:8011/health  # DeepWiki\n\n# Test AI integration\nnoteparser ai health\n```\n\n**AI Services Documentation**: [https://collegenotesorg.github.io/noteparser-ai-services/](https://collegenotesorg.github.io/noteparser-ai-services/)\n\n## \u2699\ufe0f Configuration\n\n### AI Services Configuration (`config/services.yml`)\n\n```yaml\nservices:\n  ragflow:\n    host: localhost\n    port: 8010\n    enabled: true\n    config:\n      embedding_model: \"sentence-transformers/all-MiniLM-L6-v2\"\n      vector_db_type: \"faiss\"\n      chunk_size: 512\n      top_k: 5\n\n  deepwiki:\n    host: localhost\n    port: 8011\n    enabled: true\n    config:\n      ai_model: \"gpt-3.5-turbo\"\n      auto_link: true\n      similarity_threshold: 0.7\n\nfeatures:\n  enable_rag: true\n  enable_wiki: true\n  enable_ai_suggestions: true\n```\n\n### Organization Configuration (`.noteparser-org.yml`)\n\n```yaml\norganization:\n  name: \"my-study-notes\"\n  base_path: \".\"\n  auto_discovery: true\n\nrepositories:\n  study-notes:\n    type: \"notes\"\n    auto_sync: true\n    formats: [\"markdown\", \"latex\"]\n  noteparser:\n    type: \"parser\"\n    auto_sync: false\n\nsync_settings:\n  auto_commit: true\n  commit_message_template: \"Auto-sync: {timestamp} - {file_count} files updated\"\n  branch: \"main\"\n  push_on_sync: false\n\ncross_references:\n  enabled: true\n  similarity_threshold: 0.7\n  max_suggestions: 5\n```\n\n### Plugin Configuration\n\n```yaml\nplugins:\n  math_processor:\n    enabled: true\n    config:\n      equation_numbering: true\n      symbol_standardization: true\n\n  cs_processor:\n    enabled: true\n    config:\n      code_line_numbers: true\n      auto_language_detection: true\n```\n\n## \ud83d\udd0c Plugin Development\n\nCreate custom plugins for specialized course content:\n\n```python\nfrom noteparser.plugins import BasePlugin\n\nclass ChemistryPlugin(BasePlugin):\n    name = \"chemistry_processor\"\n    version = \"1.0.0\"\n    description = \"Enhanced processing for chemistry courses\"\n    course_types = ['chemistry', 'organic', 'biochemistry']\n\n    def process_content(self, content: str, metadata: Dict[str, Any]) -> Dict[str, Any]:\n        # Your custom processing logic here\n        processed_content = self.enhance_chemical_formulas(content)\n\n        return {\n            'content': processed_content,\n            'metadata': {**metadata, 'chemical_formulas_found': count}\n        }\n```\n\n## \ud83c\udf0a GitHub Actions Integration\n\nAutomatic processing when you push new documents:\n\n```yaml\n# .github/workflows/parse-notes.yml\nname: Parse and Sync Notes\non:\n  push:\n    paths: ['input/**', 'raw-notes/**']\n\njobs:\n  parse-notes:\n    runs-on: ubuntu-latest\n    steps:\n      - uses: actions/checkout@v4\n      - name: Set up Python\n        uses: actions/setup-python@v4\n        with:\n          python-version: '3.11'\n      - name: Install dependencies\n        run: pip install noteparser[all]\n      - name: Parse documents\n        run: noteparser batch input/ --format markdown\n      - name: Sync to study-notes\n        run: noteparser sync output/*.md --target-repo study-notes\n```\n\n## \ud83d\udda5\ufe0f AI-Enhanced Web Dashboard\n\nAccess the AI-powered web interface at `http://localhost:5000`:\n\n```bash\nnoteparser web\n```\n\n### Core Features:\n- **Browse** all repositories and courses\n- **Search** across all notes with semantic similarity\n- **View** documents with syntax highlighting\n- **Parse** new documents through the web interface\n- **Manage** plugins and configuration\n- **Monitor** sync status and cross-references\n\n### AI Features (`/ai` dashboard):\n- **\ud83e\udd16 AI Document Analysis**: Upload and analyze documents with AI insights\n- **\ud83d\udd0d Knowledge Querying**: Natural language Q&A over your document library\n- **\ud83d\udcca Text Analysis**: Extract keywords, topics, and summaries from content\n- **\ud83d\ude80 Enhanced Search**: Semantic search with relevance ranking and AI answers\n- **\ud83d\udca1 Smart Insights**: Automatic topic detection and content relationships\n- **\ud83d\udcc8 Service Health**: Real-time monitoring of AI service status\n\n### Production Deployment:\n\n```bash\n# Using Docker Compose (recommended)\ndocker-compose -f docker-compose.prod.yml up -d\n\n# Using deployment script\n./scripts/deploy.sh production 2.1.0\n\n# Access the application\nopen http://localhost:5000\nopen http://localhost:5000/ai  # AI Dashboard\n```\n\n## \ud83d\udcca Use Cases\n\n### \ud83d\udcd6 **Individual Student**\n```bash\n# Daily workflow\nnoteparser parse \"Today's Lecture.pdf\"\nnoteparser sync output/todays-lecture.md --course CS101\n```\n\n### \ud83c\udfeb **Course Organization**\n```bash\n# Semester setup\nnoteparser init\nnoteparser batch course-materials/ --recursive\nnoteparser index --format json > course-index.json\n```\n\n### \ud83d\udc65 **Study Group**\n```bash\n# Collaborative notes\nnoteparser parse shared-notes.docx --format markdown\ngit add . && git commit -m \"Add processed notes\"\ngit push origin main  # Triggers auto-sync via GitHub Actions\n```\n\n### \ud83d\udd2c **Research Lab**\n```bash\n# Research paper processing\nnoteparser parse \"Research Paper.pdf\" --format latex\nnoteparser web  # Browse and cross-reference with existing notes\n```\n\n## \ud83d\udcda Advanced Features\n\n### \ud83d\udd0d **Smart Content Detection**\n- **Mathematical equations**: Automatic LaTeX formatting preservation\n- **Code blocks**: Language detection and syntax highlighting\n- **Citations**: APA, MLA, IEEE format recognition\n- **Figures and tables**: Structured conversion with captions\n\n### \ud83c\udff7\ufe0f **Metadata Extraction**\n- **Course identification** from file names and paths\n- **Topic extraction** and categorization\n- **Author and date** detection\n- **Academic keywords** and tagging\n\n### \ud83d\udd17 **Cross-References**\n- **Similar content detection** across documents\n- **Prerequisite tracking** between topics\n- **Citation network** visualization\n- **Knowledge graph** construction\n\n## \ud83d\udee0\ufe0f Development\n\n### Setup Development Environment\n\n```bash\ngit clone https://github.com/CollegeNotesOrg/noteparser.git\ncd noteparser\npython -m venv venv\nsource venv/bin/activate  # On Windows: venv\\Scripts\\activate\n\n# Install with all development dependencies (recommended)\npip install -e .[dev,all]\n\n# Or install dev tools only\npip install -e .[dev]\n\n# Install pre-commit hooks\npre-commit install\n```\n\n### Development Dependencies\n\nThe `[dev]` extra includes comprehensive development tools:\n\n- **Testing**: `pytest`, `pytest-cov`, `pytest-mock`, `pytest-asyncio`, `pytest-xdist`\n- **Code Quality**: `black`, `ruff`, `mypy`, `isort`, `pylint`, `pre-commit`\n- **Documentation**: `sphinx`, `mkdocs-material`, `myst-parser`\n- **Development Tools**: `ipython`, `jupyter`, `notebook`\n- **Profiling**: `memory-profiler`, `line-profiler`\n- **Security**: `bandit`, `safety`\n\n### Run Tests\n\n```bash\npytest tests/ -v --cov=noteparser\n```\n\n### Code Quality\n\n```bash\n# Auto-formatting (required)\nblack src/ tests/\n\n# Linting with auto-fixes\nruff check src/ tests/ --fix\n\n# Type checking\nmypy src/noteparser/ --ignore-missing-imports\n\n# All quality checks at once\nmake lint  # Runs black, ruff, and mypy\n```\n\n### CI/CD Information\n\nThe project uses GitHub Actions for continuous integration with the following jobs:\n\n- **Cross-platform testing** (Ubuntu, Windows, macOS) on Python 3.10-3.12\n- **Code quality checks** (black, ruff, mypy)\n- **Security scans** (bandit, safety)\n- **Performance benchmarking** with pytest-benchmark\n- **Docker image testing** and validation\n- **Integration testing** with Redis and PostgreSQL services\n\nAll dependencies are now managed through `pyproject.toml` for better reproducibility and CI reliability.\n\n## \ud83e\udd1d Contributing\n\n1. Fork the repository\n2. Create a feature branch (`git checkout -b feature/amazing-feature`)\n3. Make your changes\n4. Add tests for new functionality\n5. Run the test suite\n6. Commit your changes (`git commit -m 'Add amazing feature'`)\n7. Push to the branch (`git push origin feature/amazing-feature`)\n8. Open a Pull Request\n\n## \ud83d\udcc4 License\n\nThis project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.\n\n## \ud83d\udce6 Dependencies\n\nAll dependencies are managed through `pyproject.toml` with the following structure:\n\n### Core Dependencies (included in base installation)\n- **markitdown** - Microsoft's document parsing engine\n- **Flask** - Web framework for dashboard\n- **Click** - CLI interface\n- **PyYAML** - Configuration management\n- **Pillow** - Image processing\n- **OpenCV** - Advanced image operations\n- **pytesseract** - OCR capabilities\n- **SpeechRecognition** - Audio transcription\n- **moviepy** - Video processing\n- **pandas** - Data processing\n- **requests** - HTTP client\n- **gunicorn** - Production WSGI server\n\n### Optional Dependency Groups\n\n#### `[ai]` - Advanced AI/ML Features\n- **sentence-transformers** - Semantic embeddings\n- **faiss-cpu** - Vector similarity search\n- **langchain** - LLM framework integration\n- **openai** - OpenAI API client\n- **sqlalchemy** - Database ORM\n- **elasticsearch** - Full-text search\n- **prometheus-client** - Metrics collection\n- **pydantic** - Data validation\n\n#### `[dev]` - Development Tools\n- **pytest** ecosystem - Testing framework\n- **black**, **ruff**, **mypy** - Code quality\n- **sphinx**, **mkdocs-material** - Documentation\n- **jupyter**, **ipython** - Interactive development\n- **bandit**, **safety** - Security scanning\n\n#### `[all]` - All Optional Features\nCombines AI and development dependencies for complete functionality.\n\n### Installation Examples\n```bash\npip install noteparser           # Core only\npip install noteparser[ai]       # Core + AI features\npip install noteparser[dev]      # Core + dev tools\npip install noteparser[all]      # Everything\n```\n\n## \ud83d\ude4f Acknowledgments\n\n- **Microsoft MarkItDown** - The core parsing engine that powers format conversion\n- **Academic Community** - For inspiration and requirements gathering\n- **Open Source Libraries** - All the amazing Python packages that make this possible\n\n## \ud83d\udcde Support\n\n- **Documentation**: [https://collegenotesorg.github.io/noteparser/](https://collegenotesorg.github.io/noteparser/)\n- **Issues**: [GitHub Issues](https://github.com/CollegeNotesOrg/noteparser/issues)\n- **Discussions**: [GitHub Discussions](https://github.com/CollegeNotesOrg/noteparser/discussions)\n\n---\n\n**Made with \u2764\ufe0f for students, by a student**\n\n*Transform your study materials into a searchable, interconnected knowledge base*\n\n---\n\n**Author**: Suryansh Sijwali\n**GitHub**: [@SuryanshSS1011](https://github.com/SuryanshSS1011)\n**Organization**: [CollegeNotesOrg](https://github.com/CollegeNotesOrg)\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "A comprehensive document parser with AI-powered intelligence for converting and analyzing academic materials",
    "version": "2.1.2",
    "project_urls": {
        "Homepage": "https://github.com/CollegeNotesOrg/noteparser",
        "Issues": "https://github.com/CollegeNotesOrg/noteparser/issues"
    },
    "split_keywords": [
        "markdown",
        " latex",
        " document",
        " parser",
        " converter",
        " pdf",
        " docx",
        " pptx",
        " ai",
        " semantic-search",
        " knowledge-graph",
        " rag"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "826b1cccb8cf42542ec638da086d65f478a14492c49154d690f992370d527611",
                "md5": "2cc93082285048b808b350a1dd1c013c",
                "sha256": "1ca4bec0aac8af904c38baa0471196e1e6bfc5e4b88a3281e7d30b33ecf29aa3"
            },
            "downloads": -1,
            "filename": "noteparser-2.1.2-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "2cc93082285048b808b350a1dd1c013c",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.10",
            "size": 102005,
            "upload_time": "2025-08-22T22:35:06",
            "upload_time_iso_8601": "2025-08-22T22:35:06.692048Z",
            "url": "https://files.pythonhosted.org/packages/82/6b/1cccb8cf42542ec638da086d65f478a14492c49154d690f992370d527611/noteparser-2.1.2-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "932e5e21d73d925b33e124c1edf8285a1fd0c6e83a1ca999e2b1461623375f26",
                "md5": "144516e79e14cf5396b5c666fd7f76ba",
                "sha256": "97cdf3b8b3922ce91dc08aacec0bb2e040120e0d5c1742e44541f663cbe1209b"
            },
            "downloads": -1,
            "filename": "noteparser-2.1.2.tar.gz",
            "has_sig": false,
            "md5_digest": "144516e79e14cf5396b5c666fd7f76ba",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.10",
            "size": 140849,
            "upload_time": "2025-08-22T22:35:08",
            "upload_time_iso_8601": "2025-08-22T22:35:08.540006Z",
            "url": "https://files.pythonhosted.org/packages/93/2e/5e21d73d925b33e124c1edf8285a1fd0c6e83a1ca999e2b1461623375f26/noteparser-2.1.2.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-08-22 22:35:08",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "CollegeNotesOrg",
    "github_project": "noteparser",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "requirements": [
        {
            "name": "click",
            "specs": [
                [
                    ">=",
                    "8.0.0"
                ]
            ]
        },
        {
            "name": "pyyaml",
            "specs": [
                [
                    ">=",
                    "6.0"
                ]
            ]
        },
        {
            "name": "rich",
            "specs": [
                [
                    ">=",
                    "13.0"
                ]
            ]
        },
        {
            "name": "markitdown",
            "specs": [
                [
                    ">=",
                    "0.1.0"
                ]
            ]
        },
        {
            "name": "flask",
            "specs": [
                [
                    ">=",
                    "2.3.0"
                ]
            ]
        },
        {
            "name": "SpeechRecognition",
            "specs": [
                [
                    ">=",
                    "3.10.0"
                ]
            ]
        },
        {
            "name": "moviepy",
            "specs": [
                [
                    ">=",
                    "1.0.3"
                ]
            ]
        },
        {
            "name": "pydub",
            "specs": [
                [
                    ">=",
                    "0.25.1"
                ]
            ]
        },
        {
            "name": "pytesseract",
            "specs": [
                [
                    ">=",
                    "0.3.10"
                ]
            ]
        },
        {
            "name": "pillow",
            "specs": [
                [
                    ">=",
                    "10.0.0"
                ]
            ]
        },
        {
            "name": "opencv-python",
            "specs": [
                [
                    ">=",
                    "4.8.0"
                ]
            ]
        },
        {
            "name": "numpy",
            "specs": [
                [
                    ">=",
                    "1.24.0"
                ]
            ]
        },
        {
            "name": "python-docx",
            "specs": [
                [
                    ">=",
                    "1.0.0"
                ]
            ]
        },
        {
            "name": "python-pptx",
            "specs": [
                [
                    ">=",
                    "0.6.0"
                ]
            ]
        },
        {
            "name": "openpyxl",
            "specs": [
                [
                    ">=",
                    "3.1.0"
                ]
            ]
        },
        {
            "name": "pypdf",
            "specs": [
                [
                    ">=",
                    "3.17.0"
                ]
            ]
        },
        {
            "name": "pdfplumber",
            "specs": [
                [
                    ">=",
                    "0.10.0"
                ]
            ]
        },
        {
            "name": "pytest",
            "specs": [
                [
                    ">=",
                    "7.0.0"
                ]
            ]
        },
        {
            "name": "pytest-cov",
            "specs": [
                [
                    ">=",
                    "4.0.0"
                ]
            ]
        },
        {
            "name": "black",
            "specs": [
                [
                    ">=",
                    "23.0.0"
                ]
            ]
        },
        {
            "name": "ruff",
            "specs": [
                [
                    ">=",
                    "0.1.0"
                ]
            ]
        },
        {
            "name": "mypy",
            "specs": [
                [
                    ">=",
                    "1.0.0"
                ]
            ]
        },
        {
            "name": "pre-commit",
            "specs": [
                [
                    ">=",
                    "3.0.0"
                ]
            ]
        },
        {
            "name": "beautifulsoup4",
            "specs": [
                [
                    ">=",
                    "4.12.0"
                ]
            ]
        },
        {
            "name": "lxml",
            "specs": [
                [
                    ">=",
                    "4.9.0"
                ]
            ]
        },
        {
            "name": "pandas",
            "specs": [
                [
                    ">=",
                    "2.0.0"
                ]
            ]
        },
        {
            "name": "matplotlib",
            "specs": [
                [
                    ">=",
                    "3.7.0"
                ]
            ]
        },
        {
            "name": "requests",
            "specs": [
                [
                    ">=",
                    "2.31.0"
                ]
            ]
        },
        {
            "name": "gunicorn",
            "specs": [
                [
                    ">=",
                    "21.2.0"
                ]
            ]
        },
        {
            "name": "python-dotenv",
            "specs": [
                [
                    ">=",
                    "1.0.0"
                ]
            ]
        },
        {
            "name": "aiohttp",
            "specs": [
                [
                    ">=",
                    "3.9.0"
                ]
            ]
        },
        {
            "name": "asyncio",
            "specs": [
                [
                    ">=",
                    "3.4.3"
                ]
            ]
        },
        {
            "name": "sentence-transformers",
            "specs": [
                [
                    ">=",
                    "2.2.0"
                ]
            ]
        },
        {
            "name": "faiss-cpu",
            "specs": [
                [
                    ">=",
                    "1.7.4"
                ]
            ]
        },
        {
            "name": "tiktoken",
            "specs": [
                [
                    ">=",
                    "0.5.0"
                ]
            ]
        },
        {
            "name": "langchain",
            "specs": [
                [
                    ">=",
                    "0.1.0"
                ]
            ]
        },
        {
            "name": "openai",
            "specs": [
                [
                    ">=",
                    "1.0.0"
                ]
            ]
        },
        {
            "name": "sqlalchemy",
            "specs": [
                [
                    ">=",
                    "2.0.0"
                ]
            ]
        },
        {
            "name": "alembic",
            "specs": [
                [
                    ">=",
                    "1.12.0"
                ]
            ]
        },
        {
            "name": "redis",
            "specs": [
                [
                    ">=",
                    "5.0.0"
                ]
            ]
        },
        {
            "name": "psycopg2-binary",
            "specs": [
                [
                    ">=",
                    "2.9.0"
                ]
            ]
        },
        {
            "name": "elasticsearch",
            "specs": [
                [
                    ">=",
                    "8.11.0"
                ]
            ]
        },
        {
            "name": "prometheus-client",
            "specs": [
                [
                    ">=",
                    "0.19.0"
                ]
            ]
        },
        {
            "name": "opentelemetry-api",
            "specs": [
                [
                    ">=",
                    "1.20.0"
                ]
            ]
        },
        {
            "name": "opentelemetry-sdk",
            "specs": [
                [
                    ">=",
                    "1.20.0"
                ]
            ]
        },
        {
            "name": "opentelemetry-instrumentation-flask",
            "specs": [
                [
                    ">=",
                    "0.42b0"
                ]
            ]
        },
        {
            "name": "structlog",
            "specs": [
                [
                    ">=",
                    "23.2.0"
                ]
            ]
        },
        {
            "name": "pydantic",
            "specs": [
                [
                    ">=",
                    "2.5.0"
                ]
            ]
        },
        {
            "name": "marshmallow",
            "specs": [
                [
                    ">=",
                    "3.20.0"
                ]
            ]
        },
        {
            "name": "fastapi",
            "specs": [
                [
                    ">=",
                    "0.104.0"
                ]
            ]
        }
    ],
    "lcname": "noteparser"
}

None