# NoteParser 📚
**A comprehensive AI-powered document parser for converting and analyzing academic materials**
[](https://www.python.org/downloads/)
[](https://opensource.org/licenses/MIT)
[](https://badge.fury.io/py/noteparser)
[](https://github.com/CollegeNotesOrg/noteparser/actions)
[](https://codecov.io/gh/CollegeNotesOrg/noteparser)
NoteParser is a powerful AI-enhanced academic document processing system built on Microsoft's [MarkItDown](https://github.com/microsoft/markitdown) library. It combines traditional document parsing with cutting-edge AI services to provide intelligent document analysis, semantic search, and knowledge extraction for university students and educators.
## ✨ Key Features
### 🔄 **Multi-Format Support**
- **Documents**: PDF, DOCX, PPTX, XLSX, HTML, EPUB
- **Media**: Images with OCR, Audio/Video with transcription
- **Output**: Markdown, LaTeX, HTML
### 🎓 **Academic-Focused Processing**
- **Mathematical equations** preservation and enhancement
- **Code blocks** with syntax highlighting and language detection
- **Bibliography** and citation extraction
- **Chemical formulas** with proper subscript formatting
- **Academic keyword highlighting** (theorem, proof, definition, etc.)
### 🔌 **Extensible Plugin System**
- **Course-specific processors** (Math, Computer Science, Chemistry)
- **Custom parser plugins** for specialized content
- **Easy plugin development** with base classes
### 🌐 **Organization Integration**
- **Multi-repository synchronization** for course organization
- **Cross-reference detection** between related documents
- **Automated GitHub Actions** for continuous processing
- **Searchable indexing** across all notes
### 🤖 **AI-Powered Intelligence**
- **Semantic document analysis** with keyword and topic extraction
- **Natural language Q&A** over your document library
- **Intelligent summarization** and insight generation
- **Knowledge graph** construction and navigation
- **AI-enhanced search** with relevance ranking
### 🖥️ **Multiple Interfaces**
- **AI-enhanced CLI** with natural language commands
- **Interactive web dashboard** with AI features
- **Python API** with async AI integration
- **REST API** endpoints with AI processing
## 🚀 Quick Start
### Installation
#### Option 1: Using pip (Recommended)
```bash
# Install from PyPI with all features (recommended)
pip install noteparser[all]
# Install with AI features only
pip install noteparser[ai]
# Install basic version
pip install noteparser
# Install from source with all features (recommended)
git clone https://github.com/CollegeNotesOrg/noteparser.git
cd noteparser
pip install -e .[dev,all]
```
#### Option 2: Development Installation
```bash
# Clone the repository
git clone https://github.com/CollegeNotesOrg/noteparser.git
cd noteparser
# Create a virtual environment (recommended)
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install with all dependencies (includes dev tools)
pip install -e .[dev,all]
# Or install with specific feature sets
pip install -e .[dev] # Development tools only
pip install -e .[ai] # AI features only
```
> **Note**: As of v2.1.0, all dependencies are managed through `pyproject.toml`. The `requirements.txt` files are maintained for compatibility but using pip extras is the recommended approach.
#### System Dependencies
Some features require system packages:
```bash
# Ubuntu/Debian
sudo apt-get update
sudo apt-get install -y \
tesseract-ocr \
tesseract-ocr-eng \
ffmpeg \
poppler-utils
# macOS
brew install tesseract ffmpeg poppler
# Windows (using Chocolatey)
choco install tesseract ffmpeg poppler
```
#### Python Version Compatibility
- **Python 3.10+** is required (updated from 3.9+ due to markitdown dependency)
- Tested on Python 3.10, 3.11, and 3.12
- **Python 3.9 and earlier** support was removed due to compatibility requirements with latest dependencies
### Basic Usage
```bash
# Initialize in your project directory
noteparser init
# Parse a single document
noteparser parse lecture.pdf --format markdown
# Parse with AI enhancement
noteparser ai analyze lecture.pdf --output enhanced-lecture.md
# Query your knowledge base
noteparser ai query "What is machine learning?" --filters '{"course": "CS101"}'
# Batch process a directory
noteparser batch input/ --recursive --format latex
# Start the AI-enhanced web dashboard
noteparser web --host 0.0.0.0 --port 5000
# Check AI services health
noteparser ai health --detailed
# Sync to organization repository
noteparser sync output/*.md --target-repo study-notes --course CS101
```
### Python API
```python
import asyncio
from noteparser import NoteParser
from noteparser.integration import OrganizationSync
# Initialize parser with AI capabilities
parser = NoteParser(enable_ai=True, llm_client=your_llm_client)
# Parse single document
result = parser.parse_to_markdown("lecture.pdf")
print(result['content'])
# Parse with AI enhancement
async def ai_parse():
result = await parser.parse_to_markdown_with_ai("lecture.pdf")
print(f"Content: {result['content']}")
print(f"AI Insights: {result['ai_processing']}")
asyncio.run(ai_parse())
# Query knowledge base
async def query_knowledge():
result = await parser.query_knowledge(
"What are the key concepts in machine learning?",
filters={"course": "CS101"}
)
print(f"Answer: {result['answer']}")
for doc in result['documents']:
print(f"- {doc['title']} (relevance: {doc['score']:.2f})")
asyncio.run(query_knowledge())
# Batch processing
results = parser.parse_batch("notes/", output_format="markdown")
# Organization sync
org_sync = OrganizationSync()
sync_result = org_sync.sync_parsed_notes(
source_files=["note1.md", "note2.md"],
target_repo="study-notes",
course="CS101"
)
```
## 📁 Project Structure
```
your-study-organization/
├── noteparser/ # This repository - AI-powered parsing engine
├── noteparser-ai-services/ # AI microservices (RagFlow, DeepWiki)
├── study-notes/ # Main notes repository
│ ├── courses/
│ │ ├── CS101/
│ │ ├── MATH201/
│ │ └── PHYS301/
│ └── .noteparser.yml # Organization configuration
├── note-templates/ # Shared LaTeX/Markdown templates
├── note-extensions/ # Custom plugins
└── note-dashboard/ # Optional: separate web interface
```
## 🤖 AI Services Setup
NoteParser can operate in two modes:
### Standalone Mode (Basic)
Works without external AI services - provides core document parsing functionality.
### AI-Enhanced Mode (Recommended)
Requires the `noteparser-ai-services` repository for full AI capabilities.
```bash
# Clone and start AI services
git clone https://github.com/CollegeNotesOrg/noteparser-ai-services.git
cd noteparser-ai-services
docker-compose up -d
# Verify services are running
curl http://localhost:8010/health # RagFlow
curl http://localhost:8011/health # DeepWiki
# Test AI integration
noteparser ai health
```
**AI Services Documentation**: [https://collegenotesorg.github.io/noteparser-ai-services/](https://collegenotesorg.github.io/noteparser-ai-services/)
## ⚙️ Configuration
### AI Services Configuration (`config/services.yml`)
```yaml
services:
ragflow:
host: localhost
port: 8010
enabled: true
config:
embedding_model: "sentence-transformers/all-MiniLM-L6-v2"
vector_db_type: "faiss"
chunk_size: 512
top_k: 5
deepwiki:
host: localhost
port: 8011
enabled: true
config:
ai_model: "gpt-3.5-turbo"
auto_link: true
similarity_threshold: 0.7
features:
enable_rag: true
enable_wiki: true
enable_ai_suggestions: true
```
### Organization Configuration (`.noteparser-org.yml`)
```yaml
organization:
name: "my-study-notes"
base_path: "."
auto_discovery: true
repositories:
study-notes:
type: "notes"
auto_sync: true
formats: ["markdown", "latex"]
noteparser:
type: "parser"
auto_sync: false
sync_settings:
auto_commit: true
commit_message_template: "Auto-sync: {timestamp} - {file_count} files updated"
branch: "main"
push_on_sync: false
cross_references:
enabled: true
similarity_threshold: 0.7
max_suggestions: 5
```
### Plugin Configuration
```yaml
plugins:
math_processor:
enabled: true
config:
equation_numbering: true
symbol_standardization: true
cs_processor:
enabled: true
config:
code_line_numbers: true
auto_language_detection: true
```
## 🔌 Plugin Development
Create custom plugins for specialized course content:
```python
from noteparser.plugins import BasePlugin
class ChemistryPlugin(BasePlugin):
name = "chemistry_processor"
version = "1.0.0"
description = "Enhanced processing for chemistry courses"
course_types = ['chemistry', 'organic', 'biochemistry']
def process_content(self, content: str, metadata: Dict[str, Any]) -> Dict[str, Any]:
# Your custom processing logic here
processed_content = self.enhance_chemical_formulas(content)
return {
'content': processed_content,
'metadata': {**metadata, 'chemical_formulas_found': count}
}
```
## 🌊 GitHub Actions Integration
Automatic processing when you push new documents:
```yaml
# .github/workflows/parse-notes.yml
name: Parse and Sync Notes
on:
push:
paths: ['input/**', 'raw-notes/**']
jobs:
parse-notes:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Set up Python
uses: actions/setup-python@v4
with:
python-version: '3.11'
- name: Install dependencies
run: pip install noteparser[all]
- name: Parse documents
run: noteparser batch input/ --format markdown
- name: Sync to study-notes
run: noteparser sync output/*.md --target-repo study-notes
```
## 🖥️ AI-Enhanced Web Dashboard
Access the AI-powered web interface at `http://localhost:5000`:
```bash
noteparser web
```
### Core Features:
- **Browse** all repositories and courses
- **Search** across all notes with semantic similarity
- **View** documents with syntax highlighting
- **Parse** new documents through the web interface
- **Manage** plugins and configuration
- **Monitor** sync status and cross-references
### AI Features (`/ai` dashboard):
- **🤖 AI Document Analysis**: Upload and analyze documents with AI insights
- **🔍 Knowledge Querying**: Natural language Q&A over your document library
- **📊 Text Analysis**: Extract keywords, topics, and summaries from content
- **🚀 Enhanced Search**: Semantic search with relevance ranking and AI answers
- **💡 Smart Insights**: Automatic topic detection and content relationships
- **📈 Service Health**: Real-time monitoring of AI service status
### Production Deployment:
```bash
# Using Docker Compose (recommended)
docker-compose -f docker-compose.prod.yml up -d
# Using deployment script
./scripts/deploy.sh production 2.1.0
# Access the application
open http://localhost:5000
open http://localhost:5000/ai # AI Dashboard
```
## 📊 Use Cases
### 📖 **Individual Student**
```bash
# Daily workflow
noteparser parse "Today's Lecture.pdf"
noteparser sync output/todays-lecture.md --course CS101
```
### 🏫 **Course Organization**
```bash
# Semester setup
noteparser init
noteparser batch course-materials/ --recursive
noteparser index --format json > course-index.json
```
### 👥 **Study Group**
```bash
# Collaborative notes
noteparser parse shared-notes.docx --format markdown
git add . && git commit -m "Add processed notes"
git push origin main # Triggers auto-sync via GitHub Actions
```
### 🔬 **Research Lab**
```bash
# Research paper processing
noteparser parse "Research Paper.pdf" --format latex
noteparser web # Browse and cross-reference with existing notes
```
## 📚 Advanced Features
### 🔍 **Smart Content Detection**
- **Mathematical equations**: Automatic LaTeX formatting preservation
- **Code blocks**: Language detection and syntax highlighting
- **Citations**: APA, MLA, IEEE format recognition
- **Figures and tables**: Structured conversion with captions
### 🏷️ **Metadata Extraction**
- **Course identification** from file names and paths
- **Topic extraction** and categorization
- **Author and date** detection
- **Academic keywords** and tagging
### 🔗 **Cross-References**
- **Similar content detection** across documents
- **Prerequisite tracking** between topics
- **Citation network** visualization
- **Knowledge graph** construction
## 🛠️ Development
### Setup Development Environment
```bash
git clone https://github.com/CollegeNotesOrg/noteparser.git
cd noteparser
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install with all development dependencies (recommended)
pip install -e .[dev,all]
# Or install dev tools only
pip install -e .[dev]
# Install pre-commit hooks
pre-commit install
```
### Development Dependencies
The `[dev]` extra includes comprehensive development tools:
- **Testing**: `pytest`, `pytest-cov`, `pytest-mock`, `pytest-asyncio`, `pytest-xdist`
- **Code Quality**: `black`, `ruff`, `mypy`, `isort`, `pylint`, `pre-commit`
- **Documentation**: `sphinx`, `mkdocs-material`, `myst-parser`
- **Development Tools**: `ipython`, `jupyter`, `notebook`
- **Profiling**: `memory-profiler`, `line-profiler`
- **Security**: `bandit`, `safety`
### Run Tests
```bash
pytest tests/ -v --cov=noteparser
```
### Code Quality
```bash
# Auto-formatting (required)
black src/ tests/
# Linting with auto-fixes
ruff check src/ tests/ --fix
# Type checking
mypy src/noteparser/ --ignore-missing-imports
# All quality checks at once
make lint # Runs black, ruff, and mypy
```
### CI/CD Information
The project uses GitHub Actions for continuous integration with the following jobs:
- **Cross-platform testing** (Ubuntu, Windows, macOS) on Python 3.10-3.12
- **Code quality checks** (black, ruff, mypy)
- **Security scans** (bandit, safety)
- **Performance benchmarking** with pytest-benchmark
- **Docker image testing** and validation
- **Integration testing** with Redis and PostgreSQL services
All dependencies are now managed through `pyproject.toml` for better reproducibility and CI reliability.
## 🤝 Contributing
1. Fork the repository
2. Create a feature branch (`git checkout -b feature/amazing-feature`)
3. Make your changes
4. Add tests for new functionality
5. Run the test suite
6. Commit your changes (`git commit -m 'Add amazing feature'`)
7. Push to the branch (`git push origin feature/amazing-feature`)
8. Open a Pull Request
## 📄 License
This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
## 📦 Dependencies
All dependencies are managed through `pyproject.toml` with the following structure:
### Core Dependencies (included in base installation)
- **markitdown** - Microsoft's document parsing engine
- **Flask** - Web framework for dashboard
- **Click** - CLI interface
- **PyYAML** - Configuration management
- **Pillow** - Image processing
- **OpenCV** - Advanced image operations
- **pytesseract** - OCR capabilities
- **SpeechRecognition** - Audio transcription
- **moviepy** - Video processing
- **pandas** - Data processing
- **requests** - HTTP client
- **gunicorn** - Production WSGI server
### Optional Dependency Groups
#### `[ai]` - Advanced AI/ML Features
- **sentence-transformers** - Semantic embeddings
- **faiss-cpu** - Vector similarity search
- **langchain** - LLM framework integration
- **openai** - OpenAI API client
- **sqlalchemy** - Database ORM
- **elasticsearch** - Full-text search
- **prometheus-client** - Metrics collection
- **pydantic** - Data validation
#### `[dev]` - Development Tools
- **pytest** ecosystem - Testing framework
- **black**, **ruff**, **mypy** - Code quality
- **sphinx**, **mkdocs-material** - Documentation
- **jupyter**, **ipython** - Interactive development
- **bandit**, **safety** - Security scanning
#### `[all]` - All Optional Features
Combines AI and development dependencies for complete functionality.
### Installation Examples
```bash
pip install noteparser # Core only
pip install noteparser[ai] # Core + AI features
pip install noteparser[dev] # Core + dev tools
pip install noteparser[all] # Everything
```
## 🙏 Acknowledgments
- **Microsoft MarkItDown** - The core parsing engine that powers format conversion
- **Academic Community** - For inspiration and requirements gathering
- **Open Source Libraries** - All the amazing Python packages that make this possible
## 📞 Support
- **Documentation**: [https://collegenotesorg.github.io/noteparser/](https://collegenotesorg.github.io/noteparser/)
- **Issues**: [GitHub Issues](https://github.com/CollegeNotesOrg/noteparser/issues)
- **Discussions**: [GitHub Discussions](https://github.com/CollegeNotesOrg/noteparser/discussions)
---
**Made with ❤️ for students, by a student**
*Transform your study materials into a searchable, interconnected knowledge base*
---
**Author**: Suryansh Sijwali
**GitHub**: [@SuryanshSS1011](https://github.com/SuryanshSS1011)
**Organization**: [CollegeNotesOrg](https://github.com/CollegeNotesOrg)
Raw data
{
"_id": null,
"home_page": null,
"name": "noteparser",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.10",
"maintainer_email": null,
"keywords": "markdown, latex, document, parser, converter, pdf, docx, pptx, ai, semantic-search, knowledge-graph, rag",
"author": null,
"author_email": "Suryansh Sijwali <suryanshss1011@gmail.com>",
"download_url": "https://files.pythonhosted.org/packages/93/2e/5e21d73d925b33e124c1edf8285a1fd0c6e83a1ca999e2b1461623375f26/noteparser-2.1.2.tar.gz",
"platform": null,
"description": "# NoteParser \ud83d\udcda\n\n**A comprehensive AI-powered document parser for converting and analyzing academic materials**\n\n[](https://www.python.org/downloads/)\n[](https://opensource.org/licenses/MIT)\n[](https://badge.fury.io/py/noteparser)\n[](https://github.com/CollegeNotesOrg/noteparser/actions)\n[](https://codecov.io/gh/CollegeNotesOrg/noteparser)\n\nNoteParser is a powerful AI-enhanced academic document processing system built on Microsoft's [MarkItDown](https://github.com/microsoft/markitdown) library. It combines traditional document parsing with cutting-edge AI services to provide intelligent document analysis, semantic search, and knowledge extraction for university students and educators.\n\n## \u2728 Key Features\n\n### \ud83d\udd04 **Multi-Format Support**\n- **Documents**: PDF, DOCX, PPTX, XLSX, HTML, EPUB\n- **Media**: Images with OCR, Audio/Video with transcription\n- **Output**: Markdown, LaTeX, HTML\n\n### \ud83c\udf93 **Academic-Focused Processing**\n- **Mathematical equations** preservation and enhancement\n- **Code blocks** with syntax highlighting and language detection\n- **Bibliography** and citation extraction\n- **Chemical formulas** with proper subscript formatting\n- **Academic keyword highlighting** (theorem, proof, definition, etc.)\n\n### \ud83d\udd0c **Extensible Plugin System**\n- **Course-specific processors** (Math, Computer Science, Chemistry)\n- **Custom parser plugins** for specialized content\n- **Easy plugin development** with base classes\n\n### \ud83c\udf10 **Organization Integration**\n- **Multi-repository synchronization** for course organization\n- **Cross-reference detection** between related documents\n- **Automated GitHub Actions** for continuous processing\n- **Searchable indexing** across all notes\n\n### \ud83e\udd16 **AI-Powered Intelligence**\n- **Semantic document analysis** with keyword and topic extraction\n- **Natural language Q&A** over your document library\n- **Intelligent summarization** and insight generation\n- **Knowledge graph** construction and navigation\n- **AI-enhanced search** with relevance ranking\n\n### \ud83d\udda5\ufe0f **Multiple Interfaces**\n- **AI-enhanced CLI** with natural language commands\n- **Interactive web dashboard** with AI features\n- **Python API** with async AI integration\n- **REST API** endpoints with AI processing\n\n## \ud83d\ude80 Quick Start\n\n### Installation\n\n#### Option 1: Using pip (Recommended)\n\n```bash\n# Install from PyPI with all features (recommended)\npip install noteparser[all]\n\n# Install with AI features only\npip install noteparser[ai]\n\n# Install basic version\npip install noteparser\n\n# Install from source with all features (recommended)\ngit clone https://github.com/CollegeNotesOrg/noteparser.git\ncd noteparser\npip install -e .[dev,all]\n```\n\n#### Option 2: Development Installation\n\n```bash\n# Clone the repository\ngit clone https://github.com/CollegeNotesOrg/noteparser.git\ncd noteparser\n\n# Create a virtual environment (recommended)\npython -m venv venv\nsource venv/bin/activate # On Windows: venv\\Scripts\\activate\n\n# Install with all dependencies (includes dev tools)\npip install -e .[dev,all]\n\n# Or install with specific feature sets\npip install -e .[dev] # Development tools only\npip install -e .[ai] # AI features only\n```\n\n> **Note**: As of v2.1.0, all dependencies are managed through `pyproject.toml`. The `requirements.txt` files are maintained for compatibility but using pip extras is the recommended approach.\n\n#### System Dependencies\n\nSome features require system packages:\n\n```bash\n# Ubuntu/Debian\nsudo apt-get update\nsudo apt-get install -y \\\n tesseract-ocr \\\n tesseract-ocr-eng \\\n ffmpeg \\\n poppler-utils\n\n# macOS\nbrew install tesseract ffmpeg poppler\n\n# Windows (using Chocolatey)\nchoco install tesseract ffmpeg poppler\n```\n\n#### Python Version Compatibility\n\n- **Python 3.10+** is required (updated from 3.9+ due to markitdown dependency)\n- Tested on Python 3.10, 3.11, and 3.12\n- **Python 3.9 and earlier** support was removed due to compatibility requirements with latest dependencies\n\n### Basic Usage\n\n```bash\n# Initialize in your project directory\nnoteparser init\n\n# Parse a single document\nnoteparser parse lecture.pdf --format markdown\n\n# Parse with AI enhancement\nnoteparser ai analyze lecture.pdf --output enhanced-lecture.md\n\n# Query your knowledge base\nnoteparser ai query \"What is machine learning?\" --filters '{\"course\": \"CS101\"}'\n\n# Batch process a directory\nnoteparser batch input/ --recursive --format latex\n\n# Start the AI-enhanced web dashboard\nnoteparser web --host 0.0.0.0 --port 5000\n\n# Check AI services health\nnoteparser ai health --detailed\n\n# Sync to organization repository\nnoteparser sync output/*.md --target-repo study-notes --course CS101\n```\n\n### Python API\n\n```python\nimport asyncio\nfrom noteparser import NoteParser\nfrom noteparser.integration import OrganizationSync\n\n# Initialize parser with AI capabilities\nparser = NoteParser(enable_ai=True, llm_client=your_llm_client)\n\n# Parse single document\nresult = parser.parse_to_markdown(\"lecture.pdf\")\nprint(result['content'])\n\n# Parse with AI enhancement\nasync def ai_parse():\n result = await parser.parse_to_markdown_with_ai(\"lecture.pdf\")\n print(f\"Content: {result['content']}\")\n print(f\"AI Insights: {result['ai_processing']}\")\n\nasyncio.run(ai_parse())\n\n# Query knowledge base\nasync def query_knowledge():\n result = await parser.query_knowledge(\n \"What are the key concepts in machine learning?\",\n filters={\"course\": \"CS101\"}\n )\n print(f\"Answer: {result['answer']}\")\n for doc in result['documents']:\n print(f\"- {doc['title']} (relevance: {doc['score']:.2f})\")\n\nasyncio.run(query_knowledge())\n\n# Batch processing\nresults = parser.parse_batch(\"notes/\", output_format=\"markdown\")\n\n# Organization sync\norg_sync = OrganizationSync()\nsync_result = org_sync.sync_parsed_notes(\n source_files=[\"note1.md\", \"note2.md\"],\n target_repo=\"study-notes\",\n course=\"CS101\"\n)\n```\n\n## \ud83d\udcc1 Project Structure\n\n```\nyour-study-organization/\n\u251c\u2500\u2500 noteparser/ # This repository - AI-powered parsing engine\n\u251c\u2500\u2500 noteparser-ai-services/ # AI microservices (RagFlow, DeepWiki)\n\u251c\u2500\u2500 study-notes/ # Main notes repository\n\u2502 \u251c\u2500\u2500 courses/\n\u2502 \u2502 \u251c\u2500\u2500 CS101/\n\u2502 \u2502 \u251c\u2500\u2500 MATH201/\n\u2502 \u2502 \u2514\u2500\u2500 PHYS301/\n\u2502 \u2514\u2500\u2500 .noteparser.yml # Organization configuration\n\u251c\u2500\u2500 note-templates/ # Shared LaTeX/Markdown templates\n\u251c\u2500\u2500 note-extensions/ # Custom plugins\n\u2514\u2500\u2500 note-dashboard/ # Optional: separate web interface\n```\n\n## \ud83e\udd16 AI Services Setup\n\nNoteParser can operate in two modes:\n\n### Standalone Mode (Basic)\nWorks without external AI services - provides core document parsing functionality.\n\n### AI-Enhanced Mode (Recommended)\nRequires the `noteparser-ai-services` repository for full AI capabilities.\n\n```bash\n# Clone and start AI services\ngit clone https://github.com/CollegeNotesOrg/noteparser-ai-services.git\ncd noteparser-ai-services\ndocker-compose up -d\n\n# Verify services are running\ncurl http://localhost:8010/health # RagFlow\ncurl http://localhost:8011/health # DeepWiki\n\n# Test AI integration\nnoteparser ai health\n```\n\n**AI Services Documentation**: [https://collegenotesorg.github.io/noteparser-ai-services/](https://collegenotesorg.github.io/noteparser-ai-services/)\n\n## \u2699\ufe0f Configuration\n\n### AI Services Configuration (`config/services.yml`)\n\n```yaml\nservices:\n ragflow:\n host: localhost\n port: 8010\n enabled: true\n config:\n embedding_model: \"sentence-transformers/all-MiniLM-L6-v2\"\n vector_db_type: \"faiss\"\n chunk_size: 512\n top_k: 5\n\n deepwiki:\n host: localhost\n port: 8011\n enabled: true\n config:\n ai_model: \"gpt-3.5-turbo\"\n auto_link: true\n similarity_threshold: 0.7\n\nfeatures:\n enable_rag: true\n enable_wiki: true\n enable_ai_suggestions: true\n```\n\n### Organization Configuration (`.noteparser-org.yml`)\n\n```yaml\norganization:\n name: \"my-study-notes\"\n base_path: \".\"\n auto_discovery: true\n\nrepositories:\n study-notes:\n type: \"notes\"\n auto_sync: true\n formats: [\"markdown\", \"latex\"]\n noteparser:\n type: \"parser\"\n auto_sync: false\n\nsync_settings:\n auto_commit: true\n commit_message_template: \"Auto-sync: {timestamp} - {file_count} files updated\"\n branch: \"main\"\n push_on_sync: false\n\ncross_references:\n enabled: true\n similarity_threshold: 0.7\n max_suggestions: 5\n```\n\n### Plugin Configuration\n\n```yaml\nplugins:\n math_processor:\n enabled: true\n config:\n equation_numbering: true\n symbol_standardization: true\n\n cs_processor:\n enabled: true\n config:\n code_line_numbers: true\n auto_language_detection: true\n```\n\n## \ud83d\udd0c Plugin Development\n\nCreate custom plugins for specialized course content:\n\n```python\nfrom noteparser.plugins import BasePlugin\n\nclass ChemistryPlugin(BasePlugin):\n name = \"chemistry_processor\"\n version = \"1.0.0\"\n description = \"Enhanced processing for chemistry courses\"\n course_types = ['chemistry', 'organic', 'biochemistry']\n\n def process_content(self, content: str, metadata: Dict[str, Any]) -> Dict[str, Any]:\n # Your custom processing logic here\n processed_content = self.enhance_chemical_formulas(content)\n\n return {\n 'content': processed_content,\n 'metadata': {**metadata, 'chemical_formulas_found': count}\n }\n```\n\n## \ud83c\udf0a GitHub Actions Integration\n\nAutomatic processing when you push new documents:\n\n```yaml\n# .github/workflows/parse-notes.yml\nname: Parse and Sync Notes\non:\n push:\n paths: ['input/**', 'raw-notes/**']\n\njobs:\n parse-notes:\n runs-on: ubuntu-latest\n steps:\n - uses: actions/checkout@v4\n - name: Set up Python\n uses: actions/setup-python@v4\n with:\n python-version: '3.11'\n - name: Install dependencies\n run: pip install noteparser[all]\n - name: Parse documents\n run: noteparser batch input/ --format markdown\n - name: Sync to study-notes\n run: noteparser sync output/*.md --target-repo study-notes\n```\n\n## \ud83d\udda5\ufe0f AI-Enhanced Web Dashboard\n\nAccess the AI-powered web interface at `http://localhost:5000`:\n\n```bash\nnoteparser web\n```\n\n### Core Features:\n- **Browse** all repositories and courses\n- **Search** across all notes with semantic similarity\n- **View** documents with syntax highlighting\n- **Parse** new documents through the web interface\n- **Manage** plugins and configuration\n- **Monitor** sync status and cross-references\n\n### AI Features (`/ai` dashboard):\n- **\ud83e\udd16 AI Document Analysis**: Upload and analyze documents with AI insights\n- **\ud83d\udd0d Knowledge Querying**: Natural language Q&A over your document library\n- **\ud83d\udcca Text Analysis**: Extract keywords, topics, and summaries from content\n- **\ud83d\ude80 Enhanced Search**: Semantic search with relevance ranking and AI answers\n- **\ud83d\udca1 Smart Insights**: Automatic topic detection and content relationships\n- **\ud83d\udcc8 Service Health**: Real-time monitoring of AI service status\n\n### Production Deployment:\n\n```bash\n# Using Docker Compose (recommended)\ndocker-compose -f docker-compose.prod.yml up -d\n\n# Using deployment script\n./scripts/deploy.sh production 2.1.0\n\n# Access the application\nopen http://localhost:5000\nopen http://localhost:5000/ai # AI Dashboard\n```\n\n## \ud83d\udcca Use Cases\n\n### \ud83d\udcd6 **Individual Student**\n```bash\n# Daily workflow\nnoteparser parse \"Today's Lecture.pdf\"\nnoteparser sync output/todays-lecture.md --course CS101\n```\n\n### \ud83c\udfeb **Course Organization**\n```bash\n# Semester setup\nnoteparser init\nnoteparser batch course-materials/ --recursive\nnoteparser index --format json > course-index.json\n```\n\n### \ud83d\udc65 **Study Group**\n```bash\n# Collaborative notes\nnoteparser parse shared-notes.docx --format markdown\ngit add . && git commit -m \"Add processed notes\"\ngit push origin main # Triggers auto-sync via GitHub Actions\n```\n\n### \ud83d\udd2c **Research Lab**\n```bash\n# Research paper processing\nnoteparser parse \"Research Paper.pdf\" --format latex\nnoteparser web # Browse and cross-reference with existing notes\n```\n\n## \ud83d\udcda Advanced Features\n\n### \ud83d\udd0d **Smart Content Detection**\n- **Mathematical equations**: Automatic LaTeX formatting preservation\n- **Code blocks**: Language detection and syntax highlighting\n- **Citations**: APA, MLA, IEEE format recognition\n- **Figures and tables**: Structured conversion with captions\n\n### \ud83c\udff7\ufe0f **Metadata Extraction**\n- **Course identification** from file names and paths\n- **Topic extraction** and categorization\n- **Author and date** detection\n- **Academic keywords** and tagging\n\n### \ud83d\udd17 **Cross-References**\n- **Similar content detection** across documents\n- **Prerequisite tracking** between topics\n- **Citation network** visualization\n- **Knowledge graph** construction\n\n## \ud83d\udee0\ufe0f Development\n\n### Setup Development Environment\n\n```bash\ngit clone https://github.com/CollegeNotesOrg/noteparser.git\ncd noteparser\npython -m venv venv\nsource venv/bin/activate # On Windows: venv\\Scripts\\activate\n\n# Install with all development dependencies (recommended)\npip install -e .[dev,all]\n\n# Or install dev tools only\npip install -e .[dev]\n\n# Install pre-commit hooks\npre-commit install\n```\n\n### Development Dependencies\n\nThe `[dev]` extra includes comprehensive development tools:\n\n- **Testing**: `pytest`, `pytest-cov`, `pytest-mock`, `pytest-asyncio`, `pytest-xdist`\n- **Code Quality**: `black`, `ruff`, `mypy`, `isort`, `pylint`, `pre-commit`\n- **Documentation**: `sphinx`, `mkdocs-material`, `myst-parser`\n- **Development Tools**: `ipython`, `jupyter`, `notebook`\n- **Profiling**: `memory-profiler`, `line-profiler`\n- **Security**: `bandit`, `safety`\n\n### Run Tests\n\n```bash\npytest tests/ -v --cov=noteparser\n```\n\n### Code Quality\n\n```bash\n# Auto-formatting (required)\nblack src/ tests/\n\n# Linting with auto-fixes\nruff check src/ tests/ --fix\n\n# Type checking\nmypy src/noteparser/ --ignore-missing-imports\n\n# All quality checks at once\nmake lint # Runs black, ruff, and mypy\n```\n\n### CI/CD Information\n\nThe project uses GitHub Actions for continuous integration with the following jobs:\n\n- **Cross-platform testing** (Ubuntu, Windows, macOS) on Python 3.10-3.12\n- **Code quality checks** (black, ruff, mypy)\n- **Security scans** (bandit, safety)\n- **Performance benchmarking** with pytest-benchmark\n- **Docker image testing** and validation\n- **Integration testing** with Redis and PostgreSQL services\n\nAll dependencies are now managed through `pyproject.toml` for better reproducibility and CI reliability.\n\n## \ud83e\udd1d Contributing\n\n1. Fork the repository\n2. Create a feature branch (`git checkout -b feature/amazing-feature`)\n3. Make your changes\n4. Add tests for new functionality\n5. Run the test suite\n6. Commit your changes (`git commit -m 'Add amazing feature'`)\n7. Push to the branch (`git push origin feature/amazing-feature`)\n8. Open a Pull Request\n\n## \ud83d\udcc4 License\n\nThis project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.\n\n## \ud83d\udce6 Dependencies\n\nAll dependencies are managed through `pyproject.toml` with the following structure:\n\n### Core Dependencies (included in base installation)\n- **markitdown** - Microsoft's document parsing engine\n- **Flask** - Web framework for dashboard\n- **Click** - CLI interface\n- **PyYAML** - Configuration management\n- **Pillow** - Image processing\n- **OpenCV** - Advanced image operations\n- **pytesseract** - OCR capabilities\n- **SpeechRecognition** - Audio transcription\n- **moviepy** - Video processing\n- **pandas** - Data processing\n- **requests** - HTTP client\n- **gunicorn** - Production WSGI server\n\n### Optional Dependency Groups\n\n#### `[ai]` - Advanced AI/ML Features\n- **sentence-transformers** - Semantic embeddings\n- **faiss-cpu** - Vector similarity search\n- **langchain** - LLM framework integration\n- **openai** - OpenAI API client\n- **sqlalchemy** - Database ORM\n- **elasticsearch** - Full-text search\n- **prometheus-client** - Metrics collection\n- **pydantic** - Data validation\n\n#### `[dev]` - Development Tools\n- **pytest** ecosystem - Testing framework\n- **black**, **ruff**, **mypy** - Code quality\n- **sphinx**, **mkdocs-material** - Documentation\n- **jupyter**, **ipython** - Interactive development\n- **bandit**, **safety** - Security scanning\n\n#### `[all]` - All Optional Features\nCombines AI and development dependencies for complete functionality.\n\n### Installation Examples\n```bash\npip install noteparser # Core only\npip install noteparser[ai] # Core + AI features\npip install noteparser[dev] # Core + dev tools\npip install noteparser[all] # Everything\n```\n\n## \ud83d\ude4f Acknowledgments\n\n- **Microsoft MarkItDown** - The core parsing engine that powers format conversion\n- **Academic Community** - For inspiration and requirements gathering\n- **Open Source Libraries** - All the amazing Python packages that make this possible\n\n## \ud83d\udcde Support\n\n- **Documentation**: [https://collegenotesorg.github.io/noteparser/](https://collegenotesorg.github.io/noteparser/)\n- **Issues**: [GitHub Issues](https://github.com/CollegeNotesOrg/noteparser/issues)\n- **Discussions**: [GitHub Discussions](https://github.com/CollegeNotesOrg/noteparser/discussions)\n\n---\n\n**Made with \u2764\ufe0f for students, by a student**\n\n*Transform your study materials into a searchable, interconnected knowledge base*\n\n---\n\n**Author**: Suryansh Sijwali\n**GitHub**: [@SuryanshSS1011](https://github.com/SuryanshSS1011)\n**Organization**: [CollegeNotesOrg](https://github.com/CollegeNotesOrg)\n",
"bugtrack_url": null,
"license": null,
"summary": "A comprehensive document parser with AI-powered intelligence for converting and analyzing academic materials",
"version": "2.1.2",
"project_urls": {
"Homepage": "https://github.com/CollegeNotesOrg/noteparser",
"Issues": "https://github.com/CollegeNotesOrg/noteparser/issues"
},
"split_keywords": [
"markdown",
" latex",
" document",
" parser",
" converter",
" pdf",
" docx",
" pptx",
" ai",
" semantic-search",
" knowledge-graph",
" rag"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "826b1cccb8cf42542ec638da086d65f478a14492c49154d690f992370d527611",
"md5": "2cc93082285048b808b350a1dd1c013c",
"sha256": "1ca4bec0aac8af904c38baa0471196e1e6bfc5e4b88a3281e7d30b33ecf29aa3"
},
"downloads": -1,
"filename": "noteparser-2.1.2-py3-none-any.whl",
"has_sig": false,
"md5_digest": "2cc93082285048b808b350a1dd1c013c",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.10",
"size": 102005,
"upload_time": "2025-08-22T22:35:06",
"upload_time_iso_8601": "2025-08-22T22:35:06.692048Z",
"url": "https://files.pythonhosted.org/packages/82/6b/1cccb8cf42542ec638da086d65f478a14492c49154d690f992370d527611/noteparser-2.1.2-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "932e5e21d73d925b33e124c1edf8285a1fd0c6e83a1ca999e2b1461623375f26",
"md5": "144516e79e14cf5396b5c666fd7f76ba",
"sha256": "97cdf3b8b3922ce91dc08aacec0bb2e040120e0d5c1742e44541f663cbe1209b"
},
"downloads": -1,
"filename": "noteparser-2.1.2.tar.gz",
"has_sig": false,
"md5_digest": "144516e79e14cf5396b5c666fd7f76ba",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.10",
"size": 140849,
"upload_time": "2025-08-22T22:35:08",
"upload_time_iso_8601": "2025-08-22T22:35:08.540006Z",
"url": "https://files.pythonhosted.org/packages/93/2e/5e21d73d925b33e124c1edf8285a1fd0c6e83a1ca999e2b1461623375f26/noteparser-2.1.2.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-08-22 22:35:08",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "CollegeNotesOrg",
"github_project": "noteparser",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"requirements": [
{
"name": "click",
"specs": [
[
">=",
"8.0.0"
]
]
},
{
"name": "pyyaml",
"specs": [
[
">=",
"6.0"
]
]
},
{
"name": "rich",
"specs": [
[
">=",
"13.0"
]
]
},
{
"name": "markitdown",
"specs": [
[
">=",
"0.1.0"
]
]
},
{
"name": "flask",
"specs": [
[
">=",
"2.3.0"
]
]
},
{
"name": "SpeechRecognition",
"specs": [
[
">=",
"3.10.0"
]
]
},
{
"name": "moviepy",
"specs": [
[
">=",
"1.0.3"
]
]
},
{
"name": "pydub",
"specs": [
[
">=",
"0.25.1"
]
]
},
{
"name": "pytesseract",
"specs": [
[
">=",
"0.3.10"
]
]
},
{
"name": "pillow",
"specs": [
[
">=",
"10.0.0"
]
]
},
{
"name": "opencv-python",
"specs": [
[
">=",
"4.8.0"
]
]
},
{
"name": "numpy",
"specs": [
[
">=",
"1.24.0"
]
]
},
{
"name": "python-docx",
"specs": [
[
">=",
"1.0.0"
]
]
},
{
"name": "python-pptx",
"specs": [
[
">=",
"0.6.0"
]
]
},
{
"name": "openpyxl",
"specs": [
[
">=",
"3.1.0"
]
]
},
{
"name": "pypdf",
"specs": [
[
">=",
"3.17.0"
]
]
},
{
"name": "pdfplumber",
"specs": [
[
">=",
"0.10.0"
]
]
},
{
"name": "pytest",
"specs": [
[
">=",
"7.0.0"
]
]
},
{
"name": "pytest-cov",
"specs": [
[
">=",
"4.0.0"
]
]
},
{
"name": "black",
"specs": [
[
">=",
"23.0.0"
]
]
},
{
"name": "ruff",
"specs": [
[
">=",
"0.1.0"
]
]
},
{
"name": "mypy",
"specs": [
[
">=",
"1.0.0"
]
]
},
{
"name": "pre-commit",
"specs": [
[
">=",
"3.0.0"
]
]
},
{
"name": "beautifulsoup4",
"specs": [
[
">=",
"4.12.0"
]
]
},
{
"name": "lxml",
"specs": [
[
">=",
"4.9.0"
]
]
},
{
"name": "pandas",
"specs": [
[
">=",
"2.0.0"
]
]
},
{
"name": "matplotlib",
"specs": [
[
">=",
"3.7.0"
]
]
},
{
"name": "requests",
"specs": [
[
">=",
"2.31.0"
]
]
},
{
"name": "gunicorn",
"specs": [
[
">=",
"21.2.0"
]
]
},
{
"name": "python-dotenv",
"specs": [
[
">=",
"1.0.0"
]
]
},
{
"name": "aiohttp",
"specs": [
[
">=",
"3.9.0"
]
]
},
{
"name": "asyncio",
"specs": [
[
">=",
"3.4.3"
]
]
},
{
"name": "sentence-transformers",
"specs": [
[
">=",
"2.2.0"
]
]
},
{
"name": "faiss-cpu",
"specs": [
[
">=",
"1.7.4"
]
]
},
{
"name": "tiktoken",
"specs": [
[
">=",
"0.5.0"
]
]
},
{
"name": "langchain",
"specs": [
[
">=",
"0.1.0"
]
]
},
{
"name": "openai",
"specs": [
[
">=",
"1.0.0"
]
]
},
{
"name": "sqlalchemy",
"specs": [
[
">=",
"2.0.0"
]
]
},
{
"name": "alembic",
"specs": [
[
">=",
"1.12.0"
]
]
},
{
"name": "redis",
"specs": [
[
">=",
"5.0.0"
]
]
},
{
"name": "psycopg2-binary",
"specs": [
[
">=",
"2.9.0"
]
]
},
{
"name": "elasticsearch",
"specs": [
[
">=",
"8.11.0"
]
]
},
{
"name": "prometheus-client",
"specs": [
[
">=",
"0.19.0"
]
]
},
{
"name": "opentelemetry-api",
"specs": [
[
">=",
"1.20.0"
]
]
},
{
"name": "opentelemetry-sdk",
"specs": [
[
">=",
"1.20.0"
]
]
},
{
"name": "opentelemetry-instrumentation-flask",
"specs": [
[
">=",
"0.42b0"
]
]
},
{
"name": "structlog",
"specs": [
[
">=",
"23.2.0"
]
]
},
{
"name": "pydantic",
"specs": [
[
">=",
"2.5.0"
]
]
},
{
"name": "marshmallow",
"specs": [
[
">=",
"3.20.0"
]
]
},
{
"name": "fastapi",
"specs": [
[
">=",
"0.104.0"
]
]
}
],
"lcname": "noteparser"
}