# FastParser 🚀
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
[![Python 3.7+](https://img.shields.io/badge/python-3.7+-blue.svg)](https://www.python.org/downloads/)
[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)
A high-performance, asynchronous content parser that supports both HTML and PDF extraction with special handling for arXiv papers.
## ✨ Features
- 🚄 Asynchronous content fetching
- 📄 PDF extraction support
- 🌐 HTML parsing
- 📚 Special handling for arXiv URLs
- 📦 Batch processing capability
- 🔄 Progress tracking with tqdm
## 🛠️ Installation
```bash
pip install fastparser
# Dependencies
pip install aiohttp PyPDF2 tqdm
```
## 🚀 Quick Start
```python
from fastparser import parse
# Single URL parsing
text = parse("https://example.com")
# Batch processing
urls = [
"https://example.com",
"https://arxiv.org/abs/2301.01234",
"https://example.com/document.pdf"
]
texts = parse(urls)
```
## 📖 Detailed Usage
### Basic Parser Configuration
```python
from fastparser import FastParser
# Initialize with PDF extraction (default: True)
parser = FastParser(extract_pdf=True)
# Single URL
content = parser.fetch("https://example.com")
# Multiple URLs
contents = parser.fetch_batch([
"https://example.com",
"https://arxiv.org/abs/2301.01234"
])
```
### Working with arXiv Papers
The parser automatically handles different arXiv URL formats:
```python
parser = FastParser()
# These will be automatically converted to appropriate formats
urls = [
"https://arxiv.org/abs/2301.01234", # Will fetch PDF if extract_pdf=True
"http://arxiv.org/html/2301.01234", # Will fetch HTML or PDF based on settings
]
contents = parser.fetch_batch(urls)
```
### PDF-Only Processing
```python
parser = FastParser(extract_pdf=True)
pdf_urls = [
"https://example.com/document.pdf",
"https://arxiv.org/pdf/2301.01234.pdf"
]
pdf_contents = parser.fetch_batch(pdf_urls)
```
## 🔧 API Reference
### FastParser Class
```python
class FastParser:
def __init__(self, extract_pdf: bool = True)
def fetch(self, url: str) -> str
def fetch_batch(self, urls: list) -> list
def __call__(self, urls: str|list) -> str|list
```
### Main Functions
- `parse(urls: str|list) -> str|list`: Convenience function for quick parsing
- `_async_html_parser(urls: list)`: Internal async processing method
- `_fetch_pdf_content(pdf_urls: list)`: Internal PDF processing method
- `_arxiv_url_fix(url: str)`: Internal arXiv URL formatting method
## ⚡ Performance
The parser uses asynchronous operations for optimal performance:
- Concurrent URL fetching
- Batch processing capabilities
- Progress tracking with tqdm
- Memory-efficient PDF processing
## 🔍 Example: Advanced Usage
```python
import asyncio
from fastparser import FastParser
async def process_large_dataset():
parser = FastParser(extract_pdf=True)
# Process URLs in batches
all_urls = ["url1", "url2", ..., "url1000"]
batch_size = 50
results = []
for i in range(0, len(all_urls), batch_size):
batch = all_urls[i:i + batch_size]
batch_results = parser.fetch_batch(batch)
results.extend(batch_results)
return results
# Run with asyncio
results = asyncio.run(process_large_dataset())
```
## ⚠️ Error Handling
The parser includes robust error handling:
- Failed URL fetches return empty strings
- PDF processing errors are caught gracefully
- HTTP status checks
- Invalid URL format handling
## 🤝 Contributing
Contributions are welcome! Here's how you can help:
1. Fork the repository
2. Create your feature branch (`git checkout -b feature/AmazingFeature`)
3. Commit your changes (`git commit -m 'Add some AmazingFeature'`)
4. Push to the branch (`git push origin feature/AmazingFeature`)
5. Open a Pull Request
## 📝 Dependencies
- `aiohttp`: Async HTTP client/server framework
- `PyPDF2`: PDF processing library
- `tqdm`: Progress bar library
- Custom `FastHTMLParserV3` module
## 📋 TODO
- [ ] Add support for more document types
- [ ] Implement caching mechanism
- [ ] Add timeout configurations
- [ ] Improve error reporting
- [ ] Add proxy support
## 📄 License
This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
---
Made with ❤️ by [Your Name]
Raw data
{
"_id": null,
"home_page": "https://github.com/yourusername/pyopengenai",
"name": "parselite",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.7",
"maintainer_email": null,
"keywords": null,
"author": "Kammari Santhosh",
"author_email": "Your Name <your.email@example.com>",
"download_url": "https://files.pythonhosted.org/packages/d9/60/ec89c19f60cdd1965146ad618a3030b50a2dbedc60c0b6bbb0097ba89198/parselite-0.3.7.tar.gz",
"platform": null,
"description": "# FastParser \ud83d\ude80\n\n[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)\n[![Python 3.7+](https://img.shields.io/badge/python-3.7+-blue.svg)](https://www.python.org/downloads/)\n[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)\n\nA high-performance, asynchronous content parser that supports both HTML and PDF extraction with special handling for arXiv papers.\n\n## \u2728 Features\n\n- \ud83d\ude84 Asynchronous content fetching\n- \ud83d\udcc4 PDF extraction support\n- \ud83c\udf10 HTML parsing\n- \ud83d\udcda Special handling for arXiv URLs\n- \ud83d\udce6 Batch processing capability\n- \ud83d\udd04 Progress tracking with tqdm\n\n## \ud83d\udee0\ufe0f Installation\n\n```bash\npip install fastparser\n\n# Dependencies\npip install aiohttp PyPDF2 tqdm\n```\n\n## \ud83d\ude80 Quick Start\n\n```python\nfrom fastparser import parse\n\n# Single URL parsing\ntext = parse(\"https://example.com\")\n\n# Batch processing\nurls = [\n \"https://example.com\",\n \"https://arxiv.org/abs/2301.01234\",\n \"https://example.com/document.pdf\"\n]\ntexts = parse(urls)\n```\n\n## \ud83d\udcd6 Detailed Usage\n\n### Basic Parser Configuration\n\n```python\nfrom fastparser import FastParser\n\n# Initialize with PDF extraction (default: True)\nparser = FastParser(extract_pdf=True)\n\n# Single URL\ncontent = parser.fetch(\"https://example.com\")\n\n# Multiple URLs\ncontents = parser.fetch_batch([\n \"https://example.com\",\n \"https://arxiv.org/abs/2301.01234\"\n])\n```\n\n### Working with arXiv Papers\n\nThe parser automatically handles different arXiv URL formats:\n\n```python\nparser = FastParser()\n\n# These will be automatically converted to appropriate formats\nurls = [\n \"https://arxiv.org/abs/2301.01234\", # Will fetch PDF if extract_pdf=True\n \"http://arxiv.org/html/2301.01234\", # Will fetch HTML or PDF based on settings\n]\ncontents = parser.fetch_batch(urls)\n```\n\n### PDF-Only Processing\n\n```python\nparser = FastParser(extract_pdf=True)\n\npdf_urls = [\n \"https://example.com/document.pdf\",\n \"https://arxiv.org/pdf/2301.01234.pdf\"\n]\npdf_contents = parser.fetch_batch(pdf_urls)\n```\n\n## \ud83d\udd27 API Reference\n\n### FastParser Class\n\n```python\nclass FastParser:\n def __init__(self, extract_pdf: bool = True)\n def fetch(self, url: str) -> str\n def fetch_batch(self, urls: list) -> list\n def __call__(self, urls: str|list) -> str|list\n```\n\n### Main Functions\n\n- `parse(urls: str|list) -> str|list`: Convenience function for quick parsing\n- `_async_html_parser(urls: list)`: Internal async processing method\n- `_fetch_pdf_content(pdf_urls: list)`: Internal PDF processing method\n- `_arxiv_url_fix(url: str)`: Internal arXiv URL formatting method\n\n## \u26a1 Performance\n\nThe parser uses asynchronous operations for optimal performance:\n\n- Concurrent URL fetching\n- Batch processing capabilities\n- Progress tracking with tqdm\n- Memory-efficient PDF processing\n\n## \ud83d\udd0d Example: Advanced Usage\n\n```python\nimport asyncio\nfrom fastparser import FastParser\n\nasync def process_large_dataset():\n parser = FastParser(extract_pdf=True)\n \n # Process URLs in batches\n all_urls = [\"url1\", \"url2\", ..., \"url1000\"]\n batch_size = 50\n \n results = []\n for i in range(0, len(all_urls), batch_size):\n batch = all_urls[i:i + batch_size]\n batch_results = parser.fetch_batch(batch)\n results.extend(batch_results)\n \n return results\n\n# Run with asyncio\nresults = asyncio.run(process_large_dataset())\n```\n\n## \u26a0\ufe0f Error Handling\n\nThe parser includes robust error handling:\n\n- Failed URL fetches return empty strings\n- PDF processing errors are caught gracefully\n- HTTP status checks\n- Invalid URL format handling\n\n## \ud83e\udd1d Contributing\n\nContributions are welcome! Here's how you can help:\n\n1. Fork the repository\n2. Create your feature branch (`git checkout -b feature/AmazingFeature`)\n3. Commit your changes (`git commit -m 'Add some AmazingFeature'`)\n4. Push to the branch (`git push origin feature/AmazingFeature`)\n5. Open a Pull Request\n\n## \ud83d\udcdd Dependencies\n\n- `aiohttp`: Async HTTP client/server framework\n- `PyPDF2`: PDF processing library\n- `tqdm`: Progress bar library\n- Custom `FastHTMLParserV3` module\n\n## \ud83d\udccb TODO\n\n- [ ] Add support for more document types\n- [ ] Implement caching mechanism\n- [ ] Add timeout configurations\n- [ ] Improve error reporting\n- [ ] Add proxy support\n\n## \ud83d\udcc4 License\n\nThis project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.\n\n---\n\nMade with \u2764\ufe0f by [Your Name]\n",
"bugtrack_url": null,
"license": null,
"summary": "A powerful web content fetcher and processor",
"version": "0.3.7",
"project_urls": {
"Bug Tracker": "https://github.com/yourusername/pyopengenai/issues",
"Homepage": "https://github.com/yourusername/pyopengenai"
},
"split_keywords": [],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "649631230194768d5af1baec2835dbb65e247a8ce59183948cf125235c1195bd",
"md5": "2c83042eff95e170c3a1ba4e76f3e79e",
"sha256": "2b5c8c12d8e2c095298a3333a149c086ce0d5e52bff3a03afeee435664739357"
},
"downloads": -1,
"filename": "parselite-0.3.7-py3-none-any.whl",
"has_sig": false,
"md5_digest": "2c83042eff95e170c3a1ba4e76f3e79e",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.7",
"size": 6042,
"upload_time": "2024-12-20T06:53:42",
"upload_time_iso_8601": "2024-12-20T06:53:42.657138Z",
"url": "https://files.pythonhosted.org/packages/64/96/31230194768d5af1baec2835dbb65e247a8ce59183948cf125235c1195bd/parselite-0.3.7-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "d960ec89c19f60cdd1965146ad618a3030b50a2dbedc60c0b6bbb0097ba89198",
"md5": "f5135be0a55a01e7cab44afe7633bc79",
"sha256": "194447694a253eae08752b655fb04ab201910e3337f93516cf7478a209e02cdf"
},
"downloads": -1,
"filename": "parselite-0.3.7.tar.gz",
"has_sig": false,
"md5_digest": "f5135be0a55a01e7cab44afe7633bc79",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.7",
"size": 5577,
"upload_time": "2024-12-20T06:53:45",
"upload_time_iso_8601": "2024-12-20T06:53:45.195842Z",
"url": "https://files.pythonhosted.org/packages/d9/60/ec89c19f60cdd1965146ad618a3030b50a2dbedc60c0b6bbb0097ba89198/parselite-0.3.7.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-12-20 06:53:45",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "yourusername",
"github_project": "pyopengenai",
"github_not_found": true,
"lcname": "parselite"
}