parselite


Nameparselite JSON
Version 0.3.7 PyPI version JSON
download
home_pagehttps://github.com/yourusername/pyopengenai
SummaryA powerful web content fetcher and processor
upload_time2024-12-20 06:53:45
maintainerNone
docs_urlNone
authorKammari Santhosh
requires_python>=3.7
licenseNone
keywords
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # FastParser 🚀

[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
[![Python 3.7+](https://img.shields.io/badge/python-3.7+-blue.svg)](https://www.python.org/downloads/)
[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)

A high-performance, asynchronous content parser that supports both HTML and PDF extraction with special handling for arXiv papers.

## ✨ Features

- 🚄 Asynchronous content fetching
- 📄 PDF extraction support
- 🌐 HTML parsing
- 📚 Special handling for arXiv URLs
- 📦 Batch processing capability
- 🔄 Progress tracking with tqdm

## 🛠️ Installation

```bash
pip install fastparser

# Dependencies
pip install aiohttp PyPDF2 tqdm
```

## 🚀 Quick Start

```python
from fastparser import parse

# Single URL parsing
text = parse("https://example.com")

# Batch processing
urls = [
    "https://example.com",
    "https://arxiv.org/abs/2301.01234",
    "https://example.com/document.pdf"
]
texts = parse(urls)
```

## 📖 Detailed Usage

### Basic Parser Configuration

```python
from fastparser import FastParser

# Initialize with PDF extraction (default: True)
parser = FastParser(extract_pdf=True)

# Single URL
content = parser.fetch("https://example.com")

# Multiple URLs
contents = parser.fetch_batch([
    "https://example.com",
    "https://arxiv.org/abs/2301.01234"
])
```

### Working with arXiv Papers

The parser automatically handles different arXiv URL formats:

```python
parser = FastParser()

# These will be automatically converted to appropriate formats
urls = [
    "https://arxiv.org/abs/2301.01234",  # Will fetch PDF if extract_pdf=True
    "http://arxiv.org/html/2301.01234",  # Will fetch HTML or PDF based on settings
]
contents = parser.fetch_batch(urls)
```

### PDF-Only Processing

```python
parser = FastParser(extract_pdf=True)

pdf_urls = [
    "https://example.com/document.pdf",
    "https://arxiv.org/pdf/2301.01234.pdf"
]
pdf_contents = parser.fetch_batch(pdf_urls)
```

## 🔧 API Reference

### FastParser Class

```python
class FastParser:
    def __init__(self, extract_pdf: bool = True)
    def fetch(self, url: str) -> str
    def fetch_batch(self, urls: list) -> list
    def __call__(self, urls: str|list) -> str|list
```

### Main Functions

- `parse(urls: str|list) -> str|list`: Convenience function for quick parsing
- `_async_html_parser(urls: list)`: Internal async processing method
- `_fetch_pdf_content(pdf_urls: list)`: Internal PDF processing method
- `_arxiv_url_fix(url: str)`: Internal arXiv URL formatting method

## ⚡ Performance

The parser uses asynchronous operations for optimal performance:

- Concurrent URL fetching
- Batch processing capabilities
- Progress tracking with tqdm
- Memory-efficient PDF processing

## 🔍 Example: Advanced Usage

```python
import asyncio
from fastparser import FastParser

async def process_large_dataset():
    parser = FastParser(extract_pdf=True)
    
    # Process URLs in batches
    all_urls = ["url1", "url2", ..., "url1000"]
    batch_size = 50
    
    results = []
    for i in range(0, len(all_urls), batch_size):
        batch = all_urls[i:i + batch_size]
        batch_results = parser.fetch_batch(batch)
        results.extend(batch_results)
        
    return results

# Run with asyncio
results = asyncio.run(process_large_dataset())
```

## ⚠️ Error Handling

The parser includes robust error handling:

- Failed URL fetches return empty strings
- PDF processing errors are caught gracefully
- HTTP status checks
- Invalid URL format handling

## 🤝 Contributing

Contributions are welcome! Here's how you can help:

1. Fork the repository
2. Create your feature branch (`git checkout -b feature/AmazingFeature`)
3. Commit your changes (`git commit -m 'Add some AmazingFeature'`)
4. Push to the branch (`git push origin feature/AmazingFeature`)
5. Open a Pull Request

## 📝 Dependencies

- `aiohttp`: Async HTTP client/server framework
- `PyPDF2`: PDF processing library
- `tqdm`: Progress bar library
- Custom `FastHTMLParserV3` module

## 📋 TODO

- [ ] Add support for more document types
- [ ] Implement caching mechanism
- [ ] Add timeout configurations
- [ ] Improve error reporting
- [ ] Add proxy support

## 📄 License

This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.

---

Made with ❤️ by [Your Name]

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/yourusername/pyopengenai",
    "name": "parselite",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.7",
    "maintainer_email": null,
    "keywords": null,
    "author": "Kammari Santhosh",
    "author_email": "Your Name <your.email@example.com>",
    "download_url": "https://files.pythonhosted.org/packages/d9/60/ec89c19f60cdd1965146ad618a3030b50a2dbedc60c0b6bbb0097ba89198/parselite-0.3.7.tar.gz",
    "platform": null,
    "description": "# FastParser \ud83d\ude80\n\n[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)\n[![Python 3.7+](https://img.shields.io/badge/python-3.7+-blue.svg)](https://www.python.org/downloads/)\n[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)\n\nA high-performance, asynchronous content parser that supports both HTML and PDF extraction with special handling for arXiv papers.\n\n## \u2728 Features\n\n- \ud83d\ude84 Asynchronous content fetching\n- \ud83d\udcc4 PDF extraction support\n- \ud83c\udf10 HTML parsing\n- \ud83d\udcda Special handling for arXiv URLs\n- \ud83d\udce6 Batch processing capability\n- \ud83d\udd04 Progress tracking with tqdm\n\n## \ud83d\udee0\ufe0f Installation\n\n```bash\npip install fastparser\n\n# Dependencies\npip install aiohttp PyPDF2 tqdm\n```\n\n## \ud83d\ude80 Quick Start\n\n```python\nfrom fastparser import parse\n\n# Single URL parsing\ntext = parse(\"https://example.com\")\n\n# Batch processing\nurls = [\n    \"https://example.com\",\n    \"https://arxiv.org/abs/2301.01234\",\n    \"https://example.com/document.pdf\"\n]\ntexts = parse(urls)\n```\n\n## \ud83d\udcd6 Detailed Usage\n\n### Basic Parser Configuration\n\n```python\nfrom fastparser import FastParser\n\n# Initialize with PDF extraction (default: True)\nparser = FastParser(extract_pdf=True)\n\n# Single URL\ncontent = parser.fetch(\"https://example.com\")\n\n# Multiple URLs\ncontents = parser.fetch_batch([\n    \"https://example.com\",\n    \"https://arxiv.org/abs/2301.01234\"\n])\n```\n\n### Working with arXiv Papers\n\nThe parser automatically handles different arXiv URL formats:\n\n```python\nparser = FastParser()\n\n# These will be automatically converted to appropriate formats\nurls = [\n    \"https://arxiv.org/abs/2301.01234\",  # Will fetch PDF if extract_pdf=True\n    \"http://arxiv.org/html/2301.01234\",  # Will fetch HTML or PDF based on settings\n]\ncontents = parser.fetch_batch(urls)\n```\n\n### PDF-Only Processing\n\n```python\nparser = FastParser(extract_pdf=True)\n\npdf_urls = [\n    \"https://example.com/document.pdf\",\n    \"https://arxiv.org/pdf/2301.01234.pdf\"\n]\npdf_contents = parser.fetch_batch(pdf_urls)\n```\n\n## \ud83d\udd27 API Reference\n\n### FastParser Class\n\n```python\nclass FastParser:\n    def __init__(self, extract_pdf: bool = True)\n    def fetch(self, url: str) -> str\n    def fetch_batch(self, urls: list) -> list\n    def __call__(self, urls: str|list) -> str|list\n```\n\n### Main Functions\n\n- `parse(urls: str|list) -> str|list`: Convenience function for quick parsing\n- `_async_html_parser(urls: list)`: Internal async processing method\n- `_fetch_pdf_content(pdf_urls: list)`: Internal PDF processing method\n- `_arxiv_url_fix(url: str)`: Internal arXiv URL formatting method\n\n## \u26a1 Performance\n\nThe parser uses asynchronous operations for optimal performance:\n\n- Concurrent URL fetching\n- Batch processing capabilities\n- Progress tracking with tqdm\n- Memory-efficient PDF processing\n\n## \ud83d\udd0d Example: Advanced Usage\n\n```python\nimport asyncio\nfrom fastparser import FastParser\n\nasync def process_large_dataset():\n    parser = FastParser(extract_pdf=True)\n    \n    # Process URLs in batches\n    all_urls = [\"url1\", \"url2\", ..., \"url1000\"]\n    batch_size = 50\n    \n    results = []\n    for i in range(0, len(all_urls), batch_size):\n        batch = all_urls[i:i + batch_size]\n        batch_results = parser.fetch_batch(batch)\n        results.extend(batch_results)\n        \n    return results\n\n# Run with asyncio\nresults = asyncio.run(process_large_dataset())\n```\n\n## \u26a0\ufe0f Error Handling\n\nThe parser includes robust error handling:\n\n- Failed URL fetches return empty strings\n- PDF processing errors are caught gracefully\n- HTTP status checks\n- Invalid URL format handling\n\n## \ud83e\udd1d Contributing\n\nContributions are welcome! Here's how you can help:\n\n1. Fork the repository\n2. Create your feature branch (`git checkout -b feature/AmazingFeature`)\n3. Commit your changes (`git commit -m 'Add some AmazingFeature'`)\n4. Push to the branch (`git push origin feature/AmazingFeature`)\n5. Open a Pull Request\n\n## \ud83d\udcdd Dependencies\n\n- `aiohttp`: Async HTTP client/server framework\n- `PyPDF2`: PDF processing library\n- `tqdm`: Progress bar library\n- Custom `FastHTMLParserV3` module\n\n## \ud83d\udccb TODO\n\n- [ ] Add support for more document types\n- [ ] Implement caching mechanism\n- [ ] Add timeout configurations\n- [ ] Improve error reporting\n- [ ] Add proxy support\n\n## \ud83d\udcc4 License\n\nThis project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.\n\n---\n\nMade with \u2764\ufe0f by [Your Name]\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "A powerful web content fetcher and processor",
    "version": "0.3.7",
    "project_urls": {
        "Bug Tracker": "https://github.com/yourusername/pyopengenai/issues",
        "Homepage": "https://github.com/yourusername/pyopengenai"
    },
    "split_keywords": [],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "649631230194768d5af1baec2835dbb65e247a8ce59183948cf125235c1195bd",
                "md5": "2c83042eff95e170c3a1ba4e76f3e79e",
                "sha256": "2b5c8c12d8e2c095298a3333a149c086ce0d5e52bff3a03afeee435664739357"
            },
            "downloads": -1,
            "filename": "parselite-0.3.7-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "2c83042eff95e170c3a1ba4e76f3e79e",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.7",
            "size": 6042,
            "upload_time": "2024-12-20T06:53:42",
            "upload_time_iso_8601": "2024-12-20T06:53:42.657138Z",
            "url": "https://files.pythonhosted.org/packages/64/96/31230194768d5af1baec2835dbb65e247a8ce59183948cf125235c1195bd/parselite-0.3.7-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "d960ec89c19f60cdd1965146ad618a3030b50a2dbedc60c0b6bbb0097ba89198",
                "md5": "f5135be0a55a01e7cab44afe7633bc79",
                "sha256": "194447694a253eae08752b655fb04ab201910e3337f93516cf7478a209e02cdf"
            },
            "downloads": -1,
            "filename": "parselite-0.3.7.tar.gz",
            "has_sig": false,
            "md5_digest": "f5135be0a55a01e7cab44afe7633bc79",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.7",
            "size": 5577,
            "upload_time": "2024-12-20T06:53:45",
            "upload_time_iso_8601": "2024-12-20T06:53:45.195842Z",
            "url": "https://files.pythonhosted.org/packages/d9/60/ec89c19f60cdd1965146ad618a3030b50a2dbedc60c0b6bbb0097ba89198/parselite-0.3.7.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-12-20 06:53:45",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "yourusername",
    "github_project": "pyopengenai",
    "github_not_found": true,
    "lcname": "parselite"
}
        
Elapsed time: 5.08533s