doc-master


Namedoc-master JSON
Version 0.0.2 PyPI version JSON
download
home_pagehttps://github.com/The-Swarm-Corporation/doc-master
SummaryPaper - Pytorch
upload_time2024-11-07 20:52:35
maintainerNone
docs_urlNone
authorKye Gomez
requires_python<4.0,>=3.10
licenseMIT
keywords document reader file content extraction pdf docx excel
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            [![Multi-Modality](agorabanner.png)](https://discord.com/servers/agora-999382051935506503)

# Doc Master 📚

[![Join our Discord](https://img.shields.io/badge/Discord-Join%20our%20server-5865F2?style=for-the-badge&logo=discord&logoColor=white)](https://discord.gg/agora-999382051935506503) [![Subscribe on YouTube](https://img.shields.io/badge/YouTube-Subscribe-red?style=for-the-badge&logo=youtube&logoColor=white)](https://www.youtube.com/@kyegomez3242) [![Connect on LinkedIn](https://img.shields.io/badge/LinkedIn-Connect-blue?style=for-the-badge&logo=linkedin&logoColor=white)](https://www.linkedin.com/in/kye-g-38759a207/) [![Follow on X.com](https://img.shields.io/badge/X.com-Follow-1DA1F2?style=for-the-badge&logo=x&logoColor=white)](https://x.com/kyegomezb)


[![PyPI version](https://badge.fury.io/py/doc-master.svg)](https://badge.fury.io/py/doc-master)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
[![Python 3.8+](https://img.shields.io/badge/python-3.8+-blue.svg)](https://www.python.org/downloads/)
[![Discord](https://img.shields.io/discord/999382051935506503?color=7289da&label=Discord&logo=discord&logoColor=white)](https://discord.gg/agora-999382051935506503)

A powerful, lightweight Python library for automated file reading and content extraction. Doc Master simplifies the process of reading various file formats into string representations, making it perfect for data processing, content analysis, and document management systems.

## 🚀 Features

- **Universal File Reading**: Seamlessly handle multiple file formats including:
  - PDF documents
  - Microsoft Word documents (.docx)
  - Excel spreadsheets
  - Text files
  - XML documents
  - Images (with base64 encoding)
  - Binary files

- **Smart Format Detection**: Automatic file type detection and appropriate processing
- **Flexible Output**: Choose between string or dictionary output formats
- **Batch Processing**: Process entire folders of documents efficiently
- **Encoding Detection**: Smart encoding detection for text files
- **Enterprise-Ready**: Built with stability and performance in mind

## 📦 Installation

```bash
pip install -U doc-master
```

## 🔧 Quick Start

```python
from doc_master import doc_master

# Read all files in a folder
results = doc_master(folder_path="path/to/folder", output_type="dict")

# Or read a single file
content = doc_master(file_path="path/to/file.docx")
```

## 📋 Requirements

- Python 3.8+
- pandas
- pypdf
- python-docx
- Pillow

## 🤝 Contributing

We love your input! We want to make contributing to Doc Master as easy and transparent as possible. Here's how you can help:

1. Fork the repo
2. Create your feature branch (`git checkout -b feature/AmazingFeature`)
3. Commit your changes (`git commit -m 'Add some AmazingFeature'`)
4. Push to the branch (`git push origin feature/AmazingFeature`)
5. Open a Pull Request

Check out our [Contributing Guidelines](CONTRIBUTING.md) for more details.

## 🌟 Support the Project

If you find Doc Master useful, please consider:
- Starring the repository ⭐
- Following us on GitHub
- Joining our [Discord community](https://discord.gg/agora-999382051935506503)
- Sharing the project with others

## 📖 Documentation

For detailed documentation, visit our [Wiki](https://github.com/The-Swarm-Corporation/doc-master/wiki).

### Basic Usage Examples

```python
# Read a PDF file
content = read_single_file("document.pdf")

# Read an Excel file with specific sheet
reader = AutoFileReader()
content = reader.read_file("spreadsheet.xlsx", sheet_name="Data")

# Process a folder of documents
results = doc_master(
    folder_path="documents/",
    output_type="dict"
)
```

## 🔍 Error Handling

The library includes comprehensive error handling:

```python
try:
    content = read_single_file("file.pdf")
except Exception as e:
    print(f"Error processing file: {e}")
```

## 🛣️ Roadmap

- [ ] Add OCR capabilities for image processing
- [ ] Support for additional file formats
- [ ] Performance optimizations for large files
- [ ] Async file processing
- [ ] CLI interface

## 💬 Community and Support

- Join our [Discord server](https://discord.gg/agora-999382051935506503) for discussions and support
- Check out our [GitHub Issues](https://github.com/The-Swarm-Corporation/doc-master/issues) for bug reports and feature requests
- Follow our [GitHub Discussions](https://github.com/The-Swarm-Corporation/doc-master/discussions) for general questions

## 📄 License

This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.

## 🙏 Acknowledgments

- All our amazing contributors
- The open-source community
- The Swarm Corporation team

---

<p align="center">
  Made with ❤️ by The Swarm Corporation
</p>

<p align="center">
  <a href="https://github.com/The-Swarm-Corporation/doc-master/stargazers">⭐ Star us on GitHub!</a>
</p>
            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/The-Swarm-Corporation/doc-master",
    "name": "doc-master",
    "maintainer": null,
    "docs_url": null,
    "requires_python": "<4.0,>=3.10",
    "maintainer_email": null,
    "keywords": "document, reader, file, content, extraction, pdf, docx, excel",
    "author": "Kye Gomez",
    "author_email": "kye@apac.ai",
    "download_url": "https://files.pythonhosted.org/packages/00/77/42cbb470dad68887ba58ef23c182f5af145f8f65de010df8eb3002751f41/doc_master-0.0.2.tar.gz",
    "platform": null,
    "description": "[![Multi-Modality](agorabanner.png)](https://discord.com/servers/agora-999382051935506503)\n\n# Doc Master \ud83d\udcda\n\n[![Join our Discord](https://img.shields.io/badge/Discord-Join%20our%20server-5865F2?style=for-the-badge&logo=discord&logoColor=white)](https://discord.gg/agora-999382051935506503) [![Subscribe on YouTube](https://img.shields.io/badge/YouTube-Subscribe-red?style=for-the-badge&logo=youtube&logoColor=white)](https://www.youtube.com/@kyegomez3242) [![Connect on LinkedIn](https://img.shields.io/badge/LinkedIn-Connect-blue?style=for-the-badge&logo=linkedin&logoColor=white)](https://www.linkedin.com/in/kye-g-38759a207/) [![Follow on X.com](https://img.shields.io/badge/X.com-Follow-1DA1F2?style=for-the-badge&logo=x&logoColor=white)](https://x.com/kyegomezb)\n\n\n[![PyPI version](https://badge.fury.io/py/doc-master.svg)](https://badge.fury.io/py/doc-master)\n[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)\n[![Python 3.8+](https://img.shields.io/badge/python-3.8+-blue.svg)](https://www.python.org/downloads/)\n[![Discord](https://img.shields.io/discord/999382051935506503?color=7289da&label=Discord&logo=discord&logoColor=white)](https://discord.gg/agora-999382051935506503)\n\nA powerful, lightweight Python library for automated file reading and content extraction. Doc Master simplifies the process of reading various file formats into string representations, making it perfect for data processing, content analysis, and document management systems.\n\n## \ud83d\ude80 Features\n\n- **Universal File Reading**: Seamlessly handle multiple file formats including:\n  - PDF documents\n  - Microsoft Word documents (.docx)\n  - Excel spreadsheets\n  - Text files\n  - XML documents\n  - Images (with base64 encoding)\n  - Binary files\n\n- **Smart Format Detection**: Automatic file type detection and appropriate processing\n- **Flexible Output**: Choose between string or dictionary output formats\n- **Batch Processing**: Process entire folders of documents efficiently\n- **Encoding Detection**: Smart encoding detection for text files\n- **Enterprise-Ready**: Built with stability and performance in mind\n\n## \ud83d\udce6 Installation\n\n```bash\npip install -U doc-master\n```\n\n## \ud83d\udd27 Quick Start\n\n```python\nfrom doc_master import doc_master\n\n# Read all files in a folder\nresults = doc_master(folder_path=\"path/to/folder\", output_type=\"dict\")\n\n# Or read a single file\ncontent = doc_master(file_path=\"path/to/file.docx\")\n```\n\n## \ud83d\udccb Requirements\n\n- Python 3.8+\n- pandas\n- pypdf\n- python-docx\n- Pillow\n\n## \ud83e\udd1d Contributing\n\nWe love your input! We want to make contributing to Doc Master as easy and transparent as possible. Here's how you can help:\n\n1. Fork the repo\n2. Create your feature branch (`git checkout -b feature/AmazingFeature`)\n3. Commit your changes (`git commit -m 'Add some AmazingFeature'`)\n4. Push to the branch (`git push origin feature/AmazingFeature`)\n5. Open a Pull Request\n\nCheck out our [Contributing Guidelines](CONTRIBUTING.md) for more details.\n\n## \ud83c\udf1f Support the Project\n\nIf you find Doc Master useful, please consider:\n- Starring the repository \u2b50\n- Following us on GitHub\n- Joining our [Discord community](https://discord.gg/agora-999382051935506503)\n- Sharing the project with others\n\n## \ud83d\udcd6 Documentation\n\nFor detailed documentation, visit our [Wiki](https://github.com/The-Swarm-Corporation/doc-master/wiki).\n\n### Basic Usage Examples\n\n```python\n# Read a PDF file\ncontent = read_single_file(\"document.pdf\")\n\n# Read an Excel file with specific sheet\nreader = AutoFileReader()\ncontent = reader.read_file(\"spreadsheet.xlsx\", sheet_name=\"Data\")\n\n# Process a folder of documents\nresults = doc_master(\n    folder_path=\"documents/\",\n    output_type=\"dict\"\n)\n```\n\n## \ud83d\udd0d Error Handling\n\nThe library includes comprehensive error handling:\n\n```python\ntry:\n    content = read_single_file(\"file.pdf\")\nexcept Exception as e:\n    print(f\"Error processing file: {e}\")\n```\n\n## \ud83d\udee3\ufe0f Roadmap\n\n- [ ] Add OCR capabilities for image processing\n- [ ] Support for additional file formats\n- [ ] Performance optimizations for large files\n- [ ] Async file processing\n- [ ] CLI interface\n\n## \ud83d\udcac Community and Support\n\n- Join our [Discord server](https://discord.gg/agora-999382051935506503) for discussions and support\n- Check out our [GitHub Issues](https://github.com/The-Swarm-Corporation/doc-master/issues) for bug reports and feature requests\n- Follow our [GitHub Discussions](https://github.com/The-Swarm-Corporation/doc-master/discussions) for general questions\n\n## \ud83d\udcc4 License\n\nThis project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.\n\n## \ud83d\ude4f Acknowledgments\n\n- All our amazing contributors\n- The open-source community\n- The Swarm Corporation team\n\n---\n\n<p align=\"center\">\n  Made with \u2764\ufe0f by The Swarm Corporation\n</p>\n\n<p align=\"center\">\n  <a href=\"https://github.com/The-Swarm-Corporation/doc-master/stargazers\">\u2b50 Star us on GitHub!</a>\n</p>",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "Paper - Pytorch",
    "version": "0.0.2",
    "project_urls": {
        "Documentation": "https://github.com/The-Swarm-Corporation/doc-master",
        "Homepage": "https://github.com/The-Swarm-Corporation/doc-master",
        "Repository": "https://github.com/The-Swarm-Corporation/doc-master"
    },
    "split_keywords": [
        "document",
        " reader",
        " file",
        " content",
        " extraction",
        " pdf",
        " docx",
        " excel"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "4fa284e87aa54f52aaf749da5aa8a99f1d1ec626c19af60d737e757fd98c3920",
                "md5": "9412114ab30c4915d323076bec35475b",
                "sha256": "4ab872c5eda063c306a55242ab2e18cb5dc34ec47a23a050d0a6d54146a8eb53"
            },
            "downloads": -1,
            "filename": "doc_master-0.0.2-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "9412114ab30c4915d323076bec35475b",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": "<4.0,>=3.10",
            "size": 6861,
            "upload_time": "2024-11-07T20:52:34",
            "upload_time_iso_8601": "2024-11-07T20:52:34.810187Z",
            "url": "https://files.pythonhosted.org/packages/4f/a2/84e87aa54f52aaf749da5aa8a99f1d1ec626c19af60d737e757fd98c3920/doc_master-0.0.2-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "007742cbb470dad68887ba58ef23c182f5af145f8f65de010df8eb3002751f41",
                "md5": "e0cc89bd789a97438ccf601793cf0912",
                "sha256": "422757a56ef07f03a58088378c5ccdb3ce7931d45c12c4bbe4f92b51f5d33e46"
            },
            "downloads": -1,
            "filename": "doc_master-0.0.2.tar.gz",
            "has_sig": false,
            "md5_digest": "e0cc89bd789a97438ccf601793cf0912",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": "<4.0,>=3.10",
            "size": 6548,
            "upload_time": "2024-11-07T20:52:35",
            "upload_time_iso_8601": "2024-11-07T20:52:35.752132Z",
            "url": "https://files.pythonhosted.org/packages/00/77/42cbb470dad68887ba58ef23c182f5af145f8f65de010df8eb3002751f41/doc_master-0.0.2.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-11-07 20:52:35",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "The-Swarm-Corporation",
    "github_project": "doc-master",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "requirements": [],
    "lcname": "doc-master"
}
        
Elapsed time: 0.39875s