filestruct

Name	filestruct JSON
Version	0.2 JSON
	download
home_page	https://github.com/ixalodecte/filestruct
Summary	A python package to structure files using visual and style informations
upload_time	2024-02-06 16:37:52
maintainer
docs_url	None
author	léo DECHAUMET
requires_python
license	GPLv3+
keywords	pdf parser layout-analysis
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            
# FileStruct

**FileStruct** is a high-level Python library that aims to extract the overall structure of documents, particularly PDFs, based on visual information such as size, color and font.

## How does it work ?

As clever human beings, we are able to detect titles, subtitles, and paragraphs using the visual appearence of the document. A big text in red most certainly represent a title (or subtitle). Using these heuristics, we are able to structure a document : _This paragraph belongs to this section_. The same method is used by this package to provide an automated, while realistic way to structure a document. The method is described bellow :

1.  **Text and style extraction :** We rely on lower level librairies (like PyMuPDF) for the extraction of the text and style information, and the ordering of each block of text.
2.  **Tree creation :** A tree is created, in which each block of text is a node of the tree. A child of a node in the tree is a subsection of a section in the document.
3.  **Data exportation :** The data can be exported in JSON format.

For now, filestruct can only read formats that are supported by PyMuPDF. This includes pdf, epub, xps, mobi, fb2, cbz and svg. I plan to add more file formats in the future.

## Installation

Install **FileStruct** using **pip** :

```sh
pip install filestruct
```


## Getting Started

Bellow, a basic usage for a PDF document :

```python
from filestruct.document import PDFDocument

doc = Document("PATH_TO_YOUR_FILE.pdf")
data = doc.to_json()   # Export the tree into json format
print(data)
print(doc)
```

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/ixalodecte/filestruct",
    "name": "filestruct",
    "maintainer": "",
    "docs_url": null,
    "requires_python": "",
    "maintainer_email": "",
    "keywords": "pdf,parser,layout-analysis",
    "author": "l\u00e9o DECHAUMET",
    "author_email": "leo_dechaumet_research@pm.me",
    "download_url": "https://files.pythonhosted.org/packages/94/07/63392fa76d330921f591fd135183e354b75528fc5aab1e0cd0a7dea1018a/filestruct-0.2.tar.gz",
    "platform": null,
    "description": "\n# FileStruct\n\n**FileStruct** is a high-level Python library that aims to extract the overall structure of documents, particularly PDFs, based on visual information such as size, color and font.\n\n## How does it work ?\n\nAs clever human beings, we are able to detect titles, subtitles, and paragraphs using the visual appearence of the document. A big text in red most certainly represent a title (or subtitle). Using these heuristics, we are able to structure a document : _This paragraph belongs to this section_. The same method is used by this package to provide an automated, while realistic way to structure a document. The method is described bellow :\n\n1.  **Text and style extraction :** We rely on lower level librairies (like PyMuPDF) for the extraction of the text and style information, and the ordering of each block of text.\n2.  **Tree creation :** A tree is created, in which each block of text is a node of the tree. A child of a node in the tree is a subsection of a section in the document.\n3.  **Data exportation :** The data can be exported in JSON format.\n\nFor now, filestruct can only read formats that are supported by PyMuPDF. This includes pdf, epub, xps, mobi, fb2, cbz and svg. I plan to add more file formats in the future.\n\n## Installation\n\nInstall **FileStruct** using **pip** :\n\n```sh\npip install filestruct\n```\n\n\n## Getting Started\n\nBellow, a basic usage for a PDF document :\n\n```python\nfrom filestruct.document import PDFDocument\n\ndoc = Document(\"PATH_TO_YOUR_FILE.pdf\")\ndata = doc.to_json()   # Export the tree into json format\nprint(data)\nprint(doc)\n```\n\n",
    "bugtrack_url": null,
    "license": "GPLv3+",
    "summary": "A python package to structure files using visual and style informations",
    "version": "0.2",
    "project_urls": {
        "Download": "https://github.com/ixalodecte/filestruct/archive/refs/tags/v0.2-alpha.tar.gz",
        "Homepage": "https://github.com/ixalodecte/filestruct"
    },
    "split_keywords": [
        "pdf",
        "parser",
        "layout-analysis"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "940763392fa76d330921f591fd135183e354b75528fc5aab1e0cd0a7dea1018a",
                "md5": "1f9e1b1d4b37c1eef349641023d97878",
                "sha256": "bded6c726d9950261020c3d3b8c7cad65bea808101d304f5bd607eeca9481699"
            },
            "downloads": -1,
            "filename": "filestruct-0.2.tar.gz",
            "has_sig": false,
            "md5_digest": "1f9e1b1d4b37c1eef349641023d97878",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": null,
            "size": 17509,
            "upload_time": "2024-02-06T16:37:52",
            "upload_time_iso_8601": "2024-02-06T16:37:52.482242Z",
            "url": "https://files.pythonhosted.org/packages/94/07/63392fa76d330921f591fd135183e354b75528fc5aab1e0cd0a7dea1018a/filestruct-0.2.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-02-06 16:37:52",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "ixalodecte",
    "github_project": "filestruct",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "requirements": [],
    "lcname": "filestruct"
}

léo DECHAUMET