<a href="https://explosion.ai"><img src="https://explosion.ai/assets/img/logo.svg" width="125" height="125" align="right" /></a>
# spaCy Layout: Process PDFs, Word documents and more with spaCy
This plugin integrates with [Docling](https://ds4sd.github.io/docling/) to bring structured processing of **PDFs**, **Word documents** and other input formats to your [spaCy](https://spacy.io) pipeline. It outputs clean, **structured data** in a text-based format and outputs spaCy's familiar [`Doc`](https://spacy.io/api/doc) objects that let you access labelled text spans like sections, headings, or footnotes.
This workflow makes it easy to apply powerful **NLP techniques** to your documents, including linguistic analysis, named entity recognition, text classification and more. It's also great for implementing **chunking for RAG** pipelines.
[![Test](https://github.com/explosion/spacy-layout/actions/workflows/test.yml/badge.svg)](https://github.com/explosion/spacy-layout/actions/workflows/test.yml)
[![Current Release Version](https://img.shields.io/github/release/explosion/spacy-layout.svg?style=flat-square&logo=github&include_prereleases)](https://github.com/explosion/spacy-layout/releases)
[![pypi Version](https://img.shields.io/pypi/v/spacy-layout.svg?style=flat-square&logo=pypi&logoColor=white)](https://pypi.org/project/spacy-layout/)
[![Built with spaCy](https://img.shields.io/badge/built%20with-spaCy-09a3d5.svg?style=flat-square)](https://spacy.io)
## 📝 Usage
> ⚠️ This package requires **Python 3.10** or above.
```bash
pip install spacy-layout
```
After initializing the `spaCyLayout` preprocessor with an `nlp` object for tokenization, you can call it on a document path to convert it to structured data. The resulting `Doc` object includes layout spans that map into the original raw text and expose various attributes, including the content type and layout features.
```python
import spacy
from spacy_layout import spaCyLayout
nlp = spacy.blank("en")
layout = spaCyLayout(nlp)
# Process a document and create a spaCy Doc object
doc = layout("./starcraft.pdf")
# The text-based contents of the document
print(doc.text)
# Document layout including pages and page sizes
print(doc._.layout)
# Layout spans for different sections
for span in doc.spans["layout"]:
# Document section and token and character offsets into the text
print(span.text, span.start, span.end, span.start_char, span.end_char)
# Section type, e.g. "text", "title", "section_header" etc.
print(span.label_)
# Layout features of the section, including bounding box
print(span._.layout)
# Closest heading to the span (accuracy depends on document structure)
print(span._.heading)
```
If you need to process larger volumes of documents at scale, you can use the `spaCyLayout.pipe` method, which takes an iterable of paths or bytes instead and yields `Doc` objects:
```python
paths = ["one.pdf", "two.pdf", "three.pdf", ...]
for doc in layout.pipe(paths):
print(doc._.layout)
```
After you've processed the documents, you can [serialize](https://spacy.io/usage/saving-loading#docs) the structured `Doc` objects in spaCy's efficient binary format, so you don't have to re-run the resource-intensive conversion.
spaCy also allows you to call the `nlp` object on an already created `Doc`, so you can easily apply a pipeline of components for [linguistic analysis](https://spacy.io/usage/linguistic-features) or [named entity recognition](https://spacy.io/usage/linguistic-features#named-entities), use [rule-based matching](https://spacy.io/usage/rule-based-matching) or anything else you can do with spaCy.
```python
# Load the transformer-based English pipeline
# Installation: python -m spacy download en_core_web_trf
nlp = spacy.load("en_core_web_trf")
layout = spaCyLayout(nlp)
doc = layout("./starcraft.pdf")
# Apply the pipeline to access POS tags, dependencies, entities etc.
doc = nlp(doc)
```
## 🎛️ API
### Data and extension attributes
```python
layout = spaCyLayout(nlp)
doc = layout("./starcraft.pdf")
print(doc._.layout)
for span in doc.spans["layout"]:
print(span.label_, span._.layout)
```
| Attribute | Type | Description |
| --- | --- | --- |
| `Doc._.layout` | `DocLayout` | Layout features of the document. |
| `Doc._.pages` | `list[tuple[PageLayout, list[Span]]]` | Pages in the document and the spans they contain. |
| `Doc.spans["layout"]` | `spacy.tokens.SpanGroup` | The layout spans in the document. |
| `Span.label_` | `str` | The type of the extracted layout span, e.g. `"text"` or `"section_header"`. [See here](https://github.com/DS4SD/docling-core/blob/14cad33ae7f8dc011a79dd364361d2647c635466/docling_core/types/doc/labels.py) for options. |
| `Span.label` | `int` | The integer ID of the span label. |
| `Span.id` | `int` | Running index of layout span. |
| `Span._.layout` | `SpanLayout` | Layout features of a layout span. |
| `Span._.heading` | `Span \| None` | Closest heading to a span, if available. |
### <kbd>dataclass</kbd> PageLayout
| Attribute | Type | Description |
| --- | --- | --- |
| `page_no` | `int` | The page number (1-indexed). |
| `width` | `float` | Page with in pixels. |
| `height` | `float` | Page height in pixels. |
### <kbd>dataclass</kbd> DocLayout
| Attribute | Type | Description |
| --- | --- | --- |
| `pages` | `list[PageLayout]` | The pages in the document. |
### <kbd>dataclass</kbd> SpanLayout
| Attribute | Type | Description |
| --- | --- | --- |
| `x` | `float` | Horizontal offset of the bounding box in pixels. |
| `y` | `float` | Vertical offset of the bounding box in pixels. |
| `width` | `float` | Width of the bounding box in pixels. |
| `height` | `float` | Height of the bounding box in pixels. |
| `page_no` | `int` | Number of page the span is on. |
### <kbd>class</kbd> `spaCyLayout`
#### <kbd>method</kbd> `spaCyLayout.__init__`
Initialize the document processor.
```python
nlp = spacy.blank("en")
layout = spaCyLayout(nlp)
```
| Argument | Type | Description |
| --- | --- | --- |
| `nlp` | `spacy.language.Language` | The initialized `nlp` object to use for tokenization. |
| `separator` | `str` | Token used to separate sections in the created `Doc` object. The separator won't be part of the layout span. If `None`, no separator will be added. Defaults to `"\n\n"`. |
| `attrs` | `dict[str, str]` | Override the custom spaCy attributes. Can include `"doc_layout"`, `"doc_pages"`, `"span_layout"`, `"span_heading"` and `"span_group"`. |
| `headings` | `list[str]` | Labels of headings to consider for `Span._.heading` detection. Defaults to `["section_header", "page_header", "title"]`. |
| `docling_options` | `dict[InputFormat, FormatOption]` | [Format options](https://ds4sd.github.io/docling/usage/#advanced-options) passed to Docling's `DocumentConverter`. |
| **RETURNS** | `spaCyLayout` | The initialized object. |
#### <kbd>method</kbd> `spaCyLayout.__call__`
Process a document and create a spaCy [`Doc`](https://spacy.io/api/doc) object containing the text content and layout spans, available via `Doc.spans["layout"]` by default.
```python
layout = spaCyLayout(nlp)
doc = layout("./starcraft.pdf")
```
| Argument | Type | Description |
| --- | --- | --- |
| `source` | `str \| Path \| bytes` | Path of document to process or bytes. |
| **RETURNS** | `Doc` | The processed spaCy `Doc` object. |
#### <kbd>method</kbd> `spaCyLayout.pipe`
Process multiple documents and create spaCy [`Doc`](https://spacy.io/api/doc) objects. You should use this method if you're processing larger volumes of documents at scale.
```python
layout = spaCyLayout(nlp)
paths = ["one.pdf", "two.pdf", "three.pdf", ...]
docs = layout.pipe(paths)
```
| Argument | Type | Description |
| --- | --- | --- |
| `sources` | `Iterable[str \| Path \| bytes]` | Paths of documents to process or bytes. |
| **YIELDS** | `Doc` | The processed spaCy `Doc` object. |
Raw data
{
"_id": null,
"home_page": "https://github.com/explosion/spacy-layout",
"name": "spacy-layout",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.10",
"maintainer_email": null,
"keywords": null,
"author": "Explosion",
"author_email": "contact@explosion.ai",
"download_url": "https://files.pythonhosted.org/packages/29/00/41f12302794ae7e1f2e879aa17927ccd86e8e3a7963db78d6d9d2f2e9933/spacy_layout-0.0.5.tar.gz",
"platform": null,
"description": "<a href=\"https://explosion.ai\"><img src=\"https://explosion.ai/assets/img/logo.svg\" width=\"125\" height=\"125\" align=\"right\" /></a>\n\n# spaCy Layout: Process PDFs, Word documents and more with spaCy\n\nThis plugin integrates with [Docling](https://ds4sd.github.io/docling/) to bring structured processing of **PDFs**, **Word documents** and other input formats to your [spaCy](https://spacy.io) pipeline. It outputs clean, **structured data** in a text-based format and outputs spaCy's familiar [`Doc`](https://spacy.io/api/doc) objects that let you access labelled text spans like sections, headings, or footnotes.\n\nThis workflow makes it easy to apply powerful **NLP techniques** to your documents, including linguistic analysis, named entity recognition, text classification and more. It's also great for implementing **chunking for RAG** pipelines.\n\n[![Test](https://github.com/explosion/spacy-layout/actions/workflows/test.yml/badge.svg)](https://github.com/explosion/spacy-layout/actions/workflows/test.yml)\n[![Current Release Version](https://img.shields.io/github/release/explosion/spacy-layout.svg?style=flat-square&logo=github&include_prereleases)](https://github.com/explosion/spacy-layout/releases)\n[![pypi Version](https://img.shields.io/pypi/v/spacy-layout.svg?style=flat-square&logo=pypi&logoColor=white)](https://pypi.org/project/spacy-layout/)\n[![Built with spaCy](https://img.shields.io/badge/built%20with-spaCy-09a3d5.svg?style=flat-square)](https://spacy.io)\n\n## \ud83d\udcdd Usage\n\n> \u26a0\ufe0f This package requires **Python 3.10** or above.\n\n```bash\npip install spacy-layout\n```\n\nAfter initializing the `spaCyLayout` preprocessor with an `nlp` object for tokenization, you can call it on a document path to convert it to structured data. The resulting `Doc` object includes layout spans that map into the original raw text and expose various attributes, including the content type and layout features.\n\n```python\nimport spacy\nfrom spacy_layout import spaCyLayout\n\nnlp = spacy.blank(\"en\")\nlayout = spaCyLayout(nlp)\n\n# Process a document and create a spaCy Doc object\ndoc = layout(\"./starcraft.pdf\")\n\n# The text-based contents of the document\nprint(doc.text)\n# Document layout including pages and page sizes\nprint(doc._.layout)\n\n# Layout spans for different sections\nfor span in doc.spans[\"layout\"]:\n # Document section and token and character offsets into the text\n print(span.text, span.start, span.end, span.start_char, span.end_char)\n # Section type, e.g. \"text\", \"title\", \"section_header\" etc.\n print(span.label_)\n # Layout features of the section, including bounding box\n print(span._.layout)\n # Closest heading to the span (accuracy depends on document structure)\n print(span._.heading)\n```\n\nIf you need to process larger volumes of documents at scale, you can use the `spaCyLayout.pipe` method, which takes an iterable of paths or bytes instead and yields `Doc` objects:\n\n```python\npaths = [\"one.pdf\", \"two.pdf\", \"three.pdf\", ...]\nfor doc in layout.pipe(paths):\n print(doc._.layout)\n```\n\nAfter you've processed the documents, you can [serialize](https://spacy.io/usage/saving-loading#docs) the structured `Doc` objects in spaCy's efficient binary format, so you don't have to re-run the resource-intensive conversion.\n\nspaCy also allows you to call the `nlp` object on an already created `Doc`, so you can easily apply a pipeline of components for [linguistic analysis](https://spacy.io/usage/linguistic-features) or [named entity recognition](https://spacy.io/usage/linguistic-features#named-entities), use [rule-based matching](https://spacy.io/usage/rule-based-matching) or anything else you can do with spaCy.\n\n```python\n# Load the transformer-based English pipeline\n# Installation: python -m spacy download en_core_web_trf\nnlp = spacy.load(\"en_core_web_trf\")\nlayout = spaCyLayout(nlp)\n\ndoc = layout(\"./starcraft.pdf\")\n# Apply the pipeline to access POS tags, dependencies, entities etc.\ndoc = nlp(doc)\n```\n\n## \ud83c\udf9b\ufe0f API\n\n### Data and extension attributes\n\n```python\nlayout = spaCyLayout(nlp)\ndoc = layout(\"./starcraft.pdf\")\nprint(doc._.layout)\nfor span in doc.spans[\"layout\"]:\n print(span.label_, span._.layout)\n```\n\n| Attribute | Type | Description |\n| --- | --- | --- |\n| `Doc._.layout` | `DocLayout` | Layout features of the document. |\n| `Doc._.pages` | `list[tuple[PageLayout, list[Span]]]` | Pages in the document and the spans they contain. |\n| `Doc.spans[\"layout\"]` | `spacy.tokens.SpanGroup` | The layout spans in the document. |\n| `Span.label_` | `str` | The type of the extracted layout span, e.g. `\"text\"` or `\"section_header\"`. [See here](https://github.com/DS4SD/docling-core/blob/14cad33ae7f8dc011a79dd364361d2647c635466/docling_core/types/doc/labels.py) for options. |\n| `Span.label` | `int` | The integer ID of the span label. |\n| `Span.id` | `int` | Running index of layout span. |\n| `Span._.layout` | `SpanLayout` | Layout features of a layout span. |\n| `Span._.heading` | `Span \\| None` | Closest heading to a span, if available. |\n\n### <kbd>dataclass</kbd> PageLayout\n\n| Attribute | Type | Description |\n| --- | --- | --- |\n| `page_no` | `int` | The page number (1-indexed). |\n| `width` | `float` | Page with in pixels. |\n| `height` | `float` | Page height in pixels. |\n\n### <kbd>dataclass</kbd> DocLayout\n\n| Attribute | Type | Description |\n| --- | --- | --- |\n| `pages` | `list[PageLayout]` | The pages in the document. |\n\n### <kbd>dataclass</kbd> SpanLayout\n\n| Attribute | Type | Description |\n| --- | --- | --- |\n| `x` | `float` | Horizontal offset of the bounding box in pixels. |\n| `y` | `float` | Vertical offset of the bounding box in pixels. |\n| `width` | `float` | Width of the bounding box in pixels. |\n| `height` | `float` | Height of the bounding box in pixels. |\n| `page_no` | `int` | Number of page the span is on. |\n\n### <kbd>class</kbd> `spaCyLayout`\n\n#### <kbd>method</kbd> `spaCyLayout.__init__`\n\nInitialize the document processor.\n\n```python\nnlp = spacy.blank(\"en\")\nlayout = spaCyLayout(nlp)\n```\n\n| Argument | Type | Description |\n| --- | --- | --- |\n| `nlp` | `spacy.language.Language` | The initialized `nlp` object to use for tokenization. |\n| `separator` | `str` | Token used to separate sections in the created `Doc` object. The separator won't be part of the layout span. If `None`, no separator will be added. Defaults to `\"\\n\\n\"`. |\n| `attrs` | `dict[str, str]` | Override the custom spaCy attributes. Can include `\"doc_layout\"`, `\"doc_pages\"`, `\"span_layout\"`, `\"span_heading\"` and `\"span_group\"`. |\n| `headings` | `list[str]` | Labels of headings to consider for `Span._.heading` detection. Defaults to `[\"section_header\", \"page_header\", \"title\"]`. |\n| `docling_options` | `dict[InputFormat, FormatOption]` | [Format options](https://ds4sd.github.io/docling/usage/#advanced-options) passed to Docling's `DocumentConverter`. |\n| **RETURNS** | `spaCyLayout` | The initialized object. |\n\n#### <kbd>method</kbd> `spaCyLayout.__call__`\n\nProcess a document and create a spaCy [`Doc`](https://spacy.io/api/doc) object containing the text content and layout spans, available via `Doc.spans[\"layout\"]` by default.\n\n```python\nlayout = spaCyLayout(nlp)\ndoc = layout(\"./starcraft.pdf\")\n```\n\n| Argument | Type | Description |\n| --- | --- | --- |\n| `source` | `str \\| Path \\| bytes` | Path of document to process or bytes. |\n| **RETURNS** | `Doc` | The processed spaCy `Doc` object. |\n\n#### <kbd>method</kbd> `spaCyLayout.pipe`\n\nProcess multiple documents and create spaCy [`Doc`](https://spacy.io/api/doc) objects. You should use this method if you're processing larger volumes of documents at scale.\n\n```python\nlayout = spaCyLayout(nlp)\npaths = [\"one.pdf\", \"two.pdf\", \"three.pdf\", ...]\ndocs = layout.pipe(paths)\n```\n\n| Argument | Type | Description |\n| --- | --- | --- |\n| `sources` | `Iterable[str \\| Path \\| bytes]` | Paths of documents to process or bytes. |\n| **YIELDS** | `Doc` | The processed spaCy `Doc` object. |\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "Use spaCy with PDFs, Word docs and other documents",
"version": "0.0.5",
"project_urls": {
"Homepage": "https://github.com/explosion/spacy-layout",
"Release notes": "https://github.com/explosion/spacy-layout/releases",
"Source": "https://github.com/explosion/spacy-layout"
},
"split_keywords": [],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "aa52c325dec52834767a37c93c9c8e6d12576d2c5c52093e499d30989b09f2ae",
"md5": "0151fd97e45cf165737355c570e5d95c",
"sha256": "919b6f49a7a1d8d1dfb21ec60a48248453f45aeada206f2eddbb62fcf2d1ebd3"
},
"downloads": -1,
"filename": "spacy_layout-0.0.5-py2.py3-none-any.whl",
"has_sig": false,
"md5_digest": "0151fd97e45cf165737355c570e5d95c",
"packagetype": "bdist_wheel",
"python_version": "py2.py3",
"requires_python": ">=3.10",
"size": 7973,
"upload_time": "2024-11-21T13:19:50",
"upload_time_iso_8601": "2024-11-21T13:19:50.998142Z",
"url": "https://files.pythonhosted.org/packages/aa/52/c325dec52834767a37c93c9c8e6d12576d2c5c52093e499d30989b09f2ae/spacy_layout-0.0.5-py2.py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "290041f12302794ae7e1f2e879aa17927ccd86e8e3a7963db78d6d9d2f2e9933",
"md5": "cc172dce600f571a8bb2533fd6d6a51b",
"sha256": "8c785c70dd74f70b915109cf466b525c5b53292da1733d330fd705e96b3e5ace"
},
"downloads": -1,
"filename": "spacy_layout-0.0.5.tar.gz",
"has_sig": false,
"md5_digest": "cc172dce600f571a8bb2533fd6d6a51b",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.10",
"size": 8140,
"upload_time": "2024-11-21T13:19:55",
"upload_time_iso_8601": "2024-11-21T13:19:55.635346Z",
"url": "https://files.pythonhosted.org/packages/29/00/41f12302794ae7e1f2e879aa17927ccd86e8e3a7963db78d6d9d2f2e9933/spacy_layout-0.0.5.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-11-21 13:19:55",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "explosion",
"github_project": "spacy-layout",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"requirements": [],
"lcname": "spacy-layout"
}