# GrobidArticleExtractor
This Python tool extracts content from PDF files using GROBID and organizes it by sections. It provides a structured way
to extract both metadata and content from academic papers and other structured documents.
## Features
- Direct PDF processing using GROBID API
- Metadata extraction (title, authors, abstract, publication date)
- Hierarchical section organization with subsections
## Prerequisites
1. Install GROBID:
```bash
docker pull lfoppiano/grobid:0.8.0
docker run --init -p 8070:8070 -e JAVA_OPTS="-XX:+UseZGC" lfoppiano/grobid:0.8.0
```
`JAVA_OPTS="-XX:+UseZGC"` helps to resolve the following error in mac os.
```bash
[thread 44 also had an error]
A fatal error has been detected by the Java Runtime Environment:
SIGSEGV (0xb) at pc=0x00007ffffef8ad07, pid=8, tid=47
JRE version: OpenJDK Runtime Environment (17.0.2+8) (build 17.0.2+8-86)
Java VM: OpenJDK 64-Bit Server VM (17.0.2+8-86, mixed mode, sharing, tiered, compressed oops, compressed class ptrs, parallel gc, linux-amd64)
Problematic frame:
[thread 41 also had an error]
[thread 45 also had an error]
[thread 46 also had an error]
```
2. Installation :
Install this package via :
```sh
pip install GrobidArticleExtractor
```
Or get the newest development version via:
```sh
pip install git+https://github.com/sensein/GrobidArticleExtractor.git
```
Note: If upgrading from a previous version, you may need to reinstall the package to ensure the CLI command is properly installed:
```sh
pip uninstall GrobidArticleExtractor
pip install GrobidArticleExtractor
```
## Usage
### Command Line Interface
The tool provides a user-friendly command-line interface for batch processing PDF files:
```bash
# Basic usage (processes PDFs from 'pdfs' directory)
grobidextractor
# Process PDFs from a specific directory
grobidextractor path/to/pdfs
# Specify custom output directory
grobidextractor path/to/pdfs -o path/to/output
# Use custom GROBID server and disable content preview
grobidextractor path/to/pdfs --grobid-url http://custom:8070 --no-preview
```
Available options:
```bash
$ grobidextractor --help
Usage: grobidextractor [OPTIONS] [INPUT_FOLDER]
Process PDF files from INPUT_FOLDER and extract their content using GROBID.
The extracted content is saved as JSON files in the output directory.
Each JSON file is named after its source PDF file.
Options:
-o, --output-dir PATH Directory to save extracted JSON files (default: output)
-g, --grobid-url TEXT GROBID service URL (default: http://localhost:8070)
--preview / --no-preview
Show preview of extracted content (default: True)
--help Show this message and exit.
Example:
grobidextractor path/to/pdfs -o path/to/output
```
### Python API Usage
You can also use the tool programmatically in your Python code:
```python
from GrobidArticleExtractor.app import GrobidArticleExtractor
# Initialize extractor (default GROBID URL: http://localhost:8070)
extractor = GrobidArticleExtractor()
# Process a PDF file
xml_content = extractor.process_pdf("path/to/your/paper.pdf")
if xml_content:
# Extract and organize content
result = extractor.extract_content(xml_content)
# Access metadata
print(result['metadata'])
# Access sections
for section in result['sections']:
print(section['heading'])
if 'content' in section:
print(section['content'])
```
Custom GROBID server:
```python
extractor = GrobidArticleExtractor(grobid_url="http://your-grobid-server:8070")
```
## Output Structure
The extracted content is organized as follows:
```python
{
'metadata': {
'title': 'Paper Title',
'authors': ['Author 1', 'Author 2'],
'abstract': 'Paper abstract...',
'publication_date': '2023'
},
'sections': [
{
'heading': 'Introduction',
'content': ['Paragraph 1...', 'Paragraph 2...'],
'subsections': [
{
'heading': 'Background',
'content': ['Subsection content...']
}
]
}
# More sections...
]
}
```
## Project Structure
The project is organized into two main files:
- `app.py` - Contains the core `GrobidArticleExtractor` class with all the PDF processing and content extraction
functionality
- `cli.py` - Contains the command-line interface implementation using Click
## Error Handling
The tool includes comprehensive error handling for common scenarios:
- PDF file not found
- GROBID service unavailable
- XML parsing errors
- Invalid content structure
All errors are logged with appropriate messages using Python's logging module.
## Contributing
Feel free to submit issues and enhancement requests!
## License
MIT License
Raw data
{
"_id": null,
"home_page": "https://github.com/sensein/GrobidArticleExtractor",
"name": "GrobidArticleExtractor",
"maintainer": "tekrajchhetri",
"docs_url": null,
"requires_python": "<4.0,>=3.10",
"maintainer_email": "tekrajchhetri@gmail.com",
"keywords": "python, package, template",
"author": "tekrajchhetri",
"author_email": "tekrajchhetri@gmail.com",
"download_url": "https://files.pythonhosted.org/packages/75/58/96acef08322d2d010ca88a737f3a0fff0a12aee5d448277b536e88f8a7a7/grobidarticleextractor-0.7.0.tar.gz",
"platform": null,
"description": "# GrobidArticleExtractor\n\nThis Python tool extracts content from PDF files using GROBID and organizes it by sections. It provides a structured way\nto extract both metadata and content from academic papers and other structured documents.\n\n## Features\n\n- Direct PDF processing using GROBID API\n- Metadata extraction (title, authors, abstract, publication date)\n- Hierarchical section organization with subsections\n\n## Prerequisites\n\n1. Install GROBID:\n\n ```bash \n docker pull lfoppiano/grobid:0.8.0\n docker run --init -p 8070:8070 -e JAVA_OPTS=\"-XX:+UseZGC\" lfoppiano/grobid:0.8.0\n ```\n `JAVA_OPTS=\"-XX:+UseZGC\"` helps to resolve the following error in mac os.\n ```bash\n [thread 44 also had an error]\n \n A fatal error has been detected by the Java Runtime Environment:\n \n SIGSEGV (0xb) at pc=0x00007ffffef8ad07, pid=8, tid=47\n \n JRE version: OpenJDK Runtime Environment (17.0.2+8) (build 17.0.2+8-86)\n Java VM: OpenJDK 64-Bit Server VM (17.0.2+8-86, mixed mode, sharing, tiered, compressed oops, compressed class ptrs, parallel gc, linux-amd64)\n Problematic frame:\n [thread 41 also had an error]\n [thread 45 also had an error]\n [thread 46 also had an error]\n ```\n\n2. Installation :\n \n Install this package via :\n \n ```sh\n pip install GrobidArticleExtractor\n ```\n \n Or get the newest development version via:\n \n ```sh\n pip install git+https://github.com/sensein/GrobidArticleExtractor.git\n ```\n\n Note: If upgrading from a previous version, you may need to reinstall the package to ensure the CLI command is properly installed:\n ```sh\n pip uninstall GrobidArticleExtractor\n pip install GrobidArticleExtractor\n ```\n\n## Usage\n\n### Command Line Interface\n\nThe tool provides a user-friendly command-line interface for batch processing PDF files:\n\n```bash\n# Basic usage (processes PDFs from 'pdfs' directory)\ngrobidextractor\n\n# Process PDFs from a specific directory\ngrobidextractor path/to/pdfs\n\n# Specify custom output directory\ngrobidextractor path/to/pdfs -o path/to/output\n\n# Use custom GROBID server and disable content preview\ngrobidextractor path/to/pdfs --grobid-url http://custom:8070 --no-preview\n```\n\nAvailable options:\n\n```bash\n$ grobidextractor --help\nUsage: grobidextractor [OPTIONS] [INPUT_FOLDER]\n\n Process PDF files from INPUT_FOLDER and extract their content using GROBID.\n\n The extracted content is saved as JSON files in the output directory.\n Each JSON file is named after its source PDF file.\n\nOptions:\n -o, --output-dir PATH Directory to save extracted JSON files (default: output)\n -g, --grobid-url TEXT GROBID service URL (default: http://localhost:8070)\n --preview / --no-preview\n Show preview of extracted content (default: True)\n --help Show this message and exit.\n\nExample:\n grobidextractor path/to/pdfs -o path/to/output\n```\n\n### Python API Usage\n\nYou can also use the tool programmatically in your Python code:\n\n```python\nfrom GrobidArticleExtractor.app import GrobidArticleExtractor\n\n# Initialize extractor (default GROBID URL: http://localhost:8070)\nextractor = GrobidArticleExtractor()\n\n# Process a PDF file\nxml_content = extractor.process_pdf(\"path/to/your/paper.pdf\")\n\nif xml_content:\n # Extract and organize content\n result = extractor.extract_content(xml_content)\n\n # Access metadata\n print(result['metadata'])\n\n # Access sections\n for section in result['sections']:\n print(section['heading'])\n if 'content' in section:\n print(section['content'])\n```\n\nCustom GROBID server:\n\n```python\nextractor = GrobidArticleExtractor(grobid_url=\"http://your-grobid-server:8070\")\n```\n\n## Output Structure\n\nThe extracted content is organized as follows:\n\n```python\n{\n 'metadata': {\n 'title': 'Paper Title',\n 'authors': ['Author 1', 'Author 2'],\n 'abstract': 'Paper abstract...',\n 'publication_date': '2023'\n },\n 'sections': [\n {\n 'heading': 'Introduction',\n 'content': ['Paragraph 1...', 'Paragraph 2...'],\n 'subsections': [\n {\n 'heading': 'Background',\n 'content': ['Subsection content...']\n }\n ]\n }\n # More sections...\n ]\n}\n```\n\n## Project Structure\n\nThe project is organized into two main files:\n\n- `app.py` - Contains the core `GrobidArticleExtractor` class with all the PDF processing and content extraction\n functionality\n- `cli.py` - Contains the command-line interface implementation using Click\n\n## Error Handling\n\nThe tool includes comprehensive error handling for common scenarios:\n\n- PDF file not found\n- GROBID service unavailable\n- XML parsing errors\n- Invalid content structure\n\nAll errors are logged with appropriate messages using Python's logging module.\n\n## Contributing\n\nFeel free to submit issues and enhancement requests!\n\n## License\n\nMIT License\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "GrobidArticleExtractor is a Python package designed to extract and organize content from scientific papers in PDF format.",
"version": "0.7.0",
"project_urls": {
"Documentation": "https://tekrajchhetri.github.io/GrobidArticleExtractor",
"Homepage": "https://github.com/sensein/GrobidArticleExtractor",
"Repository": "https://github.com/sensein/GrobidArticleExtractor"
},
"split_keywords": [
"python",
" package",
" template"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "fb4986ac3c28e06b7484d78e0df602bef1dc38c0d20fdfc28c57227b0239e0a6",
"md5": "8fdc5211d7e1bc5d0278b1754f319ddd",
"sha256": "7af55a38739c893d58f6578229951cac9d52e94f3ddd867b51609b5640153e3d"
},
"downloads": -1,
"filename": "grobidarticleextractor-0.7.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "8fdc5211d7e1bc5d0278b1754f319ddd",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": "<4.0,>=3.10",
"size": 9857,
"upload_time": "2025-01-07T02:43:52",
"upload_time_iso_8601": "2025-01-07T02:43:52.044040Z",
"url": "https://files.pythonhosted.org/packages/fb/49/86ac3c28e06b7484d78e0df602bef1dc38c0d20fdfc28c57227b0239e0a6/grobidarticleextractor-0.7.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "755896acef08322d2d010ca88a737f3a0fff0a12aee5d448277b536e88f8a7a7",
"md5": "c5ef92012aa377a747470cb3f47114ee",
"sha256": "069ea23a3b6aab9c5f1aaa2786398383403b70f6efc2db4d6e7f794d73b66f9c"
},
"downloads": -1,
"filename": "grobidarticleextractor-0.7.0.tar.gz",
"has_sig": false,
"md5_digest": "c5ef92012aa377a747470cb3f47114ee",
"packagetype": "sdist",
"python_version": "source",
"requires_python": "<4.0,>=3.10",
"size": 10013,
"upload_time": "2025-01-07T02:43:53",
"upload_time_iso_8601": "2025-01-07T02:43:53.191001Z",
"url": "https://files.pythonhosted.org/packages/75/58/96acef08322d2d010ca88a737f3a0fff0a12aee5d448277b536e88f8a7a7/grobidarticleextractor-0.7.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-01-07 02:43:53",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "sensein",
"github_project": "GrobidArticleExtractor",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"lcname": "grobidarticleextractor"
}