GrobidArticleExtractor


NameGrobidArticleExtractor JSON
Version 0.7.0 PyPI version JSON
download
home_pagehttps://github.com/sensein/GrobidArticleExtractor
SummaryGrobidArticleExtractor is a Python package designed to extract and organize content from scientific papers in PDF format.
upload_time2025-01-07 02:43:53
maintainertekrajchhetri
docs_urlNone
authortekrajchhetri
requires_python<4.0,>=3.10
licenseMIT
keywords python package template
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # GrobidArticleExtractor

This Python tool extracts content from PDF files using GROBID and organizes it by sections. It provides a structured way
to extract both metadata and content from academic papers and other structured documents.

## Features

- Direct PDF processing using GROBID API
- Metadata extraction (title, authors, abstract, publication date)
- Hierarchical section organization with subsections

## Prerequisites

1. Install GROBID:

   ```bash 
   docker pull lfoppiano/grobid:0.8.0
   docker run --init -p 8070:8070 -e JAVA_OPTS="-XX:+UseZGC" lfoppiano/grobid:0.8.0
   ```
   `JAVA_OPTS="-XX:+UseZGC"` helps to resolve the following error in mac os.
    ```bash
    [thread 44 also had an error]
    
    A fatal error has been detected by the Java Runtime Environment:
    
    SIGSEGV (0xb) at pc=0x00007ffffef8ad07, pid=8, tid=47
    
    JRE version: OpenJDK Runtime Environment (17.0.2+8) (build 17.0.2+8-86)
    Java VM: OpenJDK 64-Bit Server VM (17.0.2+8-86, mixed mode, sharing, tiered, compressed oops, compressed class ptrs, parallel gc, linux-amd64)
    Problematic frame:
    [thread 41 also had an error]
    [thread 45 also had an error]
    [thread 46 also had an error]
    ```

2. Installation :
 
   Install this package via :
   
   ```sh
   pip install GrobidArticleExtractor
   ```
   
   Or get the newest development version via:
   
   ```sh
   pip install git+https://github.com/sensein/GrobidArticleExtractor.git
   ```

   Note: If upgrading from a previous version, you may need to reinstall the package to ensure the CLI command is properly installed:
   ```sh
   pip uninstall GrobidArticleExtractor
   pip install GrobidArticleExtractor
   ```

## Usage

### Command Line Interface

The tool provides a user-friendly command-line interface for batch processing PDF files:

```bash
# Basic usage (processes PDFs from 'pdfs' directory)
grobidextractor

# Process PDFs from a specific directory
grobidextractor path/to/pdfs

# Specify custom output directory
grobidextractor path/to/pdfs -o path/to/output

# Use custom GROBID server and disable content preview
grobidextractor path/to/pdfs --grobid-url http://custom:8070 --no-preview
```

Available options:

```bash
$ grobidextractor --help
Usage: grobidextractor [OPTIONS] [INPUT_FOLDER]

  Process PDF files from INPUT_FOLDER and extract their content using GROBID.

  The extracted content is saved as JSON files in the output directory.
  Each JSON file is named after its source PDF file.

Options:
  -o, --output-dir PATH  Directory to save extracted JSON files (default: output)
  -g, --grobid-url TEXT  GROBID service URL (default: http://localhost:8070)
  --preview / --no-preview
                        Show preview of extracted content (default: True)
  --help                Show this message and exit.

Example:
  grobidextractor path/to/pdfs -o path/to/output
```

### Python API Usage

You can also use the tool programmatically in your Python code:

```python
from GrobidArticleExtractor.app import GrobidArticleExtractor

# Initialize extractor (default GROBID URL: http://localhost:8070)
extractor = GrobidArticleExtractor()

# Process a PDF file
xml_content = extractor.process_pdf("path/to/your/paper.pdf")

if xml_content:
   # Extract and organize content
   result = extractor.extract_content(xml_content)

   # Access metadata
   print(result['metadata'])

   # Access sections
   for section in result['sections']:
      print(section['heading'])
      if 'content' in section:
         print(section['content'])
```

Custom GROBID server:

```python
extractor = GrobidArticleExtractor(grobid_url="http://your-grobid-server:8070")
```

## Output Structure

The extracted content is organized as follows:

```python
{
   'metadata': {
      'title': 'Paper Title',
      'authors': ['Author 1', 'Author 2'],
      'abstract': 'Paper abstract...',
      'publication_date': '2023'
   },
   'sections': [
      {
         'heading': 'Introduction',
         'content': ['Paragraph 1...', 'Paragraph 2...'],
         'subsections': [
            {
               'heading': 'Background',
               'content': ['Subsection content...']
            }
         ]
      }
      # More sections...
   ]
}
```

## Project Structure

The project is organized into two main files:

- `app.py` - Contains the core `GrobidArticleExtractor` class with all the PDF processing and content extraction
  functionality
- `cli.py` - Contains the command-line interface implementation using Click

## Error Handling

The tool includes comprehensive error handling for common scenarios:

- PDF file not found
- GROBID service unavailable
- XML parsing errors
- Invalid content structure

All errors are logged with appropriate messages using Python's logging module.

## Contributing

Feel free to submit issues and enhancement requests!

## License

MIT License

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/sensein/GrobidArticleExtractor",
    "name": "GrobidArticleExtractor",
    "maintainer": "tekrajchhetri",
    "docs_url": null,
    "requires_python": "<4.0,>=3.10",
    "maintainer_email": "tekrajchhetri@gmail.com",
    "keywords": "python, package, template",
    "author": "tekrajchhetri",
    "author_email": "tekrajchhetri@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/75/58/96acef08322d2d010ca88a737f3a0fff0a12aee5d448277b536e88f8a7a7/grobidarticleextractor-0.7.0.tar.gz",
    "platform": null,
    "description": "# GrobidArticleExtractor\n\nThis Python tool extracts content from PDF files using GROBID and organizes it by sections. It provides a structured way\nto extract both metadata and content from academic papers and other structured documents.\n\n## Features\n\n- Direct PDF processing using GROBID API\n- Metadata extraction (title, authors, abstract, publication date)\n- Hierarchical section organization with subsections\n\n## Prerequisites\n\n1. Install GROBID:\n\n   ```bash \n   docker pull lfoppiano/grobid:0.8.0\n   docker run --init -p 8070:8070 -e JAVA_OPTS=\"-XX:+UseZGC\" lfoppiano/grobid:0.8.0\n   ```\n   `JAVA_OPTS=\"-XX:+UseZGC\"` helps to resolve the following error in mac os.\n    ```bash\n    [thread 44 also had an error]\n    \n    A fatal error has been detected by the Java Runtime Environment:\n    \n    SIGSEGV (0xb) at pc=0x00007ffffef8ad07, pid=8, tid=47\n    \n    JRE version: OpenJDK Runtime Environment (17.0.2+8) (build 17.0.2+8-86)\n    Java VM: OpenJDK 64-Bit Server VM (17.0.2+8-86, mixed mode, sharing, tiered, compressed oops, compressed class ptrs, parallel gc, linux-amd64)\n    Problematic frame:\n    [thread 41 also had an error]\n    [thread 45 also had an error]\n    [thread 46 also had an error]\n    ```\n\n2. Installation :\n \n   Install this package via :\n   \n   ```sh\n   pip install GrobidArticleExtractor\n   ```\n   \n   Or get the newest development version via:\n   \n   ```sh\n   pip install git+https://github.com/sensein/GrobidArticleExtractor.git\n   ```\n\n   Note: If upgrading from a previous version, you may need to reinstall the package to ensure the CLI command is properly installed:\n   ```sh\n   pip uninstall GrobidArticleExtractor\n   pip install GrobidArticleExtractor\n   ```\n\n## Usage\n\n### Command Line Interface\n\nThe tool provides a user-friendly command-line interface for batch processing PDF files:\n\n```bash\n# Basic usage (processes PDFs from 'pdfs' directory)\ngrobidextractor\n\n# Process PDFs from a specific directory\ngrobidextractor path/to/pdfs\n\n# Specify custom output directory\ngrobidextractor path/to/pdfs -o path/to/output\n\n# Use custom GROBID server and disable content preview\ngrobidextractor path/to/pdfs --grobid-url http://custom:8070 --no-preview\n```\n\nAvailable options:\n\n```bash\n$ grobidextractor --help\nUsage: grobidextractor [OPTIONS] [INPUT_FOLDER]\n\n  Process PDF files from INPUT_FOLDER and extract their content using GROBID.\n\n  The extracted content is saved as JSON files in the output directory.\n  Each JSON file is named after its source PDF file.\n\nOptions:\n  -o, --output-dir PATH  Directory to save extracted JSON files (default: output)\n  -g, --grobid-url TEXT  GROBID service URL (default: http://localhost:8070)\n  --preview / --no-preview\n                        Show preview of extracted content (default: True)\n  --help                Show this message and exit.\n\nExample:\n  grobidextractor path/to/pdfs -o path/to/output\n```\n\n### Python API Usage\n\nYou can also use the tool programmatically in your Python code:\n\n```python\nfrom GrobidArticleExtractor.app import GrobidArticleExtractor\n\n# Initialize extractor (default GROBID URL: http://localhost:8070)\nextractor = GrobidArticleExtractor()\n\n# Process a PDF file\nxml_content = extractor.process_pdf(\"path/to/your/paper.pdf\")\n\nif xml_content:\n   # Extract and organize content\n   result = extractor.extract_content(xml_content)\n\n   # Access metadata\n   print(result['metadata'])\n\n   # Access sections\n   for section in result['sections']:\n      print(section['heading'])\n      if 'content' in section:\n         print(section['content'])\n```\n\nCustom GROBID server:\n\n```python\nextractor = GrobidArticleExtractor(grobid_url=\"http://your-grobid-server:8070\")\n```\n\n## Output Structure\n\nThe extracted content is organized as follows:\n\n```python\n{\n   'metadata': {\n      'title': 'Paper Title',\n      'authors': ['Author 1', 'Author 2'],\n      'abstract': 'Paper abstract...',\n      'publication_date': '2023'\n   },\n   'sections': [\n      {\n         'heading': 'Introduction',\n         'content': ['Paragraph 1...', 'Paragraph 2...'],\n         'subsections': [\n            {\n               'heading': 'Background',\n               'content': ['Subsection content...']\n            }\n         ]\n      }\n      # More sections...\n   ]\n}\n```\n\n## Project Structure\n\nThe project is organized into two main files:\n\n- `app.py` - Contains the core `GrobidArticleExtractor` class with all the PDF processing and content extraction\n  functionality\n- `cli.py` - Contains the command-line interface implementation using Click\n\n## Error Handling\n\nThe tool includes comprehensive error handling for common scenarios:\n\n- PDF file not found\n- GROBID service unavailable\n- XML parsing errors\n- Invalid content structure\n\nAll errors are logged with appropriate messages using Python's logging module.\n\n## Contributing\n\nFeel free to submit issues and enhancement requests!\n\n## License\n\nMIT License\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "GrobidArticleExtractor is a Python package designed to extract and organize content from scientific papers in PDF format.",
    "version": "0.7.0",
    "project_urls": {
        "Documentation": "https://tekrajchhetri.github.io/GrobidArticleExtractor",
        "Homepage": "https://github.com/sensein/GrobidArticleExtractor",
        "Repository": "https://github.com/sensein/GrobidArticleExtractor"
    },
    "split_keywords": [
        "python",
        " package",
        " template"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "fb4986ac3c28e06b7484d78e0df602bef1dc38c0d20fdfc28c57227b0239e0a6",
                "md5": "8fdc5211d7e1bc5d0278b1754f319ddd",
                "sha256": "7af55a38739c893d58f6578229951cac9d52e94f3ddd867b51609b5640153e3d"
            },
            "downloads": -1,
            "filename": "grobidarticleextractor-0.7.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "8fdc5211d7e1bc5d0278b1754f319ddd",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": "<4.0,>=3.10",
            "size": 9857,
            "upload_time": "2025-01-07T02:43:52",
            "upload_time_iso_8601": "2025-01-07T02:43:52.044040Z",
            "url": "https://files.pythonhosted.org/packages/fb/49/86ac3c28e06b7484d78e0df602bef1dc38c0d20fdfc28c57227b0239e0a6/grobidarticleextractor-0.7.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "755896acef08322d2d010ca88a737f3a0fff0a12aee5d448277b536e88f8a7a7",
                "md5": "c5ef92012aa377a747470cb3f47114ee",
                "sha256": "069ea23a3b6aab9c5f1aaa2786398383403b70f6efc2db4d6e7f794d73b66f9c"
            },
            "downloads": -1,
            "filename": "grobidarticleextractor-0.7.0.tar.gz",
            "has_sig": false,
            "md5_digest": "c5ef92012aa377a747470cb3f47114ee",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": "<4.0,>=3.10",
            "size": 10013,
            "upload_time": "2025-01-07T02:43:53",
            "upload_time_iso_8601": "2025-01-07T02:43:53.191001Z",
            "url": "https://files.pythonhosted.org/packages/75/58/96acef08322d2d010ca88a737f3a0fff0a12aee5d448277b536e88f8a7a7/grobidarticleextractor-0.7.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-01-07 02:43:53",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "sensein",
    "github_project": "GrobidArticleExtractor",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "grobidarticleextractor"
}
        
Elapsed time: 0.84404s