refextract


Namerefextract JSON
Version 1.1.6 PyPI version JSON
download
home_pageNone
SummarySmall library for extracting references used in scholarly communication.
upload_time2025-10-21 09:48:19
maintainerNone
docs_urlhttps://pythonhosted.org/refextract/
authorCERN
requires_python<4,>=3.11
licenseGPL-2.0-or-later
keywords
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            
# refextract

## About

A library for extracting references used in scholarly communication.

## Getting Started

Note: due to the usage of `mmap` resize functionality this library cannot be locally installed on a mac

### Docker Setup:

Before the first usage, or anytime a new library/dependency is changed a new docker image must be created using:
```shell
docker build --target refextract-tests -t refextract .
```

After that, spin up a `refextract` service with:
```shell
docker run -it -v ./tests:/refextract/tests -v ./refextract:/refextract/refextract  refextract
```

### Running tests

Exec into the container via
```shell
docker exec -it refextract /bin/bash
```
Then simply run
```shell
pytest .
```

## Usage

To get structured information from a publication reference:


``` python
>>> from refextract import extract_journal_reference
>>> reference = extract_journal_reference('J.Phys.,A39,13445')
>>> print(reference)
{
'extra_ibids': [],
'is_ibid': False,
'misc_txt': '',
'page': '13445',
'title': 'J. Phys.',
'type': 'JOURNAL',
'volume': 'A39',
'year': '',

}
```

To extract references from a PDF:
``` python
>>> from refextract import extract_references_from_file
>>> references = extract_references_from_file('1503.07589.pdf')
>>> print(references[0])
{
'author': ['F. Englert and R. Brout'],
'doi': ['doi:10.1103/PhysRevLett.13.321'],
'journal_page': ['321'],
'journal_reference': ['Phys. Rev. Lett. 13 (1964) 321'],
'journal_title': ['Phys. Rev. Lett.'],
'journal_volume': ['13'],
'journal_year': ['1964'],
'linemarker': ['1'],
'raw_ref': ['[1] F. Englert and R. Brout, \u201cBroken symmetry and the mass of gauge vector mesons\u201d, Phys. Rev. Lett. 13 (1964) 321, doi:10.1103/PhysRevLett.13.321.'],
'texkey': ['Englert:1964et'],
'year': ['1964'],
}
```

To extract directly from a URL:
``` python
>>> from refextract import extract_references_from_url
>>> references = extract_references_from_url('https://arxiv.org/pdf/1503.07589.pdf')
>>> print(references[0])
{
'author': ['F. Englert and R. Brout'],
'doi': ['doi:10.1103/PhysRevLett.13.321'],
'journal_page': ['321'],
'journal_reference': ['Phys. Rev. Lett. 13 (1964) 321'],
'journal_title': ['Phys. Rev. Lett.'],
'journal_volume': ['13'],
'journal_year': ['1964'],
'linemarker': ['1'],
'raw_ref': ['[1] F. Englert and R. Brout, \u201cBroken symmetry and the mass of gauge vector mesons\u201d, Phys. Rev. Lett. 13 (1964) 321, doi:10.1103/PhysRevLett.13.321.'],
'texkey': ['Englert:1964et'],
'year': ['1964'],

}

```

## Notes
`refextract` depends on

[pdftotext](http://linux.die.net/man/1/pdftotext).

## Acknowledgments

`refextract` is based on code and ideas from the following people, who

contributed to the `docextract` module in Invenio:
- Alessio Deiana
- Federico Poli
- Gerrit Rindermann
- Graham R. Armstrong
- Grzegorz Szpura
- Jan Aage Lavik
- Javier Martin Montull
- Micha Moskovic
- Samuele Kaplun
- Thorsten Schwander
- Tibor Simko

## License
GPLv2


            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "refextract",
    "maintainer": null,
    "docs_url": "https://pythonhosted.org/refextract/",
    "requires_python": "<4,>=3.11",
    "maintainer_email": null,
    "keywords": null,
    "author": "CERN",
    "author_email": "admin@inspirehep.net",
    "download_url": "https://files.pythonhosted.org/packages/f2/5d/ec25190dd00f7121eebcde4656402c59ee565f88adcee40e1c8f8e602c00/refextract-1.1.6.tar.gz",
    "platform": null,
    "description": "\n# refextract\n\n## About\n\nA library for extracting references used in scholarly communication.\n\n## Getting Started\n\nNote: due to the usage of `mmap` resize functionality this library cannot be locally installed on a mac\n\n### Docker Setup:\n\nBefore the first usage, or anytime a new library/dependency is changed a new docker image must be created using:\n```shell\ndocker build --target refextract-tests -t refextract .\n```\n\nAfter that, spin up a `refextract` service with:\n```shell\ndocker run -it -v ./tests:/refextract/tests -v ./refextract:/refextract/refextract  refextract\n```\n\n### Running tests\n\nExec into the container via\n```shell\ndocker exec -it refextract /bin/bash\n```\nThen simply run\n```shell\npytest .\n```\n\n## Usage\n\nTo get structured information from a publication reference:\n\n\n``` python\n>>> from refextract import extract_journal_reference\n>>> reference = extract_journal_reference('J.Phys.,A39,13445')\n>>> print(reference)\n{\n'extra_ibids': [],\n'is_ibid': False,\n'misc_txt': '',\n'page': '13445',\n'title': 'J. Phys.',\n'type': 'JOURNAL',\n'volume': 'A39',\n'year': '',\n\n}\n```\n\nTo extract references from a PDF:\n``` python\n>>> from refextract import extract_references_from_file\n>>> references = extract_references_from_file('1503.07589.pdf')\n>>> print(references[0])\n{\n'author': ['F. Englert and R. Brout'],\n'doi': ['doi:10.1103/PhysRevLett.13.321'],\n'journal_page': ['321'],\n'journal_reference': ['Phys. Rev. Lett. 13 (1964) 321'],\n'journal_title': ['Phys. Rev. Lett.'],\n'journal_volume': ['13'],\n'journal_year': ['1964'],\n'linemarker': ['1'],\n'raw_ref': ['[1] F. Englert and R. Brout, \\u201cBroken symmetry and the mass of gauge vector mesons\\u201d, Phys. Rev. Lett. 13 (1964) 321, doi:10.1103/PhysRevLett.13.321.'],\n'texkey': ['Englert:1964et'],\n'year': ['1964'],\n}\n```\n\nTo extract directly from a URL:\n``` python\n>>> from refextract import extract_references_from_url\n>>> references = extract_references_from_url('https://arxiv.org/pdf/1503.07589.pdf')\n>>> print(references[0])\n{\n'author': ['F. Englert and R. Brout'],\n'doi': ['doi:10.1103/PhysRevLett.13.321'],\n'journal_page': ['321'],\n'journal_reference': ['Phys. Rev. Lett. 13 (1964) 321'],\n'journal_title': ['Phys. Rev. Lett.'],\n'journal_volume': ['13'],\n'journal_year': ['1964'],\n'linemarker': ['1'],\n'raw_ref': ['[1] F. Englert and R. Brout, \\u201cBroken symmetry and the mass of gauge vector mesons\\u201d, Phys. Rev. Lett. 13 (1964) 321, doi:10.1103/PhysRevLett.13.321.'],\n'texkey': ['Englert:1964et'],\n'year': ['1964'],\n\n}\n\n```\n\n## Notes\n`refextract` depends on\n\n[pdftotext](http://linux.die.net/man/1/pdftotext).\n\n## Acknowledgments\n\n`refextract` is based on code and ideas from the following people, who\n\ncontributed to the `docextract` module in Invenio:\n- Alessio Deiana\n- Federico Poli\n- Gerrit Rindermann\n- Graham R. Armstrong\n- Grzegorz Szpura\n- Jan Aage Lavik\n- Javier Martin Montull\n- Micha Moskovic\n- Samuele Kaplun\n- Thorsten Schwander\n- Tibor Simko\n\n## License\nGPLv2\n\n",
    "bugtrack_url": null,
    "license": "GPL-2.0-or-later",
    "summary": "Small library for extracting references used in scholarly communication.",
    "version": "1.1.6",
    "project_urls": {
        "Homepage": "https://github.com/inspirehep/refextract"
    },
    "split_keywords": [],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "bc39f00089a804db6b1516568a7479a816dd413f2d12c526d65e746574634f97",
                "md5": "ec803f8993c3e2ec0220679ed4fac2a8",
                "sha256": "8fab1374a91e264dc23fac81f3b7ab31fcd4bd970756b9d4417974640fa03e77"
            },
            "downloads": -1,
            "filename": "refextract-1.1.6-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "ec803f8993c3e2ec0220679ed4fac2a8",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": "<4,>=3.11",
            "size": 276145,
            "upload_time": "2025-10-21T09:48:18",
            "upload_time_iso_8601": "2025-10-21T09:48:18.077303Z",
            "url": "https://files.pythonhosted.org/packages/bc/39/f00089a804db6b1516568a7479a816dd413f2d12c526d65e746574634f97/refextract-1.1.6-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "f25dec25190dd00f7121eebcde4656402c59ee565f88adcee40e1c8f8e602c00",
                "md5": "bee3ba760883bd8dce08ad1f9caaa216",
                "sha256": "d1cfd235286f1e77af9992c493a3fab83bd3c6d69e91962f0c8c97dae45dc226"
            },
            "downloads": -1,
            "filename": "refextract-1.1.6.tar.gz",
            "has_sig": false,
            "md5_digest": "bee3ba760883bd8dce08ad1f9caaa216",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": "<4,>=3.11",
            "size": 259512,
            "upload_time": "2025-10-21T09:48:19",
            "upload_time_iso_8601": "2025-10-21T09:48:19.717465Z",
            "url": "https://files.pythonhosted.org/packages/f2/5d/ec25190dd00f7121eebcde4656402c59ee565f88adcee40e1c8f8e602c00/refextract-1.1.6.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-10-21 09:48:19",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "inspirehep",
    "github_project": "refextract",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "refextract"
}
        
Elapsed time: 1.27428s