Name | linkrot JSON |
Version |
5.2.2
JSON |
| download |
home_page | None |
Summary | Extract metadata and URLs from PDF files |
upload_time | 2025-07-22 18:53:37 |
maintainer | None |
docs_url | None |
author | None |
requires_python | >=3.8 |
license | None |
keywords |
linkrot
pdf
reference
|
VCS |
 |
bugtrack_url |
|
requirements |
No requirements were recorded.
|
Travis-CI |
No Travis.
|
coveralls test coverage |
No coveralls.
|

[](https://app.fossa.com/projects/git%2Bgithub.com%2Frottingresearch%2Flinkrot?ref=badge_shield)
# Introduction
Scans PDFs for links written in plaintext and checks if they are active or returns an error code. It then generates a report of its findings. Extract references (PDF, URL, DOI, arXiv) and metadata from a PDF.
**New in v5.2.2**: Retraction checking! linkrot now automatically checks DOIs against retraction databases to identify potentially retracted papers, helping ensure research integrity.
Check out our sister project, [Rotting Research](https://github.com/marshalmiller/rottingresearch), for a web app implementation of this project.
# Features
- Extract references and metadata from a given PDF.
- Detects PDF, URL, arXiv and DOI references.
- **Check DOIs for retracted papers** (using the -r flag).
- Archives valid links using Internet Archive's Wayback Machine (using the -a flag).
- Checks for valid SSL certificate.
- Find broken hyperlinks (using the -c flag).
- Output as text or JSON (using the -j flag).
- Extract the PDF text (using the --text flag).
- Use as command-line tool or Python package.
- Works with local and online PDFs.
# Installation
## PyPI (Recommended)
Grab a copy of the code with pip:
```bash
pip install linkrot
```
## Debian/Ubuntu Package
For Debian/Ubuntu systems, you can build and install a .deb package:
```bash
# Install build dependencies
sudo apt-get install dpkg-dev debhelper dh-python python3-setuptools
# Build the package
python3 setup-deb-build.py
./build-deb.sh
# Install the packages
sudo dpkg -i ../python3-linkrot_*.deb ../linkrot_*.deb
sudo apt-get install -f # Fix any dependency issues
```
See `debian/README.md` for detailed packaging instructions.
# Usage
linkrot can be used to extract info from a PDF in two ways:
- Command line/Terminal tool `linkrot`
- Python library `import linkrot`
## 1. Command Line/Terminal tool
```bash
linkrot [pdf-file-or-url]
```
Run linkrot -h to see the help output:
```bash
linkrot -h
```
usage:
```bash
linkrot [-h] [-d OUTPUT_DIRECTORY] [-c] [-r] [-j] [-v] [-t] [-o OUTPUT_FILE] [--version] pdf
```
Extract metadata and references from a PDF, and optionally download all
referenced PDFs.
### Arguments
#### positional arguments:
pdf (Filename or URL of a PDF file)
#### optional arguments:
-h, --help (Show this help message and exit)
-d OUTPUT_DIRECTORY, --download-pdfs OUTPUT_DIRECTORY (Download all referenced PDFs into specified directory)
-c, --check-links (Check for broken links)
-r, --check-retractions (Check DOIs for retracted papers)
-j, --json (Output infos as JSON (instead of plain text))
-v, --verbose (Print all references (instead of only PDFs))
-t, --text (Only extract text (no metadata or references))
-a, --archive (Archive actvice links)
-o OUTPUT_FILE, --output-file OUTPUT_FILE (Output to specified file instead of console)
--version (Show program's version number and exit)
### PDF Samples
For testing purposes, you can find PDF samples in [shared MEGA](https://mega.nz/folder/uwBxVSzS#lpBtSz49E9dqHtmrQwp0Ig) folder](https://mega.nz/folder/uwBxVSzS#lpBtSz49E9dqHtmrQwp0Ig).
### Examples
#### Extract text to console.
```bash
linkrot https://example.com/example.pdf -t
```
#### Extract text to file
```bash
linkrot https://example.com/example.pdf -t -o pdf-text.txt
```
#### Check Links
```bash
linkrot https://example.com/example.pdf -c
```
#### Check for Retracted Papers
```bash
linkrot https://example.com/example.pdf -r
```
#### Check Both Links and Retractions
```bash
linkrot https://example.com/example.pdf -c -r
```
#### Get Results as JSON with Retraction Check
```bash
linkrot https://example.com/example.pdf -r -j
```
## 2. Main Python Library
Import the library:
```python
import linkrot
```
Create an instance of the linkrot class like so:
```python
pdf = linkrot.linkrot("filename-or-url.pdf") #pdf is the instance of the linkrot class
```
Now the following function can be used to extract specific data from the pdf:
### get_metadata()
Arguments: None
Usage:
```python
metadata = pdf.get_metadata() #pdf is the instance of the linkrot class
```
Return type: Dictionary `<class 'dict'>`
Information Provided: All metadata, secret metadata associated with the PDF including Creation date, Creator, Title, etc...
### get_text()
Arguments: None
Usage:
```python
text = pdf.get_text() #pdf is the instance of the linkrot class
```
Return type: String `<class 'str'>`
Information Provided: The entire content of the PDF in string form.
### get_references(reftype=None, sort=False)
Arguments:
reftype: The type of reference that is needed
values: 'pdf', 'url', 'doi', 'arxiv'.
default: Provides all reference types.
sort: Whether reference should be sorted or not
values: True or False.
default: Is not sorted.
Usage:
```python
references_list = pdf.get_references() #pdf is the instance of the linkrot class
```
Return type: Set `<class 'set'>` of `<linkrot.backends.Reference object>`
linkrot.backends.Reference object has 3 member variables:
- ref: actual URL/PDF/DOI/ARXIV
- reftype: type of reference
- page: page on which it was referenced
Information Provided: All references with their corresponding type and page number.
### get_references_as_dict(reftype=None, sort=False)
Arguments:
reftype: The type of reference that is needed
values: 'pdf', 'url', 'doi', 'arxiv'.
default: Provides all reference types.
sort: Whether reference should be sorted or not
values: True or False.
default: Is not sorted.
Usage:
```python
references_dict = pdf.get_references_as_dict() #pdf is the instance of the linkrot class
```
Return type: Dictionary `<class 'dict'>` with keys 'pdf', 'url', 'doi', 'arxiv' that each have a list `<class 'list'>` of refs of that type.
Information Provided: All references in their corresponding type list.
### download_pdfs(target_dir)
Arguments:
target_dir: The path of the directory to which the reference PDFs should be downloaded
Usage:
```python
pdf.download_pdfs("target-directory") #pdf is the instance of the linkrot class
```
Return type: None
Information Provided: Downloads all the reference PDFs to the specified directory.
## 3. Linkrot downloader functions
Import:
```python
from linkrot.downloader import sanitize_url, get_status_code, check_refs
```
### sanitize_url(url)
Arguments:
url: The url to be sanitized.
Usage:
```python
new_url = sanitize_url(old_url)
```
Return type: String `<class 'str'>`
Information Provided: URL is prefixed with 'http://' if it was not before and makes sure it is in utf-8 format.
### get_status_code(url)
Arguments:
url: The url to be checked for its status.
Usage:
```python
status_code = get_status_code(url)
```
Return type: String `<class 'str'>`
Information Provided: Checks if the URL is active or broken.
### check_refs(refs, verbose=True, max_threads=MAX_THREADS_DEFAULT)
Arguments:
refs: set of linkrot.backends.Reference objects
verbose: whether it should print every reference with its code or just the summary of the link checker
max_threads: number of threads for multithreading
Usage:
```python
check_refs(pdf.get_references()) #pdf is the instance of the linkrot class
```
Return type: None
Information Provided: Prints references with their status code and a summary of all the broken/active links on terminal.
## 4. Linkrot extractor functions
Import:
```python
from linkrot.extractor import extract_urls, extract_doi, extract_arxiv
```
Get pdf text:
```python
text = pdf.get_text() #pdf is the instance of the linkrot class
```
### extract_urls(text)
Arguments:
text: String of text to extract urls from
Usage:
```python
urls = extract_urls(text)
```
Return type: Set `<class 'set'>` of URLs
Information Provided: All URLs in the text
### extract_arxiv(text)
Arguments:
text: String of text to extract arXivs from
Usage:
```python
arxiv = extract_arxiv(text)
```
Return type: Set `<class 'set'>` of arxivs
Information Provided: All arXivs in the text
### extract_doi(text)
Arguments:
text: String of text to extract DOIs from
Usage:
```python
doi = extract_doi(text)
```
Return type: Set `<class 'set'>` of DOIs
Information Provided: All DOIs in the text
## 5. Linkrot retraction functions
Import:
```python
from linkrot.retraction import check_dois_for_retractions, RetractionChecker
```
### check_dois_for_retractions(dois, verbose=False)
Arguments:
dois: Set of DOI strings to check for retractions
verbose: Whether to print detailed results
Usage:
```python
# Get DOIs from PDF text
text = pdf.get_text()
dois = extract_doi(text)
# Check for retractions
result = check_dois_for_retractions(dois, verbose=True)
```
Return type: Dictionary with retraction results and summary
Information Provided: Checks each DOI against retraction databases and provides detailed information about any retracted papers found.
### RetractionChecker class
For more advanced usage, you can use the RetractionChecker class directly:
```python
checker = RetractionChecker()
# Check individual DOI
result = checker.check_doi("10.1000/182")
# Check multiple DOIs
results = checker.check_multiple_dois({"10.1000/182", "10.1038/nature12373"})
# Get summary
summary = checker.get_retraction_summary(results)
```
The retraction checker uses multiple methods to detect retractions:
- CrossRef API for retraction notices in metadata
- Analysis of DOI landing pages for retraction indicators
- Extensible design for adding more retraction databases
# Code of Conduct
To view our code of conduct please visit our [Code of Conduct page](https://github.com/marshalmiller/rottingresearch/blob/main/code_of_conduct.md).
# License
This program is licensed with an [GPLv3 License](https://github.com/marshalmiller/linkrot/blob/main/LICENSE).
[](https://app.fossa.com/projects/git%2Bgithub.com%2Frottingresearch%2Flinkrot?ref=badge_large)
Raw data
{
"_id": null,
"home_page": null,
"name": "linkrot",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.8",
"maintainer_email": null,
"keywords": "linkrot, pdf, reference",
"author": null,
"author_email": "Marshal Miller <marshal@rottingresearch.org>",
"download_url": "https://files.pythonhosted.org/packages/49/36/fa65fbffbff54df85a4e9d03afdc98e26fdaa345e72707a009aafe5aec90/linkrot-5.2.2.tar.gz",
"platform": null,
"description": "\n[](https://app.fossa.com/projects/git%2Bgithub.com%2Frottingresearch%2Flinkrot?ref=badge_shield)\n# Introduction\n\nScans PDFs for links written in plaintext and checks if they are active or returns an error code. It then generates a report of its findings. Extract references (PDF, URL, DOI, arXiv) and metadata from a PDF.\n\n**New in v5.2.2**: Retraction checking! linkrot now automatically checks DOIs against retraction databases to identify potentially retracted papers, helping ensure research integrity.\n\nCheck out our sister project, [Rotting Research](https://github.com/marshalmiller/rottingresearch), for a web app implementation of this project.\n\n# Features\n\n- Extract references and metadata from a given PDF. \n- Detects PDF, URL, arXiv and DOI references.\n- **Check DOIs for retracted papers** (using the -r flag).\n- Archives valid links using Internet Archive's Wayback Machine (using the -a flag).\n- Checks for valid SSL certificate. \n- Find broken hyperlinks (using the -c flag). \n- Output as text or JSON (using the -j flag). \n- Extract the PDF text (using the --text flag). \n- Use as command-line tool or Python package. \n- Works with local and online PDFs. \n\n# Installation\n\n## PyPI (Recommended)\nGrab a copy of the code with pip:\n \n```bash\npip install linkrot\n```\n\n## Debian/Ubuntu Package\nFor Debian/Ubuntu systems, you can build and install a .deb package:\n\n```bash\n# Install build dependencies\nsudo apt-get install dpkg-dev debhelper dh-python python3-setuptools\n\n# Build the package\npython3 setup-deb-build.py\n./build-deb.sh\n\n# Install the packages\nsudo dpkg -i ../python3-linkrot_*.deb ../linkrot_*.deb\nsudo apt-get install -f # Fix any dependency issues\n```\n\nSee `debian/README.md` for detailed packaging instructions.\n\n# Usage\n\nlinkrot can be used to extract info from a PDF in two ways:\n- Command line/Terminal tool `linkrot`\n- Python library `import linkrot`\n\n## 1. Command Line/Terminal tool\n\n```bash\nlinkrot [pdf-file-or-url]\n```\n\nRun linkrot -h to see the help output:\n```bash\nlinkrot -h\n```\n\nusage: \n```bash \nlinkrot [-h] [-d OUTPUT_DIRECTORY] [-c] [-r] [-j] [-v] [-t] [-o OUTPUT_FILE] [--version] pdf\n```\n\nExtract metadata and references from a PDF, and optionally download all\nreferenced PDFs.\n\n### Arguments\n\n#### positional arguments:\n pdf (Filename or URL of a PDF file) \n\n#### optional arguments:\n -h, --help (Show this help message and exit) \n -d OUTPUT_DIRECTORY, --download-pdfs OUTPUT_DIRECTORY (Download all referenced PDFs into specified directory) \n -c, --check-links (Check for broken links) \n -r, --check-retractions (Check DOIs for retracted papers)\n -j, --json (Output infos as JSON (instead of plain text)) \n -v, --verbose (Print all references (instead of only PDFs)) \n -t, --text (Only extract text (no metadata or references)) \n -a, --archive\t (Archive actvice links)\n -o OUTPUT_FILE, --output-file OUTPUT_FILE (Output to specified file instead of console) \n --version (Show program's version number and exit) \n\n### PDF Samples\n\nFor testing purposes, you can find PDF samples in [shared MEGA](https://mega.nz/folder/uwBxVSzS#lpBtSz49E9dqHtmrQwp0Ig) folder](https://mega.nz/folder/uwBxVSzS#lpBtSz49E9dqHtmrQwp0Ig).\n\n### Examples\n\n#### Extract text to console.\n```bash\nlinkrot https://example.com/example.pdf -t\n```\n\n#### Extract text to file\n```bash\nlinkrot https://example.com/example.pdf -t -o pdf-text.txt\n```\n\n#### Check Links\n```bash\nlinkrot https://example.com/example.pdf -c\n```\n\n#### Check for Retracted Papers\n```bash\nlinkrot https://example.com/example.pdf -r\n```\n\n#### Check Both Links and Retractions\n```bash\nlinkrot https://example.com/example.pdf -c -r\n```\n\n#### Get Results as JSON with Retraction Check\n```bash\nlinkrot https://example.com/example.pdf -r -j\n```\n\n## 2. Main Python Library\n\nImport the library: \n```python\nimport linkrot\n```\n\nCreate an instance of the linkrot class like so: \n```python\npdf = linkrot.linkrot(\"filename-or-url.pdf\") #pdf is the instance of the linkrot class\n```\n\nNow the following function can be used to extract specific data from the pdf:\n\n### get_metadata()\nArguments: None\n\nUsage: \n```python\nmetadata = pdf.get_metadata() #pdf is the instance of the linkrot class\n``` \n\nReturn type: Dictionary `<class 'dict'>`\n\nInformation Provided: All metadata, secret metadata associated with the PDF including Creation date, Creator, Title, etc...\n\n### get_text()\nArguments: None\n\nUsage: \n```python\ntext = pdf.get_text() #pdf is the instance of the linkrot class\n```\n\nReturn type: String `<class 'str'>`\n\nInformation Provided: The entire content of the PDF in string form.\n\n### get_references(reftype=None, sort=False)\nArguments: \n\n\treftype: The type of reference that is needed \n\t\t values: 'pdf', 'url', 'doi', 'arxiv'. \n\t\t default: Provides all reference types.\n\t\n\tsort: Whether reference should be sorted or not\n\t values: True or False. \n\t default: Is not sorted.\n\t\nUsage: \n```python\nreferences_list = pdf.get_references() #pdf is the instance of the linkrot class\n```\n\nReturn type: Set `<class 'set'>` of `<linkrot.backends.Reference object>`\n\n\tlinkrot.backends.Reference object has 3 member variables:\n\t- ref: actual URL/PDF/DOI/ARXIV\n\t- reftype: type of reference\n\t- page: page on which it was referenced\n\nInformation Provided: All references with their corresponding type and page number. \n\n### get_references_as_dict(reftype=None, sort=False)\nArguments: \n\n\treftype: The type of reference that is needed \n\t\t values: 'pdf', 'url', 'doi', 'arxiv'. \n\t\t default: Provides all reference types.\n\t\n\tsort: Whether reference should be sorted or not\n\t values: True or False. \n\t default: Is not sorted.\n\t\nUsage: \n```python\nreferences_dict = pdf.get_references_as_dict() #pdf is the instance of the linkrot class\n```\n\nReturn type: Dictionary `<class 'dict'>` with keys 'pdf', 'url', 'doi', 'arxiv' that each have a list `<class 'list'>` of refs of that type.\n\nInformation Provided: All references in their corresponding type list.\n\n\n### download_pdfs(target_dir)\nArguments: \n\n\ttarget_dir: The path of the directory to which the reference PDFs should be downloaded \n\t\nUsage: \n```python\npdf.download_pdfs(\"target-directory\") #pdf is the instance of the linkrot class\n```\n\nReturn type: None\n\nInformation Provided: Downloads all the reference PDFs to the specified directory.\n\n## 3. Linkrot downloader functions\n\nImport:\n```python\nfrom linkrot.downloader import sanitize_url, get_status_code, check_refs\n```\n### sanitize_url(url)\nArguments: \n\n\turl: The url to be sanitized.\n\t\nUsage: \n```python\nnew_url = sanitize_url(old_url) \n```\n\nReturn type: String `<class 'str'>`\n\nInformation Provided: URL is prefixed with 'http://' if it was not before and makes sure it is in utf-8 format.\n\n### get_status_code(url)\nArguments: \n\n\turl: The url to be checked for its status. \n\t\nUsage: \n```python\nstatus_code = get_status_code(url) \n```\n\nReturn type: String `<class 'str'>`\n\nInformation Provided: Checks if the URL is active or broken.\n\n### check_refs(refs, verbose=True, max_threads=MAX_THREADS_DEFAULT)\nArguments: \n\n\trefs: set of linkrot.backends.Reference objects\n\tverbose: whether it should print every reference with its code or just the summary of the link checker\n\tmax_threads: number of threads for multithreading\n\t\nUsage: \n```python\ncheck_refs(pdf.get_references()) #pdf is the instance of the linkrot class\n```\n\nReturn type: None\n\nInformation Provided: Prints references with their status code and a summary of all the broken/active links on terminal.\n\n## 4. Linkrot extractor functions\n\nImport:\n```python\nfrom linkrot.extractor import extract_urls, extract_doi, extract_arxiv\n```\n\nGet pdf text:\n```python\ntext = pdf.get_text() #pdf is the instance of the linkrot class\n```\n\n### extract_urls(text)\nArguments: \n\n\ttext: String of text to extract urls from\n\t\nUsage: \n```python\nurls = extract_urls(text)\n```\n\nReturn type: Set `<class 'set'>` of URLs\n\nInformation Provided: All URLs in the text\n\n### extract_arxiv(text)\nArguments: \n\n\ttext: String of text to extract arXivs from\n\t\nUsage: \n```python\narxiv = extract_arxiv(text)\n```\n\nReturn type: Set `<class 'set'>` of arxivs\n\nInformation Provided: All arXivs in the text\n\n### extract_doi(text)\nArguments: \n\n\ttext: String of text to extract DOIs from\n\t\nUsage: \n```python\ndoi = extract_doi(text)\n```\n\nReturn type: Set `<class 'set'>` of DOIs\n\nInformation Provided: All DOIs in the text\n\n## 5. Linkrot retraction functions\n\nImport:\n```python\nfrom linkrot.retraction import check_dois_for_retractions, RetractionChecker\n```\n\n### check_dois_for_retractions(dois, verbose=False)\nArguments: \n\n\tdois: Set of DOI strings to check for retractions\n\tverbose: Whether to print detailed results\n\t\nUsage: \n```python\n# Get DOIs from PDF text\ntext = pdf.get_text()\ndois = extract_doi(text)\n\n# Check for retractions\nresult = check_dois_for_retractions(dois, verbose=True)\n```\n\nReturn type: Dictionary with retraction results and summary\n\nInformation Provided: Checks each DOI against retraction databases and provides detailed information about any retracted papers found.\n\n### RetractionChecker class\nFor more advanced usage, you can use the RetractionChecker class directly:\n\n```python\nchecker = RetractionChecker()\n\n# Check individual DOI\nresult = checker.check_doi(\"10.1000/182\")\n\n# Check multiple DOIs\nresults = checker.check_multiple_dois({\"10.1000/182\", \"10.1038/nature12373\"})\n\n# Get summary\nsummary = checker.get_retraction_summary(results)\n```\n\nThe retraction checker uses multiple methods to detect retractions:\n- CrossRef API for retraction notices in metadata\n- Analysis of DOI landing pages for retraction indicators\n- Extensible design for adding more retraction databases\n\n# Code of Conduct\nTo view our code of conduct please visit our [Code of Conduct page](https://github.com/marshalmiller/rottingresearch/blob/main/code_of_conduct.md).\n \n# License\nThis program is licensed with an [GPLv3 License](https://github.com/marshalmiller/linkrot/blob/main/LICENSE).\n\n\n[](https://app.fossa.com/projects/git%2Bgithub.com%2Frottingresearch%2Flinkrot?ref=badge_large)",
"bugtrack_url": null,
"license": null,
"summary": "Extract metadata and URLs from PDF files",
"version": "5.2.2",
"project_urls": {
"Homepage": "https://github.com/rottingresearch/linkrot"
},
"split_keywords": [
"linkrot",
" pdf",
" reference"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "fbb3eb9b2133c4d2c5c7b4cb0ce06c6d93fed58e0ab25372c6f3d9b1c7180903",
"md5": "0bf3c06812c2ae70cd378ce9704ba947",
"sha256": "073b4413d05f905fd4200ce5849fd57159a0fcdb0733b5a4b50a759b4d043b42"
},
"downloads": -1,
"filename": "linkrot-5.2.2-py3-none-any.whl",
"has_sig": false,
"md5_digest": "0bf3c06812c2ae70cd378ce9704ba947",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.8",
"size": 36932,
"upload_time": "2025-07-22T18:53:36",
"upload_time_iso_8601": "2025-07-22T18:53:36.603955Z",
"url": "https://files.pythonhosted.org/packages/fb/b3/eb9b2133c4d2c5c7b4cb0ce06c6d93fed58e0ab25372c6f3d9b1c7180903/linkrot-5.2.2-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "4936fa65fbffbff54df85a4e9d03afdc98e26fdaa345e72707a009aafe5aec90",
"md5": "098bd038abad9f325a0c00ab80322820",
"sha256": "b3235fb8d5913cca7a188bcf7a9fe36d2aedb5e617100ae99a29e1236dfee953"
},
"downloads": -1,
"filename": "linkrot-5.2.2.tar.gz",
"has_sig": false,
"md5_digest": "098bd038abad9f325a0c00ab80322820",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.8",
"size": 33190,
"upload_time": "2025-07-22T18:53:37",
"upload_time_iso_8601": "2025-07-22T18:53:37.797137Z",
"url": "https://files.pythonhosted.org/packages/49/36/fa65fbffbff54df85a4e9d03afdc98e26fdaa345e72707a009aafe5aec90/linkrot-5.2.2.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-07-22 18:53:37",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "rottingresearch",
"github_project": "linkrot",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"lcname": "linkrot"
}