Name | linkrot JSON |
Version |
5.2.1
JSON |
| download |
home_page | |
Summary | Extract metadata and URLs from PDF files |
upload_time | 2024-01-04 02:19:37 |
maintainer | |
docs_url | None |
author | |
requires_python | >=3.8 |
license | |
keywords |
linkrot
pdf
reference
|
VCS |
|
bugtrack_url |
|
requirements |
No requirements were recorded.
|
Travis-CI |
No Travis.
|
coveralls test coverage |
No coveralls.
|
![linkrot logo](https://github.com/marshalmiller/linkrot/blob/6e6fb45239f8d06e89671e2ec68a11629747355d/branding/Asset%207@4x.png)
# Introduction
Scans PDFs for links written in plaintext and checks if they are active or returns an error code. It then generates a report of its findings. Extract references (PDF, URL, DOI, arXiv) and metadata from a PDF.
Check out our sister project, [Rotting Research](https://github.com/marshalmiller/rottingresearch), for a web app implementation of this project.
# Features
- Extract references and metadata from a given PDF.
- Detects PDF, URL, arXiv and DOI references.
- Archives valid links using Internet Archive's Wayback Machine (using the -a flag).
- Checks for valid SSL certificate.
- Find broken hyperlinks (using the -c flag).
- Output as text or JSON (using the -j flag).
- Extract the PDF text (using the --text flag).
- Use as command-line tool or Python package.
- Works with local and online PDFs.
# Installation
Grab a copy of the code with pip:
```bash
pip install linkrot
```
# Usage
linkrot can be used to extract info from a PDF in two ways:
- Command line/Terminal tool `linkrot`
- Python library `import linkrot`
## 1. Command Line/Terminal tool
```bash
linkrot [pdf-file-or-url]
```
Run linkrot -h to see the help output:
```bash
linkrot -h
```
usage:
```bash
linkrot [-h] [-d OUTPUT_DIRECTORY] [-c] [-j] [-v] [-t] [-o OUTPUT_FILE] [--version] pdf
```
Extract metadata and references from a PDF, and optionally download all
referenced PDFs.
### Arguments
#### positional arguments:
pdf (Filename or URL of a PDF file)
#### optional arguments:
-h, --help (Show this help message and exit)
-d OUTPUT_DIRECTORY, --download-pdfs OUTPUT_DIRECTORY (Download all referenced PDFs into specified directory)
-c, --check-links (Check for broken links)
-j, --json (Output infos as JSON (instead of plain text))
-v, --verbose (Print all references (instead of only PDFs))
-t, --text (Only extract text (no metadata or references))
-a, --archive (Archive actvice links)
-o OUTPUT_FILE, --output-file OUTPUT_FILE (Output to specified file instead of console)
--version (Show program's version number and exit)
### PDF Samples
For testing purposes, you can find PDF samples in [shared MEGA](https://mega.nz/folder/uwBxVSzS#lpBtSz49E9dqHtmrQwp0Ig) folder](https://mega.nz/folder/uwBxVSzS#lpBtSz49E9dqHtmrQwp0Ig).
### Examples
#### Extract text to console.
```bash
linkrot https://example.com/example.pdf -t
```
#### Extract text to file
```bash
linkrot https://example.com/example.pdf -t -o pdf-text.txt
```
#### Check Links
```bash
linkrot https://example.com/example.pdf -c
```
## 2. Main Python Library
Import the library:
```python
import linkrot
```
Create an instance of the linkrot class like so:
```python
pdf = linkrot.linkrot("filename-or-url.pdf") #pdf is the instance of the linkrot class
```
Now the following function can be used to extract specific data from the pdf:
### get_metadata()
Arguments: None
Usage:
```python
metadata = pdf.get_metadata() #pdf is the instance of the linkrot class
```
Return type: Dictionary `<class 'dict'>`
Information Provided: All metadata, secret metadata associated with the PDF including Creation date, Creator, Title, etc...
### get_text()
Arguments: None
Usage:
```python
text = pdf.get_text() #pdf is the instance of the linkrot class
```
Return type: String `<class 'str'>`
Information Provided: The entire content of the PDF in string form.
### get_references(reftype=None, sort=False)
Arguments:
reftype: The type of reference that is needed
values: 'pdf', 'url', 'doi', 'arxiv'.
default: Provides all reference types.
sort: Whether reference should be sorted or not
values: True or False.
default: Is not sorted.
Usage:
```python
references_list = pdf.get_references() #pdf is the instance of the linkrot class
```
Return type: Set `<class 'set'>` of `<linkrot.backends.Reference object>`
linkrot.backends.Reference object has 3 member variables:
- ref: actual URL/PDF/DOI/ARXIV
- reftype: type of reference
- page: page on which it was referenced
Information Provided: All references with their corresponding type and page number.
### get_references_as_dict(reftype=None, sort=False)
Arguments:
reftype: The type of reference that is needed
values: 'pdf', 'url', 'doi', 'arxiv'.
default: Provides all reference types.
sort: Whether reference should be sorted or not
values: True or False.
default: Is not sorted.
Usage:
```python
references_dict = pdf.get_references_as_dict() #pdf is the instance of the linkrot class
```
Return type: Dictionary `<class 'dict'>` with keys 'pdf', 'url', 'doi', 'arxiv' that each have a list `<class 'list'>` of refs of that type.
Information Provided: All references in their corresponding type list.
### download_pdfs(target_dir)
Arguments:
target_dir: The path of the directory to which the reference PDFs should be downloaded
Usage:
```python
pdf.download_pdfs("target-directory") #pdf is the instance of the linkrot class
```
Return type: None
Information Provided: Downloads all the reference PDFs to the specified directory.
## 3. Linkrot downloader functions
Import:
```python
from linkrot.downloader import sanitize_url, get_status_code, check_refs
```
### sanitize_url(url)
Arguments:
url: The url to be sanitized.
Usage:
```python
new_url = sanitize_url(old_url)
```
Return type: String `<class 'str'>`
Information Provided: URL is prefixed with 'http://' if it was not before and makes sure it is in utf-8 format.
### get_status_code(url)
Arguments:
url: The url to be checked for its status.
Usage:
```python
status_code = get_status_code(url)
```
Return type: String `<class 'str'>`
Information Provided: Checks if the URL is active or broken.
### check_refs(refs, verbose=True, max_threads=MAX_THREADS_DEFAULT)
Arguments:
refs: set of linkrot.backends.Reference objects
verbose: whether it should print every reference with its code or just the summary of the link checker
max_threads: number of threads for multithreading
Usage:
```python
check_refs(pdf.get_references()) #pdf is the instance of the linkrot class
```
Return type: None
Information Provided: Prints references with their status code and a summary of all the broken/active links on terminal.
## 4. Linkrot extractor functions
Import:
```python
from linkrot.extractor import extract_urls, extract_doi, extract_arxiv
```
Get pdf text:
```python
text = pdf.get_text() #pdf is the instance of the linkrot class
```
### extract_urls(text)
Arguments:
text: String of text to extract urls from
Usage:
```python
urls = extract_urls(text)
```
Return type: Set `<class 'set'>` of URLs
Information Provided: All URLs in the text
### extract_arxiv(text)
Arguments:
text: String of text to extract arXivs from
Usage:
```python
arxiv = extract_arxiv(text)
```
Return type: Set `<class 'set'>` of arxivs
Information Provided: All arXivs in the text
### extract_doi(text)
Arguments:
text: String of text to extract DOIs from
Usage:
```python
doi = extract_doi(text)
```
Return type: Set `<class 'set'>` of DOIs
Information Provided: All DOIs in the text
# Code of Conduct
To view our code of conduct please visit our [Code of Conduct page](https://github.com/marshalmiller/rottingresearch/blob/main/code_of_conduct.md).
# License
This program is licensed with an [GPLv3 License](https://github.com/marshalmiller/linkrot/blob/main/LICENSE).
Raw data
{
"_id": null,
"home_page": "",
"name": "linkrot",
"maintainer": "",
"docs_url": null,
"requires_python": ">=3.8",
"maintainer_email": "",
"keywords": "linkrot,pdf,reference",
"author": "",
"author_email": "Marshal Miller <marshal@rottingresearch.org>",
"download_url": "https://files.pythonhosted.org/packages/fb/86/9d76739582bc8a0f2e5e2f0400c086ee89bfa4917d0206e29339aceef190/linkrot-5.2.1.tar.gz",
"platform": null,
"description": "![linkrot logo](https://github.com/marshalmiller/linkrot/blob/6e6fb45239f8d06e89671e2ec68a11629747355d/branding/Asset%207@4x.png)\n# Introduction\n\nScans PDFs for links written in plaintext and checks if they are active or returns an error code. It then generates a report of its findings. Extract references (PDF, URL, DOI, arXiv) and metadata from a PDF.\n\nCheck out our sister project, [Rotting Research](https://github.com/marshalmiller/rottingresearch), for a web app implementation of this project.\n\n# Features\n\n- Extract references and metadata from a given PDF. \n- Detects PDF, URL, arXiv and DOI references.\n- Archives valid links using Internet Archive's Wayback Machine (using the -a flag).\n- Checks for valid SSL certificate. \n- Find broken hyperlinks (using the -c flag). \n- Output as text or JSON (using the -j flag). \n- Extract the PDF text (using the --text flag). \n- Use as command-line tool or Python package. \n- Works with local and online PDFs. \n\n# Installation\n\nGrab a copy of the code with pip:\n \n```bash\npip install linkrot\n```\n\n# Usage\n\nlinkrot can be used to extract info from a PDF in two ways:\n- Command line/Terminal tool `linkrot`\n- Python library `import linkrot`\n\n## 1. Command Line/Terminal tool\n\n```bash\nlinkrot [pdf-file-or-url]\n```\n\nRun linkrot -h to see the help output:\n```bash\nlinkrot -h\n```\n\nusage: \n```bash \nlinkrot [-h] [-d OUTPUT_DIRECTORY] [-c] [-j] [-v] [-t] [-o OUTPUT_FILE] [--version] pdf\n```\n\nExtract metadata and references from a PDF, and optionally download all\nreferenced PDFs.\n\n### Arguments\n\n#### positional arguments:\n pdf (Filename or URL of a PDF file) \n\n#### optional arguments:\n -h, --help (Show this help message and exit) \n -d OUTPUT_DIRECTORY, --download-pdfs OUTPUT_DIRECTORY (Download all referenced PDFs into specified directory) \n -c, --check-links (Check for broken links) \n -j, --json (Output infos as JSON (instead of plain text)) \n -v, --verbose (Print all references (instead of only PDFs)) \n -t, --text (Only extract text (no metadata or references)) \n -a, --archive\t (Archive actvice links)\n -o OUTPUT_FILE, --output-file OUTPUT_FILE (Output to specified file instead of console) \n --version (Show program's version number and exit) \n\n### PDF Samples\n\nFor testing purposes, you can find PDF samples in [shared MEGA](https://mega.nz/folder/uwBxVSzS#lpBtSz49E9dqHtmrQwp0Ig) folder](https://mega.nz/folder/uwBxVSzS#lpBtSz49E9dqHtmrQwp0Ig).\n\n### Examples\n\n#### Extract text to console.\n```bash\nlinkrot https://example.com/example.pdf -t\n```\n\n#### Extract text to file\n```bash\nlinkrot https://example.com/example.pdf -t -o pdf-text.txt\n```\n\n#### Check Links\n```bash\nlinkrot https://example.com/example.pdf -c\n```\n\n## 2. Main Python Library\n\nImport the library: \n```python\nimport linkrot\n```\n\nCreate an instance of the linkrot class like so: \n```python\npdf = linkrot.linkrot(\"filename-or-url.pdf\") #pdf is the instance of the linkrot class\n```\n\nNow the following function can be used to extract specific data from the pdf:\n\n### get_metadata()\nArguments: None\n\nUsage: \n```python\nmetadata = pdf.get_metadata() #pdf is the instance of the linkrot class\n``` \n\nReturn type: Dictionary `<class 'dict'>`\n\nInformation Provided: All metadata, secret metadata associated with the PDF including Creation date, Creator, Title, etc...\n\n### get_text()\nArguments: None\n\nUsage: \n```python\ntext = pdf.get_text() #pdf is the instance of the linkrot class\n```\n\nReturn type: String `<class 'str'>`\n\nInformation Provided: The entire content of the PDF in string form.\n\n### get_references(reftype=None, sort=False)\nArguments: \n\n\treftype: The type of reference that is needed \n\t\t values: 'pdf', 'url', 'doi', 'arxiv'. \n\t\t default: Provides all reference types.\n\t\n\tsort: Whether reference should be sorted or not\n\t values: True or False. \n\t default: Is not sorted.\n\t\nUsage: \n```python\nreferences_list = pdf.get_references() #pdf is the instance of the linkrot class\n```\n\nReturn type: Set `<class 'set'>` of `<linkrot.backends.Reference object>`\n\n\tlinkrot.backends.Reference object has 3 member variables:\n\t- ref: actual URL/PDF/DOI/ARXIV\n\t- reftype: type of reference\n\t- page: page on which it was referenced\n\nInformation Provided: All references with their corresponding type and page number. \n\n### get_references_as_dict(reftype=None, sort=False)\nArguments: \n\n\treftype: The type of reference that is needed \n\t\t values: 'pdf', 'url', 'doi', 'arxiv'. \n\t\t default: Provides all reference types.\n\t\n\tsort: Whether reference should be sorted or not\n\t values: True or False. \n\t default: Is not sorted.\n\t\nUsage: \n```python\nreferences_dict = pdf.get_references_as_dict() #pdf is the instance of the linkrot class\n```\n\nReturn type: Dictionary `<class 'dict'>` with keys 'pdf', 'url', 'doi', 'arxiv' that each have a list `<class 'list'>` of refs of that type.\n\nInformation Provided: All references in their corresponding type list.\n\n\n### download_pdfs(target_dir)\nArguments: \n\n\ttarget_dir: The path of the directory to which the reference PDFs should be downloaded \n\t\nUsage: \n```python\npdf.download_pdfs(\"target-directory\") #pdf is the instance of the linkrot class\n```\n\nReturn type: None\n\nInformation Provided: Downloads all the reference PDFs to the specified directory.\n\n## 3. Linkrot downloader functions\n\nImport:\n```python\nfrom linkrot.downloader import sanitize_url, get_status_code, check_refs\n```\n### sanitize_url(url)\nArguments: \n\n\turl: The url to be sanitized.\n\t\nUsage: \n```python\nnew_url = sanitize_url(old_url) \n```\n\nReturn type: String `<class 'str'>`\n\nInformation Provided: URL is prefixed with 'http://' if it was not before and makes sure it is in utf-8 format.\n\n### get_status_code(url)\nArguments: \n\n\turl: The url to be checked for its status. \n\t\nUsage: \n```python\nstatus_code = get_status_code(url) \n```\n\nReturn type: String `<class 'str'>`\n\nInformation Provided: Checks if the URL is active or broken.\n\n### check_refs(refs, verbose=True, max_threads=MAX_THREADS_DEFAULT)\nArguments: \n\n\trefs: set of linkrot.backends.Reference objects\n\tverbose: whether it should print every reference with its code or just the summary of the link checker\n\tmax_threads: number of threads for multithreading\n\t\nUsage: \n```python\ncheck_refs(pdf.get_references()) #pdf is the instance of the linkrot class\n```\n\nReturn type: None\n\nInformation Provided: Prints references with their status code and a summary of all the broken/active links on terminal.\n\n## 4. Linkrot extractor functions\n\nImport:\n```python\nfrom linkrot.extractor import extract_urls, extract_doi, extract_arxiv\n```\n\nGet pdf text:\n```python\ntext = pdf.get_text() #pdf is the instance of the linkrot class\n```\n\n### extract_urls(text)\nArguments: \n\n\ttext: String of text to extract urls from\n\t\nUsage: \n```python\nurls = extract_urls(text)\n```\n\nReturn type: Set `<class 'set'>` of URLs\n\nInformation Provided: All URLs in the text\n\n### extract_arxiv(text)\nArguments: \n\n\ttext: String of text to extract arXivs from\n\t\nUsage: \n```python\narxiv = extract_arxiv(text)\n```\n\nReturn type: Set `<class 'set'>` of arxivs\n\nInformation Provided: All arXivs in the text\n\n### extract_doi(text)\nArguments: \n\n\ttext: String of text to extract DOIs from\n\t\nUsage: \n```python\ndoi = extract_doi(text)\n```\n\nReturn type: Set `<class 'set'>` of DOIs\n\nInformation Provided: All DOIs in the text\n\n# Code of Conduct\nTo view our code of conduct please visit our [Code of Conduct page](https://github.com/marshalmiller/rottingresearch/blob/main/code_of_conduct.md).\n \n# License\nThis program is licensed with an [GPLv3 License](https://github.com/marshalmiller/linkrot/blob/main/LICENSE).\n",
"bugtrack_url": null,
"license": "",
"summary": "Extract metadata and URLs from PDF files",
"version": "5.2.1",
"project_urls": {
"Homepage": "https://github.com/rottingresearch/linkrot"
},
"split_keywords": [
"linkrot",
"pdf",
"reference"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "452adaeb932e2c36e890635898f5ac0c8840bea0fa6e42c179c96208b7d1294f",
"md5": "c5175bde7889520b1669fc418f5414e1",
"sha256": "68502965636ac2e2e5e7d11b3b05084fe3568c73a78878d28cb59f39861791b3"
},
"downloads": -1,
"filename": "linkrot-5.2.1-py3-none-any.whl",
"has_sig": false,
"md5_digest": "c5175bde7889520b1669fc418f5414e1",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.8",
"size": 30081,
"upload_time": "2024-01-04T02:19:35",
"upload_time_iso_8601": "2024-01-04T02:19:35.598817Z",
"url": "https://files.pythonhosted.org/packages/45/2a/daeb932e2c36e890635898f5ac0c8840bea0fa6e42c179c96208b7d1294f/linkrot-5.2.1-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "fb869d76739582bc8a0f2e5e2f0400c086ee89bfa4917d0206e29339aceef190",
"md5": "6607fd02bfc5cfd7697212f7eb6d0a7a",
"sha256": "6544d6f6004547ba12f03476f7931082d3241f12d81de899bf0e3a3ead1459ff"
},
"downloads": -1,
"filename": "linkrot-5.2.1.tar.gz",
"has_sig": false,
"md5_digest": "6607fd02bfc5cfd7697212f7eb6d0a7a",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.8",
"size": 27074,
"upload_time": "2024-01-04T02:19:37",
"upload_time_iso_8601": "2024-01-04T02:19:37.631857Z",
"url": "https://files.pythonhosted.org/packages/fb/86/9d76739582bc8a0f2e5e2f0400c086ee89bfa4917d0206e29339aceef190/linkrot-5.2.1.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-01-04 02:19:37",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "rottingresearch",
"github_project": "linkrot",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"lcname": "linkrot"
}