linkrot


Namelinkrot JSON
Version 5.2.1 PyPI version JSON
download
home_page
SummaryExtract metadata and URLs from PDF files
upload_time2024-01-04 02:19:37
maintainer
docs_urlNone
author
requires_python>=3.8
license
keywords linkrot pdf reference
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            ![linkrot logo](https://github.com/marshalmiller/linkrot/blob/6e6fb45239f8d06e89671e2ec68a11629747355d/branding/Asset%207@4x.png)
# Introduction

Scans PDFs for links written in plaintext and checks if they are active or returns an error code. It then generates a report of its findings. Extract references (PDF, URL, DOI, arXiv) and metadata from a PDF.

Check out our sister project, [Rotting Research](https://github.com/marshalmiller/rottingresearch), for a web app implementation of this project.

# Features

- Extract references and metadata from a given PDF.  
- Detects PDF, URL, arXiv and DOI references.
- Archives valid links using Internet Archive's Wayback Machine (using the -a flag).
- Checks for valid SSL certificate.  
- Find broken hyperlinks (using the -c flag).  
- Output as text or JSON (using the -j flag).  
- Extract the PDF text (using the --text flag).  
- Use as command-line tool or Python package.  
- Works with local and online PDFs.  

# Installation

Grab a copy of the code with pip:
 
```bash
pip install linkrot
```

# Usage

linkrot can be used to extract info from a PDF in two ways:
- Command line/Terminal tool `linkrot`
- Python library `import linkrot`

## 1. Command Line/Terminal tool

```bash
linkrot [pdf-file-or-url]
```

Run linkrot -h to see the help output:
```bash
linkrot -h
```

usage: 
```bash 
linkrot [-h] [-d OUTPUT_DIRECTORY] [-c] [-j] [-v] [-t] [-o OUTPUT_FILE] [--version] pdf
```

Extract metadata and references from a PDF, and optionally download all
referenced PDFs.

### Arguments

#### positional arguments:
  pdf                   (Filename or URL of a PDF file)  

#### optional arguments:
    -h, --help            (Show this help message and exit)  
    -d OUTPUT_DIRECTORY,  --download-pdfs OUTPUT_DIRECTORY (Download all referenced PDFs into specified directory)  
    -c, --check-links     (Check for broken links)  
    -j, --json            (Output infos as JSON (instead of plain text))  
    -v, --verbose         (Print all references (instead of only PDFs))  
    -t, --text            (Only extract text (no metadata or references))  
    -a, --archive	  (Archive actvice links)
    -o OUTPUT_FILE,        --output-file OUTPUT_FILE (Output to specified file instead of console)  
    --version             (Show program's version number and exit)  

### PDF Samples

For testing purposes, you can find PDF samples in [shared MEGA](https://mega.nz/folder/uwBxVSzS#lpBtSz49E9dqHtmrQwp0Ig) folder](https://mega.nz/folder/uwBxVSzS#lpBtSz49E9dqHtmrQwp0Ig).

### Examples

#### Extract text to console.
```bash
linkrot https://example.com/example.pdf -t
```

#### Extract text to file
```bash
linkrot https://example.com/example.pdf -t -o pdf-text.txt
```

#### Check Links
```bash
linkrot https://example.com/example.pdf -c
```

## 2. Main Python Library

Import the library: 
```python
import linkrot
```

Create an instance of the linkrot class like so: 
```python
pdf = linkrot.linkrot("filename-or-url.pdf") #pdf is the instance of the linkrot class
```

Now the following function can be used to extract specific data from the pdf:

### get_metadata()
Arguments: None

Usage: 
```python
metadata = pdf.get_metadata() #pdf is the instance of the linkrot class
``` 

Return type: Dictionary `<class 'dict'>`

Information Provided: All metadata, secret metadata associated with the PDF including Creation date, Creator, Title, etc...

### get_text()
Arguments: None

Usage: 
```python
text = pdf.get_text() #pdf is the instance of the linkrot class
```

Return type: String `<class 'str'>`

Information Provided: The entire content of the PDF in string form.

### get_references(reftype=None, sort=False)
Arguments: 

	reftype: The type of reference that is needed 
		 values: 'pdf', 'url', 'doi', 'arxiv'. 
		 default: Provides all reference types.
	
	sort: Whether reference should be sorted or not
	      values: True or False. 
	      default: Is not sorted.
	
Usage: 
```python
references_list = pdf.get_references() #pdf is the instance of the linkrot class
```

Return type: Set `<class 'set'>` of `<linkrot.backends.Reference object>`

	linkrot.backends.Reference object has 3 member variables:
	- ref: actual URL/PDF/DOI/ARXIV
	- reftype: type of reference
	- page: page on which it was referenced

Information Provided: All references with their corresponding type and page number. 

### get_references_as_dict(reftype=None, sort=False)
Arguments: 

	reftype: The type of reference that is needed 
		 values: 'pdf', 'url', 'doi', 'arxiv'. 
		 default: Provides all reference types.
	
	sort: Whether reference should be sorted or not
	      values: True or False. 
	      default: Is not sorted.
	
Usage: 
```python
references_dict = pdf.get_references_as_dict() #pdf is the instance of the linkrot class
```

Return type: Dictionary `<class 'dict'>` with keys 'pdf', 'url', 'doi', 'arxiv' that each have a list `<class 'list'>` of refs of that type.

Information Provided: All references in their corresponding type list.


### download_pdfs(target_dir)
Arguments: 

	target_dir: The path of the directory to which the reference PDFs should be downloaded 
	
Usage: 
```python
pdf.download_pdfs("target-directory") #pdf is the instance of the linkrot class
```

Return type: None

Information Provided: Downloads all the reference PDFs to the specified directory.

## 3. Linkrot downloader functions

Import:
```python
from linkrot.downloader import sanitize_url, get_status_code, check_refs
```
### sanitize_url(url)
Arguments: 

	url: The url to be sanitized.
	
Usage: 
```python
new_url = sanitize_url(old_url) 
```

Return type: String `<class 'str'>`

Information Provided: URL is prefixed with 'http://' if it was not before and makes sure it is in utf-8 format.

### get_status_code(url)
Arguments: 

	url: The url to be checked for its status. 
	
Usage: 
```python
status_code = get_status_code(url) 
```

Return type: String `<class 'str'>`

Information Provided: Checks if the URL is active or broken.

### check_refs(refs, verbose=True, max_threads=MAX_THREADS_DEFAULT)
Arguments: 

	refs: set of linkrot.backends.Reference objects
	verbose: whether it should print every reference with its code or just the summary of the link checker
	max_threads: number of threads for multithreading
	
Usage: 
```python
check_refs(pdf.get_references()) #pdf is the instance of the linkrot class
```

Return type: None

Information Provided: Prints references with their status code and a summary of all the broken/active links on terminal.

## 4. Linkrot extractor functions

Import:
```python
from linkrot.extractor import extract_urls, extract_doi, extract_arxiv
```

Get pdf text:
```python
text = pdf.get_text() #pdf is the instance of the linkrot class
```

### extract_urls(text)
Arguments: 

	text: String of text to extract urls from
	
Usage: 
```python
urls = extract_urls(text)
```

Return type: Set `<class 'set'>` of URLs

Information Provided: All URLs in the text

### extract_arxiv(text)
Arguments: 

	text: String of text to extract arXivs from
	
Usage: 
```python
arxiv = extract_arxiv(text)
```

Return type: Set `<class 'set'>` of arxivs

Information Provided: All arXivs in the text

### extract_doi(text)
Arguments: 

	text: String of text to extract DOIs from
	
Usage: 
```python
doi = extract_doi(text)
```

Return type: Set `<class 'set'>` of DOIs

Information Provided: All DOIs in the text

# Code of Conduct
To view our code of conduct please visit our [Code of Conduct page](https://github.com/marshalmiller/rottingresearch/blob/main/code_of_conduct.md).
            
# License
This program is licensed with an [GPLv3 License](https://github.com/marshalmiller/linkrot/blob/main/LICENSE).

            

Raw data

            {
    "_id": null,
    "home_page": "",
    "name": "linkrot",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": "",
    "keywords": "linkrot,pdf,reference",
    "author": "",
    "author_email": "Marshal Miller <marshal@rottingresearch.org>",
    "download_url": "https://files.pythonhosted.org/packages/fb/86/9d76739582bc8a0f2e5e2f0400c086ee89bfa4917d0206e29339aceef190/linkrot-5.2.1.tar.gz",
    "platform": null,
    "description": "![linkrot logo](https://github.com/marshalmiller/linkrot/blob/6e6fb45239f8d06e89671e2ec68a11629747355d/branding/Asset%207@4x.png)\n# Introduction\n\nScans PDFs for links written in plaintext and checks if they are active or returns an error code. It then generates a report of its findings. Extract references (PDF, URL, DOI, arXiv) and metadata from a PDF.\n\nCheck out our sister project, [Rotting Research](https://github.com/marshalmiller/rottingresearch), for a web app implementation of this project.\n\n# Features\n\n- Extract references and metadata from a given PDF.  \n- Detects PDF, URL, arXiv and DOI references.\n- Archives valid links using Internet Archive's Wayback Machine (using the -a flag).\n- Checks for valid SSL certificate.  \n- Find broken hyperlinks (using the -c flag).  \n- Output as text or JSON (using the -j flag).  \n- Extract the PDF text (using the --text flag).  \n- Use as command-line tool or Python package.  \n- Works with local and online PDFs.  \n\n# Installation\n\nGrab a copy of the code with pip:\n \n```bash\npip install linkrot\n```\n\n# Usage\n\nlinkrot can be used to extract info from a PDF in two ways:\n- Command line/Terminal tool `linkrot`\n- Python library `import linkrot`\n\n## 1. Command Line/Terminal tool\n\n```bash\nlinkrot [pdf-file-or-url]\n```\n\nRun linkrot -h to see the help output:\n```bash\nlinkrot -h\n```\n\nusage: \n```bash \nlinkrot [-h] [-d OUTPUT_DIRECTORY] [-c] [-j] [-v] [-t] [-o OUTPUT_FILE] [--version] pdf\n```\n\nExtract metadata and references from a PDF, and optionally download all\nreferenced PDFs.\n\n### Arguments\n\n#### positional arguments:\n  pdf                   (Filename or URL of a PDF file)  \n\n#### optional arguments:\n    -h, --help            (Show this help message and exit)  \n    -d OUTPUT_DIRECTORY,  --download-pdfs OUTPUT_DIRECTORY (Download all referenced PDFs into specified directory)  \n    -c, --check-links     (Check for broken links)  \n    -j, --json            (Output infos as JSON (instead of plain text))  \n    -v, --verbose         (Print all references (instead of only PDFs))  \n    -t, --text            (Only extract text (no metadata or references))  \n    -a, --archive\t  (Archive actvice links)\n    -o OUTPUT_FILE,        --output-file OUTPUT_FILE (Output to specified file instead of console)  \n    --version             (Show program's version number and exit)  \n\n### PDF Samples\n\nFor testing purposes, you can find PDF samples in [shared MEGA](https://mega.nz/folder/uwBxVSzS#lpBtSz49E9dqHtmrQwp0Ig) folder](https://mega.nz/folder/uwBxVSzS#lpBtSz49E9dqHtmrQwp0Ig).\n\n### Examples\n\n#### Extract text to console.\n```bash\nlinkrot https://example.com/example.pdf -t\n```\n\n#### Extract text to file\n```bash\nlinkrot https://example.com/example.pdf -t -o pdf-text.txt\n```\n\n#### Check Links\n```bash\nlinkrot https://example.com/example.pdf -c\n```\n\n## 2. Main Python Library\n\nImport the library: \n```python\nimport linkrot\n```\n\nCreate an instance of the linkrot class like so: \n```python\npdf = linkrot.linkrot(\"filename-or-url.pdf\") #pdf is the instance of the linkrot class\n```\n\nNow the following function can be used to extract specific data from the pdf:\n\n### get_metadata()\nArguments: None\n\nUsage: \n```python\nmetadata = pdf.get_metadata() #pdf is the instance of the linkrot class\n``` \n\nReturn type: Dictionary `<class 'dict'>`\n\nInformation Provided: All metadata, secret metadata associated with the PDF including Creation date, Creator, Title, etc...\n\n### get_text()\nArguments: None\n\nUsage: \n```python\ntext = pdf.get_text() #pdf is the instance of the linkrot class\n```\n\nReturn type: String `<class 'str'>`\n\nInformation Provided: The entire content of the PDF in string form.\n\n### get_references(reftype=None, sort=False)\nArguments: \n\n\treftype: The type of reference that is needed \n\t\t values: 'pdf', 'url', 'doi', 'arxiv'. \n\t\t default: Provides all reference types.\n\t\n\tsort: Whether reference should be sorted or not\n\t      values: True or False. \n\t      default: Is not sorted.\n\t\nUsage: \n```python\nreferences_list = pdf.get_references() #pdf is the instance of the linkrot class\n```\n\nReturn type: Set `<class 'set'>` of `<linkrot.backends.Reference object>`\n\n\tlinkrot.backends.Reference object has 3 member variables:\n\t- ref: actual URL/PDF/DOI/ARXIV\n\t- reftype: type of reference\n\t- page: page on which it was referenced\n\nInformation Provided: All references with their corresponding type and page number. \n\n### get_references_as_dict(reftype=None, sort=False)\nArguments: \n\n\treftype: The type of reference that is needed \n\t\t values: 'pdf', 'url', 'doi', 'arxiv'. \n\t\t default: Provides all reference types.\n\t\n\tsort: Whether reference should be sorted or not\n\t      values: True or False. \n\t      default: Is not sorted.\n\t\nUsage: \n```python\nreferences_dict = pdf.get_references_as_dict() #pdf is the instance of the linkrot class\n```\n\nReturn type: Dictionary `<class 'dict'>` with keys 'pdf', 'url', 'doi', 'arxiv' that each have a list `<class 'list'>` of refs of that type.\n\nInformation Provided: All references in their corresponding type list.\n\n\n### download_pdfs(target_dir)\nArguments: \n\n\ttarget_dir: The path of the directory to which the reference PDFs should be downloaded \n\t\nUsage: \n```python\npdf.download_pdfs(\"target-directory\") #pdf is the instance of the linkrot class\n```\n\nReturn type: None\n\nInformation Provided: Downloads all the reference PDFs to the specified directory.\n\n## 3. Linkrot downloader functions\n\nImport:\n```python\nfrom linkrot.downloader import sanitize_url, get_status_code, check_refs\n```\n### sanitize_url(url)\nArguments: \n\n\turl: The url to be sanitized.\n\t\nUsage: \n```python\nnew_url = sanitize_url(old_url) \n```\n\nReturn type: String `<class 'str'>`\n\nInformation Provided: URL is prefixed with 'http://' if it was not before and makes sure it is in utf-8 format.\n\n### get_status_code(url)\nArguments: \n\n\turl: The url to be checked for its status. \n\t\nUsage: \n```python\nstatus_code = get_status_code(url) \n```\n\nReturn type: String `<class 'str'>`\n\nInformation Provided: Checks if the URL is active or broken.\n\n### check_refs(refs, verbose=True, max_threads=MAX_THREADS_DEFAULT)\nArguments: \n\n\trefs: set of linkrot.backends.Reference objects\n\tverbose: whether it should print every reference with its code or just the summary of the link checker\n\tmax_threads: number of threads for multithreading\n\t\nUsage: \n```python\ncheck_refs(pdf.get_references()) #pdf is the instance of the linkrot class\n```\n\nReturn type: None\n\nInformation Provided: Prints references with their status code and a summary of all the broken/active links on terminal.\n\n## 4. Linkrot extractor functions\n\nImport:\n```python\nfrom linkrot.extractor import extract_urls, extract_doi, extract_arxiv\n```\n\nGet pdf text:\n```python\ntext = pdf.get_text() #pdf is the instance of the linkrot class\n```\n\n### extract_urls(text)\nArguments: \n\n\ttext: String of text to extract urls from\n\t\nUsage: \n```python\nurls = extract_urls(text)\n```\n\nReturn type: Set `<class 'set'>` of URLs\n\nInformation Provided: All URLs in the text\n\n### extract_arxiv(text)\nArguments: \n\n\ttext: String of text to extract arXivs from\n\t\nUsage: \n```python\narxiv = extract_arxiv(text)\n```\n\nReturn type: Set `<class 'set'>` of arxivs\n\nInformation Provided: All arXivs in the text\n\n### extract_doi(text)\nArguments: \n\n\ttext: String of text to extract DOIs from\n\t\nUsage: \n```python\ndoi = extract_doi(text)\n```\n\nReturn type: Set `<class 'set'>` of DOIs\n\nInformation Provided: All DOIs in the text\n\n# Code of Conduct\nTo view our code of conduct please visit our [Code of Conduct page](https://github.com/marshalmiller/rottingresearch/blob/main/code_of_conduct.md).\n            \n# License\nThis program is licensed with an [GPLv3 License](https://github.com/marshalmiller/linkrot/blob/main/LICENSE).\n",
    "bugtrack_url": null,
    "license": "",
    "summary": "Extract metadata and URLs from PDF files",
    "version": "5.2.1",
    "project_urls": {
        "Homepage": "https://github.com/rottingresearch/linkrot"
    },
    "split_keywords": [
        "linkrot",
        "pdf",
        "reference"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "452adaeb932e2c36e890635898f5ac0c8840bea0fa6e42c179c96208b7d1294f",
                "md5": "c5175bde7889520b1669fc418f5414e1",
                "sha256": "68502965636ac2e2e5e7d11b3b05084fe3568c73a78878d28cb59f39861791b3"
            },
            "downloads": -1,
            "filename": "linkrot-5.2.1-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "c5175bde7889520b1669fc418f5414e1",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8",
            "size": 30081,
            "upload_time": "2024-01-04T02:19:35",
            "upload_time_iso_8601": "2024-01-04T02:19:35.598817Z",
            "url": "https://files.pythonhosted.org/packages/45/2a/daeb932e2c36e890635898f5ac0c8840bea0fa6e42c179c96208b7d1294f/linkrot-5.2.1-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "fb869d76739582bc8a0f2e5e2f0400c086ee89bfa4917d0206e29339aceef190",
                "md5": "6607fd02bfc5cfd7697212f7eb6d0a7a",
                "sha256": "6544d6f6004547ba12f03476f7931082d3241f12d81de899bf0e3a3ead1459ff"
            },
            "downloads": -1,
            "filename": "linkrot-5.2.1.tar.gz",
            "has_sig": false,
            "md5_digest": "6607fd02bfc5cfd7697212f7eb6d0a7a",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8",
            "size": 27074,
            "upload_time": "2024-01-04T02:19:37",
            "upload_time_iso_8601": "2024-01-04T02:19:37.631857Z",
            "url": "https://files.pythonhosted.org/packages/fb/86/9d76739582bc8a0f2e5e2f0400c086ee89bfa4917d0206e29339aceef190/linkrot-5.2.1.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-01-04 02:19:37",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "rottingresearch",
    "github_project": "linkrot",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "linkrot"
}
        
Elapsed time: 1.04452s