# getpaper
Paper downloader
WARNING: temporally broken with last update
# getting started
Install the library with:
```bash
pip install getpaper
```
If you want to edit getpaper repository consider installing it locally:
```
pip install -e .
```
On linux systems you sometimes need to check that build essentials are installed:
```bash
sudo apt install build-essential.
```
It is also recommended to use micromamba, conda, anaconda or other environments to avoid bloating system python with too many dependencies.
# Usage
## Downloading papers
After the installation you can either import the library into your python code or you can use the console scripts.
If you install from pip calling _download_ will mean calling getpaper/download.py , for _parse_ - getpaper/parse.py , for _index_ - getpaper/index.py
```bash
download download_pubmed --pubmed 22266545 --folder "data/output/test/papers" --name pmid --loglevel info --scihub_on_fail True
```
Downloads the paper with pubmed id into the folder 'papers' and uses the pubmed id as name
```bash
download download_doi --doi 10.1038/s41597-020-00710-z --folder "data/output/test/papers" --scihub_on_fail True
```
Downloads the paper with DOI into the folder papers, as --name is not specified doi is used as name
It is also possible to download many papers in parallel with download_papers(dois: List[str], destination: Path, threads: int) function, for example:
```python
from pathlib import Path
from typing import List
from getpaper.download import download_papers
dois: List[str] = ["10.3390/ijms22031073", "10.1038/s41597-020-00710-z", "wrong"]
destination: Path = Path("./data/output/test/papers").absolute().resolve()
threads: int = 5
results = download_papers(dois, destination, threads)
successful = results[0]
failed = results[1]
```
Here results will be OrderedDict[str, Path] with successfully downloaded doi->paper_path and List[str] with failed dois, in current example:
```
(OrderedDict([('10.3390/ijms22031073',
PosixPath('/home/antonkulaga/sources/getpaper/notebooks/data/output/test/papers/10.3390/ijms22031073.pdf')),
('10.1038/s41597-020-00710-z',
PosixPath('/home/antonkulaga/sources/getpaper/notebooks/data/output/test/papers/10.1038/s41597-020-00710-z.pdf'))]),
['wrong'])
```
Same function can be called from the command line:
```bash
download download_papers --dois "10.3390/ijms22031073" --dois "10.1038/s41597-020-00710-z" --dois "wrong" --folder "data/output/test/papers" --threads 5
```
You can also call download.py script directly:
```bash
python getpaper/download.py download_papers --dois "10.3390/ijms22031073" --dois "10.1038/s41597-020-00710-z" --dois "wrong" --folder "data/output/test/papers" --threads 5
```
## Parsing the papers
You can parse the downloaded papers with the unstructured library. For example if the papers are in the folder test, you can run:
```bash
getpaper/parse.py parse_folder --folder data/output/test/papers --cores 5
```
You can also switch between different PDF parsers:
```
getpaper/parse.py parse_folder --folder data/output/test/papers --parser pdf_miner --cores 5
```
You can also parse papers on a per-file basis, for example:
```bash
getpaper/parse.py parse_paper --paper data/output/test/papers/10.3390/ijms22031073.pdf
```
## Combining parsing and downloading
```bash
getpaper/parse.py download_and_parse --doi 10.1038/s41597-020-00710-z
```
## Count tokens
To evaluate how much you want to split texts and how much embeddings will cost you it is useful to compute token number:
```bash
getpaper/parse.py count_tokens --path /home/antonkulaga/sources/non-animal-models/data/inputs/datasets
```
# Examples
You can run examples.py to see usage examples
# Additional requirements
index.py has local dependencies on other modules, for this reason if you are running it inside getpaper project folder consider having it installed locally:
```bash
pip install -e .
```
Detectron2 is required for using models from the layoutparser model zoo but is not automatically installed with this package.
For macOS and Linux, build from source with:
pip install 'git+https://github.com/facebookresearch/detectron2.git@e2ce8dc#egg=detectron2'
# Notes
Sometimes semantic-scholar change their APIs, so if the library stops working for you, open the issue.
Since 0.3.0 version all indexing features were moved to indexpaper library
Raw data
{
"_id": null,
"home_page": null,
"name": "getpaper",
"maintainer": null,
"docs_url": null,
"requires_python": null,
"maintainer_email": null,
"keywords": "python, utils, files, papers, download",
"author": "antonkulaga (Anton Kulaga)",
"author_email": "<antonkulaga@gmail.com>",
"download_url": "https://files.pythonhosted.org/packages/6f/f7/ebc4948133aa0dee1f8e5fc39edbf6cceae3a1cba99a240c5c06fe4528d2/getpaper-0.5.0.tar.gz",
"platform": null,
"description": "\n# getpaper\nPaper downloader\n\nWARNING: temporally broken with last update\n\n# getting started\n\nInstall the library with:\n```bash\npip install getpaper\n```\nIf you want to edit getpaper repository consider installing it locally:\n```\npip install -e .\n```\n\nOn linux systems you sometimes need to check that build essentials are installed:\n```bash\nsudo apt install build-essential.\n```\nIt is also recommended to use micromamba, conda, anaconda or other environments to avoid bloating system python with too many dependencies.\n\n# Usage\n## Downloading papers\n\nAfter the installation you can either import the library into your python code or you can use the console scripts.\n\nIf you install from pip calling _download_ will mean calling getpaper/download.py , for _parse_ - getpaper/parse.py , for _index_ - getpaper/index.py\n\n```bash\ndownload download_pubmed --pubmed 22266545 --folder \"data/output/test/papers\" --name pmid --loglevel info --scihub_on_fail True\n```\nDownloads the paper with pubmed id into the folder 'papers' and uses the pubmed id as name\n\n```bash\ndownload download_doi --doi 10.1038/s41597-020-00710-z --folder \"data/output/test/papers\" --scihub_on_fail True\n```\nDownloads the paper with DOI into the folder papers, as --name is not specified doi is used as name\n\nIt is also possible to download many papers in parallel with download_papers(dois: List[str], destination: Path, threads: int) function, for example:\n```python\nfrom pathlib import Path\nfrom typing import List\nfrom getpaper.download import download_papers\ndois: List[str] = [\"10.3390/ijms22031073\", \"10.1038/s41597-020-00710-z\", \"wrong\"]\ndestination: Path = Path(\"./data/output/test/papers\").absolute().resolve()\nthreads: int = 5\nresults = download_papers(dois, destination, threads)\nsuccessful = results[0]\nfailed = results[1]\n```\nHere results will be OrderedDict[str, Path] with successfully downloaded doi->paper_path and List[str] with failed dois, in current example:\n```\n(OrderedDict([('10.3390/ijms22031073',\n PosixPath('/home/antonkulaga/sources/getpaper/notebooks/data/output/test/papers/10.3390/ijms22031073.pdf')),\n ('10.1038/s41597-020-00710-z',\n PosixPath('/home/antonkulaga/sources/getpaper/notebooks/data/output/test/papers/10.1038/s41597-020-00710-z.pdf'))]),\n ['wrong'])\n```\nSame function can be called from the command line:\n```bash\ndownload download_papers --dois \"10.3390/ijms22031073\" --dois \"10.1038/s41597-020-00710-z\" --dois \"wrong\" --folder \"data/output/test/papers\" --threads 5\n```\nYou can also call download.py script directly:\n```bash\npython getpaper/download.py download_papers --dois \"10.3390/ijms22031073\" --dois \"10.1038/s41597-020-00710-z\" --dois \"wrong\" --folder \"data/output/test/papers\" --threads 5\n```\n\n## Parsing the papers\n\nYou can parse the downloaded papers with the unstructured library. For example if the papers are in the folder test, you can run:\n```bash\ngetpaper/parse.py parse_folder --folder data/output/test/papers --cores 5\n```\nYou can also switch between different PDF parsers:\n```\ngetpaper/parse.py parse_folder --folder data/output/test/papers --parser pdf_miner --cores 5\n```\nYou can also parse papers on a per-file basis, for example:\n```bash\ngetpaper/parse.py parse_paper --paper data/output/test/papers/10.3390/ijms22031073.pdf\n```\n\n## Combining parsing and downloading\n\n```bash\ngetpaper/parse.py download_and_parse --doi 10.1038/s41597-020-00710-z\n```\n\n## Count tokens\n\nTo evaluate how much you want to split texts and how much embeddings will cost you it is useful to compute token number:\n\n```bash\ngetpaper/parse.py count_tokens --path /home/antonkulaga/sources/non-animal-models/data/inputs/datasets\n```\n# Examples\n\nYou can run examples.py to see usage examples\n\n# Additional requirements\n\nindex.py has local dependencies on other modules, for this reason if you are running it inside getpaper project folder consider having it installed locally:\n```bash\npip install -e .\n```\n\nDetectron2 is required for using models from the layoutparser model zoo but is not automatically installed with this package. \nFor macOS and Linux, build from source with:\n\npip install 'git+https://github.com/facebookresearch/detectron2.git@e2ce8dc#egg=detectron2'\n\n# Notes\n\nSometimes semantic-scholar change their APIs, so if the library stops working for you, open the issue.\nSince 0.3.0 version all indexing features were moved to indexpaper library\n",
"bugtrack_url": null,
"license": null,
"summary": "getpaper - papers download made easy!",
"version": "0.5.0",
"project_urls": null,
"split_keywords": [
"python",
" utils",
" files",
" papers",
" download"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "388039d85a3da57c5f83ea22494ac9ed6d385c931a0135ba896a11ee19fca01b",
"md5": "99e0bb0914248b832e99b3f898f0bea2",
"sha256": "401758d9ab96246622ef3776c6430f65b8ddf6ccfdea18cdc85bf5a6912014e7"
},
"downloads": -1,
"filename": "getpaper-0.5.0-py2.py3-none-any.whl",
"has_sig": false,
"md5_digest": "99e0bb0914248b832e99b3f898f0bea2",
"packagetype": "bdist_wheel",
"python_version": "py2.py3",
"requires_python": null,
"size": 17911,
"upload_time": "2024-05-28T23:03:53",
"upload_time_iso_8601": "2024-05-28T23:03:53.223346Z",
"url": "https://files.pythonhosted.org/packages/38/80/39d85a3da57c5f83ea22494ac9ed6d385c931a0135ba896a11ee19fca01b/getpaper-0.5.0-py2.py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "6ff7ebc4948133aa0dee1f8e5fc39edbf6cceae3a1cba99a240c5c06fe4528d2",
"md5": "0a9ed22efc89bd54eb8b4c29497bbd87",
"sha256": "2efdbe89778019beb2f462b2d8fca76c9a9a31db187b92b2d243b26690ff7a66"
},
"downloads": -1,
"filename": "getpaper-0.5.0.tar.gz",
"has_sig": false,
"md5_digest": "0a9ed22efc89bd54eb8b4c29497bbd87",
"packagetype": "sdist",
"python_version": "source",
"requires_python": null,
"size": 17568,
"upload_time": "2024-05-28T23:03:55",
"upload_time_iso_8601": "2024-05-28T23:03:55.410805Z",
"url": "https://files.pythonhosted.org/packages/6f/f7/ebc4948133aa0dee1f8e5fc39edbf6cceae3a1cba99a240c5c06fe4528d2/getpaper-0.5.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-05-28 23:03:55",
"github": false,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"lcname": "getpaper"
}