getpaper

Name	getpaper JSON
Version	0.5.0 JSON
	download
home_page	None
Summary	getpaper - papers download made easy!
upload_time	2024-05-28 23:03:55
maintainer	None
docs_url	None
author	antonkulaga (Anton Kulaga)
requires_python	None
license	None
keywords	python utils files papers download
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            
# getpaper
Paper downloader

WARNING: temporally broken with last update

# getting started

Install the library with:
```bash
pip install getpaper
```
If you want to edit getpaper repository consider installing it locally:
```
pip install -e .
```

On linux systems you sometimes need to check that build essentials are installed:
```bash
sudo apt install build-essential.
```
It is also recommended to use micromamba, conda, anaconda or other environments to avoid bloating system python with too many dependencies.

# Usage
## Downloading papers

After the installation you can either import the library into your python code or you can use the console scripts.

If you install from pip calling _download_ will mean calling getpaper/download.py , for _parse_ - getpaper/parse.py , for _index_ - getpaper/index.py

```bash
download download_pubmed --pubmed 22266545 --folder "data/output/test/papers" --name pmid --loglevel info  --scihub_on_fail True
```
Downloads the paper with pubmed id into the folder 'papers' and uses the pubmed id as name

```bash
download download_doi --doi 10.1038/s41597-020-00710-z --folder "data/output/test/papers" --scihub_on_fail True
```
Downloads the paper with DOI into the folder papers, as --name is not specified doi is used as name

It is also possible to download many papers in parallel with download_papers(dois: List[str], destination: Path, threads: int) function, for example:
```python
from pathlib import Path
from typing import List
from getpaper.download import download_papers
dois: List[str] = ["10.3390/ijms22031073", "10.1038/s41597-020-00710-z", "wrong"]
destination: Path = Path("./data/output/test/papers").absolute().resolve()
threads: int = 5
results = download_papers(dois, destination, threads)
successful = results[0]
failed = results[1]
```
Here results will be OrderedDict[str, Path] with successfully downloaded doi->paper_path and List[str] with failed dois, in current example:
```
(OrderedDict([('10.3390/ijms22031073',
               PosixPath('/home/antonkulaga/sources/getpaper/notebooks/data/output/test/papers/10.3390/ijms22031073.pdf')),
              ('10.1038/s41597-020-00710-z',
               PosixPath('/home/antonkulaga/sources/getpaper/notebooks/data/output/test/papers/10.1038/s41597-020-00710-z.pdf'))]),
 ['wrong'])
```
Same function can be called from the command line:
```bash
download download_papers --dois "10.3390/ijms22031073" --dois "10.1038/s41597-020-00710-z" --dois "wrong" --folder "data/output/test/papers" --threads 5
```
You can also call download.py script directly:
```bash
python getpaper/download.py download_papers --dois "10.3390/ijms22031073" --dois "10.1038/s41597-020-00710-z" --dois "wrong" --folder "data/output/test/papers" --threads 5
```

## Parsing the papers

You can parse the downloaded papers with the unstructured library. For example if the papers are in the folder test, you can run:
```bash
getpaper/parse.py parse_folder --folder data/output/test/papers --cores 5
```
You can also switch between different PDF parsers:
```
getpaper/parse.py parse_folder --folder data/output/test/papers --parser pdf_miner --cores 5
```
You can also parse papers on a per-file basis, for example:
```bash
getpaper/parse.py parse_paper --paper data/output/test/papers/10.3390/ijms22031073.pdf
```

## Combining parsing and downloading

```bash
getpaper/parse.py download_and_parse --doi 10.1038/s41597-020-00710-z
```

## Count tokens

To evaluate how much you want to split texts and how much embeddings will cost you it is useful to compute token number:

```bash
getpaper/parse.py count_tokens --path /home/antonkulaga/sources/non-animal-models/data/inputs/datasets
```
# Examples

You can run examples.py to see usage examples

# Additional requirements

index.py has local dependencies on other modules, for this reason if you are running it inside getpaper project folder consider having it installed locally:
```bash
pip install -e .
```

Detectron2 is required for using models from the layoutparser model zoo but is not automatically installed with this package. 
For macOS and Linux, build from source with:

pip install 'git+https://github.com/facebookresearch/detectron2.git@e2ce8dc#egg=detectron2'

# Notes

Sometimes semantic-scholar change their APIs, so if the library stops working for you, open the issue.
Since 0.3.0 version all indexing features were moved to indexpaper library

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "getpaper",
    "maintainer": null,
    "docs_url": null,
    "requires_python": null,
    "maintainer_email": null,
    "keywords": "python, utils, files, papers, download",
    "author": "antonkulaga (Anton Kulaga)",
    "author_email": "<antonkulaga@gmail.com>",
    "download_url": "https://files.pythonhosted.org/packages/6f/f7/ebc4948133aa0dee1f8e5fc39edbf6cceae3a1cba99a240c5c06fe4528d2/getpaper-0.5.0.tar.gz",
    "platform": null,
    "description": "\n# getpaper\nPaper downloader\n\nWARNING: temporally broken with last update\n\n# getting started\n\nInstall the library with:\n```bash\npip install getpaper\n```\nIf you want to edit getpaper repository consider installing it locally:\n```\npip install -e .\n```\n\nOn linux systems you sometimes need to check that build essentials are installed:\n```bash\nsudo apt install build-essential.\n```\nIt is also recommended to use micromamba, conda, anaconda or other environments to avoid bloating system python with too many dependencies.\n\n# Usage\n## Downloading papers\n\nAfter the installation you can either import the library into your python code or you can use the console scripts.\n\nIf you install from pip calling _download_ will mean calling getpaper/download.py , for _parse_ - getpaper/parse.py , for _index_ - getpaper/index.py\n\n```bash\ndownload download_pubmed --pubmed 22266545 --folder \"data/output/test/papers\" --name pmid --loglevel info  --scihub_on_fail True\n```\nDownloads the paper with pubmed id into the folder 'papers' and uses the pubmed id as name\n\n```bash\ndownload download_doi --doi 10.1038/s41597-020-00710-z --folder \"data/output/test/papers\" --scihub_on_fail True\n```\nDownloads the paper with DOI into the folder papers, as --name is not specified doi is used as name\n\nIt is also possible to download many papers in parallel with download_papers(dois: List[str], destination: Path, threads: int) function, for example:\n```python\nfrom pathlib import Path\nfrom typing import List\nfrom getpaper.download import download_papers\ndois: List[str] = [\"10.3390/ijms22031073\", \"10.1038/s41597-020-00710-z\", \"wrong\"]\ndestination: Path = Path(\"./data/output/test/papers\").absolute().resolve()\nthreads: int = 5\nresults = download_papers(dois, destination, threads)\nsuccessful = results[0]\nfailed = results[1]\n```\nHere results will be OrderedDict[str, Path] with successfully downloaded doi->paper_path and List[str] with failed dois, in current example:\n```\n(OrderedDict([('10.3390/ijms22031073',\n               PosixPath('/home/antonkulaga/sources/getpaper/notebooks/data/output/test/papers/10.3390/ijms22031073.pdf')),\n              ('10.1038/s41597-020-00710-z',\n               PosixPath('/home/antonkulaga/sources/getpaper/notebooks/data/output/test/papers/10.1038/s41597-020-00710-z.pdf'))]),\n ['wrong'])\n```\nSame function can be called from the command line:\n```bash\ndownload download_papers --dois \"10.3390/ijms22031073\" --dois \"10.1038/s41597-020-00710-z\" --dois \"wrong\" --folder \"data/output/test/papers\" --threads 5\n```\nYou can also call download.py script directly:\n```bash\npython getpaper/download.py download_papers --dois \"10.3390/ijms22031073\" --dois \"10.1038/s41597-020-00710-z\" --dois \"wrong\" --folder \"data/output/test/papers\" --threads 5\n```\n\n## Parsing the papers\n\nYou can parse the downloaded papers with the unstructured library. For example if the papers are in the folder test, you can run:\n```bash\ngetpaper/parse.py parse_folder --folder data/output/test/papers --cores 5\n```\nYou can also switch between different PDF parsers:\n```\ngetpaper/parse.py parse_folder --folder data/output/test/papers --parser pdf_miner --cores 5\n```\nYou can also parse papers on a per-file basis, for example:\n```bash\ngetpaper/parse.py parse_paper --paper data/output/test/papers/10.3390/ijms22031073.pdf\n```\n\n## Combining parsing and downloading\n\n```bash\ngetpaper/parse.py download_and_parse --doi 10.1038/s41597-020-00710-z\n```\n\n## Count tokens\n\nTo evaluate how much you want to split texts and how much embeddings will cost you it is useful to compute token number:\n\n```bash\ngetpaper/parse.py count_tokens --path /home/antonkulaga/sources/non-animal-models/data/inputs/datasets\n```\n# Examples\n\nYou can run examples.py to see usage examples\n\n# Additional requirements\n\nindex.py has local dependencies on other modules, for this reason if you are running it inside getpaper project folder consider having it installed locally:\n```bash\npip install -e .\n```\n\nDetectron2 is required for using models from the layoutparser model zoo but is not automatically installed with this package. \nFor macOS and Linux, build from source with:\n\npip install 'git+https://github.com/facebookresearch/detectron2.git@e2ce8dc#egg=detectron2'\n\n# Notes\n\nSometimes semantic-scholar change their APIs, so if the library stops working for you, open the issue.\nSince 0.3.0 version all indexing features were moved to indexpaper library\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "getpaper - papers download made easy!",
    "version": "0.5.0",
    "project_urls": null,
    "split_keywords": [
        "python",
        " utils",
        " files",
        " papers",
        " download"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "388039d85a3da57c5f83ea22494ac9ed6d385c931a0135ba896a11ee19fca01b",
                "md5": "99e0bb0914248b832e99b3f898f0bea2",
                "sha256": "401758d9ab96246622ef3776c6430f65b8ddf6ccfdea18cdc85bf5a6912014e7"
            },
            "downloads": -1,
            "filename": "getpaper-0.5.0-py2.py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "99e0bb0914248b832e99b3f898f0bea2",
            "packagetype": "bdist_wheel",
            "python_version": "py2.py3",
            "requires_python": null,
            "size": 17911,
            "upload_time": "2024-05-28T23:03:53",
            "upload_time_iso_8601": "2024-05-28T23:03:53.223346Z",
            "url": "https://files.pythonhosted.org/packages/38/80/39d85a3da57c5f83ea22494ac9ed6d385c931a0135ba896a11ee19fca01b/getpaper-0.5.0-py2.py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "6ff7ebc4948133aa0dee1f8e5fc39edbf6cceae3a1cba99a240c5c06fe4528d2",
                "md5": "0a9ed22efc89bd54eb8b4c29497bbd87",
                "sha256": "2efdbe89778019beb2f462b2d8fca76c9a9a31db187b92b2d243b26690ff7a66"
            },
            "downloads": -1,
            "filename": "getpaper-0.5.0.tar.gz",
            "has_sig": false,
            "md5_digest": "0a9ed22efc89bd54eb8b4c29497bbd87",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": null,
            "size": 17568,
            "upload_time": "2024-05-28T23:03:55",
            "upload_time_iso_8601": "2024-05-28T23:03:55.410805Z",
            "url": "https://files.pythonhosted.org/packages/6f/f7/ebc4948133aa0dee1f8e5fc39edbf6cceae3a1cba99a240c5c06fe4528d2/getpaper-0.5.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-05-28 23:03:55",
    "github": false,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "lcname": "getpaper"
}

antonkulaga (Anton Kulaga)