getpaper


Namegetpaper JSON
Version 0.4.9 PyPI version JSON
download
home_page
Summarygetpaper - papers download made easy!
upload_time2024-02-25 12:28:16
maintainer
docs_urlNone
authorantonkulaga (Anton Kulaga)
requires_python
license
keywords python utils files papers download
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            
# getpaper
Paper downloader

WARNING: temporally broken with last update

# getting started

Install the library with:
```bash
pip install getpaper
```
If you want to edit getpaper repository consider installing it locally:
```
pip install -e .
```

On linux systems you sometimes need to check that build essentials are installed:
```bash
sudo apt install build-essential.
```
It is also recommended to use micromamba, conda, anaconda or other environments to avoid bloating system python with too many dependencies.

# Usage
## Downloading papers

After the installation you can either import the library into your python code or you can use the console scripts.

If you install from pip calling _download_ will mean calling getpaper/download.py , for _parse_ - getpaper/parse.py , for _index_ - getpaper/index.py

```bash
download download_pubmed --pubmed 22266545 --folder "data/output/test/papers" --name pmid
```
Downloads the paper with pubmed id into the folder 'papers' and uses the pubmed id as name
```bash
download download_doi --doi 10.1038/s41597-020-00710-z --folder "data/output/test/papers"
```
Downloads the paper with DOI into the folder papers, as --name is not specified doi is used as name

It is also possible to download many papers in parallel with download_papers(dois: List[str], destination: Path, threads: int) function, for example:
```python
from pathlib import Path
from typing import List
from getpaper.download import download_papers
dois: List[str] = ["10.3390/ijms22031073", "10.1038/s41597-020-00710-z", "wrong"]
destination: Path = Path("./data/output/test/papers").absolute().resolve()
threads: int = 5
results = download_papers(dois, destination, threads)
successful = results[0]
failed = results[1]
```
Here results will be OrderedDict[str, Path] with successfully downloaded doi->paper_path and List[str] with failed dois, in current example:
```
(OrderedDict([('10.3390/ijms22031073',
               PosixPath('/home/antonkulaga/sources/getpaper/notebooks/data/output/test/papers/10.3390/ijms22031073.pdf')),
              ('10.1038/s41597-020-00710-z',
               PosixPath('/home/antonkulaga/sources/getpaper/notebooks/data/output/test/papers/10.1038/s41597-020-00710-z.pdf'))]),
 ['wrong'])
```
Same function can be called from the command line:
```bash
download download_papers --dois "10.3390/ijms22031073" --dois "10.1038/s41597-020-00710-z" --dois "wrong" --folder "data/output/test/papers" --threads 5
```
You can also call download.py script directly:
```bash
python getpaper/download.py download_papers --dois "10.3390/ijms22031073" --dois "10.1038/s41597-020-00710-z" --dois "wrong" --folder "data/output/test/papers" --threads 5
```

## Parsing the papers

You can parse the downloaded papers with the unstructured library. For example if the papers are in the folder test, you can run:
```bash
getpaper/parse.py parse_folder --folder data/output/test/papers --cores 5
```
You can also switch between different PDF parsers:
```
getpaper/parse.py parse_folder --folder data/output/test/papers --parser pdf_miner --cores 5
```
You can also parse papers on a per-file basis, for example:
```bash
getpaper/parse.py parse_paper --paper data/output/test/papers/10.3390/ijms22031073.pdf
```

## Combining parsing and downloading

```bash
getpaper/parse.py download_and_parse --doi 10.1038/s41597-020-00710-z
```

## Count tokens

To evaluate how much you want to split texts and how much embeddings will cost you it is useful to compute token number:

```bash
getpaper/parse.py count_tokens --path /home/antonkulaga/sources/non-animal-models/data/inputs/datasets
```
# Examples

You can run examples.py to see usage examples

# Additional requirements

index.py has local dependencies on other modules, for this reason if you are running it inside getpaper project folder consider having it installed locally:
```bash
pip install -e .
```

Detectron2 is required for using models from the layoutparser model zoo but is not automatically installed with this package. 
For macOS and Linux, build from source with:

pip install 'git+https://github.com/facebookresearch/detectron2.git@e2ce8dc#egg=detectron2'

# Note

Since 0.3.0 version all indexing features were moved to indexpaper library

            

Raw data

            {
    "_id": null,
    "home_page": "",
    "name": "getpaper",
    "maintainer": "",
    "docs_url": null,
    "requires_python": "",
    "maintainer_email": "",
    "keywords": "python,utils,files,papers,download",
    "author": "antonkulaga (Anton Kulaga)",
    "author_email": "<antonkulaga@gmail.com>",
    "download_url": "https://files.pythonhosted.org/packages/dc/35/75b197e6a68e95d4878f7decfd155dafc908fae6a8e19941409e4fcc1378/getpaper-0.4.9.tar.gz",
    "platform": null,
    "description": "\n# getpaper\nPaper downloader\n\nWARNING: temporally broken with last update\n\n# getting started\n\nInstall the library with:\n```bash\npip install getpaper\n```\nIf you want to edit getpaper repository consider installing it locally:\n```\npip install -e .\n```\n\nOn linux systems you sometimes need to check that build essentials are installed:\n```bash\nsudo apt install build-essential.\n```\nIt is also recommended to use micromamba, conda, anaconda or other environments to avoid bloating system python with too many dependencies.\n\n# Usage\n## Downloading papers\n\nAfter the installation you can either import the library into your python code or you can use the console scripts.\n\nIf you install from pip calling _download_ will mean calling getpaper/download.py , for _parse_ - getpaper/parse.py , for _index_ - getpaper/index.py\n\n```bash\ndownload download_pubmed --pubmed 22266545 --folder \"data/output/test/papers\" --name pmid\n```\nDownloads the paper with pubmed id into the folder 'papers' and uses the pubmed id as name\n```bash\ndownload download_doi --doi 10.1038/s41597-020-00710-z --folder \"data/output/test/papers\"\n```\nDownloads the paper with DOI into the folder papers, as --name is not specified doi is used as name\n\nIt is also possible to download many papers in parallel with download_papers(dois: List[str], destination: Path, threads: int) function, for example:\n```python\nfrom pathlib import Path\nfrom typing import List\nfrom getpaper.download import download_papers\ndois: List[str] = [\"10.3390/ijms22031073\", \"10.1038/s41597-020-00710-z\", \"wrong\"]\ndestination: Path = Path(\"./data/output/test/papers\").absolute().resolve()\nthreads: int = 5\nresults = download_papers(dois, destination, threads)\nsuccessful = results[0]\nfailed = results[1]\n```\nHere results will be OrderedDict[str, Path] with successfully downloaded doi->paper_path and List[str] with failed dois, in current example:\n```\n(OrderedDict([('10.3390/ijms22031073',\n               PosixPath('/home/antonkulaga/sources/getpaper/notebooks/data/output/test/papers/10.3390/ijms22031073.pdf')),\n              ('10.1038/s41597-020-00710-z',\n               PosixPath('/home/antonkulaga/sources/getpaper/notebooks/data/output/test/papers/10.1038/s41597-020-00710-z.pdf'))]),\n ['wrong'])\n```\nSame function can be called from the command line:\n```bash\ndownload download_papers --dois \"10.3390/ijms22031073\" --dois \"10.1038/s41597-020-00710-z\" --dois \"wrong\" --folder \"data/output/test/papers\" --threads 5\n```\nYou can also call download.py script directly:\n```bash\npython getpaper/download.py download_papers --dois \"10.3390/ijms22031073\" --dois \"10.1038/s41597-020-00710-z\" --dois \"wrong\" --folder \"data/output/test/papers\" --threads 5\n```\n\n## Parsing the papers\n\nYou can parse the downloaded papers with the unstructured library. For example if the papers are in the folder test, you can run:\n```bash\ngetpaper/parse.py parse_folder --folder data/output/test/papers --cores 5\n```\nYou can also switch between different PDF parsers:\n```\ngetpaper/parse.py parse_folder --folder data/output/test/papers --parser pdf_miner --cores 5\n```\nYou can also parse papers on a per-file basis, for example:\n```bash\ngetpaper/parse.py parse_paper --paper data/output/test/papers/10.3390/ijms22031073.pdf\n```\n\n## Combining parsing and downloading\n\n```bash\ngetpaper/parse.py download_and_parse --doi 10.1038/s41597-020-00710-z\n```\n\n## Count tokens\n\nTo evaluate how much you want to split texts and how much embeddings will cost you it is useful to compute token number:\n\n```bash\ngetpaper/parse.py count_tokens --path /home/antonkulaga/sources/non-animal-models/data/inputs/datasets\n```\n# Examples\n\nYou can run examples.py to see usage examples\n\n# Additional requirements\n\nindex.py has local dependencies on other modules, for this reason if you are running it inside getpaper project folder consider having it installed locally:\n```bash\npip install -e .\n```\n\nDetectron2 is required for using models from the layoutparser model zoo but is not automatically installed with this package. \nFor macOS and Linux, build from source with:\n\npip install 'git+https://github.com/facebookresearch/detectron2.git@e2ce8dc#egg=detectron2'\n\n# Note\n\nSince 0.3.0 version all indexing features were moved to indexpaper library\n",
    "bugtrack_url": null,
    "license": "",
    "summary": "getpaper - papers download made easy!",
    "version": "0.4.9",
    "project_urls": null,
    "split_keywords": [
        "python",
        "utils",
        "files",
        "papers",
        "download"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "31338453c39187c608ea70dabf0cc2c82c7e6182de33b505d7c18fa2591df800",
                "md5": "41f5635401bdc47778867b53db0b22eb",
                "sha256": "f5993a384772e96b437d2ee2e0ef8b7e230482ad35849eeeac9acc3d8715a353"
            },
            "downloads": -1,
            "filename": "getpaper-0.4.9-py2.py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "41f5635401bdc47778867b53db0b22eb",
            "packagetype": "bdist_wheel",
            "python_version": "py2.py3",
            "requires_python": null,
            "size": 17118,
            "upload_time": "2024-02-25T12:28:14",
            "upload_time_iso_8601": "2024-02-25T12:28:14.280327Z",
            "url": "https://files.pythonhosted.org/packages/31/33/8453c39187c608ea70dabf0cc2c82c7e6182de33b505d7c18fa2591df800/getpaper-0.4.9-py2.py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "dc3575b197e6a68e95d4878f7decfd155dafc908fae6a8e19941409e4fcc1378",
                "md5": "180c189305e2589da8d7a727a0002386",
                "sha256": "0e0ebad53e0afc0e0c658e07ec77e5c92acde99f8f63708f970f122d6c181419"
            },
            "downloads": -1,
            "filename": "getpaper-0.4.9.tar.gz",
            "has_sig": false,
            "md5_digest": "180c189305e2589da8d7a727a0002386",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": null,
            "size": 16675,
            "upload_time": "2024-02-25T12:28:16",
            "upload_time_iso_8601": "2024-02-25T12:28:16.519290Z",
            "url": "https://files.pythonhosted.org/packages/dc/35/75b197e6a68e95d4878f7decfd155dafc908fae6a8e19941409e4fcc1378/getpaper-0.4.9.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-02-25 12:28:16",
    "github": false,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "lcname": "getpaper"
}
        
Elapsed time: 0.20058s