[![Donate](https://img.shields.io/badge/Donate-PayPal-green.svg)](https://www.paypal.me/ferru97)
# NEWS: PyPaperBot development is back on track!
### Join the [Telegram](https://t.me/pypaperbotdatawizards) channel to stay updated, report bugs, or request custom data mining scripts.
---
# PyPaperBot
PyPaperBot is a Python tool for **downloading scientific papers and bibtex** using Google Scholar, Crossref, SciHub, and SciDB.
The tool tries to download papers from different sources such as PDF provided by Scholar, Scholar related links, and Scihub.
PyPaperbot is also able to download the **bibtex** of each paper.
## Features
- Download papers given a query
- Download papers given paper's DOIs
- Download papers given a Google Scholar link
- Generate Bibtex of the downloaded paper
- Filter downloaded paper by year, journal and citations number
## Installation
### For normal Users
Use `pip` to install from pypi:
```bash
pip install PyPaperBot
```
If on windows you get an error saying *error: Microsoft Visual C++ 14.0 is required..* try to install [Microsoft C++ Build Tools](https://visualstudio.microsoft.com/it/visual-cpp-build-tools/) or [Visual Studio](https://visualstudio.microsoft.com/it/downloads/)
### For Termux users
Since numpy cannot be directly installed....
```
pkg install wget
wget https://its-pointless.github.io/setup-pointless-repo.sh
pkg install numpy
export CFLAGS="-Wno-deprecated-declarations -Wno-unreachable-code"
pip install pandas
```
and
```
pip install PyPaperbot
```
## How to use
PyPaperBot arguments:
| Arguments | Description | Type |
|-----------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--------|
| \-\-query | Query to make on Google Scholar or Google Scholar page link | string |
| \-\-skip-words | List of comma separated words i.e. "word1,word2 word3,word4". Articles containing any of this word in the title or google scholar summary will be ignored | string |
| \-\-cites | Paper ID (from scholar address bar when you search cites) if you want get only citations of that paper | string | string |
| \-\-doi | DOI of the paper to download (this option uses only SciHub to download) | string |
| \-\-doi-file | File .txt containing the list of paper's DOIs to download | string |
| \-\-scholar-pages | Number or range of Google Scholar pages to inspect. Each page has a maximum of 10 papers | string |
| \-\-dwn-dir | Directory path in which to save the result | string |
| \-\-min-year | Minimal publication year of the paper to download | int |
| \-\-max-dwn-year | Maximum number of papers to download sorted by year | int |
| \-\-max-dwn-cites | Maximum number of papers to download sorted by number of citations | int |
| \-\-journal-filter | CSV file path of the journal filter (More info on github) | string |
| \-\-restrict | 0:Download only Bibtex - 1:Download only papers PDF | int |
| \-\-scihub-mirror | Mirror for downloading papers from sci-hub. If not set, it is selected automatically | string |
| \-\-annas-archive-mirror | Mirror for downloading papers from Annas Archive (SciDB). If not set, https://annas-archive.se is used | string |
| \-\-scholar-results | Number of scholar results to bedownloaded when \-\-scholar-pages=1 | int |
| \-\-proxy | Proxies to be used. Please specify the protocol to be used. | string |
| \-\-single-proxy | Use a single proxy. Recommended if using --proxy gives errors. | string |
| \-\-selenium-chrome-version | First three digits of the chrome version installed on your machine. If provided, selenium will be used for scholar search. It helps avoid bot detection but chrome must be installed. | int |
| \-\-use-doi-as-filename | If provided, files are saved using the unique DOI as the filename rather than the default paper title | bool |
| \-h | Shows the help | -- |
### Note
You can use only one of the arguments in the following groups
- *\-\-query*, *\-\-doi-file*, and *\-\-doi*
- *\-\-max-dwn-year* and *and max-dwn-cites*
One of the arguments *\-\-scholar-pages*, *\-\-query *, and* \-\-file* is mandatory
The arguments *\-\-scholar-pages* is mandatory when using *\-\-query *
The argument *\-\-dwn-dir* is mandatory
The argument *\-\-journal-filter* require the path of a CSV containing a list of journal name paired with a boolean which indicates whether or not to consider that journal (0: don't consider /1: consider) [Example](https://github.com/ferru97/PyPaperBot/blob/master/file_examples/jurnals.csv)
The argument *\-\-doi-file* require the path of a txt file containing the list of paper's DOIs to download organized with one DOI per line [Example](https://github.com/ferru97/PyPaperBot/blob/master/file_examples/papers.txt)
Use the --proxy argument at the end of all other arguments and specify the protocol to be used. See the examples to understand how to use the option.
## SciHub access
If access to SciHub is blocked in your country, consider using a free VPN service like [ProtonVPN](https://protonvpn.com/)
Also, you can use proxy option above.
## Example
Download a maximum of 30 papers from the first 3 pages given a query and starting from 2018 using the mirror https://sci-hub.do:
```bash
python -m PyPaperBot --query="Machine learning" --scholar-pages=3 --min-year=2018 --dwn-dir="C:\User\example\papers" --scihub-mirror="https://sci-hub.do"
```
Download papers from pages 4 to 7 (7th included) given a query and skip words:
```bash
python -m PyPaperBot --query="Machine learning" --scholar-pages=4-7 --dwn-dir="C:\User\example\papers" --skip-words="ai,decision tree,bot"
```
Download a paper given the DOI:
```bash
python -m PyPaperBot --doi="10.0086/s41037-711-0132-1" --dwn-dir="C:\User\example\papers" -use-doi-as-filename`
```
Download papers given a file containing the DOIs:
```bash
python -m PyPaperBot --doi-file="C:\User\example\papers\file.txt" --dwn-dir="C:\User\example\papers"`
```
If it doesn't work, try to use *py* instead of *python* i.e.
```bash
py -m PyPaperBot --doi="10.0086/s41037-711-0132-1" --dwn-dir="C:\User\example\papers"`
```
Search papers that cite another (find ID in scholar address bar when you search citations):
```bash
python -m PyPaperBot --cites=3120460092236365926
```
Using proxy
```
python -m PyPaperBot --query=rheumatoid+arthritis --scholar-pages=1 --scholar-results=7 --dwn-dir=/download --proxy="http://1.1.1.1::8080,https://8.8.8.8::8080"
```
```
python -m PyPaperBot --query=rheumatoid+arthritis --scholar-pages=1 --scholar-results=7 --dwn-dir=/download -single-proxy=http://1.1.1.1::8080
```
In termux, you can directly use ```PyPaperBot``` followed by arguments...
## Contributions
Feel free to contribute to this project by proposing any change, fix, and enhancement on the **dev** branch
### To do
- Tests
- Code documentation
- General improvements
## Disclaimer
This application is for educational purposes only. I do not take responsibility for what you choose to do with this application.
## Donation
If you like this project, you can give me a cup of coffee :)
[![paypal](https://www.paypalobjects.com/en_US/i/btn/btn_donateCC_LG.gif)](https://www.paypal.me/ferru97)
Raw data
{
"_id": null,
"home_page": "https://github.com/ferru97/PyPaperBot",
"name": "PyPaperBot",
"maintainer": null,
"docs_url": null,
"requires_python": null,
"maintainer_email": null,
"keywords": "download-papers, google-scholar, scihub, scholar, crossref, papers",
"author": "Vito Ferrulli",
"author_email": "vitof970@gmail.com",
"download_url": "https://files.pythonhosted.org/packages/25/3c/ebb9b537f3550cce5fad79cfb4f26fd44fd5684ca03d638fe395171641f8/pypaperbot-1.4.1.tar.gz",
"platform": null,
"description": "[![Donate](https://img.shields.io/badge/Donate-PayPal-green.svg)](https://www.paypal.me/ferru97)\r\n\r\n# NEWS: PyPaperBot development is back on track!\r\n### Join the [Telegram](https://t.me/pypaperbotdatawizards) channel to stay updated, report bugs, or request custom data mining scripts.\r\n---\r\n\r\n# PyPaperBot\r\n\r\nPyPaperBot is a Python tool for **downloading scientific papers and bibtex** using Google Scholar, Crossref, SciHub, and SciDB.\r\nThe tool tries to download papers from different sources such as PDF provided by Scholar, Scholar related links, and Scihub.\r\nPyPaperbot is also able to download the **bibtex** of each paper.\r\n\r\n## Features\r\n\r\n- Download papers given a query\r\n- Download papers given paper's DOIs\r\n- Download papers given a Google Scholar link\r\n- Generate Bibtex of the downloaded paper\r\n- Filter downloaded paper by year, journal and citations number\r\n\r\n## Installation\r\n\r\n### For normal Users\r\n\r\nUse `pip` to install from pypi:\r\n\r\n```bash\r\npip install PyPaperBot\r\n```\r\n\r\nIf on windows you get an error saying *error: Microsoft Visual C++ 14.0 is required..* try to install [Microsoft C++ Build Tools](https://visualstudio.microsoft.com/it/visual-cpp-build-tools/) or [Visual Studio](https://visualstudio.microsoft.com/it/downloads/)\r\n\r\n### For Termux users\r\n\r\nSince numpy cannot be directly installed....\r\n\r\n```\r\npkg install wget\r\nwget https://its-pointless.github.io/setup-pointless-repo.sh\r\npkg install numpy\r\nexport CFLAGS=\"-Wno-deprecated-declarations -Wno-unreachable-code\"\r\npip install pandas\r\n```\r\n\r\nand\r\n\r\n```\r\npip install PyPaperbot\r\n```\r\n\r\n## How to use\r\n\r\nPyPaperBot arguments:\r\n\r\n| Arguments | Description | Type |\r\n|-----------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--------|\r\n| \\-\\-query | Query to make on Google Scholar or Google Scholar page link | string |\r\n| \\-\\-skip-words | List of comma separated words i.e. \"word1,word2 word3,word4\". Articles containing any of this word in the title or google scholar summary will be ignored | string |\r\n| \\-\\-cites | Paper ID (from scholar address bar when you search cites) if you want get only citations of that paper | string | string |\r\n| \\-\\-doi | DOI of the paper to download (this option uses only SciHub to download) | string |\r\n| \\-\\-doi-file | File .txt containing the list of paper's DOIs to download | string |\r\n| \\-\\-scholar-pages | Number or range of Google Scholar pages to inspect. Each page has a maximum of 10 papers | string |\r\n| \\-\\-dwn-dir | Directory path in which to save the result | string |\r\n| \\-\\-min-year | Minimal publication year of the paper to download | int |\r\n| \\-\\-max-dwn-year | Maximum number of papers to download sorted by year | int |\r\n| \\-\\-max-dwn-cites | Maximum number of papers to download sorted by number of citations | int |\r\n| \\-\\-journal-filter | CSV file path of the journal filter (More info on github) | string |\r\n| \\-\\-restrict | 0:Download only Bibtex - 1:Download only papers PDF | int |\r\n| \\-\\-scihub-mirror | Mirror for downloading papers from sci-hub. If not set, it is selected automatically | string |\r\n| \\-\\-annas-archive-mirror | Mirror for downloading papers from Annas Archive (SciDB). If not set, https://annas-archive.se is used | string |\r\n| \\-\\-scholar-results | Number of scholar results to bedownloaded when \\-\\-scholar-pages=1 | int |\r\n| \\-\\-proxy | Proxies to be used. Please specify the protocol to be used. | string |\r\n| \\-\\-single-proxy | Use a single proxy. Recommended if using --proxy gives errors. | string |\r\n| \\-\\-selenium-chrome-version | First three digits of the chrome version installed on your machine. If provided, selenium will be used for scholar search. It helps avoid bot detection but chrome must be installed. | int |\r\n| \\-\\-use-doi-as-filename | If provided, files are saved using the unique DOI as the filename rather than the default paper title | bool |\r\n| \\-h | Shows the help | -- |\r\n\r\n### Note\r\n\r\nYou can use only one of the arguments in the following groups\r\n\r\n- *\\-\\-query*, *\\-\\-doi-file*, and *\\-\\-doi* \r\n- *\\-\\-max-dwn-year* and *and max-dwn-cites*\r\n\r\nOne of the arguments *\\-\\-scholar-pages*, *\\-\\-query *, and* \\-\\-file* is mandatory\r\nThe arguments *\\-\\-scholar-pages* is mandatory when using *\\-\\-query *\r\nThe argument *\\-\\-dwn-dir* is mandatory\r\n\r\nThe argument *\\-\\-journal-filter* require the path of a CSV containing a list of journal name paired with a boolean which indicates whether or not to consider that journal (0: don't consider /1: consider) [Example](https://github.com/ferru97/PyPaperBot/blob/master/file_examples/jurnals.csv)\r\n\r\nThe argument *\\-\\-doi-file* require the path of a txt file containing the list of paper's DOIs to download organized with one DOI per line [Example](https://github.com/ferru97/PyPaperBot/blob/master/file_examples/papers.txt)\r\n\r\nUse the --proxy argument at the end of all other arguments and specify the protocol to be used. See the examples to understand how to use the option.\r\n\r\n## SciHub access\r\n\r\nIf access to SciHub is blocked in your country, consider using a free VPN service like [ProtonVPN](https://protonvpn.com/) \r\nAlso, you can use proxy option above.\r\n\r\n## Example\r\n\r\nDownload a maximum of 30 papers from the first 3 pages given a query and starting from 2018 using the mirror https://sci-hub.do:\r\n\r\n```bash\r\npython -m PyPaperBot --query=\"Machine learning\" --scholar-pages=3 --min-year=2018 --dwn-dir=\"C:\\User\\example\\papers\" --scihub-mirror=\"https://sci-hub.do\"\r\n```\r\n\r\nDownload papers from pages 4 to 7 (7th included) given a query and skip words:\r\n\r\n```bash\r\npython -m PyPaperBot --query=\"Machine learning\" --scholar-pages=4-7 --dwn-dir=\"C:\\User\\example\\papers\" --skip-words=\"ai,decision tree,bot\"\r\n```\r\n\r\nDownload a paper given the DOI:\r\n\r\n```bash\r\npython -m PyPaperBot --doi=\"10.0086/s41037-711-0132-1\" --dwn-dir=\"C:\\User\\example\\papers\" -use-doi-as-filename`\r\n```\r\n\r\nDownload papers given a file containing the DOIs:\r\n\r\n```bash\r\npython -m PyPaperBot --doi-file=\"C:\\User\\example\\papers\\file.txt\" --dwn-dir=\"C:\\User\\example\\papers\"`\r\n```\r\n\r\nIf it doesn't work, try to use *py* instead of *python* i.e.\r\n\r\n```bash\r\npy -m PyPaperBot --doi=\"10.0086/s41037-711-0132-1\" --dwn-dir=\"C:\\User\\example\\papers\"`\r\n```\r\n\r\nSearch papers that cite another (find ID in scholar address bar when you search citations):\r\n\r\n```bash\r\npython -m PyPaperBot --cites=3120460092236365926\r\n```\r\n\r\nUsing proxy\r\n\r\n```\r\npython -m PyPaperBot --query=rheumatoid+arthritis --scholar-pages=1 --scholar-results=7 --dwn-dir=/download --proxy=\"http://1.1.1.1::8080,https://8.8.8.8::8080\"\r\n```\r\n```\r\npython -m PyPaperBot --query=rheumatoid+arthritis --scholar-pages=1 --scholar-results=7 --dwn-dir=/download -single-proxy=http://1.1.1.1::8080\r\n```\r\n\r\nIn termux, you can directly use ```PyPaperBot``` followed by arguments...\r\n\r\n## Contributions\r\n\r\nFeel free to contribute to this project by proposing any change, fix, and enhancement on the **dev** branch\r\n\r\n### To do\r\n\r\n- Tests\r\n- Code documentation\r\n- General improvements\r\n\r\n## Disclaimer\r\n\r\nThis application is for educational purposes only. I do not take responsibility for what you choose to do with this application.\r\n\r\n## Donation\r\n\r\nIf you like this project, you can give me a cup of coffee :) \r\n\r\n[![paypal](https://www.paypalobjects.com/en_US/i/btn/btn_donateCC_LG.gif)](https://www.paypal.me/ferru97)\r\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "PyPaperBot is a Python tool for downloading scientific papers using Google Scholar, Crossref, and SciHub.",
"version": "1.4.1",
"project_urls": {
"Download": "https://github.com/ferru97/PyPaperBot/archive/v1.4.1.tar.gz",
"Homepage": "https://github.com/ferru97/PyPaperBot"
},
"split_keywords": [
"download-papers",
" google-scholar",
" scihub",
" scholar",
" crossref",
" papers"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "366eef7a978095157e475c0a676545adc2db12b7958341ddf05963d2ba176da3",
"md5": "68011be664d8058e2bb8aceeb6ab9f2c",
"sha256": "a17cb5d1220096ddaaafb6db6ec49c761464e00d2c7a7a2ba21d7865462060f6"
},
"downloads": -1,
"filename": "PyPaperBot-1.4.1-py3-none-any.whl",
"has_sig": false,
"md5_digest": "68011be664d8058e2bb8aceeb6ab9f2c",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": null,
"size": 17444,
"upload_time": "2024-12-08T22:53:27",
"upload_time_iso_8601": "2024-12-08T22:53:27.233483Z",
"url": "https://files.pythonhosted.org/packages/36/6e/ef7a978095157e475c0a676545adc2db12b7958341ddf05963d2ba176da3/PyPaperBot-1.4.1-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "253cebb9b537f3550cce5fad79cfb4f26fd44fd5684ca03d638fe395171641f8",
"md5": "d759ab7fb509e70d3b41f84fe82c7f82",
"sha256": "3b9ec00285044d465101ad897a11f69e8269761f189e4c87c80c071ab30496e4"
},
"downloads": -1,
"filename": "pypaperbot-1.4.1.tar.gz",
"has_sig": false,
"md5_digest": "d759ab7fb509e70d3b41f84fe82c7f82",
"packagetype": "sdist",
"python_version": "source",
"requires_python": null,
"size": 17764,
"upload_time": "2024-12-08T22:53:29",
"upload_time_iso_8601": "2024-12-08T22:53:29.635523Z",
"url": "https://files.pythonhosted.org/packages/25/3c/ebb9b537f3550cce5fad79cfb4f26fd44fd5684ca03d638fe395171641f8/pypaperbot-1.4.1.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-12-08 22:53:29",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "ferru97",
"github_project": "PyPaperBot",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"requirements": [],
"lcname": "pypaperbot"
}