[![build](https://github.com/jannisborn/paperscraper/actions/workflows/test_tip.yml/badge.svg?branch=main)](https://github.com/jannisborn/paperscraper/actions/workflows/test_tip.yml?query=branch%3Amain)
[![build](https://github.com/jannisborn/paperscraper/actions/workflows/test_pypi.yml/badge.svg?branch=main)](https://github.com/jannisborn/paperscraper/actions/workflows/test_pypi.yml?query=branch%3Amain)
[![License:
MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
[![PyPI version](https://badge.fury.io/py/paperscraper.svg)](https://badge.fury.io/py/paperscraper)
[![Downloads](https://static.pepy.tech/badge/paperscraper)](https://pepy.tech/project/paperscraper)
[![Downloads](https://static.pepy.tech/badge/paperscraper/month)](https://pepy.tech/project/paperscraper)
[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)
[![codecov](https://codecov.io/github/jannisborn/paperscraper/branch/main/graph/badge.svg?token=Clwi0pu61a)](https://codecov.io/github/jannisborn/paperscraper)
# paperscraper
`paperscraper` is a `python` package for scraping publication metadata or full PDF files from
**PubMed** or preprint servers such as **arXiv**, **medRxiv**, **bioRxiv** and **chemRxiv**.
It provides a streamlined interface to scrape metadata, allows to retrieve citation counts
from Google Scholar, impact factors from journals and comes with simple postprocessing functions
and plotting routines for meta-analysis.
## Getting started
```console
pip install paperscraper
```
This is enough to query **PubMed**, **arXiv** or Google Scholar.
#### Download X-rxiv Dumps
However, to scrape publication data from the preprint servers [biorxiv](https://www.biorxiv.org), [medrxiv](https://www.medrxiv.org) and [chemrxiv](https://www.chemrxiv.org), the setup is different. The entire dump is downloaded and stored in the `server_dumps` folder in a `.jsonl` format (one paper per line).
```py
from paperscraper.get_dumps import biorxiv, medrxiv, chemrxiv
medrxiv() # Takes ~30min and should result in ~35 MB file
biorxiv() # Takes ~1h and should result in ~350 MB file
chemrxiv() # Takes ~45min and should result in ~20 MB file
```
*NOTE*: Once the dumps are stored, please make sure to restart the python interpreter so that the changes take effect.
*NOTE*: If you experience API connection issues (`ConnectionError`), since v0.2.12 there are automatic retries which you can even control and raise from the default of 10, as in `biorxiv(max_retries=20)`.
Since v0.2.5 `paperscraper` also allows to scrape {med/bio/chem}rxiv for specific dates.
```py
medrxiv(begin_date="2023-04-01", end_date="2023-04-08")
```
But watch out. The resulting `.jsonl` file will be labelled according to the current date and all your subsequent searches will be based on this file **only**. If you use this option you might want to keep an eye on the source files (`paperscraper/server_dumps/*jsonl`) to ensure they contain the paper metadata for all papers you're interested in.
## Examples
`paperscraper` is build on top of the packages [arxiv](https://pypi.org/project/arxiv/), [pymed](https://pypi.org/project/pymed-paperscraper/), and [scholarly](https://pypi.org/project/scholarly/).
### Publication keyword search
Consider you want to perform a publication keyword search with the query:
`COVID-19` **AND** `Artificial Intelligence` **AND** `Medical Imaging`.
* Scrape papers from PubMed:
```py
from paperscraper.pubmed import get_and_dump_pubmed_papers
covid19 = ['COVID-19', 'SARS-CoV-2']
ai = ['Artificial intelligence', 'Deep learning', 'Machine learning']
mi = ['Medical imaging']
query = [covid19, ai, mi]
get_and_dump_pubmed_papers(query, output_filepath='covid19_ai_imaging.jsonl')
```
* Scrape papers from arXiv:
```py
from paperscraper.arxiv import get_and_dump_arxiv_papers
get_and_dump_arxiv_papers(query, output_filepath='covid19_ai_imaging.jsonl')
```
* Scrape papers from bioRiv, medRxiv or chemRxiv:
```py
from paperscraper.xrxiv.xrxiv_query import XRXivQuery
querier = XRXivQuery('server_dumps/chemrxiv_2020-11-10.jsonl')
querier.search_keywords(query, output_filepath='covid19_ai_imaging.jsonl')
```
You can also use `dump_queries` to iterate over a bunch of queries for all available databases.
```py
from paperscraper import dump_queries
queries = [[covid19, ai, mi], [covid19, ai], [ai]]
dump_queries(queries, '.')
```
Or use the harmonized interface of `QUERY_FN_DICT` to query multiple databases of your choice:
```py
from paperscraper.load_dumps import QUERY_FN_DICT
print(QUERY_FN_DICT.keys())
QUERY_FN_DICT['biorxiv'](query, output_filepath='biorxiv_covid_ai_imaging.jsonl')
QUERY_FN_DICT['medrxiv'](query, output_filepath='medrxiv_covid_ai_imaging.jsonl')
```
* Scrape papers from Google Scholar:
Thanks to [scholarly](https://pypi.org/project/scholarly/), there is an endpoint for Google Scholar too.
It does not understand Boolean expressions like the others, but should be used just like
the [Google Scholar search fields](https://scholar.google.com).
```py
from paperscraper.scholar import get_and_dump_scholar_papers
topic = 'Machine Learning'
get_and_dump_scholar_papers(topic)
```
### Scrape PDFs
`paperscraper` also allows you to download the PDF files.
```py
from paperscraper.pdf import save_pdf
paper_data = {'doi': "10.48550/arXiv.2207.03928"}
save_pdf(paper_data, filepath='gt4sd_paper.pdf')
```
If you want to batch download all PDFs for your previous metadata search, use the wrapper.
Here we scrape the PDFs for the metadata obtained in the previous example.
```py
from paperscraper.pdf import save_pdf_from_dump
# Save PDFs in current folder and name the files by their DOI
save_pdf_from_dump('medrxiv_covid_ai_imaging.jsonl', pdf_path='.', key_to_save='doi')
```
*NOTE*: This works robustly for preprint servers, but if you use it on a PubMed dump, dont expect to obtain all PDFs.
Many publishers detect and block scraping and many publications are simply behind paywalls.
### Citation search
A plus of the Scholar endpoint is that the number of citations of a paper can be fetched:
```py
from paperscraper.scholar import get_citations_from_title
title = 'Über formal unentscheidbare Sätze der Principia Mathematica und verwandter Systeme I.'
get_citations_from_title(title)
```
*NOTE*: The scholar endpoint does not require authentication but since it regularly
prompts with captchas, it's difficult to apply large scale.
### Journal impact factor
You can also retrieve the impact factor for all journals:
```py
>>>from paperscraper.impact import Impactor
>>>i = Impactor()
>>>i.search("Nat Comms", threshold=85, sort_by='impact')
[
{'journal': 'Nature Communications', 'factor': 17.694, 'score': 94},
{'journal': 'Natural Computing', 'factor': 1.504, 'score': 88}
]
```
This performs a fuzzy search with a threshold of 85. `threshold` defaults to 100 in which case an exact search
is performed. You can also search by journal abbreviation, [E-ISSN](https://portal.issn.org) or [NLM ID](https://portal.issn.org).
```py
i.search("Nat Rev Earth Environ") # [{'journal': 'Nature Reviews Earth & Environment', 'factor': 37.214, 'score': 100}]
i.search("101771060") # [{'journal': 'Nature Reviews Earth & Environment', 'factor': 37.214, 'score': 100}]
i.search('2662-138X') # [{'journal': 'Nature Reviews Earth & Environment', 'factor': 37.214, 'score': 100}]
# Filter results by impact factor
i.search("Neural network", threshold=85, min_impact=1.5, max_impact=20)
# [
# {'journal': 'IEEE Transactions on Neural Networks and Learning Systems', 'factor': 14.255, 'score': 93},
# {'journal': 'NEURAL NETWORKS', 'factor': 9.657, 'score': 91},
# {'journal': 'WORK-A Journal of Prevention Assessment & Rehabilitation', 'factor': 1.803, 'score': 86},
# {'journal': 'NETWORK-COMPUTATION IN NEURAL SYSTEMS', 'factor': 1.5, 'score': 92}
# ]
# Show all fields
i.search("quantum information", threshold=90, return_all=True)
# [
# {'factor': 10.758, 'jcr': 'Q1', 'journal_abbr': 'npj Quantum Inf', 'eissn': '2056-6387', 'journal': 'npj Quantum Information', 'nlm_id': '101722857', 'issn': '', 'score': 92},
# {'factor': 1.577, 'jcr': 'Q3', 'journal_abbr': 'Nation', 'eissn': '0027-8378', 'journal': 'NATION', 'nlm_id': '9877123', 'issn': '0027-8378', 'score': 91}
# ]
```
### Plotting
When multiple query searches are performed, two types of plots can be generated
automatically: Venn diagrams and bar plots.
#### Barplots
Compare the temporal evolution of different queries across different servers.
```py
from paperscraper import QUERY_FN_DICT
from paperscraper.postprocessing import aggregate_paper
from paperscraper.utils import get_filename_from_query, load_jsonl
# Define search terms and their synonyms
ml = ['Deep learning', 'Neural Network', 'Machine learning']
mol = ['molecule', 'molecular', 'drug', 'ligand', 'compound']
gnn = ['gcn', 'gnn', 'graph neural', 'graph convolutional', 'molecular graph']
smiles = ['SMILES', 'Simplified molecular']
fp = ['fingerprint', 'molecular fingerprint', 'fingerprints']
# Define queries
queries = [[ml, mol, smiles], [ml, mol, fp], [ml, mol, gnn]]
root = '../keyword_dumps'
data_dict = dict()
for query in queries:
filename = get_filename_from_query(query)
data_dict[filename] = dict()
for db,_ in QUERY_FN_DICT.items():
# Assuming the keyword search has been performed already
data = load_jsonl(os.path.join(root, db, filename))
# Unstructured matches are aggregated into 6 bins, 1 per year
# from 2015 to 2020. Sanity check is performed by having
# `filtering=True`, removing papers that don't contain all of
# the keywords in query.
data_dict[filename][db], filtered = aggregate_paper(
data, 2015, bins_per_year=1, filtering=True,
filter_keys=query, return_filtered=True
)
# Plotting is now very simple
from paperscraper.plotting import plot_comparison
data_keys = [
'deeplearning_molecule_fingerprint.jsonl',
'deeplearning_molecule_smiles.jsonl',
'deeplearning_molecule_gcn.jsonl'
]
plot_comparison(
data_dict,
data_keys,
title_text="'Deep Learning' AND 'Molecule' AND X",
keyword_text=['Fingerprint', 'SMILES', 'Graph'],
figpath='mol_representation'
)
```
![molreps](https://github.com/jannisborn/paperscraper/blob/main/assets/molreps.png?raw=true "MolReps")
#### Venn Diagrams
```py
from paperscraper.plotting import (
plot_venn_two, plot_venn_three, plot_multiple_venn
)
sizes_2020 = (30842, 14474, 2292, 35476, 1904, 1408, 376)
sizes_2019 = (55402, 11899, 2563)
labels_2020 = ('Medical\nImaging', 'Artificial\nIntelligence', 'COVID-19')
labels_2019 = ['Medical Imaging', 'Artificial\nIntelligence']
plot_venn_two(sizes_2019, labels_2019, title='2019', figname='ai_imaging')
```
![2019](https://github.com/jannisborn/paperscraper/blob/main/assets/ai_imaging.png?raw=true "2019")
```py
plot_venn_three(
sizes_2020, labels_2020, title='2020', figname='ai_imaging_covid'
)
```
![2020](https://github.com/jannisborn/paperscraper/blob/main/assets/ai_imaging_covid.png?raw=true "2020")
Or plot both together:
```py
plot_multiple_venn(
[sizes_2019, sizes_2020], [labels_2019, labels_2020],
titles=['2019', '2020'], suptitle='Keyword search comparison',
gridspec_kw={'width_ratios': [1, 2]}, figsize=(10, 6),
figname='both'
)
```
![both](https://github.com/jannisborn/paperscraper/blob/main/assets/both.png?raw=true "Both")
## Citation
If you use `paperscraper`, please cite a paper that motivated our development of this tool.
```bib
@article{born2021trends,
title={Trends in Deep Learning for Property-driven Drug Design},
author={Born, Jannis and Manica, Matteo},
journal={Current Medicinal Chemistry},
volume={28},
number={38},
pages={7862--7886},
year={2021},
publisher={Bentham Science Publishers}
}
```
## Contributions
Thanks to the following contributors:
- [@memray](https://github.com/memray): Since `v0.2.12` there are automatic retries when downloading the {med/bio/chem}rxiv dumps.
- [@achouhan93](https://github.com/achouhan93): Since `v0.2.5` {med/bio/chem}rxiv can be scraped for specific dates!
- [@daenuprobst](https://github.com/daenuprobst): Since `v0.2.4` PDF files can be scraped directly (`paperscraper.pdf.save_pdf`)
- [@oppih](https://github.com/oppih): Since `v0.2.3` chemRxiv API also provides DOI and URL if available
- [@lukasschwab](https://github.com/lukasschwab): Bumped `arxiv` dependency to >`1.4.2` in paperscraper `v0.1.0`.
- [@juliusbierk](https://github.com/juliusbierk): Bugfixes
Raw data
{
"_id": null,
"home_page": "https://github.com/jannisborn/paperscraper",
"name": "paperscraper",
"maintainer": null,
"docs_url": null,
"requires_python": null,
"maintainer_email": null,
"keywords": "Academics, Science, Publication, Search, PubMed, Arxiv, Medrxiv, Biorxiv, Chemrxiv, Google Scholar",
"author": "Jannis Born, Matteo Manica",
"author_email": "jannis.born@gmx.de, drugilsberg@gmail.com",
"download_url": "https://files.pythonhosted.org/packages/40/ed/d4f54a1937d953bd3daefbead023fcd770d58ffa44c6a4b400f9e84b5d12/paperscraper-0.2.14.tar.gz",
"platform": null,
"description": "[![build](https://github.com/jannisborn/paperscraper/actions/workflows/test_tip.yml/badge.svg?branch=main)](https://github.com/jannisborn/paperscraper/actions/workflows/test_tip.yml?query=branch%3Amain)\n[![build](https://github.com/jannisborn/paperscraper/actions/workflows/test_pypi.yml/badge.svg?branch=main)](https://github.com/jannisborn/paperscraper/actions/workflows/test_pypi.yml?query=branch%3Amain)\n[![License:\nMIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)\n[![PyPI version](https://badge.fury.io/py/paperscraper.svg)](https://badge.fury.io/py/paperscraper)\n[![Downloads](https://static.pepy.tech/badge/paperscraper)](https://pepy.tech/project/paperscraper)\n[![Downloads](https://static.pepy.tech/badge/paperscraper/month)](https://pepy.tech/project/paperscraper)\n[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)\n[![codecov](https://codecov.io/github/jannisborn/paperscraper/branch/main/graph/badge.svg?token=Clwi0pu61a)](https://codecov.io/github/jannisborn/paperscraper)\n# paperscraper\n\n`paperscraper` is a `python` package for scraping publication metadata or full PDF files from\n**PubMed** or preprint servers such as **arXiv**, **medRxiv**, **bioRxiv** and **chemRxiv**.\nIt provides a streamlined interface to scrape metadata, allows to retrieve citation counts\nfrom Google Scholar, impact factors from journals and comes with simple postprocessing functions\nand plotting routines for meta-analysis.\n\n\n## Getting started\n\n```console\npip install paperscraper\n```\n\nThis is enough to query **PubMed**, **arXiv** or Google Scholar.\n\n#### Download X-rxiv Dumps\n\nHowever, to scrape publication data from the preprint servers [biorxiv](https://www.biorxiv.org), [medrxiv](https://www.medrxiv.org) and [chemrxiv](https://www.chemrxiv.org), the setup is different. The entire dump is downloaded and stored in the `server_dumps` folder in a `.jsonl` format (one paper per line).\n\n```py\nfrom paperscraper.get_dumps import biorxiv, medrxiv, chemrxiv\nmedrxiv() # Takes ~30min and should result in ~35 MB file\nbiorxiv() # Takes ~1h and should result in ~350 MB file\nchemrxiv() # Takes ~45min and should result in ~20 MB file\n```\n*NOTE*: Once the dumps are stored, please make sure to restart the python interpreter so that the changes take effect. \n*NOTE*: If you experience API connection issues (`ConnectionError`), since v0.2.12 there are automatic retries which you can even control and raise from the default of 10, as in `biorxiv(max_retries=20)`.\n\nSince v0.2.5 `paperscraper` also allows to scrape {med/bio/chem}rxiv for specific dates.\n```py\nmedrxiv(begin_date=\"2023-04-01\", end_date=\"2023-04-08\")\n```\nBut watch out. The resulting `.jsonl` file will be labelled according to the current date and all your subsequent searches will be based on this file **only**. If you use this option you might want to keep an eye on the source files (`paperscraper/server_dumps/*jsonl`) to ensure they contain the paper metadata for all papers you're interested in.\n\n## Examples\n\n`paperscraper` is build on top of the packages [arxiv](https://pypi.org/project/arxiv/), [pymed](https://pypi.org/project/pymed-paperscraper/), and [scholarly](https://pypi.org/project/scholarly/). \n\n### Publication keyword search\n\nConsider you want to perform a publication keyword search with the query:\n`COVID-19` **AND** `Artificial Intelligence` **AND** `Medical Imaging`. \n\n* Scrape papers from PubMed:\n\n```py\nfrom paperscraper.pubmed import get_and_dump_pubmed_papers\ncovid19 = ['COVID-19', 'SARS-CoV-2']\nai = ['Artificial intelligence', 'Deep learning', 'Machine learning']\nmi = ['Medical imaging']\nquery = [covid19, ai, mi]\n\nget_and_dump_pubmed_papers(query, output_filepath='covid19_ai_imaging.jsonl')\n```\n\n* Scrape papers from arXiv:\n\n```py\nfrom paperscraper.arxiv import get_and_dump_arxiv_papers\n\nget_and_dump_arxiv_papers(query, output_filepath='covid19_ai_imaging.jsonl')\n```\n\n* Scrape papers from bioRiv, medRxiv or chemRxiv:\n\n```py\nfrom paperscraper.xrxiv.xrxiv_query import XRXivQuery\n\nquerier = XRXivQuery('server_dumps/chemrxiv_2020-11-10.jsonl')\nquerier.search_keywords(query, output_filepath='covid19_ai_imaging.jsonl')\n```\n\nYou can also use `dump_queries` to iterate over a bunch of queries for all available databases.\n\n```py\nfrom paperscraper import dump_queries\n\nqueries = [[covid19, ai, mi], [covid19, ai], [ai]]\ndump_queries(queries, '.')\n```\n\nOr use the harmonized interface of `QUERY_FN_DICT` to query multiple databases of your choice:\n```py\nfrom paperscraper.load_dumps import QUERY_FN_DICT\nprint(QUERY_FN_DICT.keys())\n\nQUERY_FN_DICT['biorxiv'](query, output_filepath='biorxiv_covid_ai_imaging.jsonl')\nQUERY_FN_DICT['medrxiv'](query, output_filepath='medrxiv_covid_ai_imaging.jsonl')\n```\n\n* Scrape papers from Google Scholar:\n\nThanks to [scholarly](https://pypi.org/project/scholarly/), there is an endpoint for Google Scholar too.\nIt does not understand Boolean expressions like the others, but should be used just like\nthe [Google Scholar search fields](https://scholar.google.com).\n\n```py\nfrom paperscraper.scholar import get_and_dump_scholar_papers\ntopic = 'Machine Learning'\nget_and_dump_scholar_papers(topic)\n```\n\n### Scrape PDFs\n\n`paperscraper` also allows you to download the PDF files.\n\n```py\nfrom paperscraper.pdf import save_pdf\npaper_data = {'doi': \"10.48550/arXiv.2207.03928\"}\nsave_pdf(paper_data, filepath='gt4sd_paper.pdf')\n```\n\nIf you want to batch download all PDFs for your previous metadata search, use the wrapper.\nHere we scrape the PDFs for the metadata obtained in the previous example.\n\n```py\nfrom paperscraper.pdf import save_pdf_from_dump\n\n# Save PDFs in current folder and name the files by their DOI\nsave_pdf_from_dump('medrxiv_covid_ai_imaging.jsonl', pdf_path='.', key_to_save='doi')\n```\n*NOTE*: This works robustly for preprint servers, but if you use it on a PubMed dump, dont expect to obtain all PDFs. \nMany publishers detect and block scraping and many publications are simply behind paywalls.\n\n\n### Citation search\n\nA plus of the Scholar endpoint is that the number of citations of a paper can be fetched:\n\n```py\nfrom paperscraper.scholar import get_citations_from_title\ntitle = '\u00dcber formal unentscheidbare S\u00e4tze der Principia Mathematica und verwandter Systeme I.'\nget_citations_from_title(title)\n```\n\n*NOTE*: The scholar endpoint does not require authentication but since it regularly\nprompts with captchas, it's difficult to apply large scale.\n\n### Journal impact factor\n\nYou can also retrieve the impact factor for all journals:\n```py\n>>>from paperscraper.impact import Impactor\n>>>i = Impactor()\n>>>i.search(\"Nat Comms\", threshold=85, sort_by='impact') \n[\n {'journal': 'Nature Communications', 'factor': 17.694, 'score': 94}, \n {'journal': 'Natural Computing', 'factor': 1.504, 'score': 88}\n]\n```\nThis performs a fuzzy search with a threshold of 85. `threshold` defaults to 100 in which case an exact search\nis performed. You can also search by journal abbreviation, [E-ISSN](https://portal.issn.org) or [NLM ID](https://portal.issn.org).\n```py\ni.search(\"Nat Rev Earth Environ\") # [{'journal': 'Nature Reviews Earth & Environment', 'factor': 37.214, 'score': 100}]\ni.search(\"101771060\") # [{'journal': 'Nature Reviews Earth & Environment', 'factor': 37.214, 'score': 100}]\ni.search('2662-138X') # [{'journal': 'Nature Reviews Earth & Environment', 'factor': 37.214, 'score': 100}]\n\n# Filter results by impact factor\ni.search(\"Neural network\", threshold=85, min_impact=1.5, max_impact=20)\n# [\n# {'journal': 'IEEE Transactions on Neural Networks and Learning Systems', 'factor': 14.255, 'score': 93}, \n# {'journal': 'NEURAL NETWORKS', 'factor': 9.657, 'score': 91},\n# {'journal': 'WORK-A Journal of Prevention Assessment & Rehabilitation', 'factor': 1.803, 'score': 86}, \n# {'journal': 'NETWORK-COMPUTATION IN NEURAL SYSTEMS', 'factor': 1.5, 'score': 92}\n# ]\n\n# Show all fields\ni.search(\"quantum information\", threshold=90, return_all=True)\n# [\n# {'factor': 10.758, 'jcr': 'Q1', 'journal_abbr': 'npj Quantum Inf', 'eissn': '2056-6387', 'journal': 'npj Quantum Information', 'nlm_id': '101722857', 'issn': '', 'score': 92},\n# {'factor': 1.577, 'jcr': 'Q3', 'journal_abbr': 'Nation', 'eissn': '0027-8378', 'journal': 'NATION', 'nlm_id': '9877123', 'issn': '0027-8378', 'score': 91}\n# ]\n```\n\n### Plotting\n\nWhen multiple query searches are performed, two types of plots can be generated\nautomatically: Venn diagrams and bar plots.\n\n#### Barplots\n\nCompare the temporal evolution of different queries across different servers.\n\n```py\nfrom paperscraper import QUERY_FN_DICT\nfrom paperscraper.postprocessing import aggregate_paper\nfrom paperscraper.utils import get_filename_from_query, load_jsonl\n\n# Define search terms and their synonyms\nml = ['Deep learning', 'Neural Network', 'Machine learning']\nmol = ['molecule', 'molecular', 'drug', 'ligand', 'compound']\ngnn = ['gcn', 'gnn', 'graph neural', 'graph convolutional', 'molecular graph']\nsmiles = ['SMILES', 'Simplified molecular']\nfp = ['fingerprint', 'molecular fingerprint', 'fingerprints']\n\n# Define queries\nqueries = [[ml, mol, smiles], [ml, mol, fp], [ml, mol, gnn]]\n\nroot = '../keyword_dumps'\n\ndata_dict = dict()\nfor query in queries:\n filename = get_filename_from_query(query)\n data_dict[filename] = dict()\n for db,_ in QUERY_FN_DICT.items():\n # Assuming the keyword search has been performed already\n data = load_jsonl(os.path.join(root, db, filename))\n\n # Unstructured matches are aggregated into 6 bins, 1 per year\n # from 2015 to 2020. Sanity check is performed by having \n # `filtering=True`, removing papers that don't contain all of\n # the keywords in query.\n data_dict[filename][db], filtered = aggregate_paper(\n data, 2015, bins_per_year=1, filtering=True,\n filter_keys=query, return_filtered=True\n )\n\n# Plotting is now very simple\nfrom paperscraper.plotting import plot_comparison\n\ndata_keys = [\n 'deeplearning_molecule_fingerprint.jsonl',\n 'deeplearning_molecule_smiles.jsonl', \n 'deeplearning_molecule_gcn.jsonl'\n]\nplot_comparison(\n data_dict,\n data_keys,\n title_text=\"'Deep Learning' AND 'Molecule' AND X\",\n keyword_text=['Fingerprint', 'SMILES', 'Graph'],\n figpath='mol_representation'\n)\n```\n\n![molreps](https://github.com/jannisborn/paperscraper/blob/main/assets/molreps.png?raw=true \"MolReps\")\n\n\n#### Venn Diagrams\n\n```py\nfrom paperscraper.plotting import (\n plot_venn_two, plot_venn_three, plot_multiple_venn\n)\n\nsizes_2020 = (30842, 14474, 2292, 35476, 1904, 1408, 376)\nsizes_2019 = (55402, 11899, 2563)\nlabels_2020 = ('Medical\\nImaging', 'Artificial\\nIntelligence', 'COVID-19')\nlabels_2019 = ['Medical Imaging', 'Artificial\\nIntelligence']\n\nplot_venn_two(sizes_2019, labels_2019, title='2019', figname='ai_imaging')\n```\n\n![2019](https://github.com/jannisborn/paperscraper/blob/main/assets/ai_imaging.png?raw=true \"2019\")\n\n\n```py\nplot_venn_three(\n sizes_2020, labels_2020, title='2020', figname='ai_imaging_covid'\n)\n```\n\n![2020](https://github.com/jannisborn/paperscraper/blob/main/assets/ai_imaging_covid.png?raw=true \"2020\")\n\nOr plot both together:\n\n```py\nplot_multiple_venn(\n [sizes_2019, sizes_2020], [labels_2019, labels_2020], \n titles=['2019', '2020'], suptitle='Keyword search comparison', \n gridspec_kw={'width_ratios': [1, 2]}, figsize=(10, 6),\n figname='both'\n)\n```\n\n![both](https://github.com/jannisborn/paperscraper/blob/main/assets/both.png?raw=true \"Both\")\n\n\n\n## Citation\nIf you use `paperscraper`, please cite a paper that motivated our development of this tool.\n\n```bib\n@article{born2021trends,\n title={Trends in Deep Learning for Property-driven Drug Design},\n author={Born, Jannis and Manica, Matteo},\n journal={Current Medicinal Chemistry},\n volume={28},\n number={38},\n pages={7862--7886},\n year={2021},\n publisher={Bentham Science Publishers}\n}\n```\n\n## Contributions\nThanks to the following contributors:\n- [@memray](https://github.com/memray): Since `v0.2.12` there are automatic retries when downloading the {med/bio/chem}rxiv dumps.\n- [@achouhan93](https://github.com/achouhan93): Since `v0.2.5` {med/bio/chem}rxiv can be scraped for specific dates!\n- [@daenuprobst](https://github.com/daenuprobst): Since `v0.2.4` PDF files can be scraped directly (`paperscraper.pdf.save_pdf`)\n- [@oppih](https://github.com/oppih): Since `v0.2.3` chemRxiv API also provides DOI and URL if available\n- [@lukasschwab](https://github.com/lukasschwab): Bumped `arxiv` dependency to >`1.4.2` in paperscraper `v0.1.0`.\n- [@juliusbierk](https://github.com/juliusbierk): Bugfixes\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "paperscraper: Package to scrape papers.",
"version": "0.2.14",
"project_urls": {
"Homepage": "https://github.com/jannisborn/paperscraper"
},
"split_keywords": [
"academics",
" science",
" publication",
" search",
" pubmed",
" arxiv",
" medrxiv",
" biorxiv",
" chemrxiv",
" google scholar"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "4f221809064ae4d8b7e8a8fcf38a8480e36d6c688e478dc1cea34af39f4247fd",
"md5": "e403f3ca05d9bb0686f098cbd01f863b",
"sha256": "d8c25c96f79272c95d3686b688e033caa66b6233adf6a636be42b2a2ab3c08b2"
},
"downloads": -1,
"filename": "paperscraper-0.2.14-py3-none-any.whl",
"has_sig": false,
"md5_digest": "e403f3ca05d9bb0686f098cbd01f863b",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": null,
"size": 42858,
"upload_time": "2024-10-30T09:24:14",
"upload_time_iso_8601": "2024-10-30T09:24:14.961717Z",
"url": "https://files.pythonhosted.org/packages/4f/22/1809064ae4d8b7e8a8fcf38a8480e36d6c688e478dc1cea34af39f4247fd/paperscraper-0.2.14-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "40edd4f54a1937d953bd3daefbead023fcd770d58ffa44c6a4b400f9e84b5d12",
"md5": "9c34607d393497462f0fd91ee7a89bf5",
"sha256": "66be910e24f0386884f4f0f01095f618298f614f263c2040b2262378270ec900"
},
"downloads": -1,
"filename": "paperscraper-0.2.14.tar.gz",
"has_sig": false,
"md5_digest": "9c34607d393497462f0fd91ee7a89bf5",
"packagetype": "sdist",
"python_version": "source",
"requires_python": null,
"size": 36641,
"upload_time": "2024-10-30T09:24:16",
"upload_time_iso_8601": "2024-10-30T09:24:16.560825Z",
"url": "https://files.pythonhosted.org/packages/40/ed/d4f54a1937d953bd3daefbead023fcd770d58ffa44c6a4b400f9e84b5d12/paperscraper-0.2.14.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-10-30 09:24:16",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "jannisborn",
"github_project": "paperscraper",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"requirements": [],
"lcname": "paperscraper"
}