paperscraper


Namepaperscraper JSON
Version 0.2.10 PyPI version JSON
download
home_pagehttps://github.com/PhosphorylatedRabbits/paperscraper
Summarypaperscraper: Package to scrape papers.
upload_time2024-01-11 22:47:50
maintainer
docs_urlNone
authorJannis Born, Matteo Manica
requires_python
licenseMIT
keywords academics science publication search pubmed arxiv medrxiv biorxiv chemrxiv google scholar
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            [![build](https://github.com/PhosphorylatedRabbits/paperscraper/actions/workflows/build.yml/badge.svg)](https://github.com/PhosphorylatedRabbits/paperscraper/actions/workflows/build.yml)
[![License:
MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
[![PyPI version](https://badge.fury.io/py/paperscraper.svg)](https://badge.fury.io/py/paperscraper)
[![Downloads](https://static.pepy.tech/badge/paperscraper)](https://pepy.tech/project/paperscraper)
[![Downloads](https://static.pepy.tech/badge/paperscraper/month)](https://pepy.tech/project/paperscraper)
[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)

# paperscraper

## Overview

`paperscraper` is a `python` package that ships via `pypi` and facilitates scraping
publication metadata as well as full PDF files from **PubMed** or from preprint servers such as **arXiv**,
**medRxiv**, **bioRxiv** and **chemRxiv**. It provides a streamlined interface to scrape metadata and comes
with simple postprocessing functions and plotting routines for meta-analysis.

Since v0.2.4 `paperscraper` also supports scraping PDF files directly! Thanks to [@daenuprobst](https://github.com/daenuprobst) for suggestions!

## Getting started

```console
pip install paperscraper
```

This is enough to query **PubMed**, **arXiv** or Google Scholar.

#### Download X-rxiv Dumps

However, to scrape publication data from the preprint servers [biorxiv](https://www.biorxiv.org), [medrxiv](https://www.medrxiv.org) and [chemrxiv](https://www.chemrxiv.org), the setup is different. The entire dump is downloaded and stored in the `server_dumps` folder in a `.jsonl` format (one paper per line).

```py
from paperscraper.get_dumps import biorxiv, medrxiv, chemrxiv
medrxiv()  #  Takes ~30min and should result in ~35 MB file
biorxiv()  # Takes ~1h and should result in ~350 MB file
chemrxiv()  #  Takes ~45min and should result in ~20 MB file
```
*NOTE*: Once the dumps are stored, please make sure to restart the python interpreter
so that the changes take effect. 

Since v0.2.5 `paperscraper` also allows to scrape {med/bio/chem}rxiv for specific dates! Thanks to [@achouhan93 ](https://github.com/achouhan93 ) for contributions!
```py
medrxiv(begin_date="2023-04-01", end_date="2023-04-08")
```
But watch out. The resulting `.jsonl` file will be labelled according to the current date and all your subsequent searches will be based on this file **only**. If you use this option you might want to keep an eye on the source files (`paperscraper/server_dumps/*jsonl`) to ensure they contain the paper metadata for all papers you're interested in.

## Examples

`paperscraper` is build on top of the packages [pymed](https://pypi.org/project/pymed/),
[arxiv](https://pypi.org/project/arxiv/) and [scholarly](https://pypi.org/project/scholarly/). 

### Publication keyword search

Consider you want to perform a publication keyword search with the query:
`COVID-19` **AND** `Artificial Intelligence` **AND** `Medical Imaging`. 

* Scrape papers from PubMed:

```py
from paperscraper.pubmed import get_and_dump_pubmed_papers
covid19 = ['COVID-19', 'SARS-CoV-2']
ai = ['Artificial intelligence', 'Deep learning', 'Machine learning']
mi = ['Medical imaging']
query = [covid19, ai, mi]

get_and_dump_pubmed_papers(query, output_filepath='covid19_ai_imaging.jsonl')
```

* Scrape papers from arXiv:

```py
from paperscraper.arxiv import get_and_dump_arxiv_papers

get_and_dump_arxiv_papers(query, output_filepath='covid19_ai_imaging.jsonl')
```

* Scrape papers from bioRiv, medRxiv or chemRxiv:

```py
from paperscraper.xrxiv.xrxiv_query import XRXivQuery

querier = XRXivQuery('server_dumps/chemrxiv_2020-11-10.jsonl')
querier.search_keywords(query, output_filepath='covid19_ai_imaging.jsonl')
```

You can also use `dump_queries` to iterate over a bunch of queries for all available databases.

```py
from paperscraper import dump_queries

queries = [[covid19, ai, mi], [covid19, ai], [ai]]
dump_queries(queries, '.')
```

Or use the harmonized interface of `QUERY_FN_DICT` to query multiple databases of your choice:
```py
from paperscraper.load_dumps import QUERY_FN_DICT
print(QUERY_FN_DICT.keys())

QUERY_FN_DICT['biorxiv'](query, output_filepath='biorxiv_covid_ai_imaging.jsonl')
QUERY_FN_DICT['medrxiv'](query, output_filepath='medrxiv_covid_ai_imaging.jsonl')
```

* Scrape papers from Google Scholar:

Thanks to [scholarly](https://pypi.org/project/scholarly/), there is an endpoint for Google Scholar too.
It does not understand Boolean expressions like the others, but should be used just like
the [Google Scholar search fields](https://scholar.google.com).

```py
from paperscraper.scholar import get_and_dump_scholar_papers
topic = 'Machine Learning'
get_and_dump_scholar_papers(topic)
```

### Scrape PDFs

`paperscraper` also allows you to download the PDF files.

```py
from paperscraper.pdf import save_pdf
paper_data = {'doi': "10.48550/arXiv.2207.03928"}
save_pdf(paper_data, filepath='gt4sd_paper.pdf')
```

If you want to batch download all PDFs for your previous metadata search, use the wrapper.
Here we scrape the PDFs for the metadata obtained in the previous example.

```py
from paperscraper.pdf import save_pdf_from_dump

# Save PDFs in current folder and name the files by their DOI
save_pdf_from_dump('medrxiv_covid_ai_imaging.jsonl', pdf_path='.', key_to_save='doi')
```
*NOTE*: This works robustly for preprint servers, but if you use it on a PubMed dump, dont expect to obtain all PDFs. 
Many publishers detect and block scraping and many publications are simply behind paywalls.


### Citation search

A plus of the Scholar endpoint is that the number of citations of a paper can be fetched:

```py
from paperscraper.scholar import get_citations_from_title
title = 'Über formal unentscheidbare Sätze der Principia Mathematica und verwandter Systeme I.'
get_citations_from_title(title)
```

*NOTE*: The scholar endpoint does not require authentification but since it regularly
prompts with captchas, it's difficult to apply large scale.

### Journal impact factor

You can also retrieve the impact factor for all journals:
```py
>>>from paperscraper.impact import Impactor
>>>i = Impactor()
>>>i.search("Nat Comms", threshold=85, sort_by='impact') 
[
    {'journal': 'Nature Communications', 'factor': 17.694, 'score': 94}, 
    {'journal': 'Natural Computing', 'factor': 1.504, 'score': 88}
]
```
This performs a fuzzy search with a threshold of 85. `threshold` defaults to 100 in which case an exact search
is performed. You can also search by journal abbreviation, [E-ISSN](https://portal.issn.org) or [NLM ID](https://portal.issn.org).
```py
i.search("Nat Rev Earth Environ") # [{'journal': 'Nature Reviews Earth & Environment', 'factor': 37.214, 'score': 100}]
i.search("101771060") # [{'journal': 'Nature Reviews Earth & Environment', 'factor': 37.214, 'score': 100}]
i.search('2662-138X') # [{'journal': 'Nature Reviews Earth & Environment', 'factor': 37.214, 'score': 100}]

# Filter results by impact factor
i.search("Neural network", threshold=85, min_impact=1.5, max_impact=20)
# [
#   {'journal': 'IEEE Transactions on Neural Networks and Learning Systems', 'factor': 14.255, 'score': 93}, 
#   {'journal': 'NEURAL NETWORKS', 'factor': 9.657, 'score': 91},
#   {'journal': 'WORK-A Journal of Prevention Assessment & Rehabilitation', 'factor': 1.803, 'score': 86}, 
#   {'journal': 'NETWORK-COMPUTATION IN NEURAL SYSTEMS', 'factor': 1.5, 'score': 92}
# ]

# Show all fields
i.search("quantum information", threshold=90, return_all=True)
# [
#   {'factor': 10.758, 'jcr': 'Q1', 'journal_abbr': 'npj Quantum Inf', 'eissn': '2056-6387', 'journal': 'npj Quantum Information', 'nlm_id': '101722857', 'issn': '', 'score': 92},
#   {'factor': 1.577, 'jcr': 'Q3', 'journal_abbr': 'Nation', 'eissn': '0027-8378', 'journal': 'NATION', 'nlm_id': '9877123', 'issn': '0027-8378', 'score': 91}
# ]
```

### Plotting

When multiple query searches are performed, two types of plots can be generated
automatically: Venn diagrams and bar plots.

#### Barplots

Compare the temporal evolution of different queries across different servers.

```py
from paperscraper import QUERY_FN_DICT
from paperscraper.postprocessing import aggregate_paper
from paperscraper.utils import get_filename_from_query, load_jsonl

# Define search terms and their synonyms
ml = ['Deep learning', 'Neural Network', 'Machine learning']
mol = ['molecule', 'molecular', 'drug', 'ligand', 'compound']
gnn = ['gcn', 'gnn', 'graph neural', 'graph convolutional', 'molecular graph']
smiles = ['SMILES', 'Simplified molecular']
fp = ['fingerprint', 'molecular fingerprint', 'fingerprints']

# Define queries
queries = [[ml, mol, smiles], [ml, mol, fp], [ml, mol, gnn]]

root = '../keyword_dumps'

data_dict = dict()
for query in queries:
    filename = get_filename_from_query(query)
    data_dict[filename] = dict()
    for db,_ in QUERY_FN_DICT.items():
        # Assuming the keyword search has been performed already
        data = load_jsonl(os.path.join(root, db, filename))

        # Unstructured matches are aggregated into 6 bins, 1 per year
        # from 2015 to 2020. Sanity check is performed by having 
        # `filtering=True`, removing papers that don't contain all of
        # the keywords in query.
        data_dict[filename][db], filtered = aggregate_paper(
            data, 2015, bins_per_year=1, filtering=True,
            filter_keys=query, return_filtered=True
        )

# Plotting is now very simple
from paperscraper.plotting import plot_comparison

data_keys = [
    'deeplearning_molecule_fingerprint.jsonl',
    'deeplearning_molecule_smiles.jsonl', 
    'deeplearning_molecule_gcn.jsonl'
]
plot_comparison(
    data_dict,
    data_keys,
    title_text="'Deep Learning' AND 'Molecule' AND X",
    keyword_text=['Fingerprint', 'SMILES', 'Graph'],
    figpath='mol_representation'
)
```

![molreps](https://github.com/PhosphorylatedRabbits/paperscraper/blob/master/assets/molreps.png "MolReps")


#### Venn Diagrams

```py
from paperscraper.plotting import (
    plot_venn_two, plot_venn_three, plot_multiple_venn
)

sizes_2020 = (30842, 14474, 2292, 35476, 1904, 1408, 376)
sizes_2019 = (55402, 11899, 2563)
labels_2020 = ('Medical\nImaging', 'Artificial\nIntelligence', 'COVID-19')
labels_2019 = ['Medical Imaging', 'Artificial\nIntelligence']

plot_venn_two(sizes_2019, labels_2019, title='2019', figname='ai_imaging')
```

![2019](https://github.com/PhosphorylatedRabbits/paperscraper/blob/master/assets/ai_imaging.png "2019")


```py
plot_venn_three(
    sizes_2020, labels_2020, title='2020', figname='ai_imaging_covid'
)
```

![2020](https://github.com/PhosphorylatedRabbits/paperscraper/blob/master/assets/ai_imaging_covid.png "2020"))

Or plot both together:

```py
plot_multiple_venn(
    [sizes_2019, sizes_2020], [labels_2019, labels_2020], 
    titles=['2019', '2020'], suptitle='Keyword search comparison', 
    gridspec_kw={'width_ratios': [1, 2]}, figsize=(10, 6),
    figname='both'
)
```

![both](https://github.com/PhosphorylatedRabbits/paperscraper/blob/master/assets/both.png "Both")



## Citation
If you use `paperscraper`, please cite the papers that motivated our development of this tool.

```bib
@article{born2021trends,
  title={Trends in Deep Learning for Property-driven Drug Design},
  author={Born, Jannis and Manica, Matteo},
  journal={Current Medicinal Chemistry},
  volume={28},
  number={38},
  pages={7862--7886},
  year={2021},
  publisher={Bentham Science Publishers}
}

@article{born2021on,
	title = {On the role of artificial intelligence in medical imaging of COVID-19},
	journal = {Patterns},
	volume = {2},
	number = {6},
	pages = {100269},
	year = {2021},
	issn = {2666-3899},
	url = {https://doi.org/10.1016/j.patter.2021.100269},
	author = {Jannis Born and David Beymer and Deepta Rajan and Adam Coy and Vandana V. Mukherjee and Matteo Manica and Prasanth Prasanna and Deddeh Ballah and Michal Guindy and Dorith Shaham and Pallav L. Shah and Emmanouil Karteris and Jan L. Robertus and Maria Gabrani and Michal Rosen-Zvi}
}
```

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/PhosphorylatedRabbits/paperscraper",
    "name": "paperscraper",
    "maintainer": "",
    "docs_url": null,
    "requires_python": "",
    "maintainer_email": "",
    "keywords": "Academics,Science,Publication,Search,PubMed,Arxiv,Medrxiv,Biorxiv,Chemrxiv,Google Scholar",
    "author": "Jannis Born, Matteo Manica",
    "author_email": "jannis.born@gmx.de, drugilsberg@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/4d/5f/bc1df06382867ec17a662c5d2824740f5a81dc083da5dad2acdbe4d1c259/paperscraper-0.2.10.tar.gz",
    "platform": null,
    "description": "[![build](https://github.com/PhosphorylatedRabbits/paperscraper/actions/workflows/build.yml/badge.svg)](https://github.com/PhosphorylatedRabbits/paperscraper/actions/workflows/build.yml)\n[![License:\nMIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)\n[![PyPI version](https://badge.fury.io/py/paperscraper.svg)](https://badge.fury.io/py/paperscraper)\n[![Downloads](https://static.pepy.tech/badge/paperscraper)](https://pepy.tech/project/paperscraper)\n[![Downloads](https://static.pepy.tech/badge/paperscraper/month)](https://pepy.tech/project/paperscraper)\n[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)\n\n# paperscraper\n\n## Overview\n\n`paperscraper` is a `python` package that ships via `pypi` and facilitates scraping\npublication metadata as well as full PDF files from **PubMed** or from preprint servers such as **arXiv**,\n**medRxiv**, **bioRxiv** and **chemRxiv**. It provides a streamlined interface to scrape metadata and comes\nwith simple postprocessing functions and plotting routines for meta-analysis.\n\nSince v0.2.4 `paperscraper` also supports scraping PDF files directly! Thanks to [@daenuprobst](https://github.com/daenuprobst) for suggestions!\n\n## Getting started\n\n```console\npip install paperscraper\n```\n\nThis is enough to query **PubMed**, **arXiv** or Google Scholar.\n\n#### Download X-rxiv Dumps\n\nHowever, to scrape publication data from the preprint servers [biorxiv](https://www.biorxiv.org), [medrxiv](https://www.medrxiv.org) and [chemrxiv](https://www.chemrxiv.org), the setup is different. The entire dump is downloaded and stored in the `server_dumps` folder in a `.jsonl` format (one paper per line).\n\n```py\nfrom paperscraper.get_dumps import biorxiv, medrxiv, chemrxiv\nmedrxiv()  #  Takes ~30min and should result in ~35 MB file\nbiorxiv()  # Takes ~1h and should result in ~350 MB file\nchemrxiv()  #  Takes ~45min and should result in ~20 MB file\n```\n*NOTE*: Once the dumps are stored, please make sure to restart the python interpreter\nso that the changes take effect. \n\nSince v0.2.5 `paperscraper` also allows to scrape {med/bio/chem}rxiv for specific dates! Thanks to [@achouhan93 ](https://github.com/achouhan93 ) for contributions!\n```py\nmedrxiv(begin_date=\"2023-04-01\", end_date=\"2023-04-08\")\n```\nBut watch out. The resulting `.jsonl` file will be labelled according to the current date and all your subsequent searches will be based on this file **only**. If you use this option you might want to keep an eye on the source files (`paperscraper/server_dumps/*jsonl`) to ensure they contain the paper metadata for all papers you're interested in.\n\n## Examples\n\n`paperscraper` is build on top of the packages [pymed](https://pypi.org/project/pymed/),\n[arxiv](https://pypi.org/project/arxiv/) and [scholarly](https://pypi.org/project/scholarly/). \n\n### Publication keyword search\n\nConsider you want to perform a publication keyword search with the query:\n`COVID-19` **AND** `Artificial Intelligence` **AND** `Medical Imaging`. \n\n* Scrape papers from PubMed:\n\n```py\nfrom paperscraper.pubmed import get_and_dump_pubmed_papers\ncovid19 = ['COVID-19', 'SARS-CoV-2']\nai = ['Artificial intelligence', 'Deep learning', 'Machine learning']\nmi = ['Medical imaging']\nquery = [covid19, ai, mi]\n\nget_and_dump_pubmed_papers(query, output_filepath='covid19_ai_imaging.jsonl')\n```\n\n* Scrape papers from arXiv:\n\n```py\nfrom paperscraper.arxiv import get_and_dump_arxiv_papers\n\nget_and_dump_arxiv_papers(query, output_filepath='covid19_ai_imaging.jsonl')\n```\n\n* Scrape papers from bioRiv, medRxiv or chemRxiv:\n\n```py\nfrom paperscraper.xrxiv.xrxiv_query import XRXivQuery\n\nquerier = XRXivQuery('server_dumps/chemrxiv_2020-11-10.jsonl')\nquerier.search_keywords(query, output_filepath='covid19_ai_imaging.jsonl')\n```\n\nYou can also use `dump_queries` to iterate over a bunch of queries for all available databases.\n\n```py\nfrom paperscraper import dump_queries\n\nqueries = [[covid19, ai, mi], [covid19, ai], [ai]]\ndump_queries(queries, '.')\n```\n\nOr use the harmonized interface of `QUERY_FN_DICT` to query multiple databases of your choice:\n```py\nfrom paperscraper.load_dumps import QUERY_FN_DICT\nprint(QUERY_FN_DICT.keys())\n\nQUERY_FN_DICT['biorxiv'](query, output_filepath='biorxiv_covid_ai_imaging.jsonl')\nQUERY_FN_DICT['medrxiv'](query, output_filepath='medrxiv_covid_ai_imaging.jsonl')\n```\n\n* Scrape papers from Google Scholar:\n\nThanks to [scholarly](https://pypi.org/project/scholarly/), there is an endpoint for Google Scholar too.\nIt does not understand Boolean expressions like the others, but should be used just like\nthe [Google Scholar search fields](https://scholar.google.com).\n\n```py\nfrom paperscraper.scholar import get_and_dump_scholar_papers\ntopic = 'Machine Learning'\nget_and_dump_scholar_papers(topic)\n```\n\n### Scrape PDFs\n\n`paperscraper` also allows you to download the PDF files.\n\n```py\nfrom paperscraper.pdf import save_pdf\npaper_data = {'doi': \"10.48550/arXiv.2207.03928\"}\nsave_pdf(paper_data, filepath='gt4sd_paper.pdf')\n```\n\nIf you want to batch download all PDFs for your previous metadata search, use the wrapper.\nHere we scrape the PDFs for the metadata obtained in the previous example.\n\n```py\nfrom paperscraper.pdf import save_pdf_from_dump\n\n# Save PDFs in current folder and name the files by their DOI\nsave_pdf_from_dump('medrxiv_covid_ai_imaging.jsonl', pdf_path='.', key_to_save='doi')\n```\n*NOTE*: This works robustly for preprint servers, but if you use it on a PubMed dump, dont expect to obtain all PDFs. \nMany publishers detect and block scraping and many publications are simply behind paywalls.\n\n\n### Citation search\n\nA plus of the Scholar endpoint is that the number of citations of a paper can be fetched:\n\n```py\nfrom paperscraper.scholar import get_citations_from_title\ntitle = '\u00dcber formal unentscheidbare S\u00e4tze der Principia Mathematica und verwandter Systeme I.'\nget_citations_from_title(title)\n```\n\n*NOTE*: The scholar endpoint does not require authentification but since it regularly\nprompts with captchas, it's difficult to apply large scale.\n\n### Journal impact factor\n\nYou can also retrieve the impact factor for all journals:\n```py\n>>>from paperscraper.impact import Impactor\n>>>i = Impactor()\n>>>i.search(\"Nat Comms\", threshold=85, sort_by='impact') \n[\n    {'journal': 'Nature Communications', 'factor': 17.694, 'score': 94}, \n    {'journal': 'Natural Computing', 'factor': 1.504, 'score': 88}\n]\n```\nThis performs a fuzzy search with a threshold of 85. `threshold` defaults to 100 in which case an exact search\nis performed. You can also search by journal abbreviation, [E-ISSN](https://portal.issn.org) or [NLM ID](https://portal.issn.org).\n```py\ni.search(\"Nat Rev Earth Environ\") # [{'journal': 'Nature Reviews Earth & Environment', 'factor': 37.214, 'score': 100}]\ni.search(\"101771060\") # [{'journal': 'Nature Reviews Earth & Environment', 'factor': 37.214, 'score': 100}]\ni.search('2662-138X') # [{'journal': 'Nature Reviews Earth & Environment', 'factor': 37.214, 'score': 100}]\n\n# Filter results by impact factor\ni.search(\"Neural network\", threshold=85, min_impact=1.5, max_impact=20)\n# [\n#   {'journal': 'IEEE Transactions on Neural Networks and Learning Systems', 'factor': 14.255, 'score': 93}, \n#   {'journal': 'NEURAL NETWORKS', 'factor': 9.657, 'score': 91},\n#   {'journal': 'WORK-A Journal of Prevention Assessment & Rehabilitation', 'factor': 1.803, 'score': 86}, \n#   {'journal': 'NETWORK-COMPUTATION IN NEURAL SYSTEMS', 'factor': 1.5, 'score': 92}\n# ]\n\n# Show all fields\ni.search(\"quantum information\", threshold=90, return_all=True)\n# [\n#   {'factor': 10.758, 'jcr': 'Q1', 'journal_abbr': 'npj Quantum Inf', 'eissn': '2056-6387', 'journal': 'npj Quantum Information', 'nlm_id': '101722857', 'issn': '', 'score': 92},\n#   {'factor': 1.577, 'jcr': 'Q3', 'journal_abbr': 'Nation', 'eissn': '0027-8378', 'journal': 'NATION', 'nlm_id': '9877123', 'issn': '0027-8378', 'score': 91}\n# ]\n```\n\n### Plotting\n\nWhen multiple query searches are performed, two types of plots can be generated\nautomatically: Venn diagrams and bar plots.\n\n#### Barplots\n\nCompare the temporal evolution of different queries across different servers.\n\n```py\nfrom paperscraper import QUERY_FN_DICT\nfrom paperscraper.postprocessing import aggregate_paper\nfrom paperscraper.utils import get_filename_from_query, load_jsonl\n\n# Define search terms and their synonyms\nml = ['Deep learning', 'Neural Network', 'Machine learning']\nmol = ['molecule', 'molecular', 'drug', 'ligand', 'compound']\ngnn = ['gcn', 'gnn', 'graph neural', 'graph convolutional', 'molecular graph']\nsmiles = ['SMILES', 'Simplified molecular']\nfp = ['fingerprint', 'molecular fingerprint', 'fingerprints']\n\n# Define queries\nqueries = [[ml, mol, smiles], [ml, mol, fp], [ml, mol, gnn]]\n\nroot = '../keyword_dumps'\n\ndata_dict = dict()\nfor query in queries:\n    filename = get_filename_from_query(query)\n    data_dict[filename] = dict()\n    for db,_ in QUERY_FN_DICT.items():\n        # Assuming the keyword search has been performed already\n        data = load_jsonl(os.path.join(root, db, filename))\n\n        # Unstructured matches are aggregated into 6 bins, 1 per year\n        # from 2015 to 2020. Sanity check is performed by having \n        # `filtering=True`, removing papers that don't contain all of\n        # the keywords in query.\n        data_dict[filename][db], filtered = aggregate_paper(\n            data, 2015, bins_per_year=1, filtering=True,\n            filter_keys=query, return_filtered=True\n        )\n\n# Plotting is now very simple\nfrom paperscraper.plotting import plot_comparison\n\ndata_keys = [\n    'deeplearning_molecule_fingerprint.jsonl',\n    'deeplearning_molecule_smiles.jsonl', \n    'deeplearning_molecule_gcn.jsonl'\n]\nplot_comparison(\n    data_dict,\n    data_keys,\n    title_text=\"'Deep Learning' AND 'Molecule' AND X\",\n    keyword_text=['Fingerprint', 'SMILES', 'Graph'],\n    figpath='mol_representation'\n)\n```\n\n![molreps](https://github.com/PhosphorylatedRabbits/paperscraper/blob/master/assets/molreps.png \"MolReps\")\n\n\n#### Venn Diagrams\n\n```py\nfrom paperscraper.plotting import (\n    plot_venn_two, plot_venn_three, plot_multiple_venn\n)\n\nsizes_2020 = (30842, 14474, 2292, 35476, 1904, 1408, 376)\nsizes_2019 = (55402, 11899, 2563)\nlabels_2020 = ('Medical\\nImaging', 'Artificial\\nIntelligence', 'COVID-19')\nlabels_2019 = ['Medical Imaging', 'Artificial\\nIntelligence']\n\nplot_venn_two(sizes_2019, labels_2019, title='2019', figname='ai_imaging')\n```\n\n![2019](https://github.com/PhosphorylatedRabbits/paperscraper/blob/master/assets/ai_imaging.png \"2019\")\n\n\n```py\nplot_venn_three(\n    sizes_2020, labels_2020, title='2020', figname='ai_imaging_covid'\n)\n```\n\n![2020](https://github.com/PhosphorylatedRabbits/paperscraper/blob/master/assets/ai_imaging_covid.png \"2020\"))\n\nOr plot both together:\n\n```py\nplot_multiple_venn(\n    [sizes_2019, sizes_2020], [labels_2019, labels_2020], \n    titles=['2019', '2020'], suptitle='Keyword search comparison', \n    gridspec_kw={'width_ratios': [1, 2]}, figsize=(10, 6),\n    figname='both'\n)\n```\n\n![both](https://github.com/PhosphorylatedRabbits/paperscraper/blob/master/assets/both.png \"Both\")\n\n\n\n## Citation\nIf you use `paperscraper`, please cite the papers that motivated our development of this tool.\n\n```bib\n@article{born2021trends,\n  title={Trends in Deep Learning for Property-driven Drug Design},\n  author={Born, Jannis and Manica, Matteo},\n  journal={Current Medicinal Chemistry},\n  volume={28},\n  number={38},\n  pages={7862--7886},\n  year={2021},\n  publisher={Bentham Science Publishers}\n}\n\n@article{born2021on,\n\ttitle = {On the role of artificial intelligence in medical imaging of COVID-19},\n\tjournal = {Patterns},\n\tvolume = {2},\n\tnumber = {6},\n\tpages = {100269},\n\tyear = {2021},\n\tissn = {2666-3899},\n\turl = {https://doi.org/10.1016/j.patter.2021.100269},\n\tauthor = {Jannis Born and David Beymer and Deepta Rajan and Adam Coy and Vandana V. Mukherjee and Matteo Manica and Prasanth Prasanna and Deddeh Ballah and Michal Guindy and Dorith Shaham and Pallav L. Shah and Emmanouil Karteris and Jan L. Robertus and Maria Gabrani and Michal Rosen-Zvi}\n}\n```\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "paperscraper: Package to scrape papers.",
    "version": "0.2.10",
    "project_urls": {
        "Homepage": "https://github.com/PhosphorylatedRabbits/paperscraper"
    },
    "split_keywords": [
        "academics",
        "science",
        "publication",
        "search",
        "pubmed",
        "arxiv",
        "medrxiv",
        "biorxiv",
        "chemrxiv",
        "google scholar"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "01da9823037b16818a51c3a3086734b1930304f0a656eaaf14bd81164ff0e8f7",
                "md5": "d34fdded491adcb9773a9632c61ed8df",
                "sha256": "d3c785e5d01bbb9a45e03612835ef670420ce590c069cf8edf9656151d7b1f22"
            },
            "downloads": -1,
            "filename": "paperscraper-0.2.10-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "d34fdded491adcb9773a9632c61ed8df",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": null,
            "size": 36664,
            "upload_time": "2024-01-11T22:47:48",
            "upload_time_iso_8601": "2024-01-11T22:47:48.843750Z",
            "url": "https://files.pythonhosted.org/packages/01/da/9823037b16818a51c3a3086734b1930304f0a656eaaf14bd81164ff0e8f7/paperscraper-0.2.10-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "4d5fbc1df06382867ec17a662c5d2824740f5a81dc083da5dad2acdbe4d1c259",
                "md5": "a949594863a6dd570760daeb76624ab1",
                "sha256": "04839ee6da71324eb410ba6f56fa842eb83c91967218d4f8c6cedf52cf86fab6"
            },
            "downloads": -1,
            "filename": "paperscraper-0.2.10.tar.gz",
            "has_sig": false,
            "md5_digest": "a949594863a6dd570760daeb76624ab1",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": null,
            "size": 32407,
            "upload_time": "2024-01-11T22:47:50",
            "upload_time_iso_8601": "2024-01-11T22:47:50.743115Z",
            "url": "https://files.pythonhosted.org/packages/4d/5f/bc1df06382867ec17a662c5d2824740f5a81dc083da5dad2acdbe4d1c259/paperscraper-0.2.10.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-01-11 22:47:50",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "PhosphorylatedRabbits",
    "github_project": "paperscraper",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "requirements": [],
    "lcname": "paperscraper"
}
        
Elapsed time: 0.19015s