paperscraper


Namepaperscraper JSON
Version 0.2.14 PyPI version JSON
download
home_pagehttps://github.com/jannisborn/paperscraper
Summarypaperscraper: Package to scrape papers.
upload_time2024-10-30 09:24:16
maintainerNone
docs_urlNone
authorJannis Born, Matteo Manica
requires_pythonNone
licenseMIT
keywords academics science publication search pubmed arxiv medrxiv biorxiv chemrxiv google scholar
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            [![build](https://github.com/jannisborn/paperscraper/actions/workflows/test_tip.yml/badge.svg?branch=main)](https://github.com/jannisborn/paperscraper/actions/workflows/test_tip.yml?query=branch%3Amain)
[![build](https://github.com/jannisborn/paperscraper/actions/workflows/test_pypi.yml/badge.svg?branch=main)](https://github.com/jannisborn/paperscraper/actions/workflows/test_pypi.yml?query=branch%3Amain)
[![License:
MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
[![PyPI version](https://badge.fury.io/py/paperscraper.svg)](https://badge.fury.io/py/paperscraper)
[![Downloads](https://static.pepy.tech/badge/paperscraper)](https://pepy.tech/project/paperscraper)
[![Downloads](https://static.pepy.tech/badge/paperscraper/month)](https://pepy.tech/project/paperscraper)
[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)
[![codecov](https://codecov.io/github/jannisborn/paperscraper/branch/main/graph/badge.svg?token=Clwi0pu61a)](https://codecov.io/github/jannisborn/paperscraper)
# paperscraper

`paperscraper` is a `python` package for scraping publication metadata or full PDF files from
**PubMed** or preprint servers such as **arXiv**, **medRxiv**, **bioRxiv** and **chemRxiv**.
It provides a streamlined interface to scrape metadata, allows to retrieve citation counts
from Google Scholar, impact factors from journals and comes with simple postprocessing functions
and plotting routines for meta-analysis.


## Getting started

```console
pip install paperscraper
```

This is enough to query **PubMed**, **arXiv** or Google Scholar.

#### Download X-rxiv Dumps

However, to scrape publication data from the preprint servers [biorxiv](https://www.biorxiv.org), [medrxiv](https://www.medrxiv.org) and [chemrxiv](https://www.chemrxiv.org), the setup is different. The entire dump is downloaded and stored in the `server_dumps` folder in a `.jsonl` format (one paper per line).

```py
from paperscraper.get_dumps import biorxiv, medrxiv, chemrxiv
medrxiv()  #  Takes ~30min and should result in ~35 MB file
biorxiv()  # Takes ~1h and should result in ~350 MB file
chemrxiv()  #  Takes ~45min and should result in ~20 MB file
```
*NOTE*: Once the dumps are stored, please make sure to restart the python interpreter so that the changes take effect. 
*NOTE*: If you experience API connection issues (`ConnectionError`), since v0.2.12 there are automatic retries which you can even control and raise from the default of 10, as in `biorxiv(max_retries=20)`.

Since v0.2.5 `paperscraper` also allows to scrape {med/bio/chem}rxiv for specific dates.
```py
medrxiv(begin_date="2023-04-01", end_date="2023-04-08")
```
But watch out. The resulting `.jsonl` file will be labelled according to the current date and all your subsequent searches will be based on this file **only**. If you use this option you might want to keep an eye on the source files (`paperscraper/server_dumps/*jsonl`) to ensure they contain the paper metadata for all papers you're interested in.

## Examples

`paperscraper` is build on top of the packages [arxiv](https://pypi.org/project/arxiv/), [pymed](https://pypi.org/project/pymed-paperscraper/), and [scholarly](https://pypi.org/project/scholarly/). 

### Publication keyword search

Consider you want to perform a publication keyword search with the query:
`COVID-19` **AND** `Artificial Intelligence` **AND** `Medical Imaging`. 

* Scrape papers from PubMed:

```py
from paperscraper.pubmed import get_and_dump_pubmed_papers
covid19 = ['COVID-19', 'SARS-CoV-2']
ai = ['Artificial intelligence', 'Deep learning', 'Machine learning']
mi = ['Medical imaging']
query = [covid19, ai, mi]

get_and_dump_pubmed_papers(query, output_filepath='covid19_ai_imaging.jsonl')
```

* Scrape papers from arXiv:

```py
from paperscraper.arxiv import get_and_dump_arxiv_papers

get_and_dump_arxiv_papers(query, output_filepath='covid19_ai_imaging.jsonl')
```

* Scrape papers from bioRiv, medRxiv or chemRxiv:

```py
from paperscraper.xrxiv.xrxiv_query import XRXivQuery

querier = XRXivQuery('server_dumps/chemrxiv_2020-11-10.jsonl')
querier.search_keywords(query, output_filepath='covid19_ai_imaging.jsonl')
```

You can also use `dump_queries` to iterate over a bunch of queries for all available databases.

```py
from paperscraper import dump_queries

queries = [[covid19, ai, mi], [covid19, ai], [ai]]
dump_queries(queries, '.')
```

Or use the harmonized interface of `QUERY_FN_DICT` to query multiple databases of your choice:
```py
from paperscraper.load_dumps import QUERY_FN_DICT
print(QUERY_FN_DICT.keys())

QUERY_FN_DICT['biorxiv'](query, output_filepath='biorxiv_covid_ai_imaging.jsonl')
QUERY_FN_DICT['medrxiv'](query, output_filepath='medrxiv_covid_ai_imaging.jsonl')
```

* Scrape papers from Google Scholar:

Thanks to [scholarly](https://pypi.org/project/scholarly/), there is an endpoint for Google Scholar too.
It does not understand Boolean expressions like the others, but should be used just like
the [Google Scholar search fields](https://scholar.google.com).

```py
from paperscraper.scholar import get_and_dump_scholar_papers
topic = 'Machine Learning'
get_and_dump_scholar_papers(topic)
```

### Scrape PDFs

`paperscraper` also allows you to download the PDF files.

```py
from paperscraper.pdf import save_pdf
paper_data = {'doi': "10.48550/arXiv.2207.03928"}
save_pdf(paper_data, filepath='gt4sd_paper.pdf')
```

If you want to batch download all PDFs for your previous metadata search, use the wrapper.
Here we scrape the PDFs for the metadata obtained in the previous example.

```py
from paperscraper.pdf import save_pdf_from_dump

# Save PDFs in current folder and name the files by their DOI
save_pdf_from_dump('medrxiv_covid_ai_imaging.jsonl', pdf_path='.', key_to_save='doi')
```
*NOTE*: This works robustly for preprint servers, but if you use it on a PubMed dump, dont expect to obtain all PDFs. 
Many publishers detect and block scraping and many publications are simply behind paywalls.


### Citation search

A plus of the Scholar endpoint is that the number of citations of a paper can be fetched:

```py
from paperscraper.scholar import get_citations_from_title
title = 'Über formal unentscheidbare Sätze der Principia Mathematica und verwandter Systeme I.'
get_citations_from_title(title)
```

*NOTE*: The scholar endpoint does not require authentication but since it regularly
prompts with captchas, it's difficult to apply large scale.

### Journal impact factor

You can also retrieve the impact factor for all journals:
```py
>>>from paperscraper.impact import Impactor
>>>i = Impactor()
>>>i.search("Nat Comms", threshold=85, sort_by='impact') 
[
    {'journal': 'Nature Communications', 'factor': 17.694, 'score': 94}, 
    {'journal': 'Natural Computing', 'factor': 1.504, 'score': 88}
]
```
This performs a fuzzy search with a threshold of 85. `threshold` defaults to 100 in which case an exact search
is performed. You can also search by journal abbreviation, [E-ISSN](https://portal.issn.org) or [NLM ID](https://portal.issn.org).
```py
i.search("Nat Rev Earth Environ") # [{'journal': 'Nature Reviews Earth & Environment', 'factor': 37.214, 'score': 100}]
i.search("101771060") # [{'journal': 'Nature Reviews Earth & Environment', 'factor': 37.214, 'score': 100}]
i.search('2662-138X') # [{'journal': 'Nature Reviews Earth & Environment', 'factor': 37.214, 'score': 100}]

# Filter results by impact factor
i.search("Neural network", threshold=85, min_impact=1.5, max_impact=20)
# [
#   {'journal': 'IEEE Transactions on Neural Networks and Learning Systems', 'factor': 14.255, 'score': 93}, 
#   {'journal': 'NEURAL NETWORKS', 'factor': 9.657, 'score': 91},
#   {'journal': 'WORK-A Journal of Prevention Assessment & Rehabilitation', 'factor': 1.803, 'score': 86}, 
#   {'journal': 'NETWORK-COMPUTATION IN NEURAL SYSTEMS', 'factor': 1.5, 'score': 92}
# ]

# Show all fields
i.search("quantum information", threshold=90, return_all=True)
# [
#   {'factor': 10.758, 'jcr': 'Q1', 'journal_abbr': 'npj Quantum Inf', 'eissn': '2056-6387', 'journal': 'npj Quantum Information', 'nlm_id': '101722857', 'issn': '', 'score': 92},
#   {'factor': 1.577, 'jcr': 'Q3', 'journal_abbr': 'Nation', 'eissn': '0027-8378', 'journal': 'NATION', 'nlm_id': '9877123', 'issn': '0027-8378', 'score': 91}
# ]
```

### Plotting

When multiple query searches are performed, two types of plots can be generated
automatically: Venn diagrams and bar plots.

#### Barplots

Compare the temporal evolution of different queries across different servers.

```py
from paperscraper import QUERY_FN_DICT
from paperscraper.postprocessing import aggregate_paper
from paperscraper.utils import get_filename_from_query, load_jsonl

# Define search terms and their synonyms
ml = ['Deep learning', 'Neural Network', 'Machine learning']
mol = ['molecule', 'molecular', 'drug', 'ligand', 'compound']
gnn = ['gcn', 'gnn', 'graph neural', 'graph convolutional', 'molecular graph']
smiles = ['SMILES', 'Simplified molecular']
fp = ['fingerprint', 'molecular fingerprint', 'fingerprints']

# Define queries
queries = [[ml, mol, smiles], [ml, mol, fp], [ml, mol, gnn]]

root = '../keyword_dumps'

data_dict = dict()
for query in queries:
    filename = get_filename_from_query(query)
    data_dict[filename] = dict()
    for db,_ in QUERY_FN_DICT.items():
        # Assuming the keyword search has been performed already
        data = load_jsonl(os.path.join(root, db, filename))

        # Unstructured matches are aggregated into 6 bins, 1 per year
        # from 2015 to 2020. Sanity check is performed by having 
        # `filtering=True`, removing papers that don't contain all of
        # the keywords in query.
        data_dict[filename][db], filtered = aggregate_paper(
            data, 2015, bins_per_year=1, filtering=True,
            filter_keys=query, return_filtered=True
        )

# Plotting is now very simple
from paperscraper.plotting import plot_comparison

data_keys = [
    'deeplearning_molecule_fingerprint.jsonl',
    'deeplearning_molecule_smiles.jsonl', 
    'deeplearning_molecule_gcn.jsonl'
]
plot_comparison(
    data_dict,
    data_keys,
    title_text="'Deep Learning' AND 'Molecule' AND X",
    keyword_text=['Fingerprint', 'SMILES', 'Graph'],
    figpath='mol_representation'
)
```

![molreps](https://github.com/jannisborn/paperscraper/blob/main/assets/molreps.png?raw=true "MolReps")


#### Venn Diagrams

```py
from paperscraper.plotting import (
    plot_venn_two, plot_venn_three, plot_multiple_venn
)

sizes_2020 = (30842, 14474, 2292, 35476, 1904, 1408, 376)
sizes_2019 = (55402, 11899, 2563)
labels_2020 = ('Medical\nImaging', 'Artificial\nIntelligence', 'COVID-19')
labels_2019 = ['Medical Imaging', 'Artificial\nIntelligence']

plot_venn_two(sizes_2019, labels_2019, title='2019', figname='ai_imaging')
```

![2019](https://github.com/jannisborn/paperscraper/blob/main/assets/ai_imaging.png?raw=true "2019")


```py
plot_venn_three(
    sizes_2020, labels_2020, title='2020', figname='ai_imaging_covid'
)
```

![2020](https://github.com/jannisborn/paperscraper/blob/main/assets/ai_imaging_covid.png?raw=true "2020")

Or plot both together:

```py
plot_multiple_venn(
    [sizes_2019, sizes_2020], [labels_2019, labels_2020], 
    titles=['2019', '2020'], suptitle='Keyword search comparison', 
    gridspec_kw={'width_ratios': [1, 2]}, figsize=(10, 6),
    figname='both'
)
```

![both](https://github.com/jannisborn/paperscraper/blob/main/assets/both.png?raw=true "Both")



## Citation
If you use `paperscraper`, please cite a paper that motivated our development of this tool.

```bib
@article{born2021trends,
  title={Trends in Deep Learning for Property-driven Drug Design},
  author={Born, Jannis and Manica, Matteo},
  journal={Current Medicinal Chemistry},
  volume={28},
  number={38},
  pages={7862--7886},
  year={2021},
  publisher={Bentham Science Publishers}
}
```

## Contributions
Thanks to the following contributors:
- [@memray](https://github.com/memray): Since `v0.2.12` there are automatic retries when downloading the {med/bio/chem}rxiv dumps.
- [@achouhan93](https://github.com/achouhan93): Since `v0.2.5` {med/bio/chem}rxiv can be scraped for specific dates!
- [@daenuprobst](https://github.com/daenuprobst): Since  `v0.2.4` PDF files can be scraped directly (`paperscraper.pdf.save_pdf`)
- [@oppih](https://github.com/oppih): Since `v0.2.3` chemRxiv API also provides DOI and URL if available
- [@lukasschwab](https://github.com/lukasschwab): Bumped `arxiv` dependency to >`1.4.2` in paperscraper `v0.1.0`.
- [@juliusbierk](https://github.com/juliusbierk): Bugfixes

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/jannisborn/paperscraper",
    "name": "paperscraper",
    "maintainer": null,
    "docs_url": null,
    "requires_python": null,
    "maintainer_email": null,
    "keywords": "Academics, Science, Publication, Search, PubMed, Arxiv, Medrxiv, Biorxiv, Chemrxiv, Google Scholar",
    "author": "Jannis Born, Matteo Manica",
    "author_email": "jannis.born@gmx.de, drugilsberg@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/40/ed/d4f54a1937d953bd3daefbead023fcd770d58ffa44c6a4b400f9e84b5d12/paperscraper-0.2.14.tar.gz",
    "platform": null,
    "description": "[![build](https://github.com/jannisborn/paperscraper/actions/workflows/test_tip.yml/badge.svg?branch=main)](https://github.com/jannisborn/paperscraper/actions/workflows/test_tip.yml?query=branch%3Amain)\n[![build](https://github.com/jannisborn/paperscraper/actions/workflows/test_pypi.yml/badge.svg?branch=main)](https://github.com/jannisborn/paperscraper/actions/workflows/test_pypi.yml?query=branch%3Amain)\n[![License:\nMIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)\n[![PyPI version](https://badge.fury.io/py/paperscraper.svg)](https://badge.fury.io/py/paperscraper)\n[![Downloads](https://static.pepy.tech/badge/paperscraper)](https://pepy.tech/project/paperscraper)\n[![Downloads](https://static.pepy.tech/badge/paperscraper/month)](https://pepy.tech/project/paperscraper)\n[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)\n[![codecov](https://codecov.io/github/jannisborn/paperscraper/branch/main/graph/badge.svg?token=Clwi0pu61a)](https://codecov.io/github/jannisborn/paperscraper)\n# paperscraper\n\n`paperscraper` is a `python` package for scraping publication metadata or full PDF files from\n**PubMed** or preprint servers such as **arXiv**, **medRxiv**, **bioRxiv** and **chemRxiv**.\nIt provides a streamlined interface to scrape metadata, allows to retrieve citation counts\nfrom Google Scholar, impact factors from journals and comes with simple postprocessing functions\nand plotting routines for meta-analysis.\n\n\n## Getting started\n\n```console\npip install paperscraper\n```\n\nThis is enough to query **PubMed**, **arXiv** or Google Scholar.\n\n#### Download X-rxiv Dumps\n\nHowever, to scrape publication data from the preprint servers [biorxiv](https://www.biorxiv.org), [medrxiv](https://www.medrxiv.org) and [chemrxiv](https://www.chemrxiv.org), the setup is different. The entire dump is downloaded and stored in the `server_dumps` folder in a `.jsonl` format (one paper per line).\n\n```py\nfrom paperscraper.get_dumps import biorxiv, medrxiv, chemrxiv\nmedrxiv()  #  Takes ~30min and should result in ~35 MB file\nbiorxiv()  # Takes ~1h and should result in ~350 MB file\nchemrxiv()  #  Takes ~45min and should result in ~20 MB file\n```\n*NOTE*: Once the dumps are stored, please make sure to restart the python interpreter so that the changes take effect. \n*NOTE*: If you experience API connection issues (`ConnectionError`), since v0.2.12 there are automatic retries which you can even control and raise from the default of 10, as in `biorxiv(max_retries=20)`.\n\nSince v0.2.5 `paperscraper` also allows to scrape {med/bio/chem}rxiv for specific dates.\n```py\nmedrxiv(begin_date=\"2023-04-01\", end_date=\"2023-04-08\")\n```\nBut watch out. The resulting `.jsonl` file will be labelled according to the current date and all your subsequent searches will be based on this file **only**. If you use this option you might want to keep an eye on the source files (`paperscraper/server_dumps/*jsonl`) to ensure they contain the paper metadata for all papers you're interested in.\n\n## Examples\n\n`paperscraper` is build on top of the packages [arxiv](https://pypi.org/project/arxiv/), [pymed](https://pypi.org/project/pymed-paperscraper/), and [scholarly](https://pypi.org/project/scholarly/). \n\n### Publication keyword search\n\nConsider you want to perform a publication keyword search with the query:\n`COVID-19` **AND** `Artificial Intelligence` **AND** `Medical Imaging`. \n\n* Scrape papers from PubMed:\n\n```py\nfrom paperscraper.pubmed import get_and_dump_pubmed_papers\ncovid19 = ['COVID-19', 'SARS-CoV-2']\nai = ['Artificial intelligence', 'Deep learning', 'Machine learning']\nmi = ['Medical imaging']\nquery = [covid19, ai, mi]\n\nget_and_dump_pubmed_papers(query, output_filepath='covid19_ai_imaging.jsonl')\n```\n\n* Scrape papers from arXiv:\n\n```py\nfrom paperscraper.arxiv import get_and_dump_arxiv_papers\n\nget_and_dump_arxiv_papers(query, output_filepath='covid19_ai_imaging.jsonl')\n```\n\n* Scrape papers from bioRiv, medRxiv or chemRxiv:\n\n```py\nfrom paperscraper.xrxiv.xrxiv_query import XRXivQuery\n\nquerier = XRXivQuery('server_dumps/chemrxiv_2020-11-10.jsonl')\nquerier.search_keywords(query, output_filepath='covid19_ai_imaging.jsonl')\n```\n\nYou can also use `dump_queries` to iterate over a bunch of queries for all available databases.\n\n```py\nfrom paperscraper import dump_queries\n\nqueries = [[covid19, ai, mi], [covid19, ai], [ai]]\ndump_queries(queries, '.')\n```\n\nOr use the harmonized interface of `QUERY_FN_DICT` to query multiple databases of your choice:\n```py\nfrom paperscraper.load_dumps import QUERY_FN_DICT\nprint(QUERY_FN_DICT.keys())\n\nQUERY_FN_DICT['biorxiv'](query, output_filepath='biorxiv_covid_ai_imaging.jsonl')\nQUERY_FN_DICT['medrxiv'](query, output_filepath='medrxiv_covid_ai_imaging.jsonl')\n```\n\n* Scrape papers from Google Scholar:\n\nThanks to [scholarly](https://pypi.org/project/scholarly/), there is an endpoint for Google Scholar too.\nIt does not understand Boolean expressions like the others, but should be used just like\nthe [Google Scholar search fields](https://scholar.google.com).\n\n```py\nfrom paperscraper.scholar import get_and_dump_scholar_papers\ntopic = 'Machine Learning'\nget_and_dump_scholar_papers(topic)\n```\n\n### Scrape PDFs\n\n`paperscraper` also allows you to download the PDF files.\n\n```py\nfrom paperscraper.pdf import save_pdf\npaper_data = {'doi': \"10.48550/arXiv.2207.03928\"}\nsave_pdf(paper_data, filepath='gt4sd_paper.pdf')\n```\n\nIf you want to batch download all PDFs for your previous metadata search, use the wrapper.\nHere we scrape the PDFs for the metadata obtained in the previous example.\n\n```py\nfrom paperscraper.pdf import save_pdf_from_dump\n\n# Save PDFs in current folder and name the files by their DOI\nsave_pdf_from_dump('medrxiv_covid_ai_imaging.jsonl', pdf_path='.', key_to_save='doi')\n```\n*NOTE*: This works robustly for preprint servers, but if you use it on a PubMed dump, dont expect to obtain all PDFs. \nMany publishers detect and block scraping and many publications are simply behind paywalls.\n\n\n### Citation search\n\nA plus of the Scholar endpoint is that the number of citations of a paper can be fetched:\n\n```py\nfrom paperscraper.scholar import get_citations_from_title\ntitle = '\u00dcber formal unentscheidbare S\u00e4tze der Principia Mathematica und verwandter Systeme I.'\nget_citations_from_title(title)\n```\n\n*NOTE*: The scholar endpoint does not require authentication but since it regularly\nprompts with captchas, it's difficult to apply large scale.\n\n### Journal impact factor\n\nYou can also retrieve the impact factor for all journals:\n```py\n>>>from paperscraper.impact import Impactor\n>>>i = Impactor()\n>>>i.search(\"Nat Comms\", threshold=85, sort_by='impact') \n[\n    {'journal': 'Nature Communications', 'factor': 17.694, 'score': 94}, \n    {'journal': 'Natural Computing', 'factor': 1.504, 'score': 88}\n]\n```\nThis performs a fuzzy search with a threshold of 85. `threshold` defaults to 100 in which case an exact search\nis performed. You can also search by journal abbreviation, [E-ISSN](https://portal.issn.org) or [NLM ID](https://portal.issn.org).\n```py\ni.search(\"Nat Rev Earth Environ\") # [{'journal': 'Nature Reviews Earth & Environment', 'factor': 37.214, 'score': 100}]\ni.search(\"101771060\") # [{'journal': 'Nature Reviews Earth & Environment', 'factor': 37.214, 'score': 100}]\ni.search('2662-138X') # [{'journal': 'Nature Reviews Earth & Environment', 'factor': 37.214, 'score': 100}]\n\n# Filter results by impact factor\ni.search(\"Neural network\", threshold=85, min_impact=1.5, max_impact=20)\n# [\n#   {'journal': 'IEEE Transactions on Neural Networks and Learning Systems', 'factor': 14.255, 'score': 93}, \n#   {'journal': 'NEURAL NETWORKS', 'factor': 9.657, 'score': 91},\n#   {'journal': 'WORK-A Journal of Prevention Assessment & Rehabilitation', 'factor': 1.803, 'score': 86}, \n#   {'journal': 'NETWORK-COMPUTATION IN NEURAL SYSTEMS', 'factor': 1.5, 'score': 92}\n# ]\n\n# Show all fields\ni.search(\"quantum information\", threshold=90, return_all=True)\n# [\n#   {'factor': 10.758, 'jcr': 'Q1', 'journal_abbr': 'npj Quantum Inf', 'eissn': '2056-6387', 'journal': 'npj Quantum Information', 'nlm_id': '101722857', 'issn': '', 'score': 92},\n#   {'factor': 1.577, 'jcr': 'Q3', 'journal_abbr': 'Nation', 'eissn': '0027-8378', 'journal': 'NATION', 'nlm_id': '9877123', 'issn': '0027-8378', 'score': 91}\n# ]\n```\n\n### Plotting\n\nWhen multiple query searches are performed, two types of plots can be generated\nautomatically: Venn diagrams and bar plots.\n\n#### Barplots\n\nCompare the temporal evolution of different queries across different servers.\n\n```py\nfrom paperscraper import QUERY_FN_DICT\nfrom paperscraper.postprocessing import aggregate_paper\nfrom paperscraper.utils import get_filename_from_query, load_jsonl\n\n# Define search terms and their synonyms\nml = ['Deep learning', 'Neural Network', 'Machine learning']\nmol = ['molecule', 'molecular', 'drug', 'ligand', 'compound']\ngnn = ['gcn', 'gnn', 'graph neural', 'graph convolutional', 'molecular graph']\nsmiles = ['SMILES', 'Simplified molecular']\nfp = ['fingerprint', 'molecular fingerprint', 'fingerprints']\n\n# Define queries\nqueries = [[ml, mol, smiles], [ml, mol, fp], [ml, mol, gnn]]\n\nroot = '../keyword_dumps'\n\ndata_dict = dict()\nfor query in queries:\n    filename = get_filename_from_query(query)\n    data_dict[filename] = dict()\n    for db,_ in QUERY_FN_DICT.items():\n        # Assuming the keyword search has been performed already\n        data = load_jsonl(os.path.join(root, db, filename))\n\n        # Unstructured matches are aggregated into 6 bins, 1 per year\n        # from 2015 to 2020. Sanity check is performed by having \n        # `filtering=True`, removing papers that don't contain all of\n        # the keywords in query.\n        data_dict[filename][db], filtered = aggregate_paper(\n            data, 2015, bins_per_year=1, filtering=True,\n            filter_keys=query, return_filtered=True\n        )\n\n# Plotting is now very simple\nfrom paperscraper.plotting import plot_comparison\n\ndata_keys = [\n    'deeplearning_molecule_fingerprint.jsonl',\n    'deeplearning_molecule_smiles.jsonl', \n    'deeplearning_molecule_gcn.jsonl'\n]\nplot_comparison(\n    data_dict,\n    data_keys,\n    title_text=\"'Deep Learning' AND 'Molecule' AND X\",\n    keyword_text=['Fingerprint', 'SMILES', 'Graph'],\n    figpath='mol_representation'\n)\n```\n\n![molreps](https://github.com/jannisborn/paperscraper/blob/main/assets/molreps.png?raw=true \"MolReps\")\n\n\n#### Venn Diagrams\n\n```py\nfrom paperscraper.plotting import (\n    plot_venn_two, plot_venn_three, plot_multiple_venn\n)\n\nsizes_2020 = (30842, 14474, 2292, 35476, 1904, 1408, 376)\nsizes_2019 = (55402, 11899, 2563)\nlabels_2020 = ('Medical\\nImaging', 'Artificial\\nIntelligence', 'COVID-19')\nlabels_2019 = ['Medical Imaging', 'Artificial\\nIntelligence']\n\nplot_venn_two(sizes_2019, labels_2019, title='2019', figname='ai_imaging')\n```\n\n![2019](https://github.com/jannisborn/paperscraper/blob/main/assets/ai_imaging.png?raw=true \"2019\")\n\n\n```py\nplot_venn_three(\n    sizes_2020, labels_2020, title='2020', figname='ai_imaging_covid'\n)\n```\n\n![2020](https://github.com/jannisborn/paperscraper/blob/main/assets/ai_imaging_covid.png?raw=true \"2020\")\n\nOr plot both together:\n\n```py\nplot_multiple_venn(\n    [sizes_2019, sizes_2020], [labels_2019, labels_2020], \n    titles=['2019', '2020'], suptitle='Keyword search comparison', \n    gridspec_kw={'width_ratios': [1, 2]}, figsize=(10, 6),\n    figname='both'\n)\n```\n\n![both](https://github.com/jannisborn/paperscraper/blob/main/assets/both.png?raw=true \"Both\")\n\n\n\n## Citation\nIf you use `paperscraper`, please cite a paper that motivated our development of this tool.\n\n```bib\n@article{born2021trends,\n  title={Trends in Deep Learning for Property-driven Drug Design},\n  author={Born, Jannis and Manica, Matteo},\n  journal={Current Medicinal Chemistry},\n  volume={28},\n  number={38},\n  pages={7862--7886},\n  year={2021},\n  publisher={Bentham Science Publishers}\n}\n```\n\n## Contributions\nThanks to the following contributors:\n- [@memray](https://github.com/memray): Since `v0.2.12` there are automatic retries when downloading the {med/bio/chem}rxiv dumps.\n- [@achouhan93](https://github.com/achouhan93): Since `v0.2.5` {med/bio/chem}rxiv can be scraped for specific dates!\n- [@daenuprobst](https://github.com/daenuprobst): Since  `v0.2.4` PDF files can be scraped directly (`paperscraper.pdf.save_pdf`)\n- [@oppih](https://github.com/oppih): Since `v0.2.3` chemRxiv API also provides DOI and URL if available\n- [@lukasschwab](https://github.com/lukasschwab): Bumped `arxiv` dependency to >`1.4.2` in paperscraper `v0.1.0`.\n- [@juliusbierk](https://github.com/juliusbierk): Bugfixes\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "paperscraper: Package to scrape papers.",
    "version": "0.2.14",
    "project_urls": {
        "Homepage": "https://github.com/jannisborn/paperscraper"
    },
    "split_keywords": [
        "academics",
        " science",
        " publication",
        " search",
        " pubmed",
        " arxiv",
        " medrxiv",
        " biorxiv",
        " chemrxiv",
        " google scholar"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "4f221809064ae4d8b7e8a8fcf38a8480e36d6c688e478dc1cea34af39f4247fd",
                "md5": "e403f3ca05d9bb0686f098cbd01f863b",
                "sha256": "d8c25c96f79272c95d3686b688e033caa66b6233adf6a636be42b2a2ab3c08b2"
            },
            "downloads": -1,
            "filename": "paperscraper-0.2.14-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "e403f3ca05d9bb0686f098cbd01f863b",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": null,
            "size": 42858,
            "upload_time": "2024-10-30T09:24:14",
            "upload_time_iso_8601": "2024-10-30T09:24:14.961717Z",
            "url": "https://files.pythonhosted.org/packages/4f/22/1809064ae4d8b7e8a8fcf38a8480e36d6c688e478dc1cea34af39f4247fd/paperscraper-0.2.14-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "40edd4f54a1937d953bd3daefbead023fcd770d58ffa44c6a4b400f9e84b5d12",
                "md5": "9c34607d393497462f0fd91ee7a89bf5",
                "sha256": "66be910e24f0386884f4f0f01095f618298f614f263c2040b2262378270ec900"
            },
            "downloads": -1,
            "filename": "paperscraper-0.2.14.tar.gz",
            "has_sig": false,
            "md5_digest": "9c34607d393497462f0fd91ee7a89bf5",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": null,
            "size": 36641,
            "upload_time": "2024-10-30T09:24:16",
            "upload_time_iso_8601": "2024-10-30T09:24:16.560825Z",
            "url": "https://files.pythonhosted.org/packages/40/ed/d4f54a1937d953bd3daefbead023fcd770d58ffa44c6a4b400f9e84b5d12/paperscraper-0.2.14.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-10-30 09:24:16",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "jannisborn",
    "github_project": "paperscraper",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "requirements": [],
    "lcname": "paperscraper"
}
        
Elapsed time: 4.99794s