pygetpapers


Namepygetpapers JSON
Version 1.2.5 PyPI version JSON
download
home_pagehttps://github.com/petermr/pygetpapers
SummaryAutomated Download of Research Papers from various scientific repositories
upload_time2022-12-21 12:33:40
maintainer
docs_urlNone
authorAyush Garg
requires_python>=3.5
licenseApache License
keywords research automation
VCS
bugtrack_url
requirements coloredlogs ConfigArgParse dict2xml habanero lxml pandas pytest requests setuptools sphinx_rtd_theme tqdm xmltodict
Travis-CI No Travis.
coveralls test coverage No coveralls.
            <p align="center">
  <img src="https://raw.githubusercontent.com/petermr/pygetpapers/main/resources/pygetpapers_logo.png" alt="pygetpapers" height="50%" width="50%">
  <h2 align="center">Research Papers right from python</h2>
</p>

# What is pygetpapers


- pygetpapers is a tool to assist text miners. It makes requests to open access scientific text repositories, analyses the hits, and systematically downloads the articles without further interaction.

- Comes with the packages `pygetpapers` and `downloadtools` which provide various functions to download, process and save research papers and their metadata.

- The main medium of its interaction with users is through a command-line interface.

- `pygetpapers` has a modular design which makes maintenance easy and simple. This also allows adding support for more repositories simple.

<br>
<p>
<a href="https://github.com/petermr/pygetpapers/actions"><img src="https://img.shields.io/pypi/dm/pygetpapers" alt="img" width="180px" height="25px"></a>
<a href="https://github.com/petermr/pygetpapers/issues"><img src="https://img.shields.io/github/issues-raw/petermr/pygetpapers?color=blue&style=for-the-badge" alt="img" width="180px" height="30px"></a>
<a href="https://github.com/petermr/pygetpapers/issues?q=is%3Aclosed"><img src="https://img.shields.io/github/issues-closed-raw/petermr/pygetpapers?color=blue&style=for-the-badge" alt="img"  width="180px" height="30px"></a>
<a href="https://github.com/petermr/pygetpapers/commits/main"><img src="https://img.shields.io/github/last-commit/petermr/pygetpapers.svg?color=blue&style=for-the-badge" alt="img" width="180px" height="30px"></a>
<a href="https://github.com/petermr/pygetpapers/stargazers"><img src="https://img.shields.io/github/stars/petermr/pygetpapers.svg?style=social&label=Star&maxAge=2592000" alt="img" width="120px" height="30px"></a>
<a href='https://pygetpapers.readthedocs.io/en/latest/?badge=latest'>
    <img src='https://readthedocs.org/projects/pygetpapers/badge/?version=latest' alt='Documentation Status' height="25px" />
</a>
<a style="border-width:0" href="https://doi.org/10.21105/joss.04451">
  <img src="https://joss.theoj.org/papers/10.21105/joss.04451/status.svg" alt="DOI badge" >
</a>
</p>


<p align="center">
  <img src="https://github-readme-stats.vercel.app/api/pin/?username=petermr&repo=pygetpapers">
</p>

<p>
The developer documentation has been setup at <a href="https://pygetpapers.readthedocs.io/en/latest/#">readthedocs</a>
</p>

## History

`getpapers` is a tool written by Rik Smith-Unna funded by ContentMine at https://github.com/ContentMine/getpapers. The OpenVirus community requires a Python version and Ayush Garg has written an implementation from scratch, with some enhancements.

## Formats supported by pygetpapers

- pygetpapers gives fulltexts in xml and pdf format. 
- The metadata for papers can be saved in many formats including JSON, CSV, HTML.
- Queries can be saved in form of an ini configuration file. 
- The additional files for papers can also be downloaded. References and citations for papers are given in XML format. 
- Log files can be saved in txt format.

## Repository Structure

The main code is located in the pygetpapers directory. All the supporting modules for different repositories are described in the pygetpapers/repository directory.

## Architecture

<p align="center">
<img src="https://github.com/petermr/pygetpapers/raw/main/resources/archietecture.png" >
</p>

## About the author and community

`pygetpapers` has been developed by Ayush Garg under the dear guidance of the OpenVirus community and Peter Murray Rust. Ayush is currently a high school student who believes that the world can only truly progress when knowledge is open and accessible by all.

Testers from OpenVirus have given a lot of useful feedback to Ayush without which this project would not have been possible.

The community has taken time to ensure that everyone can contribute to this project. So, YOU, the developer, reader and researcher can also contribute by testing, developing, and sharing.

# Installation

Ensure that `pip` is installed along with python. Download python from: https://www.python.org/downloads/ and select the option Add Python to Path while installing.

Check out https://pip.pypa.io/en/stable/installing/ if difficulties installing pip. Also, checkout https://packaging.python.org/en/latest/tutorials/installing-packages/ to learn more about installing packages in `python`.

<hr>

## Method one (recommended):

- Enter the command: `pip install pygetpapers`

- Ensure pygetpapers has been installed by reopening the terminal and typing the command `pygetpapers`

- You should see a help message come up.

<hr>

## Method two (Install Directly From Head):

- Ensure git cli is installed and is available in path. Check out (https://git-scm.com/book/en/v2/Getting-Started-Installing-Git)

- Enter the command: `pip install git+https://github.com/petermr/pygetpapers.git`

- Ensure pygetpapers has been installed by reopening the terminal and typing the command `pygetpapers`

- You should see a help message come up.

<hr>

# Usage
`pygetpapers` is a commandline tool. You can ask for help by running:
```
pygetpapers --help
```

```
usage: pygetpapers [-h] [--config CONFIG] [-v] [-q QUERY] [-o OUTPUT]
                   [--save_query] [-x] [-p] [-s] [-z] [--references REFERENCES]
                   [-n] [--citations CITATIONS] [-l LOGLEVEL] [-f LOGFILE]
                   [-k LIMIT] [-r] [-u] [--onlyquery] [-c] [--makehtml]
                   [--synonym] [--startdate STARTDATE] [--enddate ENDDATE]
                   [--terms TERMS] [--notterms NOTTERMS] [--api API]
                   [--filter FILTER]

Welcome to Pygetpapers version 0.0.9.3. -h or --help for help

optional arguments:
  -h, --help            show this help message and exit
  --config CONFIG       config file path to read query for pygetpapers
  -v, --version         output the version number
  -q QUERY, --query QUERY
                        query string transmitted to repository API. Eg.
                        "Artificial Intelligence" or "Plant Parts". To escape
                        special characters within the quotes, use backslash.
                        Incase of nested quotes, ensure that the initial quotes
                        are double and the qutoes inside are single. For eg:
                        `'(LICENSE:"cc by" OR LICENSE:"cc-by") AND
                        METHODS:"transcriptome assembly"' ` is wrong. We should
                        instead use `"(LICENSE:'cc by' OR LICENSE:'cc-by') AND
                        METHODS:'transcriptome assembly'"`
  -o OUTPUT, --output OUTPUT
                        output directory (Default: Folder inside current working directory named current date and time)
  --save_query          saved the passed query in a config file
  -x, --xml             download fulltext XMLs if available or save metadata as
                        XML
  -p, --pdf             [E][A] download fulltext PDFs if available (only eupmc
                        and arxiv supported)
  -s, --supp            [E] download supplementary files if available (only eupmc
                        supported)
  -z, --zip             [E] download files from ftp endpoint if available (only
                        eupmc supported)
  --references REFERENCES
                        [E] Download references if available. (only eupmc
                        supported)Requires source for references
                        (AGR,CBA,CTX,ETH,HIR,MED,PAT,PMC,PPR).
  -n, --noexecute       [ALL] report how many results match the query, but don't
                        actually download anything
  --citations CITATIONS
                        [E] Download citations if available (only eupmc
                        supported). Requires source for citations
                        (AGR,CBA,CTX,ETH,HIR,MED,PAT,PMC,PPR).
  -l LOGLEVEL, --loglevel LOGLEVEL
                        [All] Provide logging level. Example --log warning
                        <<info,warning,debug,error,critical>>, default='info'
  -f LOGFILE, --logfile LOGFILE
                        [All] save log to specified file in output directory as
                        well as printing to terminal
  -k LIMIT, --limit LIMIT
                        [All] maximum number of hits (default: 100)
  -r, --restart         [E] Downloads the missing flags for the corpus.Searches
                        for already existing corpus in the output directory
  -u, --update          [E][B][M][C] Updates the corpus by downloading new
                        papers. Requires -k or --limit (If not provided, default
                        will be used) and -q or --query (must be provided) to be
                        given. Searches for already existing corpus in the output
                        directory
  --onlyquery           [E] Saves json file containing the result of the query in
                        storage. (only eupmc supported)The json file can be given
                        to --restart to download the papers later.
  -c, --makecsv         [All] Stores the per-document metadata as csv.
  --makehtml            [All] Stores the per-document metadata as html.
  --synonym             [E] Results contain synonyms as well.
  --startdate STARTDATE
                        [E][B][M] Gives papers starting from given date. Format:
                        YYYY-MM-DD
  --enddate ENDDATE     [E][B][M] Gives papers till given date. Format: YYYY-MM-
                        DD
  --terms TERMS         [All] Location of the file which contains terms
                        serperated by a comma or an ami dict which will beOR'ed
                        among themselves and AND'ed with the query
  --notterms NOTTERMS   [All] Location of the txt file which contains terms
                        serperated by a comma or an ami dict which will beOR'ed
                        among themselves and NOT'ed with the query
  --api API             API to search [eupmc,
                        crossref,arxiv,biorxiv,medrxiv,rxivist] (default: eupmc)
  --filter FILTER       [C] filter by key value pair (only crossref supported)
```

Queries are build using `-q` flag. The query format can be found at http://europepmc.org/docs/EBI_Europe_PMC_Web_Service_Reference.pdf A condensed guide can be found at https://github.com/petermr/pygetpapers/wiki/query-format

## Repository-specific flags
To convey the repository specificity, we've used the first letter of the repository in square brackets in its description. 

# What is CProject?
A CProject is a directory structure that the [AMI](https://github.com/petermr.ami3) toolset uses to gather and process data. Each paper gets its folder. 
<img src = "resources/cproject_structure.png">
A CTree is a subdirectory of a CProject that deals with a single paper.
<img src = "resources/PMC_folder_inside.png">

# Tutorial
`pygetpapers` was on version `0.0.9.3` when the tutorials were documented. 

`pygetpapers` supports multiple APIs including eupmc, crossref,arxiv,biorxiv,medrxiv,rxivist-bio,rxivist-med. By default, it queries EPMC. You can specify the API by using `--api` flag.  

You can also follow this [colab notebook](https://colab.research.google.com/drive/18SJ9H4Hm_7Y2rJENXdEhmJMS59Ojm2SK?usp=sharing) as part of the tutorial. 

|Features        |EPMC           |crossref      |arxiv|biorxiv        |medarxiv|rxvist|
|----------------|---------------|--------------|-----|---------------|--------|------|
|Fulltext formats|xml, pdf       |NA            |pdf   |xml            |xml     |xml   |
|Metdata formats |json, html, csv|json, xml, csv|json, csv, html, xml |json, csv, html|json, csv, html    |json, html, csv  |
|`--query`       |yes            |yes           |yes  |NA             |NA      |NA    |
|`--update`      |yes            |NA            |NA   |yes            |yes     |      |
|`--restart`     |yes            |NA          |NA     |NA               |NA        |NA      |
|`--citation`    |yes            |NA            |NA   |NA             |NA     |NA    |
|`--references`  |yes            |NA            |NA   |NA             |NA     |NA    |
|`--terms`       |yes            |yes           |yes  |NA             |NA      |NA    |

## EPMC (Default API)
### Example Query
Let's break down the following query:   
```
pygetpapers -q "METHOD: invasive plant species" -k 10 -o "invasive_plant_species_test" -c --makehtml -x --save_query
```

|Flag|What it does|In this case `pygetpapers`...|
|---|---|---|
|`-q`|specifies the query|queries for 'invasive plant species' in METHODS section|
|`-k`|number of hits (default 100)|limits hits to 10|
|`-o`|specifies output directory|outputs to invasive_plant_species_test|
|`-x`|downloads fulltext xml||
|`-c`|saves per-paper metadata into a single csv|saves single  CSV named [`europe_pmc.csv`](resources/invasive_plant_species_test/europe_pmc.csv)|
|`--makehtml`|saves per-paper metadata into a single HTML file|saves single HTML named [`europe_pmc.html`](resources/invasive_plant_species_test/eupmc_results.html)|
|`--save_query`|saves the given query in a `config.ini` in output directory|saves query to [`saved_config.ini`](resources/invasive_plant_species_test/saved_config.ini)|


`pygetpapers`, by default, writes metadata to a JSON file within:
- individual paper directory for corresponding paper (`epmc_result.json`)
- working directory for all downloaded papers ([`epmc_results.json`](resources/invasiv_plant_species_test/eupmc_results.json))


OUTPUT:
```  
INFO: Final query is METHOD: invasive plant species
INFO: Total Hits are 17910
0it [00:00, ?it/s]WARNING: Keywords not found for paper 1
WARNING: Keywords not found for paper 4
1it [00:00, 164.87it/s]
INFO: Saving XML files to C:\Users\shweata\invasive_plant_species_test\*\fulltext.xml
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:21<00:00,  2.11s/it]
```

### Scope the number of hits for a query 
If you are just scoping the number of hits for a given query, you can use `-n` flag as shown below. 

   ```
   pygetpapers -n -q "essential oil"
   ```
OUTPUT:
```
INFO: Final query is essential oil
INFO: Total number of hits for the query are 190710
```

### Update an existing CProject with **new papers** by feeding the metadata JSON
The `--update` command is used to update a CProject with a new set of papers on same or different query. 
If let's say you have a corpus of a 30 papers on 'essential oil' (like before) and would like to download 20 more papers to the same CProject directory, you use `--update` command. 

To update your Cproject, you would give it the `-o` flag the already existing CProject name. Additionally, you should also add `--update ` flag. 
INPUT:
```
pygetpapers -q "invasive plant species" -k 10 -x -o lantana_test_5 --update
```
OUTPUT:
```
INFO: Final query is invasive plant species
INFO: Please ensure that you are providing the same --api as the one in the corpus or you may get errors
INFO: Total Hits are 32956
0it [00:00, ?it/s]WARNING: html url not found for paper 5
WARNING: pdf url not found for paper 5
WARNING: Keywords not found for paper 6
WARNING: Keywords not found for paper 7
WARNING: Author list not found for paper 10
1it [00:00, 166.68it/s]
INFO: Saving XML files to C:\Users\shweata\lantana_test_5\*\fulltext.xml
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [01:03<00:00,  3.16s/it]
```

#### How is `--update` different from just downloading x number of papers to the same output directory?
By using `--update` command you can be sure that you don't overwrite the existing papers. 
### Restart downloading papers to an existing CProject
`--restart` flag can be used for two purposes:
- To download papers in different format. Let's say you downloaded XMLs in the first round. If you want to download pdfs for same set of papers, you use this flag. 
- Continue the download from the stage where it broke. This feature would particularly come in handy if you are on poor lines. 
Let's start off by forcefully interrupting the download.   
INPUT:
```
pygetpapers -q "pinus" -k 10 -o pinus_10 -x
```
OUTPUT:
```
INFO: Final query is pinus
INFO: Total Hits are 32086
0it [00:00, ?it/s]WARNING: html url not found for paper 10
WARNING: pdf url not found for paper 10
1it [00:00, 63.84it/s]
INFO: Saving XML files to C:\Users\shweata\pinus_10\*\fulltext.xml
 60%|██████████████████████████████████████████████████████████████████████████████▌                                                    | 6/10 [00:20<00:13,  3.42s/it]
Traceback (most recent call last):
...
KeyboardInterrupt
^C
```
If you take a look at the CProject directory, there are 6 papers downloaded so far. 
```
C:.
│   eupmc_results.json
│
├───PMC8157994
│       eupmc_result.json
│       fulltext.xml
│
├───PMC8180188
│       eupmc_result.json
│       fulltext.xml
│
├───PMC8198815
│       eupmc_result.json
│       fulltext.xml
│
├───PMC8216501
│       eupmc_result.json
│       fulltext.xml
│
├───PMC8309040
│       eupmc_result.json
│       fulltext.xml
│
└───PMC8325914
        eupmc_result.json
        fulltext.xml
```
To download the rest, we can use `--restart` flag.   
INPUT
```
pygetpapers -q "pinus" -o pinus_10 --restart -x
```
OUTPUT:
```
INFO: Saving XML files to C:\Users\shweata\pinus_10\*\fulltext.xml
 80%|████████████████████████████████████████████████████████████████████████████████████████████████████████▊                          | 8/10 [00:27<00:07,  3.51s/it 90%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▉             | 9/10 [00:38<00:05,  5.95s/it100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:40<00:00,  4.49s/it100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:40<00:00,  4.01s/it]
```
Now if we inspect the CProject directory, we see that we have 10 papers as specified.
```
C:.
│   eupmc_results.json
│
├───PMC8157994
│       eupmc_result.json
│       fulltext.xml
│
├───PMC8180188
│       eupmc_result.json
│       fulltext.xml
│
├───PMC8198815
│       eupmc_result.json
│       fulltext.xml
│
├───PMC8199922
│       eupmc_result.json
│       fulltext.xml
│
├───PMC8216501
│       eupmc_result.json
│       fulltext.xml
│
├───PMC8309040
│       eupmc_result.json
│       fulltext.xml
│
├───PMC8309338
│       eupmc_result.json
│       fulltext.xml
│
├───PMC8325914
│       eupmc_result.json
│       fulltext.xml
│
├───PMC8399312
│       eupmc_result.json
│       fulltext.xml
│
└───PMC8400686
        eupmc_result.json
        fulltext.xml
```
 Under the hood, `pygetpapers` looks for `eupmc_results.json`, reads it and resumes the download. 

You could also use `--restart` to download the fulltext or metadata in different format other than the ones that you've already downloaded. For example, if I want all the fulltext PDFs of the 10 papers on `pinus`, I can run:  

INPUT: 
```
pygetpapers -q "pinus" -o pinus_10 --restart -p --makehtml
```
OUTPUT:
```
>pygetpapers -q "pinus" -o pinus_10 --restart -p --makehtml
100%|█████████████████████████████████████████████| 10/10 [03:26<00:00, 20.68s/it]
```
Now, if we take a look at the CProject:
```
C:.
│   eupmc_results.json
│
├───PMC8157994
│       eupmc_result.html
│       eupmc_result.json
│       fulltext.pdf
│       fulltext.xml
│
├───PMC8180188
│       eupmc_result.html
│       eupmc_result.json
│       fulltext.pdf
│       fulltext.xml
│
├───PMC8198815
│       eupmc_result.html
│       eupmc_result.json
│       fulltext.pdf
│       fulltext.xml
...
```
We find that each paper now has fulltext PDFs and metadata in HTML. 
#### Difference between `--restart` and `--update`
- If you aren't looking download new set of papers but would want to download a papers in different format for existing papers, `--restart` is the flag you'd want to use
- If you are looking to download a new set of papers to an existing Cproject, then you'd use `--update` command. You should note that the format in which you download papers would only apply to the new set of papers and not for the old. 

### Downloading citations and references for papers, if available
- `--references` and `--citations` flags can be used to download the references and citations respectively.  
- It also requires source for references (AGR,CBA,CTX,ETH,HIR,MED,PAT,PMC,PPR)

   `pygetpapers -q "lantana" -k 10 -o "test" -c -x --citation PMC`

### Downloading only the metadata
If you are looking to download just the metadata in the supported formats`--onlyquery` is the flag you use. It saves the metadata in the output directory. 

You can use `--restart` feature to download the fulltexts for these papers. 
INPUT:
```
pygetpapers --onlyquery -q "lantana" -k 10 -o "lantana_test" -c
```
OUTPUT:
```
INFO: Final query is lantana
INFO: Total Hits are 1909
0it [00:00, ?it/s]WARNING: html url not found for paper 1
WARNING: pdf url not found for paper 1
WARNING: Keywords not found for paper 2
WARNING: Keywords not found for paper 3
WARNING: Author list not found for paper 5
WARNING: Author list not found for paper 8
WARNING: Keywords not found for paper 9
1it [00:00, 407.69it/s]
```


### Download papers within certain start and end date range
By using `--startdate` and `--enddate` you can specify the date range within which the papers you want to download were first published. 

```
pygetpapers -q "METHOD:essential oil" --startdate "2020-01-02" --enddate "2021-09-09"
```

### Saving query for later use
To save a query for later use, you can use `--save_query`. What it does is that it saves the query in a `.ini` file in the output directory. 
```
pygetpapers -q "lantana" -k 10 -o "lantana_query_config"--save_query
```
[Here](resources/invasive_plant_species_test/saved_config.ini) is an example config file `pygetpapers` outputs
### Feed query using `config.ini` file
Using can use the `config.ini` file you created using `--save_query`, you re-run the query. To do so, you will give `--config` flag the absolute path of the `saved_config.ini` file.

`pygetpapers --config "C:\Users\shweata\lantana_query_config\saved_config.ini"`

### Querying using a term list
#### `--terms` flag
If your query is complex with multiple ORs, you can use `--terms` feature. To do so, you will:
- Create a `.txt` file with list of terms separated by commas or an `ami-dictionary` ([Click here](https://github.com/petermr/tigr2ess/blob/master/dictionaries/TUTORIAL.md) to learn how to create dictionaries). 
- Give the `--terms` flag the absolute path of the `.txt` file or `ami-dictionary` (XML)

`-q` is optional.The terms would be OR'ed with each other ANDed with the query, if given. 

INPUT:
```
pygetpapers -q "essential oil" --terms C:\Users\shweata\essential_oil_terms.txt -k 10 -o "terms_test_essential_oil" -x  
```
OUTPUT:
```
C:\Users\shweata>pygetpapers -q "essential oil" --terms C:\Users\shweata\essential_oil_terms.txt -k 10 -o "terms_test_essential_oil"
INFO: Final query is (essential oil AND (antioxidant OR  antibacterial OR  antifungal OR  antiseptic OR  antitrichomonal agent))
INFO: Total Hits are 43397
0it [00:00, ?it/s]WARNING: Author list not found for paper 9
1it [00:00, 1064.00it/s]
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:19<00:00,  1.99s/it]
```

You can also use this feature to download papers by using the PMC Ids. You can feed the `.txt` file with PMC ids comman-separated. Make sure to give a large enough hit number to download all the papers specified in the text file. 

Example text file can be found, [here](resources/PMCID_pygetpapers_text.txt)
INPUT:
```
pygetpapers --terms C:\Users\shweata\PMCID_pygetpapers_text.txt -k 100 -o "PMCID_test"
```

OUTPUT:
```
INFO: Final query is (PMC6856665 OR  PMC6877543 OR  PMC6927906 OR  PMC7008714 OR  PMC7040181 OR  PMC7080866 OR  PMC7082878 OR  PMC7096589 OR  PMC7111464 OR  PMC7142259 OR  PMC7158757 OR  PMC7174509 OR  PMC7193700 OR  PMC7198785 OR  PMC7201129 OR  PMC7203781 OR  PMC7206980 OR  PMC7214627 OR  PMC7214803 OR  PMC7220991
)
INFO: Total Hits are 20
WARNING: Could not find more papers
1it [00:00, 505.46it/s]
100%|█████████████████████████████████████████████| 20/20 [00:32<00:00,  1.61s/it]
```
#### `--notterms`
Excluded papers that have certain keywords might also be of interest for you. For example, if you want papers on essential oil which doesn't mention `antibacterial` , `antiseptic` or `antimicrobial`, you can run either create a dictionary or a text file with these terms (comma-separated), specify its absolute path to `--notterms` flag. 

INPUT:
```
pygetpapers -q "essential oil" -k 10 -o essential_oil_not_terms_test --notterms C:\Users\shweata\not_terms_test.txt
```
OUTPUT:
```
INFO: Final query is (essential oil AND NOT (antimicrobial OR  antiseptic OR  antibacterial))
INFO: Total Hits are 165557
1it [00:00, ?it/s]
100%|█| 10/10 [00:49<00:00,  4.95s/
```
The number of papers are reduced by a some proportion. For comparision, "essential oil" query gives us 193922 hits.  
```
C:\Users\shweata>pygetpapers -q "essential oil" -n
INFO: Final query is essential oil
INFO: Total number of hits for the query are 193922
```
#### Using `--terms` with dictionaries
We will take the same example as before. 
- Assuming you have [`ami3`](https://github.com/petermr/ami3) installed, you can create `ami-dictionaries`
    - Start off by listing the terms in a `.txt` file
    ```
    antimicrobial
    antiseptic
    antibacterial
    ```
    - Run the following command from the directory in which the text file exists
    ```
    amidict -v --dictionary pygetpapers_terms --directory pygetpapers_terms --input pygetpapers_terms.txt create --informat list --outformats xml
    ```
That's it! You've now created a simple `ami-dictionary`. There are ways of creating dictionaries from Wikidata as well. You can learn more about how to do that in this [Wiki](https://github.com/petermr/openVirus/wiki/Dictionaries:-creation-from-Wikidata) page. 
- You can also use [standard dictionaries](https://github.com/petermr/dictionary/) that are available. 
 we, then, pass the absolute path of the dictionary to `--terms` flag. 

INPUT:
```
pygetpapers -q "essential oil" --terms C:\Users\shweata\pygetpapers_terms\pygetpapers_terms.xml -k 10 -o pygetpapers_dictionary_test -x
```
OUTPUT:
```
INFO: Final query is (essential oil AND (antibacterial OR antimicrobial OR antiseptic))
INFO: Total Hits are 28365
0it [00:00, ?it/s]WARNING: Keywords not found for paper 5
WARNING: Keywords not found for paper 7
1it [00:00, ?it/s]
INFO: Saving XML files to C:\Users\shweata\pygetpapers_dictionary_test\*\fulltext.xml
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:36<00:00,  3.67s/it]
```

### Log levels
You can specify the log level using the `-l` flag. The default as you've already seen so far is info. 

INPUT:
```
pygetpapers -q "lantana" -k 10 -o lantana_test_10_2 --loglevel debug -x
```
### Log file
You can also choose to write the log to a `.txt` file in your HOME directory, while simultaneously printing it out. 

INPUT:
```
pygetpapers -q "lantana" -k 10 -o lantana_test_10_4 --loglevel debug -x --logfile test_log.txt
```
[Here](resources/test_log.txt) is the log file. 
## Crossref
You can query crossref api for the metadata only. 
### Sample query
- The metadata formats flags are applicable as described in the EPMC tutorial
- `--terms` and `-q` are also applicable to crossref
INPUT:
```
pygetpapers --api crossref -q "essential oil" --terms C:\Users\shweata\essential_oil_terms.txt -k 10 -o "terms_test_essential_oil_crossref_3" -x -c --makehtml
```
OUTPUT:
```
INFO: Final query is (essential oil AND (antioxidant OR  antibacterial OR  antifungal OR  antiseptic OR  antitrichomonal agent))
INFO: Making request to crossref
INFO: Got request result from crossref
INFO: Making csv files for metadata at C:\Users\shweata\terms_test_essential_oil_crossref_3
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 185.52it/s]
INFO: Making html files for metadata at C:\Users\shweata\terms_test_essential_oil_crossref_3
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 87.98it/s]
INFO: Making xml files for metadata at C:\Users\shweata\terms_test_essential_oil_crossref_3
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 366.97it/s]
INFO: Wrote metadata file for the query
INFO: Writing metadata file for the papers at C:\Users\shweata\terms_test_essential_oil_crossref_3
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 996.82it/s]
```
We have 10 folders in the CProject directory. 
```
C:\Users\shweata>cd terms_test_essential_oil_crossref_3

C:\Users\shweata\terms_test_essential_oil_crossref_3>tree
Folder PATH listing for volume Windows-SSD
Volume serial number is D88A-559A
C:.
├───10.1016_j.bcab.2021.101913
├───10.1055_s-0029-1234896
├───10.1080_0972060x.2016.1231597
├───10.1080_10412905.1989.9697767
├───10.1111_j.1745-4565.2012.00378.x
├───10.17795_bhs-24733
├───10.23880_oajmms-16000131
├───10.34302_crpjfst_2019.11.2.8
├───10.5220_0008855200960099
└───10.5220_0009957801190122
```
### `--update` 
`--update` works the same as in `EPMC`. You can use this flag to increase the number of papers in a given CProject. 
INPUT
```
pygetpapers --api crossref -q "essential oil" --terms C:\Users\shweata\essential_oil_terms.txt -k 5 -o "terms_test_essential_oil_crossref_3" -x -c --makehtml --update
```
OUTPUT: 
```
INFO: Final query is (essential oil AND (antioxidant OR  antibacterial OR  antifungal OR  antiseptic OR  antitrichomonal agent))
INFO: Please ensure that you are providing the same --api as the one in the corpus or you may get errors
INFO: Reading old json metadata file
INFO: Making request to crossref
INFO: Got request result from crossref
INFO: Wrote metadata file for the query
INFO: Writing metadata file for the papers at C:\Users\shweata\terms_test_essential_oil_crossref_3
100%|██████████████████████████████████████████████| 5/5 [00:00<00:00, 346.84it/s]
```
The CProject after updating: 
```
C:.
├───10.1002_mbo3.459
├───10.1016_j.bcab.2021.101913
├───10.1055_s-0029-1234896
├───10.1080_0972060x.2014.895156
├───10.1080_0972060x.2016.1231597
├───10.1080_0972060x.2017.1345329
├───10.1080_10412905.1989.9697767
├───10.1080_10412905.2021.1941338
├───10.1111_j.1745-4565.2012.00378.x
├───10.15406_oajs.2019.03.00121
├───10.17795_bhs-24733
├───10.23880_oajmms-16000131
├───10.34302_crpjfst_2019.11.2.8
├───10.5220_0008855200960099
└───10.5220_0009957801190122
```
We started off with 10 paper folders, and increased the number to 15. 
### Filter

## arxiv
`pygetpapers` allows you to query `arxiv` for full text PDF and metadata in all supported formats.
### Sample query
INPUT
```
pygetpapers --api arxiv -k 10 -o arxiv_test_3 -q "artificial intelligence" -x -p --makehtml -c
```
OUTPUT
```

INFO: Final query is artificial intelligence
INFO: Making request to Arxiv through pygetpapers
INFO: Got request result from Arxiv through pygetpapers
INFO: Requesting 10 results at offset 0
INFO: Requesting page of results
INFO: Got first page; 10 of 10 results available
INFO: Downloading Pdfs for papers
100%|█████████████████████████████████████████████| 10/10 [01:02<00:00,  6.27s/it]
INFO: Making csv files for metadata at C:\Users\shweata\arxiv_test_3
100%|████████████████████████████████████████████| 10/10 [00:00<00:00, 187.31it/s]
INFO: Making html files for metadata at C:\Users\shweata\arxiv_test_3
100%|████████████████████████████████████████████| 10/10 [00:00<00:00, 161.87it/s]
INFO: Making xml files for metadata at C:\Users\shweata\arxiv_test_3
100%|█████████████████████████████████████████████████████| 10/10 [00:00<?, ?it/s]
100%|███████████████████████████████████████████| 10/10 [00:00<00:00, 1111.22it/s]
```
Note: `--update` isn't supported for `arxiv`
## Biorxiv and Medrxiv
You can query `biorxiv` and `medrxiv` for fulltext and metadata (in all supported formats). However, passing a query string using `-q` flag isn't supported for both the Repositories. You can only provide a date range. 
### Sample Query - biorxiv
INPUT:
```
pygetpapers --api biorxiv -k 10 -x --startdate 2021-01-01 -o biorxiv_test_20210831
```
OUTPUT:
```
WARNING: Currently biorxiv api is malfunctioning and returning wrong DOIs
INFO: Making Request to rxiv
INFO: Making xml for paper
100%|██████████████████████████████████████████████████████████████████████████████████| 10/10 [00:23<00:00,  2.34s/it]
INFO: Wrote metadata file for the query
INFO: Writing metadata file for the papers at C:\Users\shweata\biorxiv_test_20210831
100%|█████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 684.72it/s]
```
### `--update` command
INPUT
```
pygetpapers --api biorxiv -k 10 -x --startdate 2021-01-01 -o biorxiv_test_20210831 --update
```
OUTPUT
```
WARNING: Currently biorxiv api is malfunctioning and returning wrong DOIs
INFO: Please ensure that you are providing the same --api as the one in the corpus or you may get errors
INFO: Reading old json metadata file
INFO: Making Request to rxiv
INFO: Making xml for paper
100%|██████████████████████████████████████████████████████████████████████████████████| 10/10 [00:22<00:00,  2.23s/it]
INFO: Wrote metadata file for the query
INFO: Writing metadata file for the papers at C:\Users\shweata\biorxiv_test_20210831
100%|█████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 492.39it/s]
```
The CProject now has 20 papers, in total after updating.
```
├───10.1101_008326
├───10.1101_010553
├───10.1101_035972
├───10.1101_046052
├───10.1101_060012
├───10.1101_067736
├───10.1101_086710
├───10.1101_092205
├───10.1101_092619
├───10.1101_093237
├───10.1101_121061
├───10.1101_135749
├───10.1101_145664
├───10.1101_145896
├───10.1101_165845
├───10.1101_180273
├───10.1101_181198
├───10.1101_191858
├───10.1101_194266
└───10.1101_196105
```
The working of `medarxiv` is same as `biorxiv`
## rxivist
Lets you specify a queries string to both `biorxiv` and `medarxiv`. The results you get would be a mixture of papers from both repository since `rxivist` doesn't differentiate. 

Another caveat here is that you can only retrieve metadata from `rxivist`. 

INPUT:
```
pygetpapers --api rxivist -q "biomedicine" -k 10 -c -x -o "biomedicine_rxivist" --makehtml -p
```
OUTPUT:
```
WARNING: Pdf is not supported for this api
INFO: Final query is biomedicine
INFO: Making Request to rxivist
INFO: Making csv files for metadata at C:\Users\shweata\biomedicine_rxivist
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 125.54it/s]
INFO: Making html files for metadata at C:\Users\shweata\biomedicine_rxivist
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 124.71it/s]
INFO: Making xml files for metadata at C:\Users\shweata\biomedicine_rxivist
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 633.38it/s]
INFO: Wrote metadata file for the query
INFO: Writing metadata file for the papers at C:\Users\shweata\biomedicine_rxivist
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 751.09it/s]
```
### Query hits only 
Like any other repositories under `pygetpapers`, you can use the `-n` flag to get only the hit number
INPUT: 
```
C:\Users\shweata>pygetpapers --api rxivist -q "biomedical sciences" -n
```
OUTPUT:
```
INFO: Final query is biomedical sciences
INFO: Making Request to rxivist
INFO: Total number of hits for the query are 62
```
### Update
`--update` works the same as many other repositories. Make sure to provide `rxvist` as api. 

INPUT: 
```
pygetpapers --api rxivist -q "biomedical sciences" -k 20 -c -x -o "biomedicine_rxivist" --update
```
OUPUT: 
```
INFO: Final query is biomedical sciences
INFO: Please ensure that you are providing the same --api as the one in the corpus or you may get errors
INFO: Reading old json metadata file
INFO: Making Request to rxivist
INFO: Making csv files for metadata at C:\Users\shweata\biomedicine_rxivist
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 203.69it/s]
INFO: Making xml files for metadata at C:\Users\shweata\biomedicine_rxivist
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 1059.17it/s]
INFO: Wrote metadata file for the query
INFO: Writing metadata file for the papers at C:\Users\shweata\biomedicine_rxivist
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 1077.12it/s]
```
## Run `pygetpapers` within the module

```
def run_command(output=False, query=False, save_query=False, xml=False, pdf=False, supp=False, zip=False, references=False, noexecute=False, citations=False, limit=100, restart=False, update=False, onlyquery=False, makecsv=False, makehtml=False, synonym=False, startdate=False, enddate=False, terms=False, notterms=False, api='europe_pmc', filter=None, loglevel='info', logfile=False, version=False)
```

Here's an example script to download 50 papers from EPMC on 'lantana camara'.

```
from pygetpapers import Pygetpapers
pygetpapers_call=Pygetpapers()
pygetpapers_call.run_command(query='lantana camara',limit=-50 ,output= lantana_camara, xml=True)
```

## Test `pygetpapers` 
To run automated testing on `pygetpapers`, do the following:
1) Install `pygetpapers`
2) Clone into `pygetpapers` repository
3) Install pytest
4) Run the command, `pytest`

# Contributions

https://github.com/petermr/pygetpapers/blob/main/resources/CONTRIBUTING.md

# Feature Requests

To request features, please put them in issues

# Legal Implications

If you use`pygetpapers`, you should be careful to understand the law as it applies to their content mining, as they assume full responsibility for their actions when using the software.

## Countries with copyright exceptions for content mining:

- UK
- Japan

## Countries with proposed copyright exceptions:

- Ireland
- EU countries 

## Countries with permissive interpretations of 'fair use' that might allow content mining:

- Israel
- USA
- Canada

## General summaries and guides:

- _"The legal framework of text and data mining (TDM)"_, carried out for the European Commission in March 2014 ([PDF](http://ec.europa.eu/internal_market/copyright/docs/studies/1403_study2_en.pdf))
- _"Standardisation in the area of innovation and technological development, notably in the field of Text and Data Mining"_, carried out for the European Commission in 2014 ([PDF](http://ec.europa.eu/research/innovation-union/pdf/TDM-report_from_the_expert_group-042014.pdf))





            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/petermr/pygetpapers",
    "name": "pygetpapers",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.5",
    "maintainer_email": "",
    "keywords": "research automation",
    "author": "Ayush Garg",
    "author_email": "ayush@science.org.in",
    "download_url": "",
    "platform": null,
    "description": "<p align=\"center\">\n  <img src=\"https://raw.githubusercontent.com/petermr/pygetpapers/main/resources/pygetpapers_logo.png\" alt=\"pygetpapers\" height=\"50%\" width=\"50%\">\n  <h2 align=\"center\">Research Papers right from python</h2>\n</p>\n\n# What is pygetpapers\n\n\n- pygetpapers is a tool to assist text miners. It makes requests to open access scientific text repositories, analyses the hits, and systematically downloads the articles without further interaction.\n\n- Comes with the packages `pygetpapers` and `downloadtools` which provide various functions to download, process and save research papers and their metadata.\n\n- The main medium of its interaction with users is through a command-line interface.\n\n- `pygetpapers` has a modular design which makes maintenance easy and simple. This also allows adding support for more repositories simple.\n\n<br>\n<p>\n<a href=\"https://github.com/petermr/pygetpapers/actions\"><img src=\"https://img.shields.io/pypi/dm/pygetpapers\" alt=\"img\" width=\"180px\" height=\"25px\"></a>\n<a href=\"https://github.com/petermr/pygetpapers/issues\"><img src=\"https://img.shields.io/github/issues-raw/petermr/pygetpapers?color=blue&style=for-the-badge\" alt=\"img\" width=\"180px\" height=\"30px\"></a>\n<a href=\"https://github.com/petermr/pygetpapers/issues?q=is%3Aclosed\"><img src=\"https://img.shields.io/github/issues-closed-raw/petermr/pygetpapers?color=blue&style=for-the-badge\" alt=\"img\"  width=\"180px\" height=\"30px\"></a>\n<a href=\"https://github.com/petermr/pygetpapers/commits/main\"><img src=\"https://img.shields.io/github/last-commit/petermr/pygetpapers.svg?color=blue&style=for-the-badge\" alt=\"img\" width=\"180px\" height=\"30px\"></a>\n<a href=\"https://github.com/petermr/pygetpapers/stargazers\"><img src=\"https://img.shields.io/github/stars/petermr/pygetpapers.svg?style=social&label=Star&maxAge=2592000\" alt=\"img\" width=\"120px\" height=\"30px\"></a>\n<a href='https://pygetpapers.readthedocs.io/en/latest/?badge=latest'>\n    <img src='https://readthedocs.org/projects/pygetpapers/badge/?version=latest' alt='Documentation Status' height=\"25px\" />\n</a>\n<a style=\"border-width:0\" href=\"https://doi.org/10.21105/joss.04451\">\n  <img src=\"https://joss.theoj.org/papers/10.21105/joss.04451/status.svg\" alt=\"DOI badge\" >\n</a>\n</p>\n\n\n<p align=\"center\">\n  <img src=\"https://github-readme-stats.vercel.app/api/pin/?username=petermr&repo=pygetpapers\">\n</p>\n\n<p>\nThe developer documentation has been setup at <a href=\"https://pygetpapers.readthedocs.io/en/latest/#\">readthedocs</a>\n</p>\n\n## History\n\n`getpapers` is a tool written by Rik Smith-Unna funded by ContentMine at https://github.com/ContentMine/getpapers. The OpenVirus community requires a Python version and Ayush Garg has written an implementation from scratch, with some enhancements.\n\n## Formats supported by pygetpapers\n\n- pygetpapers gives fulltexts in xml and pdf format. \n- The metadata for papers can be saved in many formats including JSON, CSV, HTML.\n- Queries can be saved in form of an ini configuration file. \n- The additional files for papers can also be downloaded. References and citations for papers are given in XML format. \n- Log files can be saved in txt format.\n\n## Repository Structure\n\nThe main code is located in the pygetpapers directory. All the supporting modules for different repositories are described in the pygetpapers/repository directory.\n\n## Architecture\n\n<p align=\"center\">\n<img src=\"https://github.com/petermr/pygetpapers/raw/main/resources/archietecture.png\" >\n</p>\n\n## About the author and community\n\n`pygetpapers` has been developed by Ayush Garg under the dear guidance of the OpenVirus community and Peter Murray Rust. Ayush is currently a high school student who believes that the world can only truly progress when knowledge is open and accessible by all.\n\nTesters from OpenVirus have given a lot of useful feedback to Ayush without which this project would not have been possible.\n\nThe community has taken time to ensure that everyone can contribute to this project. So, YOU, the developer, reader and researcher can also contribute by testing, developing, and sharing.\n\n# Installation\n\nEnsure that `pip` is installed along with python. Download python from: https://www.python.org/downloads/ and select the option Add Python to Path while installing.\n\nCheck out https://pip.pypa.io/en/stable/installing/ if difficulties installing pip. Also, checkout https://packaging.python.org/en/latest/tutorials/installing-packages/ to learn more about installing packages in `python`.\n\n<hr>\n\n## Method one (recommended):\n\n- Enter the command: `pip install pygetpapers`\n\n- Ensure pygetpapers has been installed by reopening the terminal and typing the command `pygetpapers`\n\n- You should see a help message come up.\n\n<hr>\n\n## Method two (Install Directly From Head):\n\n- Ensure git cli is installed and is available in path. Check out (https://git-scm.com/book/en/v2/Getting-Started-Installing-Git)\n\n- Enter the command: `pip install git+https://github.com/petermr/pygetpapers.git`\n\n- Ensure pygetpapers has been installed by reopening the terminal and typing the command `pygetpapers`\n\n- You should see a help message come up.\n\n<hr>\n\n# Usage\n`pygetpapers` is a commandline tool. You can ask for help by running:\n```\npygetpapers --help\n```\n\n```\nusage: pygetpapers [-h] [--config CONFIG] [-v] [-q QUERY] [-o OUTPUT]\n                   [--save_query] [-x] [-p] [-s] [-z] [--references REFERENCES]\n                   [-n] [--citations CITATIONS] [-l LOGLEVEL] [-f LOGFILE]\n                   [-k LIMIT] [-r] [-u] [--onlyquery] [-c] [--makehtml]\n                   [--synonym] [--startdate STARTDATE] [--enddate ENDDATE]\n                   [--terms TERMS] [--notterms NOTTERMS] [--api API]\n                   [--filter FILTER]\n\nWelcome to Pygetpapers version 0.0.9.3. -h or --help for help\n\noptional arguments:\n  -h, --help            show this help message and exit\n  --config CONFIG       config file path to read query for pygetpapers\n  -v, --version         output the version number\n  -q QUERY, --query QUERY\n                        query string transmitted to repository API. Eg.\n                        \"Artificial Intelligence\" or \"Plant Parts\". To escape\n                        special characters within the quotes, use backslash.\n                        Incase of nested quotes, ensure that the initial quotes\n                        are double and the qutoes inside are single. For eg:\n                        `'(LICENSE:\"cc by\" OR LICENSE:\"cc-by\") AND\n                        METHODS:\"transcriptome assembly\"' ` is wrong. We should\n                        instead use `\"(LICENSE:'cc by' OR LICENSE:'cc-by') AND\n                        METHODS:'transcriptome assembly'\"`\n  -o OUTPUT, --output OUTPUT\n                        output directory (Default: Folder inside current working directory named current date and time)\n  --save_query          saved the passed query in a config file\n  -x, --xml             download fulltext XMLs if available or save metadata as\n                        XML\n  -p, --pdf             [E][A] download fulltext PDFs if available (only eupmc\n                        and arxiv supported)\n  -s, --supp            [E] download supplementary files if available (only eupmc\n                        supported)\n  -z, --zip             [E] download files from ftp endpoint if available (only\n                        eupmc supported)\n  --references REFERENCES\n                        [E] Download references if available. (only eupmc\n                        supported)Requires source for references\n                        (AGR,CBA,CTX,ETH,HIR,MED,PAT,PMC,PPR).\n  -n, --noexecute       [ALL] report how many results match the query, but don't\n                        actually download anything\n  --citations CITATIONS\n                        [E] Download citations if available (only eupmc\n                        supported). Requires source for citations\n                        (AGR,CBA,CTX,ETH,HIR,MED,PAT,PMC,PPR).\n  -l LOGLEVEL, --loglevel LOGLEVEL\n                        [All] Provide logging level. Example --log warning\n                        <<info,warning,debug,error,critical>>, default='info'\n  -f LOGFILE, --logfile LOGFILE\n                        [All] save log to specified file in output directory as\n                        well as printing to terminal\n  -k LIMIT, --limit LIMIT\n                        [All] maximum number of hits (default: 100)\n  -r, --restart         [E] Downloads the missing flags for the corpus.Searches\n                        for already existing corpus in the output directory\n  -u, --update          [E][B][M][C] Updates the corpus by downloading new\n                        papers. Requires -k or --limit (If not provided, default\n                        will be used) and -q or --query (must be provided) to be\n                        given. Searches for already existing corpus in the output\n                        directory\n  --onlyquery           [E] Saves json file containing the result of the query in\n                        storage. (only eupmc supported)The json file can be given\n                        to --restart to download the papers later.\n  -c, --makecsv         [All] Stores the per-document metadata as csv.\n  --makehtml            [All] Stores the per-document metadata as html.\n  --synonym             [E] Results contain synonyms as well.\n  --startdate STARTDATE\n                        [E][B][M] Gives papers starting from given date. Format:\n                        YYYY-MM-DD\n  --enddate ENDDATE     [E][B][M] Gives papers till given date. Format: YYYY-MM-\n                        DD\n  --terms TERMS         [All] Location of the file which contains terms\n                        serperated by a comma or an ami dict which will beOR'ed\n                        among themselves and AND'ed with the query\n  --notterms NOTTERMS   [All] Location of the txt file which contains terms\n                        serperated by a comma or an ami dict which will beOR'ed\n                        among themselves and NOT'ed with the query\n  --api API             API to search [eupmc,\n                        crossref,arxiv,biorxiv,medrxiv,rxivist] (default: eupmc)\n  --filter FILTER       [C] filter by key value pair (only crossref supported)\n```\n\nQueries are build using `-q` flag. The query format can be found at http://europepmc.org/docs/EBI_Europe_PMC_Web_Service_Reference.pdf A condensed guide can be found at https://github.com/petermr/pygetpapers/wiki/query-format\n\n## Repository-specific flags\nTo convey the repository specificity, we've used the first letter of the repository in square brackets in its description. \n\n# What is CProject?\nA CProject is a directory structure that the [AMI](https://github.com/petermr.ami3) toolset uses to gather and process data. Each paper gets its folder. \n<img src = \"resources/cproject_structure.png\">\nA CTree is a subdirectory of a CProject that deals with a single paper.\n<img src = \"resources/PMC_folder_inside.png\">\n\n# Tutorial\n`pygetpapers` was on version `0.0.9.3` when the tutorials were documented. \n\n`pygetpapers` supports multiple APIs including eupmc, crossref,arxiv,biorxiv,medrxiv,rxivist-bio,rxivist-med. By default, it queries EPMC. You can specify the API by using `--api` flag.  \n\nYou can also follow this [colab notebook](https://colab.research.google.com/drive/18SJ9H4Hm_7Y2rJENXdEhmJMS59Ojm2SK?usp=sharing) as part of the tutorial. \n\n|Features        |EPMC           |crossref      |arxiv|biorxiv        |medarxiv|rxvist|\n|----------------|---------------|--------------|-----|---------------|--------|------|\n|Fulltext formats|xml, pdf       |NA            |pdf   |xml            |xml     |xml   |\n|Metdata formats |json, html, csv|json, xml, csv|json, csv, html, xml |json, csv, html|json, csv, html    |json, html, csv  |\n|`--query`       |yes            |yes           |yes  |NA             |NA      |NA    |\n|`--update`      |yes            |NA            |NA   |yes            |yes     |      |\n|`--restart`     |yes            |NA          |NA     |NA               |NA        |NA      |\n|`--citation`    |yes            |NA            |NA   |NA             |NA     |NA    |\n|`--references`  |yes            |NA            |NA   |NA             |NA     |NA    |\n|`--terms`       |yes            |yes           |yes  |NA             |NA      |NA    |\n\n## EPMC (Default API)\n### Example Query\nLet's break down the following query:   \n```\npygetpapers -q \"METHOD: invasive plant species\" -k 10 -o \"invasive_plant_species_test\" -c --makehtml -x --save_query\n```\n\n|Flag|What it does|In this case `pygetpapers`...|\n|---|---|---|\n|`-q`|specifies the query|queries for 'invasive plant species' in METHODS section|\n|`-k`|number of hits (default 100)|limits hits to 10|\n|`-o`|specifies output directory|outputs to invasive_plant_species_test|\n|`-x`|downloads fulltext xml||\n|`-c`|saves per-paper metadata into a single csv|saves single  CSV named [`europe_pmc.csv`](resources/invasive_plant_species_test/europe_pmc.csv)|\n|`--makehtml`|saves per-paper metadata into a single HTML file|saves single HTML named [`europe_pmc.html`](resources/invasive_plant_species_test/eupmc_results.html)|\n|`--save_query`|saves the given query in a `config.ini` in output directory|saves query to [`saved_config.ini`](resources/invasive_plant_species_test/saved_config.ini)|\n\n\n`pygetpapers`, by default, writes metadata to a JSON file within:\n- individual paper directory for corresponding paper (`epmc_result.json`)\n- working directory for all downloaded papers ([`epmc_results.json`](resources/invasiv_plant_species_test/eupmc_results.json))\n\n\nOUTPUT:\n```  \nINFO: Final query is METHOD: invasive plant species\nINFO: Total Hits are 17910\n0it [00:00, ?it/s]WARNING: Keywords not found for paper 1\nWARNING: Keywords not found for paper 4\n1it [00:00, 164.87it/s]\nINFO: Saving XML files to C:\\Users\\shweata\\invasive_plant_species_test\\*\\fulltext.xml\n100%|\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588| 10/10 [00:21<00:00,  2.11s/it]\n```\n\n### Scope the number of hits for a query \nIf you are just scoping the number of hits for a given query, you can use `-n` flag as shown below. \n\n   ```\n   pygetpapers -n -q \"essential oil\"\n   ```\nOUTPUT:\n```\nINFO: Final query is essential oil\nINFO: Total number of hits for the query are 190710\n```\n\n### Update an existing CProject with **new papers** by feeding the metadata JSON\nThe `--update` command is used to update a CProject with a new set of papers on same or different query. \nIf let's say you have a corpus of a 30 papers on 'essential oil' (like before) and would like to download 20 more papers to the same CProject directory, you use `--update` command. \n\nTo update your Cproject, you would give it the `-o` flag the already existing CProject name. Additionally, you should also add `--update ` flag. \nINPUT:\n```\npygetpapers -q \"invasive plant species\" -k 10 -x -o lantana_test_5 --update\n```\nOUTPUT:\n```\nINFO: Final query is invasive plant species\nINFO: Please ensure that you are providing the same --api as the one in the corpus or you may get errors\nINFO: Total Hits are 32956\n0it [00:00, ?it/s]WARNING: html url not found for paper 5\nWARNING: pdf url not found for paper 5\nWARNING: Keywords not found for paper 6\nWARNING: Keywords not found for paper 7\nWARNING: Author list not found for paper 10\n1it [00:00, 166.68it/s]\nINFO: Saving XML files to C:\\Users\\shweata\\lantana_test_5\\*\\fulltext.xml\n100%|\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588| 20/20 [01:03<00:00,  3.16s/it]\n```\n\n#### How is `--update` different from just downloading x number of papers to the same output directory?\nBy using `--update` command you can be sure that you don't overwrite the existing papers. \n### Restart downloading papers to an existing CProject\n`--restart` flag can be used for two purposes:\n- To download papers in different format. Let's say you downloaded XMLs in the first round. If you want to download pdfs for same set of papers, you use this flag. \n- Continue the download from the stage where it broke. This feature would particularly come in handy if you are on poor lines. \nLet's start off by forcefully interrupting the download.   \nINPUT:\n```\npygetpapers -q \"pinus\" -k 10 -o pinus_10 -x\n```\nOUTPUT:\n```\nINFO: Final query is pinus\nINFO: Total Hits are 32086\n0it [00:00, ?it/s]WARNING: html url not found for paper 10\nWARNING: pdf url not found for paper 10\n1it [00:00, 63.84it/s]\nINFO: Saving XML files to C:\\Users\\shweata\\pinus_10\\*\\fulltext.xml\n 60%|\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u258c                                                    | 6/10 [00:20<00:13,  3.42s/it]\nTraceback (most recent call last):\n...\nKeyboardInterrupt\n^C\n```\nIf you take a look at the CProject directory, there are 6 papers downloaded so far. \n```\nC:.\n\u2502   eupmc_results.json\n\u2502\n\u251c\u2500\u2500\u2500PMC8157994\n\u2502       eupmc_result.json\n\u2502       fulltext.xml\n\u2502\n\u251c\u2500\u2500\u2500PMC8180188\n\u2502       eupmc_result.json\n\u2502       fulltext.xml\n\u2502\n\u251c\u2500\u2500\u2500PMC8198815\n\u2502       eupmc_result.json\n\u2502       fulltext.xml\n\u2502\n\u251c\u2500\u2500\u2500PMC8216501\n\u2502       eupmc_result.json\n\u2502       fulltext.xml\n\u2502\n\u251c\u2500\u2500\u2500PMC8309040\n\u2502       eupmc_result.json\n\u2502       fulltext.xml\n\u2502\n\u2514\u2500\u2500\u2500PMC8325914\n        eupmc_result.json\n        fulltext.xml\n```\nTo download the rest, we can use `--restart` flag.   \nINPUT\n```\npygetpapers -q \"pinus\" -o pinus_10 --restart -x\n```\nOUTPUT:\n```\nINFO: Saving XML files to C:\\Users\\shweata\\pinus_10\\*\\fulltext.xml\n 80%|\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u258a                          | 8/10 [00:27<00:07,  3.51s/it 90%|\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2589             | 9/10 [00:38<00:05,  5.95s/it100%|\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588| 10/10 [00:40<00:00,  4.49s/it100%|\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588| 10/10 [00:40<00:00,  4.01s/it]\n```\nNow if we inspect the CProject directory, we see that we have 10 papers as specified.\n```\nC:.\n\u2502   eupmc_results.json\n\u2502\n\u251c\u2500\u2500\u2500PMC8157994\n\u2502       eupmc_result.json\n\u2502       fulltext.xml\n\u2502\n\u251c\u2500\u2500\u2500PMC8180188\n\u2502       eupmc_result.json\n\u2502       fulltext.xml\n\u2502\n\u251c\u2500\u2500\u2500PMC8198815\n\u2502       eupmc_result.json\n\u2502       fulltext.xml\n\u2502\n\u251c\u2500\u2500\u2500PMC8199922\n\u2502       eupmc_result.json\n\u2502       fulltext.xml\n\u2502\n\u251c\u2500\u2500\u2500PMC8216501\n\u2502       eupmc_result.json\n\u2502       fulltext.xml\n\u2502\n\u251c\u2500\u2500\u2500PMC8309040\n\u2502       eupmc_result.json\n\u2502       fulltext.xml\n\u2502\n\u251c\u2500\u2500\u2500PMC8309338\n\u2502       eupmc_result.json\n\u2502       fulltext.xml\n\u2502\n\u251c\u2500\u2500\u2500PMC8325914\n\u2502       eupmc_result.json\n\u2502       fulltext.xml\n\u2502\n\u251c\u2500\u2500\u2500PMC8399312\n\u2502       eupmc_result.json\n\u2502       fulltext.xml\n\u2502\n\u2514\u2500\u2500\u2500PMC8400686\n        eupmc_result.json\n        fulltext.xml\n```\n Under the hood, `pygetpapers` looks for `eupmc_results.json`, reads it and resumes the download. \n\nYou could also use `--restart` to download the fulltext or metadata in different format other than the ones that you've already downloaded. For example, if I want all the fulltext PDFs of the 10 papers on `pinus`, I can run:  \n\nINPUT: \n```\npygetpapers -q \"pinus\" -o pinus_10 --restart -p --makehtml\n```\nOUTPUT:\n```\n>pygetpapers -q \"pinus\" -o pinus_10 --restart -p --makehtml\n100%|\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588| 10/10 [03:26<00:00, 20.68s/it]\n```\nNow, if we take a look at the CProject:\n```\nC:.\n\u2502   eupmc_results.json\n\u2502\n\u251c\u2500\u2500\u2500PMC8157994\n\u2502       eupmc_result.html\n\u2502       eupmc_result.json\n\u2502       fulltext.pdf\n\u2502       fulltext.xml\n\u2502\n\u251c\u2500\u2500\u2500PMC8180188\n\u2502       eupmc_result.html\n\u2502       eupmc_result.json\n\u2502       fulltext.pdf\n\u2502       fulltext.xml\n\u2502\n\u251c\u2500\u2500\u2500PMC8198815\n\u2502       eupmc_result.html\n\u2502       eupmc_result.json\n\u2502       fulltext.pdf\n\u2502       fulltext.xml\n...\n```\nWe find that each paper now has fulltext PDFs and metadata in HTML. \n#### Difference between `--restart` and `--update`\n- If you aren't looking download new set of papers but would want to download a papers in different format for existing papers, `--restart` is the flag you'd want to use\n- If you are looking to download a new set of papers to an existing Cproject, then you'd use `--update` command. You should note that the format in which you download papers would only apply to the new set of papers and not for the old. \n\n### Downloading citations and references for papers, if available\n- `--references` and `--citations` flags can be used to download the references and citations respectively.  \n- It also requires source for references (AGR,CBA,CTX,ETH,HIR,MED,PAT,PMC,PPR)\n\n   `pygetpapers -q \"lantana\" -k 10 -o \"test\" -c -x --citation PMC`\n\n### Downloading only the metadata\nIf you are looking to download just the metadata in the supported formats`--onlyquery` is the flag you use. It saves the metadata in the output directory. \n\nYou can use `--restart` feature to download the fulltexts for these papers. \nINPUT:\n```\npygetpapers --onlyquery -q \"lantana\" -k 10 -o \"lantana_test\" -c\n```\nOUTPUT:\n```\nINFO: Final query is lantana\nINFO: Total Hits are 1909\n0it [00:00, ?it/s]WARNING: html url not found for paper 1\nWARNING: pdf url not found for paper 1\nWARNING: Keywords not found for paper 2\nWARNING: Keywords not found for paper 3\nWARNING: Author list not found for paper 5\nWARNING: Author list not found for paper 8\nWARNING: Keywords not found for paper 9\n1it [00:00, 407.69it/s]\n```\n\n\n### Download papers within certain start and end date range\nBy using `--startdate` and `--enddate` you can specify the date range within which the papers you want to download were first published. \n\n```\npygetpapers -q \"METHOD:essential oil\" --startdate \"2020-01-02\" --enddate \"2021-09-09\"\n```\n\n### Saving query for later use\nTo save a query for later use, you can use `--save_query`. What it does is that it saves the query in a `.ini` file in the output directory. \n```\npygetpapers -q \"lantana\" -k 10 -o \"lantana_query_config\"--save_query\n```\n[Here](resources/invasive_plant_species_test/saved_config.ini) is an example config file `pygetpapers` outputs\n### Feed query using `config.ini` file\nUsing can use the `config.ini` file you created using `--save_query`, you re-run the query. To do so, you will give `--config` flag the absolute path of the `saved_config.ini` file.\n\n`pygetpapers --config \"C:\\Users\\shweata\\lantana_query_config\\saved_config.ini\"`\n\n### Querying using a term list\n#### `--terms` flag\nIf your query is complex with multiple ORs, you can use `--terms` feature. To do so, you will:\n- Create a `.txt` file with list of terms separated by commas or an `ami-dictionary` ([Click here](https://github.com/petermr/tigr2ess/blob/master/dictionaries/TUTORIAL.md) to learn how to create dictionaries). \n- Give the `--terms` flag the absolute path of the `.txt` file or `ami-dictionary` (XML)\n\n`-q` is optional.The terms would be OR'ed with each other ANDed with the query, if given. \n\nINPUT:\n```\npygetpapers -q \"essential oil\" --terms C:\\Users\\shweata\\essential_oil_terms.txt -k 10 -o \"terms_test_essential_oil\" -x  \n```\nOUTPUT:\n```\nC:\\Users\\shweata>pygetpapers -q \"essential oil\" --terms C:\\Users\\shweata\\essential_oil_terms.txt -k 10 -o \"terms_test_essential_oil\"\nINFO: Final query is (essential oil AND (antioxidant OR  antibacterial OR  antifungal OR  antiseptic OR  antitrichomonal agent))\nINFO: Total Hits are 43397\n0it [00:00, ?it/s]WARNING: Author list not found for paper 9\n1it [00:00, 1064.00it/s]\n100%|\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588| 10/10 [00:19<00:00,  1.99s/it]\n```\n\nYou can also use this feature to download papers by using the PMC Ids. You can feed the `.txt` file with PMC ids comman-separated. Make sure to give a large enough hit number to download all the papers specified in the text file. \n\nExample text file can be found, [here](resources/PMCID_pygetpapers_text.txt)\nINPUT:\n```\npygetpapers --terms C:\\Users\\shweata\\PMCID_pygetpapers_text.txt -k 100 -o \"PMCID_test\"\n```\n\nOUTPUT:\n```\nINFO: Final query is (PMC6856665 OR  PMC6877543 OR  PMC6927906 OR  PMC7008714 OR  PMC7040181 OR  PMC7080866 OR  PMC7082878 OR  PMC7096589 OR  PMC7111464 OR  PMC7142259 OR  PMC7158757 OR  PMC7174509 OR  PMC7193700 OR  PMC7198785 OR  PMC7201129 OR  PMC7203781 OR  PMC7206980 OR  PMC7214627 OR  PMC7214803 OR  PMC7220991\n)\nINFO: Total Hits are 20\nWARNING: Could not find more papers\n1it [00:00, 505.46it/s]\n100%|\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588| 20/20 [00:32<00:00,  1.61s/it]\n```\n#### `--notterms`\nExcluded papers that have certain keywords might also be of interest for you. For example, if you want papers on essential oil which doesn't mention `antibacterial` , `antiseptic` or `antimicrobial`, you can run either create a dictionary or a text file with these terms (comma-separated), specify its absolute path to `--notterms` flag. \n\nINPUT:\n```\npygetpapers -q \"essential oil\" -k 10 -o essential_oil_not_terms_test --notterms C:\\Users\\shweata\\not_terms_test.txt\n```\nOUTPUT:\n```\nINFO: Final query is (essential oil AND NOT (antimicrobial OR  antiseptic OR  antibacterial))\nINFO: Total Hits are 165557\n1it [00:00, ?it/s]\n100%|\u2588| 10/10 [00:49<00:00,  4.95s/\n```\nThe number of papers are reduced by a some proportion. For comparision, \"essential oil\" query gives us 193922 hits.  \n```\nC:\\Users\\shweata>pygetpapers -q \"essential oil\" -n\nINFO: Final query is essential oil\nINFO: Total number of hits for the query are 193922\n```\n#### Using `--terms` with dictionaries\nWe will take the same example as before. \n- Assuming you have [`ami3`](https://github.com/petermr/ami3) installed, you can create `ami-dictionaries`\n    - Start off by listing the terms in a `.txt` file\n    ```\n    antimicrobial\n    antiseptic\n    antibacterial\n    ```\n    - Run the following command from the directory in which the text file exists\n    ```\n    amidict -v --dictionary pygetpapers_terms --directory pygetpapers_terms --input pygetpapers_terms.txt create --informat list --outformats xml\n    ```\nThat's it! You've now created a simple `ami-dictionary`. There are ways of creating dictionaries from Wikidata as well. You can learn more about how to do that in this [Wiki](https://github.com/petermr/openVirus/wiki/Dictionaries:-creation-from-Wikidata) page. \n- You can also use [standard dictionaries](https://github.com/petermr/dictionary/) that are available. \n we, then, pass the absolute path of the dictionary to `--terms` flag. \n\nINPUT:\n```\npygetpapers -q \"essential oil\" --terms C:\\Users\\shweata\\pygetpapers_terms\\pygetpapers_terms.xml -k 10 -o pygetpapers_dictionary_test -x\n```\nOUTPUT:\n```\nINFO: Final query is (essential oil AND (antibacterial OR antimicrobial OR antiseptic))\nINFO: Total Hits are 28365\n0it [00:00, ?it/s]WARNING: Keywords not found for paper 5\nWARNING: Keywords not found for paper 7\n1it [00:00, ?it/s]\nINFO: Saving XML files to C:\\Users\\shweata\\pygetpapers_dictionary_test\\*\\fulltext.xml\n100%|\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588| 10/10 [00:36<00:00,  3.67s/it]\n```\n\n### Log levels\nYou can specify the log level using the `-l` flag. The default as you've already seen so far is info. \n\nINPUT:\n```\npygetpapers -q \"lantana\" -k 10 -o lantana_test_10_2 --loglevel debug -x\n```\n### Log file\nYou can also choose to write the log to a `.txt` file in your HOME directory, while simultaneously printing it out. \n\nINPUT:\n```\npygetpapers -q \"lantana\" -k 10 -o lantana_test_10_4 --loglevel debug -x --logfile test_log.txt\n```\n[Here](resources/test_log.txt) is the log file. \n## Crossref\nYou can query crossref api for the metadata only. \n### Sample query\n- The metadata formats flags are applicable as described in the EPMC tutorial\n- `--terms` and `-q` are also applicable to crossref\nINPUT:\n```\npygetpapers --api crossref -q \"essential oil\" --terms C:\\Users\\shweata\\essential_oil_terms.txt -k 10 -o \"terms_test_essential_oil_crossref_3\" -x -c --makehtml\n```\nOUTPUT:\n```\nINFO: Final query is (essential oil AND (antioxidant OR  antibacterial OR  antifungal OR  antiseptic OR  antitrichomonal agent))\nINFO: Making request to crossref\nINFO: Got request result from crossref\nINFO: Making csv files for metadata at C:\\Users\\shweata\\terms_test_essential_oil_crossref_3\n100%|\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588| 10/10 [00:00<00:00, 185.52it/s]\nINFO: Making html files for metadata at C:\\Users\\shweata\\terms_test_essential_oil_crossref_3\n100%|\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588| 10/10 [00:00<00:00, 87.98it/s]\nINFO: Making xml files for metadata at C:\\Users\\shweata\\terms_test_essential_oil_crossref_3\n100%|\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588| 10/10 [00:00<00:00, 366.97it/s]\nINFO: Wrote metadata file for the query\nINFO: Writing metadata file for the papers at C:\\Users\\shweata\\terms_test_essential_oil_crossref_3\n100%|\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588| 10/10 [00:00<00:00, 996.82it/s]\n```\nWe have 10 folders in the CProject directory. \n```\nC:\\Users\\shweata>cd terms_test_essential_oil_crossref_3\n\nC:\\Users\\shweata\\terms_test_essential_oil_crossref_3>tree\nFolder PATH listing for volume Windows-SSD\nVolume serial number is D88A-559A\nC:.\n\u251c\u2500\u2500\u250010.1016_j.bcab.2021.101913\n\u251c\u2500\u2500\u250010.1055_s-0029-1234896\n\u251c\u2500\u2500\u250010.1080_0972060x.2016.1231597\n\u251c\u2500\u2500\u250010.1080_10412905.1989.9697767\n\u251c\u2500\u2500\u250010.1111_j.1745-4565.2012.00378.x\n\u251c\u2500\u2500\u250010.17795_bhs-24733\n\u251c\u2500\u2500\u250010.23880_oajmms-16000131\n\u251c\u2500\u2500\u250010.34302_crpjfst_2019.11.2.8\n\u251c\u2500\u2500\u250010.5220_0008855200960099\n\u2514\u2500\u2500\u250010.5220_0009957801190122\n```\n### `--update` \n`--update` works the same as in `EPMC`. You can use this flag to increase the number of papers in a given CProject. \nINPUT\n```\npygetpapers --api crossref -q \"essential oil\" --terms C:\\Users\\shweata\\essential_oil_terms.txt -k 5 -o \"terms_test_essential_oil_crossref_3\" -x -c --makehtml --update\n```\nOUTPUT: \n```\nINFO: Final query is (essential oil AND (antioxidant OR  antibacterial OR  antifungal OR  antiseptic OR  antitrichomonal agent))\nINFO: Please ensure that you are providing the same --api as the one in the corpus or you may get errors\nINFO: Reading old json metadata file\nINFO: Making request to crossref\nINFO: Got request result from crossref\nINFO: Wrote metadata file for the query\nINFO: Writing metadata file for the papers at C:\\Users\\shweata\\terms_test_essential_oil_crossref_3\n100%|\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588| 5/5 [00:00<00:00, 346.84it/s]\n```\nThe CProject after updating: \n```\nC:.\n\u251c\u2500\u2500\u250010.1002_mbo3.459\n\u251c\u2500\u2500\u250010.1016_j.bcab.2021.101913\n\u251c\u2500\u2500\u250010.1055_s-0029-1234896\n\u251c\u2500\u2500\u250010.1080_0972060x.2014.895156\n\u251c\u2500\u2500\u250010.1080_0972060x.2016.1231597\n\u251c\u2500\u2500\u250010.1080_0972060x.2017.1345329\n\u251c\u2500\u2500\u250010.1080_10412905.1989.9697767\n\u251c\u2500\u2500\u250010.1080_10412905.2021.1941338\n\u251c\u2500\u2500\u250010.1111_j.1745-4565.2012.00378.x\n\u251c\u2500\u2500\u250010.15406_oajs.2019.03.00121\n\u251c\u2500\u2500\u250010.17795_bhs-24733\n\u251c\u2500\u2500\u250010.23880_oajmms-16000131\n\u251c\u2500\u2500\u250010.34302_crpjfst_2019.11.2.8\n\u251c\u2500\u2500\u250010.5220_0008855200960099\n\u2514\u2500\u2500\u250010.5220_0009957801190122\n```\nWe started off with 10 paper folders, and increased the number to 15. \n### Filter\n\n## arxiv\n`pygetpapers` allows you to query `arxiv` for full text PDF and metadata in all supported formats.\n### Sample query\nINPUT\n```\npygetpapers --api arxiv -k 10 -o arxiv_test_3 -q \"artificial intelligence\" -x -p --makehtml -c\n```\nOUTPUT\n```\n\nINFO: Final query is artificial intelligence\nINFO: Making request to Arxiv through pygetpapers\nINFO: Got request result from Arxiv through pygetpapers\nINFO: Requesting 10 results at offset 0\nINFO: Requesting page of results\nINFO: Got first page; 10 of 10 results available\nINFO: Downloading Pdfs for papers\n100%|\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588| 10/10 [01:02<00:00,  6.27s/it]\nINFO: Making csv files for metadata at C:\\Users\\shweata\\arxiv_test_3\n100%|\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588| 10/10 [00:00<00:00, 187.31it/s]\nINFO: Making html files for metadata at C:\\Users\\shweata\\arxiv_test_3\n100%|\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588| 10/10 [00:00<00:00, 161.87it/s]\nINFO: Making xml files for metadata at C:\\Users\\shweata\\arxiv_test_3\n100%|\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588| 10/10 [00:00<?, ?it/s]\n100%|\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588| 10/10 [00:00<00:00, 1111.22it/s]\n```\nNote: `--update` isn't supported for `arxiv`\n## Biorxiv and Medrxiv\nYou can query `biorxiv` and `medrxiv` for fulltext and metadata (in all supported formats). However, passing a query string using `-q` flag isn't supported for both the Repositories. You can only provide a date range. \n### Sample Query - biorxiv\nINPUT:\n```\npygetpapers --api biorxiv -k 10 -x --startdate 2021-01-01 -o biorxiv_test_20210831\n```\nOUTPUT:\n```\nWARNING: Currently biorxiv api is malfunctioning and returning wrong DOIs\nINFO: Making Request to rxiv\nINFO: Making xml for paper\n100%|\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588| 10/10 [00:23<00:00,  2.34s/it]\nINFO: Wrote metadata file for the query\nINFO: Writing metadata file for the papers at C:\\Users\\shweata\\biorxiv_test_20210831\n100%|\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588| 10/10 [00:00<00:00, 684.72it/s]\n```\n### `--update` command\nINPUT\n```\npygetpapers --api biorxiv -k 10 -x --startdate 2021-01-01 -o biorxiv_test_20210831 --update\n```\nOUTPUT\n```\nWARNING: Currently biorxiv api is malfunctioning and returning wrong DOIs\nINFO: Please ensure that you are providing the same --api as the one in the corpus or you may get errors\nINFO: Reading old json metadata file\nINFO: Making Request to rxiv\nINFO: Making xml for paper\n100%|\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588| 10/10 [00:22<00:00,  2.23s/it]\nINFO: Wrote metadata file for the query\nINFO: Writing metadata file for the papers at C:\\Users\\shweata\\biorxiv_test_20210831\n100%|\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588| 10/10 [00:00<00:00, 492.39it/s]\n```\nThe CProject now has 20 papers, in total after updating.\n```\n\u251c\u2500\u2500\u250010.1101_008326\n\u251c\u2500\u2500\u250010.1101_010553\n\u251c\u2500\u2500\u250010.1101_035972\n\u251c\u2500\u2500\u250010.1101_046052\n\u251c\u2500\u2500\u250010.1101_060012\n\u251c\u2500\u2500\u250010.1101_067736\n\u251c\u2500\u2500\u250010.1101_086710\n\u251c\u2500\u2500\u250010.1101_092205\n\u251c\u2500\u2500\u250010.1101_092619\n\u251c\u2500\u2500\u250010.1101_093237\n\u251c\u2500\u2500\u250010.1101_121061\n\u251c\u2500\u2500\u250010.1101_135749\n\u251c\u2500\u2500\u250010.1101_145664\n\u251c\u2500\u2500\u250010.1101_145896\n\u251c\u2500\u2500\u250010.1101_165845\n\u251c\u2500\u2500\u250010.1101_180273\n\u251c\u2500\u2500\u250010.1101_181198\n\u251c\u2500\u2500\u250010.1101_191858\n\u251c\u2500\u2500\u250010.1101_194266\n\u2514\u2500\u2500\u250010.1101_196105\n```\nThe working of `medarxiv` is same as `biorxiv`\n## rxivist\nLets you specify a queries string to both `biorxiv` and `medarxiv`. The results you get would be a mixture of papers from both repository since `rxivist` doesn't differentiate. \n\nAnother caveat here is that you can only retrieve metadata from `rxivist`. \n\nINPUT:\n```\npygetpapers --api rxivist -q \"biomedicine\" -k 10 -c -x -o \"biomedicine_rxivist\" --makehtml -p\n```\nOUTPUT:\n```\nWARNING: Pdf is not supported for this api\nINFO: Final query is biomedicine\nINFO: Making Request to rxivist\nINFO: Making csv files for metadata at C:\\Users\\shweata\\biomedicine_rxivist\n100%|\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588| 10/10 [00:00<00:00, 125.54it/s]\nINFO: Making html files for metadata at C:\\Users\\shweata\\biomedicine_rxivist\n100%|\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588| 10/10 [00:00<00:00, 124.71it/s]\nINFO: Making xml files for metadata at C:\\Users\\shweata\\biomedicine_rxivist\n100%|\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588| 10/10 [00:00<00:00, 633.38it/s]\nINFO: Wrote metadata file for the query\nINFO: Writing metadata file for the papers at C:\\Users\\shweata\\biomedicine_rxivist\n100%|\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588| 10/10 [00:00<00:00, 751.09it/s]\n```\n### Query hits only \nLike any other repositories under `pygetpapers`, you can use the `-n` flag to get only the hit number\nINPUT: \n```\nC:\\Users\\shweata>pygetpapers --api rxivist -q \"biomedical sciences\" -n\n```\nOUTPUT:\n```\nINFO: Final query is biomedical sciences\nINFO: Making Request to rxivist\nINFO: Total number of hits for the query are 62\n```\n### Update\n`--update` works the same as many other repositories. Make sure to provide `rxvist` as api. \n\nINPUT: \n```\npygetpapers --api rxivist -q \"biomedical sciences\" -k 20 -c -x -o \"biomedicine_rxivist\" --update\n```\nOUPUT: \n```\nINFO: Final query is biomedical sciences\nINFO: Please ensure that you are providing the same --api as the one in the corpus or you may get errors\nINFO: Reading old json metadata file\nINFO: Making Request to rxivist\nINFO: Making csv files for metadata at C:\\Users\\shweata\\biomedicine_rxivist\n100%|\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588| 10/10 [00:00<00:00, 203.69it/s]\nINFO: Making xml files for metadata at C:\\Users\\shweata\\biomedicine_rxivist\n100%|\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588| 10/10 [00:00<00:00, 1059.17it/s]\nINFO: Wrote metadata file for the query\nINFO: Writing metadata file for the papers at C:\\Users\\shweata\\biomedicine_rxivist\n100%|\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588| 10/10 [00:00<00:00, 1077.12it/s]\n```\n## Run `pygetpapers` within the module\n\n```\ndef run_command(output=False, query=False, save_query=False, xml=False, pdf=False, supp=False, zip=False, references=False, noexecute=False, citations=False, limit=100, restart=False, update=False, onlyquery=False, makecsv=False, makehtml=False, synonym=False, startdate=False, enddate=False, terms=False, notterms=False, api='europe_pmc', filter=None, loglevel='info', logfile=False, version=False)\n```\n\nHere's an example script to download 50 papers from EPMC on 'lantana camara'.\n\n```\nfrom pygetpapers import Pygetpapers\npygetpapers_call=Pygetpapers()\npygetpapers_call.run_command(query='lantana camara',limit=-50 ,output= lantana_camara, xml=True)\n```\n\n## Test `pygetpapers` \nTo run automated testing on `pygetpapers`, do the following:\n1) Install `pygetpapers`\n2) Clone into `pygetpapers` repository\n3) Install pytest\n4) Run the command, `pytest`\n\n# Contributions\n\nhttps://github.com/petermr/pygetpapers/blob/main/resources/CONTRIBUTING.md\n\n# Feature Requests\n\nTo request features, please put them in issues\n\n# Legal Implications\n\nIf you use`pygetpapers`, you should be careful to understand the law as it applies to their content mining, as they assume full responsibility for their actions when using the software.\n\n## Countries with copyright exceptions for content mining:\n\n- UK\n- Japan\n\n## Countries with proposed copyright exceptions:\n\n- Ireland\n- EU countries \n\n## Countries with permissive interpretations of 'fair use' that might allow content mining:\n\n- Israel\n- USA\n- Canada\n\n## General summaries and guides:\n\n- _\"The legal framework of text and data mining (TDM)\"_, carried out for the European Commission in March 2014 ([PDF](http://ec.europa.eu/internal_market/copyright/docs/studies/1403_study2_en.pdf))\n- _\"Standardisation in the area of innovation and technological development, notably in the field of Text and Data Mining\"_, carried out for the European Commission in 2014 ([PDF](http://ec.europa.eu/research/innovation-union/pdf/TDM-report_from_the_expert_group-042014.pdf))\n\n\n\n\n",
    "bugtrack_url": null,
    "license": "Apache License",
    "summary": "Automated Download of Research Papers from various scientific repositories",
    "version": "1.2.5",
    "split_keywords": [
        "research",
        "automation"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "md5": "7d7624e395b1a25dbbe0304a199c6ba1",
                "sha256": "8da63ba8fd71b2f197bdfe3bcf589aa4159822da122d8c81988ee5480f25025c"
            },
            "downloads": -1,
            "filename": "pygetpapers-1.2.5-py3.8.egg",
            "has_sig": false,
            "md5_digest": "7d7624e395b1a25dbbe0304a199c6ba1",
            "packagetype": "bdist_egg",
            "python_version": "1.2.5",
            "requires_python": ">=3.5",
            "size": 127258,
            "upload_time": "2022-12-21T12:33:40",
            "upload_time_iso_8601": "2022-12-21T12:33:40.444913Z",
            "url": "https://files.pythonhosted.org/packages/69/49/6d88fc507e6854677b52299f90dd4e3c947d62f3f1c26a1bf2323b1c19a1/pygetpapers-1.2.5-py3.8.egg",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "md5": "7bd2287260e254f7476400d4081a8fe1",
                "sha256": "4d680d3852d8e324b7f8cb2e1e74be932fc85ad409ae69f244f2cae49834260c"
            },
            "downloads": -1,
            "filename": "pygetpapers-1.2.5-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "7bd2287260e254f7476400d4081a8fe1",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.5",
            "size": 111427,
            "upload_time": "2022-12-21T12:33:38",
            "upload_time_iso_8601": "2022-12-21T12:33:38.326109Z",
            "url": "https://files.pythonhosted.org/packages/f5/dc/c29d1cfca7cf6517c6fa25f5b2449a1200f6f11fecf68e69cdcbb692d64f/pygetpapers-1.2.5-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2022-12-21 12:33:40",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "github_user": "petermr",
    "github_project": "pygetpapers",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "requirements": [
        {
            "name": "coloredlogs",
            "specs": [
                [
                    ">=",
                    "15.0.1"
                ]
            ]
        },
        {
            "name": "ConfigArgParse",
            "specs": [
                [
                    ">=",
                    "1.5.3"
                ]
            ]
        },
        {
            "name": "dict2xml",
            "specs": [
                [
                    ">=",
                    "1.7.0"
                ]
            ]
        },
        {
            "name": "habanero",
            "specs": [
                [
                    ">=",
                    "1.0.0"
                ]
            ]
        },
        {
            "name": "lxml",
            "specs": [
                [
                    ">=",
                    "4.7.1"
                ]
            ]
        },
        {
            "name": "pandas",
            "specs": [
                [
                    ">=",
                    "1.3.5"
                ]
            ]
        },
        {
            "name": "pytest",
            "specs": [
                [
                    ">=",
                    "6.2.5"
                ]
            ]
        },
        {
            "name": "requests",
            "specs": [
                [
                    ">=",
                    "2.27.1"
                ]
            ]
        },
        {
            "name": "setuptools",
            "specs": [
                [
                    ">=",
                    "60.3.1"
                ]
            ]
        },
        {
            "name": "sphinx_rtd_theme",
            "specs": [
                [
                    ">=",
                    "1.0.0"
                ]
            ]
        },
        {
            "name": "tqdm",
            "specs": [
                [
                    ">=",
                    "4.62.3"
                ]
            ]
        },
        {
            "name": "xmltodict",
            "specs": [
                [
                    ">=",
                    "0.12.0"
                ]
            ]
        }
    ],
    "lcname": "pygetpapers"
}
        
Elapsed time: 0.02413s