paperai

Name	paperai JSON
Version	2.3.0 JSON
	download
home_page	https://github.com/neuml/paperai
Summary	Semantic search and workflows for medical/scientific papers
upload_time	2024-12-28 20:34:16
maintainer	None
docs_url	None
author	NeuML
requires_python	>=3.9
license	Apache 2.0: http://www.apache.org/licenses/LICENSE-2.0
keywords	search embedding machine-learning nlp covid-19 medical scientific papers
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage

            <p align="center">
    <img src="https://raw.githubusercontent.com/neuml/paperai/master/logo.png"/>
</p>

<p align="center">
    <b>Semantic search and workflows for medical/scientific papers</b>
</p>

<p align="center">
    <a href="https://github.com/neuml/paperai/releases">
        <img src="https://img.shields.io/github/release/neuml/paperai.svg?style=flat&color=success" alt="Version"/>
    </a>
    <a href="https://github.com/neuml/paperai/releases">
        <img src="https://img.shields.io/github/release-date/neuml/paperai.svg?style=flat&color=blue" alt="GitHub Release Date"/>
    </a>
    <a href="https://github.com/neuml/paperai/issues">
        <img src="https://img.shields.io/github/issues/neuml/paperai.svg?style=flat&color=success" alt="GitHub issues"/>
    </a>
    <a href="https://github.com/neuml/paperai">
        <img src="https://img.shields.io/github/last-commit/neuml/paperai.svg?style=flat&color=blue" alt="GitHub last commit"/>
    </a>
    <a href="https://github.com/neuml/paperai/actions?query=workflow%3Abuild">
        <img src="https://github.com/neuml/paperai/workflows/build/badge.svg" alt="Build Status"/>
    </a>
    <a href="https://coveralls.io/github/neuml/paperai?branch=master">
        <img src="https://img.shields.io/coverallsCoverage/github/neuml/paperai" alt="Coverage Status">
    </a>
</p>

-------------------------------------------------------------------------------------------------------------------------------------------------------

paperai is a semantic search and workflow application for medical/scientific papers.

![demo](https://raw.githubusercontent.com/neuml/paperai/master/demo.png)

Applications range from semantic search indexes that find matches for medical/scientific queries to full-fledged reporting applications powered by machine learning.

![architecture](https://raw.githubusercontent.com/neuml/paperai/master/images/architecture.png#gh-light-mode-only)

paperai and/or NeuML has been recognized in the following articles:

- [Machine-Learning Experts Delve Into 47,000 Papers on Coronavirus Family](https://www.wsj.com/articles/machine-learning-experts-delve-into-47-000-papers-on-coronavirus-family-11586338201)
- [Data scientists assist medical researchers in the fight against COVID-19](https://cloud.google.com/blog/products/ai-machine-learning/how-kaggle-data-scientists-help-with-coronavirus)
- [CORD-19 Kaggle Challenge Awards](https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge/discussion/161447)

## Installation

The easiest way to install is via pip and PyPI

```
pip install paperai
```

Python 3.9+ is supported. Using a Python [virtual environment](https://docs.python.org/3/library/venv.html) is recommended.

paperai can also be installed directly from GitHub to access the latest, unreleased features.

```
pip install git+https://github.com/neuml/paperai
```

See [this link](https://neuml.github.io/txtai/install/#environment-specific-prerequisites) to help resolve environment-specific install issues.

### Docker

Run the steps below to build a docker image with paperai and all dependencies.

```
wget https://raw.githubusercontent.com/neuml/paperai/master/docker/Dockerfile
docker build -t paperai .
docker run --name paperai --rm -it paperai
```

paperetl can be added in to have a single image to index and query content. Follow the instructions to build a [paperetl docker image](https://github.com/neuml/paperetl#docker) and then run the following.

```
docker build -t paperai --build-arg BASE_IMAGE=paperetl --build-arg START=/scripts/start.sh .
docker run --name paperai --rm -it paperai
```

## Examples

The following notebooks and applications demonstrate the capabilities provided by paperai.

### Notebooks

| Notebook  | Description  |       |
|:----------|:-------------|------:|
| [Introducing paperai](https://github.com/neuml/paperai/blob/master/examples/01_Introducing_paperai.ipynb) | Overview of the functionality provided by paperai | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/neuml/paperai/blob/master/examples/01_Introducing_paperai.ipynb) |

### Applications

| Application  | Description  |
|:----------|:-------------|
| [Search](https://github.com/neuml/paperai/blob/master/examples/search.py) | Search a paperai index. Set query parameters, execute searches and display results. |

## Building a model

paperai indexes databases previously built with [paperetl](https://github.com/neuml/paperetl). The following shows how to create a new paperai index.

1. (Optional) Create an index.yml file

    paperai uses the default txtai embeddings configuration when not specified. Alternatively, an index.yml file can be specified that takes all the same options as a txtai embeddings instance. See the [txtai documentation](https://neuml.github.io/txtai/embeddings/configuration) for more on the possible options. A simple example is shown below.

    ```
    path: sentence-transformers/all-MiniLM-L6-v2
    content: True
    ```

2. Build embeddings index

    ```
    python -m paperai.index <path to input data> <optional index configuration>
    ```

The paperai.index process requires an input data path and optionally takes index configuration. This configuration can either be a vector model path or an index.yml configuration file.

## Running queries

The fastest way to run queries is to start a paperai shell

```
paperai <path to model directory>
```

A prompt will come up. Queries can be typed directly into the console.

## Building a report file

Reports support generating output in multiple formats. An example report call:

```
python -m paperai.report report.yml 50 md <path to model directory>
```

The following report formats are supported:

- Markdown (Default) - Renders a Markdown report. Columns and answers are extracted from articles with the results stored in a Markdown file.
- CSV - Renders a CSV report. Columns and answers are extracted from articles with the results stored in a CSV file.
- Annotation - Columns and answers are extracted from articles with the results annotated over the original PDF files. Requires passing in a path with the original PDF files.

In the example above, a file named report.md will be created. Example report configuration files can be found [here](https://github.com/neuml/cord19q/tree/master/tasks).

## Tech Overview

paperai is a combination of a [txtai](https://github.com/neuml/txtai) embeddings index and a SQLite database with the articles. Each article is parsed into sentences and stored in SQLite along with the article metadata. Embeddings are built over the full corpus.

Multiple entry points exist to interact with the model.

- paperai.report - Builds a report for a series of queries. For each query, the top scoring articles are shown along with matches from those articles. There is also a highlights section showing the most relevant results.
- paperai.query - Runs a single query from the terminal
- paperai.shell - Allows running multiple queries from the terminal

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/neuml/paperai",
    "name": "paperai",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.9",
    "maintainer_email": null,
    "keywords": "search embedding machine-learning nlp covid-19 medical scientific papers",
    "author": "NeuML",
    "author_email": null,
    "download_url": "https://files.pythonhosted.org/packages/38/8b/2ae424c07e21b1a84b2f06486f9ff396e4dda9a4f622078ec45a462c8b05/paperai-2.3.0.tar.gz",
    "platform": null,
    "description": "<p align=\"center\">\n    <img src=\"https://raw.githubusercontent.com/neuml/paperai/master/logo.png\"/>\n</p>\n\n<p align=\"center\">\n    <b>Semantic search and workflows for medical/scientific papers</b>\n</p>\n\n<p align=\"center\">\n    <a href=\"https://github.com/neuml/paperai/releases\">\n        <img src=\"https://img.shields.io/github/release/neuml/paperai.svg?style=flat&color=success\" alt=\"Version\"/>\n    </a>\n    <a href=\"https://github.com/neuml/paperai/releases\">\n        <img src=\"https://img.shields.io/github/release-date/neuml/paperai.svg?style=flat&color=blue\" alt=\"GitHub Release Date\"/>\n    </a>\n    <a href=\"https://github.com/neuml/paperai/issues\">\n        <img src=\"https://img.shields.io/github/issues/neuml/paperai.svg?style=flat&color=success\" alt=\"GitHub issues\"/>\n    </a>\n    <a href=\"https://github.com/neuml/paperai\">\n        <img src=\"https://img.shields.io/github/last-commit/neuml/paperai.svg?style=flat&color=blue\" alt=\"GitHub last commit\"/>\n    </a>\n    <a href=\"https://github.com/neuml/paperai/actions?query=workflow%3Abuild\">\n        <img src=\"https://github.com/neuml/paperai/workflows/build/badge.svg\" alt=\"Build Status\"/>\n    </a>\n    <a href=\"https://coveralls.io/github/neuml/paperai?branch=master\">\n        <img src=\"https://img.shields.io/coverallsCoverage/github/neuml/paperai\" alt=\"Coverage Status\">\n    </a>\n</p>\n\n-------------------------------------------------------------------------------------------------------------------------------------------------------\n\npaperai is a semantic search and workflow application for medical/scientific papers.\n\n![demo](https://raw.githubusercontent.com/neuml/paperai/master/demo.png)\n\nApplications range from semantic search indexes that find matches for medical/scientific queries to full-fledged reporting applications powered by machine learning.\n\n![architecture](https://raw.githubusercontent.com/neuml/paperai/master/images/architecture.png#gh-light-mode-only)\n\npaperai and/or NeuML has been recognized in the following articles:\n\n- [Machine-Learning Experts Delve Into 47,000 Papers on Coronavirus Family](https://www.wsj.com/articles/machine-learning-experts-delve-into-47-000-papers-on-coronavirus-family-11586338201)\n- [Data scientists assist medical researchers in the fight against COVID-19](https://cloud.google.com/blog/products/ai-machine-learning/how-kaggle-data-scientists-help-with-coronavirus)\n- [CORD-19 Kaggle Challenge Awards](https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge/discussion/161447)\n\n## Installation\n\nThe easiest way to install is via pip and PyPI\n\n```\npip install paperai\n```\n\nPython 3.9+ is supported. Using a Python [virtual environment](https://docs.python.org/3/library/venv.html) is recommended.\n\npaperai can also be installed directly from GitHub to access the latest, unreleased features.\n\n```\npip install git+https://github.com/neuml/paperai\n```\n\nSee [this link](https://neuml.github.io/txtai/install/#environment-specific-prerequisites) to help resolve environment-specific install issues.\n\n### Docker\n\nRun the steps below to build a docker image with paperai and all dependencies.\n\n```\nwget https://raw.githubusercontent.com/neuml/paperai/master/docker/Dockerfile\ndocker build -t paperai .\ndocker run --name paperai --rm -it paperai\n```\n\npaperetl can be added in to have a single image to index and query content. Follow the instructions to build a [paperetl docker image](https://github.com/neuml/paperetl#docker) and then run the following.\n\n```\ndocker build -t paperai --build-arg BASE_IMAGE=paperetl --build-arg START=/scripts/start.sh .\ndocker run --name paperai --rm -it paperai\n```\n\n## Examples\n\nThe following notebooks and applications demonstrate the capabilities provided by paperai.\n\n### Notebooks\n\n| Notebook  | Description  |       |\n|:----------|:-------------|------:|\n| [Introducing paperai](https://github.com/neuml/paperai/blob/master/examples/01_Introducing_paperai.ipynb) | Overview of the functionality provided by paperai | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/neuml/paperai/blob/master/examples/01_Introducing_paperai.ipynb) |\n\n### Applications\n\n| Application  | Description  |\n|:----------|:-------------|\n| [Search](https://github.com/neuml/paperai/blob/master/examples/search.py) | Search a paperai index. Set query parameters, execute searches and display results. |\n\n## Building a model\n\npaperai indexes databases previously built with [paperetl](https://github.com/neuml/paperetl). The following shows how to create a new paperai index.\n\n1. (Optional) Create an index.yml file\n\n    paperai uses the default txtai embeddings configuration when not specified. Alternatively, an index.yml file can be specified that takes all the same options as a txtai embeddings instance. See the [txtai documentation](https://neuml.github.io/txtai/embeddings/configuration) for more on the possible options. A simple example is shown below.\n\n    ```\n    path: sentence-transformers/all-MiniLM-L6-v2\n    content: True\n    ```\n\n2. Build embeddings index\n\n    ```\n    python -m paperai.index <path to input data> <optional index configuration>\n    ```\n\nThe paperai.index process requires an input data path and optionally takes index configuration. This configuration can either be a vector model path or an index.yml configuration file.\n\n## Running queries\n\nThe fastest way to run queries is to start a paperai shell\n\n```\npaperai <path to model directory>\n```\n\nA prompt will come up. Queries can be typed directly into the console.\n\n## Building a report file\n\nReports support generating output in multiple formats. An example report call:\n\n```\npython -m paperai.report report.yml 50 md <path to model directory>\n```\n\nThe following report formats are supported:\n\n- Markdown (Default) - Renders a Markdown report. Columns and answers are extracted from articles with the results stored in a Markdown file.\n- CSV - Renders a CSV report. Columns and answers are extracted from articles with the results stored in a CSV file.\n- Annotation - Columns and answers are extracted from articles with the results annotated over the original PDF files. Requires passing in a path with the original PDF files.\n\nIn the example above, a file named report.md will be created. Example report configuration files can be found [here](https://github.com/neuml/cord19q/tree/master/tasks).\n\n## Tech Overview\n\npaperai is a combination of a [txtai](https://github.com/neuml/txtai) embeddings index and a SQLite database with the articles. Each article is parsed into sentences and stored in SQLite along with the article metadata. Embeddings are built over the full corpus.\n\nMultiple entry points exist to interact with the model.\n\n- paperai.report - Builds a report for a series of queries. For each query, the top scoring articles are shown along with matches from those articles. There is also a highlights section showing the most relevant results.\n- paperai.query - Runs a single query from the terminal\n- paperai.shell - Allows running multiple queries from the terminal\n",
    "bugtrack_url": null,
    "license": "Apache 2.0: http://www.apache.org/licenses/LICENSE-2.0",
    "summary": "Semantic search and workflows for medical/scientific papers",
    "version": "2.3.0",
    "project_urls": {
        "Documentation": "https://github.com/neuml/paperai",
        "Homepage": "https://github.com/neuml/paperai",
        "Issue Tracker": "https://github.com/neuml/paperai/issues",
        "Source Code": "https://github.com/neuml/paperai"
    },
    "split_keywords": [
        "search",
        "embedding",
        "machine-learning",
        "nlp",
        "covid-19",
        "medical",
        "scientific",
        "papers"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "023b014524951cef55128c2ebfe0ecbd0c133b94d5b83ec27f94c611365f0a50",
                "md5": "6bbc715fee81b28532a33c00c1891a3d",
                "sha256": "9875659e86cb047b84468bf1e87a8138746b5bef93938495f418649cd52712db"
            },
            "downloads": -1,
            "filename": "paperai-2.3.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "6bbc715fee81b28532a33c00c1891a3d",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.9",
            "size": 30626,
            "upload_time": "2024-12-28T20:34:14",
            "upload_time_iso_8601": "2024-12-28T20:34:14.140603Z",
            "url": "https://files.pythonhosted.org/packages/02/3b/014524951cef55128c2ebfe0ecbd0c133b94d5b83ec27f94c611365f0a50/paperai-2.3.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "388b2ae424c07e21b1a84b2f06486f9ff396e4dda9a4f622078ec45a462c8b05",
                "md5": "c79e7435509ddc91de0e75db22c5cf67",
                "sha256": "e68d1f861037ca32a035d4ebc6091468fa446e32ed0b55dcded31df166896947"
            },
            "downloads": -1,
            "filename": "paperai-2.3.0.tar.gz",
            "has_sig": false,
            "md5_digest": "c79e7435509ddc91de0e75db22c5cf67",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.9",
            "size": 27580,
            "upload_time": "2024-12-28T20:34:16",
            "upload_time_iso_8601": "2024-12-28T20:34:16.691553Z",
            "url": "https://files.pythonhosted.org/packages/38/8b/2ae424c07e21b1a84b2f06486f9ff396e4dda9a4f622078ec45a462c8b05/paperai-2.3.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-12-28 20:34:16",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "neuml",
    "github_project": "paperai",
    "travis_ci": false,
    "coveralls": true,
    "github_actions": true,
    "lcname": "paperai"
}

NeuML