curategpt

Name	curategpt JSON
Version	0.2.2 JSON
	download
home_page	None
Summary	CurateGPT
upload_time	2024-11-15 17:42:05
maintainer	None
docs_url	None
author	Chris Mungall
requires_python	!=2.7.,!=3.0.,!=3.1.,!=3.2.,!=3.3.,!=3.4.,!=3.5.,!=3.6.,!=3.7.,!=3.8.,>=3.9
license	BSD-3
keywords
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            # CurateGPT

[![DOI](https://zenodo.org/badge/645996391.svg)](https://zenodo.org/doi/10.5281/zenodo.8293691)


CurateGPT is a prototype web application and framework for performing general purpose AI-guided curation
and curation-related operations over *collections* of objects.


See also the app on [curategpt.io](https://curategpt.io) (note: this is sometimes down, and may only have a
subset of the functionality of the local app)


## Getting started

### User installation

CurateGPT is available on Pypi and may be installed with `pip`:

`pip install curategpt`

### Developer installation

You will first need to [install Poetry](https://python-poetry.org/docs/#installation).

Then clone this repo.

```
git clone https://github.com/monarch-initiative/curategpt.git
cd curategpt
```

and install the dependencies:


```
poetry install
```

### API keys

In order to get the best performance from CurateGPT, we recommend getting an OpenAI API key, and setting it:

```
export OPENAI_API_KEY=<your key>
```

(for members of Monarch: ask on Slack if you would like to use the group key)

CurateGPT will also work with other large language models - see "Selecting models" below.

## Loading example data and running the app

You initially start with an empty database. You can load whatever you like into this
database! Any JSON, YAML, or CSV is accepted.
CurateGPT comes with *wrappers* for some existing local and remote sources, including
ontologies. The [Makefile](Makefile) contains some examples of how to load these. You can
load any ontology using the `ont-<name>` target, e.g.:

```
make ont-cl
```

This loads CL (via OAK) into a collection called `ont_cl`

Note that by default this loads into a collection set stored at `stagedb`, whereas the app works off
of `db`. You can copy the collection set to the db with:

```
cp -r stagedb/* db/
```


You can then run the streamlit app with:

```
make app
```

## Building Indexes

CurateGPT depends on vector database indexes of the databases/ontologies you want to curate.

The flagship application is ontology curation, so to build an index for an OBO ontology like CL:

```
make ont-cl
```

This requires an OpenAI key.

(You can build indexes using an open embedding model, modify the command to leave off
the `-m` option, but this is not recommended as currently oai embeddings seem to work best).


To load the default ontologies:

```
make all
```

(this may take some time)

To load different databases:

```
make load-db-hpoa
make load-db-reactome
```



You can load an arbitrary json, yaml, or csv file:

```
curategpt view index -c my_foo foo.json
```

(you will need to do this in the poetry shell)

To load a GitHub repo of issues:

```
curategpt -v view index -c gh_uberon -m openai:  --view github --init-with "{repo: obophenotype/uberon}"
```

The following are also supported:

- Google Drives
- Google Sheets
- Markdown files
- LinkML Schemas
- HPOA files
- GOCAMs
- MAXOA files
- Many more

## Notebooks

- See [notebooks](notebooks) for examples.

## Selecting models

Currently this tool works best with the OpenAI gpt-4 model (for instruction tasks) and OpenAI `ada-text-embedding-002` for embedding.

CurateGPT is layered on top of [simonw/llm](https://github.com/simonw/llm) which has a plugin
architecture for using alternative models. In theory you can use any of these plugins.

Additionally, you can set up an openai-emulating proxy using [litellm](https://github.com/BerriAI/litellm/).

The `litellm` proxy may be installed with `pip` as `pip install litellm[proxy]`.

Let's say you want to run mixtral locally using ollama. You start up ollama (you may have to run `ollama serve` first):

```
ollama run mixtral
```

Then start up litellm:

```
litellm -m ollama/mixtral
```

Next edit your `extra-openai-models.yaml` as detailed in [the llm docs](https://llm.datasette.io/en/stable/other-models.html):

```
- model_name: ollama/mixtral
  model_id: litellm-mixtral
  api_base: "http://0.0.0.0:8000"
```

You can now use this:

```yaml
curategpt ask -m litellm-mixtral -c ont_cl "What neurotransmitter is released by the hippocampus?"
```

But be warned that many of the prompts in curategpt were engineered
against openai models, and they may give suboptimal results or fail
entirely on other models. As an example, `ask` seems to work quite
well with mixtral, but `complete` works horribly. We haven't yet
investigated if the issue is the model or our prompts or the overall
approach.

Welcome to the world of AI engineering!

## Using the command line

```bash
curategpt --help
```

You will see various commands for working with indexes, searching, extracting, generating, etc.

These functions are generally available through the UI, and the current priority is documenting these.

### Chatting with a knowledge base

```
curategpt ask -c ont_cl "What neurotransmitter is released by the hippocampus?"
```

may yield something like:

```
The hippocampus releases gamma-aminobutyric acid (GABA) as a neurotransmitter [1](#ref-1).

...

## 1

id: GammaAminobutyricAcidSecretion_neurotransmission
label: gamma-aminobutyric acid secretion, neurotransmission
definition: The regulated release of gamma-aminobutyric acid by a cell, in which the
  gamma-aminobutyric acid acts as a neurotransmitter.
...
```

### Chatting with pubmed

```
curategpt view ask -V pubmed "what neurons express VIP?"
```

### Chatting with a GitHub issue tracker

```
curategpt ask -c gh_obi "what are some new term requests for electrophysiology terms?"
```

### Term Autocompletion (DRAGON-AI)

```
curategpt complete -c ont_cl  "mesenchymal stem cell of the apical papilla"
```

yields

```yaml
id: MesenchymalStemCellOfTheApicalPapilla
definition: A mesenchymal cell that is part of the apical papilla of a tooth and has
  the ability to self-renew and differentiate into various cell types such as odontoblasts,
  fibroblasts, and osteoblasts.
relationships:
- predicate: PartOf
  target: ApicalPapilla
- predicate: subClassOf
  target: MesenchymalCell
- predicate: subClassOf
  target: StemCell
original_id: CL:0007045
label: mesenchymal stem cell of the apical papilla
```

### All-by-all comparisons

You can compare all objects in one collection 

`curategpt all-by-all --threshold 0.80 -c ont_hp -X ont_mp --ids-only -t csv > ~/tmp/allxall.mp.hp.csv`

This takes 1-2s, as it involves comparison over pre-computed vectors. It reports top hits above a threshold.

Results may vary. You may want to try different texts for embeddings
(the default is the entire json object; for ontologies it is
concatenation of labels, definition, aliases).

sample:

```
HP:5200068,Socially innappropriate questioning,MP:0001361,social withdrawal,0.844015132437909
HP:5200069,Spinning,MP:0001411,spinning,0.9077306606290237
HP:5200071,Delayed Echolalia,MP:0013140,excessive vocalization,0.8153252835818089
HP:5200072,Immediate Echolalia,MP:0001410,head bobbing,0.8348177036912526
HP:5200073,Excessive cleaning,MP:0001412,excessive scratching,0.8699103725005582
HP:5200104,Abnormal play,MP:0020437,abnormal social play behavior,0.8984862078522344
HP:5200105,Reduced imaginative play skills,MP:0001402,decreased locomotor activity,0.85571629684631
HP:5200108,Nonfunctional or atypical use of objects in play,MP:0003908,decreased stereotypic behavior,0.8586700411012859
HP:5200129,Abnormal rituals,MP:0010698,abnormal impulsive behavior control,0.8727804272023427
HP:5200134,Jumping,MP:0001401,jumpy,0.9011393233129765
```

Note that CurateGPT has a separate component for using an LLM to evaluate candidate matches (see also https://arxiv.org/abs/2310.03666); this is
not enabled by default, this would be expensive to run for a whole ontology.

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "curategpt",
    "maintainer": null,
    "docs_url": null,
    "requires_python": "!=2.7.*,!=3.0.*,!=3.1.*,!=3.2.*,!=3.3.*,!=3.4.*,!=3.5.*,!=3.6.*,!=3.7.*,!=3.8.*,>=3.9",
    "maintainer_email": null,
    "keywords": null,
    "author": "Chris Mungall",
    "author_email": "cjmungall@lbl.gov",
    "download_url": "https://files.pythonhosted.org/packages/91/2a/1f3bfac38468f0450ee9c7a063fbcc720a72f23e2406d97a3aae4e7d2e3c/curategpt-0.2.2.tar.gz",
    "platform": null,
    "description": "# CurateGPT\n\n[![DOI](https://zenodo.org/badge/645996391.svg)](https://zenodo.org/doi/10.5281/zenodo.8293691)\n\n\nCurateGPT is a prototype web application and framework for performing general purpose AI-guided curation\nand curation-related operations over *collections* of objects.\n\n\nSee also the app on [curategpt.io](https://curategpt.io) (note: this is sometimes down, and may only have a\nsubset of the functionality of the local app)\n\n\n## Getting started\n\n### User installation\n\nCurateGPT is available on Pypi and may be installed with `pip`:\n\n`pip install curategpt`\n\n### Developer installation\n\nYou will first need to [install Poetry](https://python-poetry.org/docs/#installation).\n\nThen clone this repo.\n\n```\ngit clone https://github.com/monarch-initiative/curategpt.git\ncd curategpt\n```\n\nand install the dependencies:\n\n\n```\npoetry install\n```\n\n### API keys\n\nIn order to get the best performance from CurateGPT, we recommend getting an OpenAI API key, and setting it:\n\n```\nexport OPENAI_API_KEY=<your key>\n```\n\n(for members of Monarch: ask on Slack if you would like to use the group key)\n\nCurateGPT will also work with other large language models - see \"Selecting models\" below.\n\n## Loading example data and running the app\n\nYou initially start with an empty database. You can load whatever you like into this\ndatabase! Any JSON, YAML, or CSV is accepted.\nCurateGPT comes with *wrappers* for some existing local and remote sources, including\nontologies. The [Makefile](Makefile) contains some examples of how to load these. You can\nload any ontology using the `ont-<name>` target, e.g.:\n\n```\nmake ont-cl\n```\n\nThis loads CL (via OAK) into a collection called `ont_cl`\n\nNote that by default this loads into a collection set stored at `stagedb`, whereas the app works off\nof `db`. You can copy the collection set to the db with:\n\n```\ncp -r stagedb/* db/\n```\n\n\nYou can then run the streamlit app with:\n\n```\nmake app\n```\n\n## Building Indexes\n\nCurateGPT depends on vector database indexes of the databases/ontologies you want to curate.\n\nThe flagship application is ontology curation, so to build an index for an OBO ontology like CL:\n\n```\nmake ont-cl\n```\n\nThis requires an OpenAI key.\n\n(You can build indexes using an open embedding model, modify the command to leave off\nthe `-m` option, but this is not recommended as currently oai embeddings seem to work best).\n\n\nTo load the default ontologies:\n\n```\nmake all\n```\n\n(this may take some time)\n\nTo load different databases:\n\n```\nmake load-db-hpoa\nmake load-db-reactome\n```\n\n\n\nYou can load an arbitrary json, yaml, or csv file:\n\n```\ncurategpt view index -c my_foo foo.json\n```\n\n(you will need to do this in the poetry shell)\n\nTo load a GitHub repo of issues:\n\n```\ncurategpt -v view index -c gh_uberon -m openai:  --view github --init-with \"{repo: obophenotype/uberon}\"\n```\n\nThe following are also supported:\n\n- Google Drives\n- Google Sheets\n- Markdown files\n- LinkML Schemas\n- HPOA files\n- GOCAMs\n- MAXOA files\n- Many more\n\n## Notebooks\n\n- See [notebooks](notebooks) for examples.\n\n## Selecting models\n\nCurrently this tool works best with the OpenAI gpt-4 model (for instruction tasks) and OpenAI `ada-text-embedding-002` for embedding.\n\nCurateGPT is layered on top of [simonw/llm](https://github.com/simonw/llm) which has a plugin\narchitecture for using alternative models. In theory you can use any of these plugins.\n\nAdditionally, you can set up an openai-emulating proxy using [litellm](https://github.com/BerriAI/litellm/).\n\nThe `litellm` proxy may be installed with `pip` as `pip install litellm[proxy]`.\n\nLet's say you want to run mixtral locally using ollama. You start up ollama (you may have to run `ollama serve` first):\n\n```\nollama run mixtral\n```\n\nThen start up litellm:\n\n```\nlitellm -m ollama/mixtral\n```\n\nNext edit your `extra-openai-models.yaml` as detailed in [the llm docs](https://llm.datasette.io/en/stable/other-models.html):\n\n```\n- model_name: ollama/mixtral\n  model_id: litellm-mixtral\n  api_base: \"http://0.0.0.0:8000\"\n```\n\nYou can now use this:\n\n```yaml\ncurategpt ask -m litellm-mixtral -c ont_cl \"What neurotransmitter is released by the hippocampus?\"\n```\n\nBut be warned that many of the prompts in curategpt were engineered\nagainst openai models, and they may give suboptimal results or fail\nentirely on other models. As an example, `ask` seems to work quite\nwell with mixtral, but `complete` works horribly. We haven't yet\ninvestigated if the issue is the model or our prompts or the overall\napproach.\n\nWelcome to the world of AI engineering!\n\n## Using the command line\n\n```bash\ncurategpt --help\n```\n\nYou will see various commands for working with indexes, searching, extracting, generating, etc.\n\nThese functions are generally available through the UI, and the current priority is documenting these.\n\n### Chatting with a knowledge base\n\n```\ncurategpt ask -c ont_cl \"What neurotransmitter is released by the hippocampus?\"\n```\n\nmay yield something like:\n\n```\nThe hippocampus releases gamma-aminobutyric acid (GABA) as a neurotransmitter [1](#ref-1).\n\n...\n\n## 1\n\nid: GammaAminobutyricAcidSecretion_neurotransmission\nlabel: gamma-aminobutyric acid secretion, neurotransmission\ndefinition: The regulated release of gamma-aminobutyric acid by a cell, in which the\n  gamma-aminobutyric acid acts as a neurotransmitter.\n...\n```\n\n### Chatting with pubmed\n\n```\ncurategpt view ask -V pubmed \"what neurons express VIP?\"\n```\n\n### Chatting with a GitHub issue tracker\n\n```\ncurategpt ask -c gh_obi \"what are some new term requests for electrophysiology terms?\"\n```\n\n### Term Autocompletion (DRAGON-AI)\n\n```\ncurategpt complete -c ont_cl  \"mesenchymal stem cell of the apical papilla\"\n```\n\nyields\n\n```yaml\nid: MesenchymalStemCellOfTheApicalPapilla\ndefinition: A mesenchymal cell that is part of the apical papilla of a tooth and has\n  the ability to self-renew and differentiate into various cell types such as odontoblasts,\n  fibroblasts, and osteoblasts.\nrelationships:\n- predicate: PartOf\n  target: ApicalPapilla\n- predicate: subClassOf\n  target: MesenchymalCell\n- predicate: subClassOf\n  target: StemCell\noriginal_id: CL:0007045\nlabel: mesenchymal stem cell of the apical papilla\n```\n\n### All-by-all comparisons\n\nYou can compare all objects in one collection \n\n`curategpt all-by-all --threshold 0.80 -c ont_hp -X ont_mp --ids-only -t csv > ~/tmp/allxall.mp.hp.csv`\n\nThis takes 1-2s, as it involves comparison over pre-computed vectors. It reports top hits above a threshold.\n\nResults may vary. You may want to try different texts for embeddings\n(the default is the entire json object; for ontologies it is\nconcatenation of labels, definition, aliases).\n\nsample:\n\n```\nHP:5200068,Socially innappropriate questioning,MP:0001361,social withdrawal,0.844015132437909\nHP:5200069,Spinning,MP:0001411,spinning,0.9077306606290237\nHP:5200071,Delayed Echolalia,MP:0013140,excessive vocalization,0.8153252835818089\nHP:5200072,Immediate Echolalia,MP:0001410,head bobbing,0.8348177036912526\nHP:5200073,Excessive cleaning,MP:0001412,excessive scratching,0.8699103725005582\nHP:5200104,Abnormal play,MP:0020437,abnormal social play behavior,0.8984862078522344\nHP:5200105,Reduced imaginative play skills,MP:0001402,decreased locomotor activity,0.85571629684631\nHP:5200108,Nonfunctional or atypical use of objects in play,MP:0003908,decreased stereotypic behavior,0.8586700411012859\nHP:5200129,Abnormal rituals,MP:0010698,abnormal impulsive behavior control,0.8727804272023427\nHP:5200134,Jumping,MP:0001401,jumpy,0.9011393233129765\n```\n\nNote that CurateGPT has a separate component for using an LLM to evaluate candidate matches (see also https://arxiv.org/abs/2310.03666); this is\nnot enabled by default, this would be expensive to run for a whole ontology.\n\n",
    "bugtrack_url": null,
    "license": "BSD-3",
    "summary": "CurateGPT",
    "version": "0.2.2",
    "project_urls": null,
    "split_keywords": [],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "a91f23388ad27979c5b62c58cc915f82cb6d014da5e761b76e144104f958079d",
                "md5": "3d7036c260aee70d900b2fe0f957ea81",
                "sha256": "0c430c60f85e9992ec176e5b12c1b4eecf4af625124c59ab7df149ce6cf5e3df"
            },
            "downloads": -1,
            "filename": "curategpt-0.2.2-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "3d7036c260aee70d900b2fe0f957ea81",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": "!=2.7.*,!=3.0.*,!=3.1.*,!=3.2.*,!=3.3.*,!=3.4.*,!=3.5.*,!=3.6.*,!=3.7.*,!=3.8.*,>=3.9",
            "size": 174883,
            "upload_time": "2024-11-15T17:42:03",
            "upload_time_iso_8601": "2024-11-15T17:42:03.082717Z",
            "url": "https://files.pythonhosted.org/packages/a9/1f/23388ad27979c5b62c58cc915f82cb6d014da5e761b76e144104f958079d/curategpt-0.2.2-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "912a1f3bfac38468f0450ee9c7a063fbcc720a72f23e2406d97a3aae4e7d2e3c",
                "md5": "a5e5a433bf35ff62464cdaa10d542a2a",
                "sha256": "131db30078f6f788a35a7d8ed69903671b4682319e14aaccfc9408e3dad28050"
            },
            "downloads": -1,
            "filename": "curategpt-0.2.2.tar.gz",
            "has_sig": false,
            "md5_digest": "a5e5a433bf35ff62464cdaa10d542a2a",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": "!=2.7.*,!=3.0.*,!=3.1.*,!=3.2.*,!=3.3.*,!=3.4.*,!=3.5.*,!=3.6.*,!=3.7.*,!=3.8.*,>=3.9",
            "size": 128392,
            "upload_time": "2024-11-15T17:42:05",
            "upload_time_iso_8601": "2024-11-15T17:42:05.231777Z",
            "url": "https://files.pythonhosted.org/packages/91/2a/1f3bfac38468f0450ee9c7a063fbcc720a72f23e2406d97a3aae4e7d2e3c/curategpt-0.2.2.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-11-15 17:42:05",
    "github": false,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "lcname": "curategpt"
}

Chris Mungall