cpr-sdk

Name	cpr-sdk JSON
Version	0.5.6 JSON
	download
home_page	None
Summary	None
upload_time	2024-04-03 09:58:26
maintainer	None
docs_url	None
author	CPR Tech
requires_python	<4.0,>=3.9
license	None
keywords
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            # cpr-sdk

Internal library for persistent access to text data.

> **Warning**
> This library is heavily under construction and doesn't work with any of our open data yet. We're working on making it usable for anyone.

## Documents and Datasets

The base document model of this library is `BaseDocument`, which contains only the metadata fields that are used in the parser.

### Loading from Huggingface Hub (recommended)

The `Dataset` class is automatically configured with the Huggingface repos we use. You can optionally provide a document limit, a dataset version, and override the repo that the data is loaded from.

If the repository is private you must provide a [user access token](https://huggingface.co/docs/hub/security-tokens), either in your environment as `HUGGINGFACE_TOKEN`, or as an argument to `from_huggingface`.

```py
from cpr_sdk.models import Dataset, GSTDocument

dataset = Dataset(GSTDocument).from_huggingface(
    version="d8363af072d7e0f87ec281dd5084fb3d3f4583a9", # commit hash, optional
    limit=1000,
    token="my-huggingface-token", # required for private repos if not in env
)
```

### Loading from local storage or s3

```py
# document_id is also the filename stem

document = BaseDocument.load_from_local(folder_path="path/to/data/", document_id="document_1234")

document = BaseDocument.load_from_remote(dataset_key"s3://cpr-data", document_id="document_1234")
```

To manage metadata, documents need to be loaded into a `Dataset` object.

```py
from cpr_sdk.models import Dataset, CPRDocument, GSTDocument

dataset = Dataset().load_from_local("path/to/data", limit=1000)
assert all([isinstance(document, BaseDocument) for document in dataset])

dataset_with_metadata = dataset.add_metadata(
    target_model=CPRDocument,
    metadata_csv="path/to/metadata.csv",
)

assert all([isinstance(document, CPRDocument) for document in dataset_with_metadata])
```

Datasets have a number of methods for filtering and accessing documents.

```py
len(dataset)
>>> 1000

dataset[0]
>>> CPRDocument(...)

# Filtering
dataset.filter("document_id", "1234")
>>> Dataset()

dataset.filter_by_language("en")
>>> Dataset()

# Filtering using a function
dataset.filter("document_id", lambda x: x in ["1234", "5678"])
>>> Dataset()
```

## Search

This library can also be used to run searches against CPR documents and passages in Vespa.

```python
from src.cpr_sdk.search_adaptors import VespaSearchAdapter
from src.cpr_sdk.models.search import SearchParameters

adaptor = VespaSearchAdapter(instance_url="YOUR_INSTANCE_URL")

request = SearchParameters(query_string="forest fires")

response = adaptor.search(request)
```

The above example will return a `SearchResponse` object, which lists some basic information about the request, and the results, arranged as a list of Families, which each contain relevant Documents and/or Passages.

### Sorting

By default, results are sorted by relevance, but can be sorted by date, or name, eg

```python
request = SearchParameters(
    query_string="forest fires",
    sort_by="date",
    sort_order="descending",
)
```

### Filters

Matching documents can also be filtered by keyword field, and by publication date

```python
request = SearchParameters(
    query_string="forest fires",
    filters={
        "language": ["English", "French"],
        "category": ["Executive"],
    },
    year_range=(2010, 2020)
)
```

### Search within families or documents

A subset of families or documents can be retrieved for search using their ids

```python
request = SearchParameters(
    query_string="forest fires",
    family_ids=["CCLW.family.10121.0", "CCLW.family.4980.0"],
)
```

```python
request = SearchParameters(
    query_string="forest fires",
    document_ids=["CCLW.executive.10121.4637", "CCLW.legislative.4980.1745"],
)
```

### Types of query

The default search approach uses a nearest neighbour search ranking.

Its also possible to search for exact matches instead:

```python
request = SearchParameters(
    query_string="forest fires",
    exact_match=True,
)
```

Or to ignore the query string and search the whole database instead:

```python
request = SearchParameters(
    year_range=(2020, 2024),
    sort_by="date",
    sort_order="descending",
)
```

### Continuing results

The response objects include continuation tokens, which can be used to get more results.

For the next selection of families:

```python
response = adaptor.search(SearchParameters(query_string="forest fires"))

follow_up_request = SearchParameters(
    query_string="forest fires"
    continuation_tokens=[response.continuation_token],

)
follow_up_response = adaptor.search(follow_up_request)
```

It is also possible to get more hits within families by using the continuation token on the family object, rather than at the responses root

Note that `this_continuation_token` is used to mark the current continuation of the families, so getting more passages for a family after getting more families would look like this:

```python
follow_up_response = adaptor.search(follow_up_request)

this_token = follow_up_response.this_continuation_token
passage_token = follow_up_response.families[0].continuation_token

follow_up_request = SearchParameters(
    query_string="forest fires"
    continuation_tokens=[this_token, passage_token],
)
```

## Get a specific document

Users can also fetch single documents directly from Vespa, by document ID

```python
adaptor.get_by_id(document_id="id:YOUR_NAMESPACE:YOUR_SCHEMA_NAME::SOME_DOCUMENT_ID")
```

All of the above search functionality assumes that a valid set of vespa credentials is available in `~/.vespa`, or in a directory supplied to the `VespaSearchAdapter` constructor directly. See [the docs](docs/vespa-auth.md) for more information on how vespa expects credentials.

# Test setup

Some tests rely on a local running instance of vespa.

This requires the [vespa cli](https://docs.vespa.ai/en/vespa-cli.html) to be installed.

Setup can then be run with:

```
poetry install --all-extras --with dev
poetry shell
make vespa_dev_setup
make test
```

Alternatively, to only run non-vespa tests:

```
make test_not_vespa
```

For clean up:

```
make vespa_dev_down
```

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "cpr-sdk",
    "maintainer": null,
    "docs_url": null,
    "requires_python": "<4.0,>=3.9",
    "maintainer_email": null,
    "keywords": null,
    "author": "CPR Tech",
    "author_email": "tech@climatepolicyradar.org",
    "download_url": "https://files.pythonhosted.org/packages/7f/24/0b070787d783bf11ff0f728c7c196582398053317306efcb11762b6aa1fa/cpr_sdk-0.5.6.tar.gz",
    "platform": null,
    "description": "# cpr-sdk\n\nInternal library for persistent access to text data.\n\n> **Warning**\n> This library is heavily under construction and doesn't work with any of our open data yet. We're working on making it usable for anyone.\n\n## Documents and Datasets\n\nThe base document model of this library is `BaseDocument`, which contains only the metadata fields that are used in the parser.\n\n### Loading from Huggingface Hub (recommended)\n\nThe `Dataset` class is automatically configured with the Huggingface repos we use. You can optionally provide a document limit, a dataset version, and override the repo that the data is loaded from.\n\nIf the repository is private you must provide a [user access token](https://huggingface.co/docs/hub/security-tokens), either in your environment as `HUGGINGFACE_TOKEN`, or as an argument to `from_huggingface`.\n\n```py\nfrom cpr_sdk.models import Dataset, GSTDocument\n\ndataset = Dataset(GSTDocument).from_huggingface(\n    version=\"d8363af072d7e0f87ec281dd5084fb3d3f4583a9\", # commit hash, optional\n    limit=1000,\n    token=\"my-huggingface-token\", # required for private repos if not in env\n)\n```\n\n### Loading from local storage or s3\n\n```py\n# document_id is also the filename stem\n\ndocument = BaseDocument.load_from_local(folder_path=\"path/to/data/\", document_id=\"document_1234\")\n\ndocument = BaseDocument.load_from_remote(dataset_key\"s3://cpr-data\", document_id=\"document_1234\")\n```\n\nTo manage metadata, documents need to be loaded into a `Dataset` object.\n\n```py\nfrom cpr_sdk.models import Dataset, CPRDocument, GSTDocument\n\ndataset = Dataset().load_from_local(\"path/to/data\", limit=1000)\nassert all([isinstance(document, BaseDocument) for document in dataset])\n\ndataset_with_metadata = dataset.add_metadata(\n    target_model=CPRDocument,\n    metadata_csv=\"path/to/metadata.csv\",\n)\n\nassert all([isinstance(document, CPRDocument) for document in dataset_with_metadata])\n```\n\nDatasets have a number of methods for filtering and accessing documents.\n\n```py\nlen(dataset)\n>>> 1000\n\ndataset[0]\n>>> CPRDocument(...)\n\n# Filtering\ndataset.filter(\"document_id\", \"1234\")\n>>> Dataset()\n\ndataset.filter_by_language(\"en\")\n>>> Dataset()\n\n# Filtering using a function\ndataset.filter(\"document_id\", lambda x: x in [\"1234\", \"5678\"])\n>>> Dataset()\n```\n\n## Search\n\nThis library can also be used to run searches against CPR documents and passages in Vespa.\n\n```python\nfrom src.cpr_sdk.search_adaptors import VespaSearchAdapter\nfrom src.cpr_sdk.models.search import SearchParameters\n\nadaptor = VespaSearchAdapter(instance_url=\"YOUR_INSTANCE_URL\")\n\nrequest = SearchParameters(query_string=\"forest fires\")\n\nresponse = adaptor.search(request)\n```\n\nThe above example will return a `SearchResponse` object, which lists some basic information about the request, and the results, arranged as a list of Families, which each contain relevant Documents and/or Passages.\n\n### Sorting\n\nBy default, results are sorted by relevance, but can be sorted by date, or name, eg\n\n```python\nrequest = SearchParameters(\n    query_string=\"forest fires\",\n    sort_by=\"date\",\n    sort_order=\"descending\",\n)\n```\n\n### Filters\n\nMatching documents can also be filtered by keyword field, and by publication date\n\n```python\nrequest = SearchParameters(\n    query_string=\"forest fires\",\n    filters={\n        \"language\": [\"English\", \"French\"],\n        \"category\": [\"Executive\"],\n    },\n    year_range=(2010, 2020)\n)\n```\n\n### Search within families or documents\n\nA subset of families or documents can be retrieved for search using their ids\n\n```python\nrequest = SearchParameters(\n    query_string=\"forest fires\",\n    family_ids=[\"CCLW.family.10121.0\", \"CCLW.family.4980.0\"],\n)\n```\n\n```python\nrequest = SearchParameters(\n    query_string=\"forest fires\",\n    document_ids=[\"CCLW.executive.10121.4637\", \"CCLW.legislative.4980.1745\"],\n)\n```\n\n### Types of query\n\nThe default search approach uses a nearest neighbour search ranking.\n\nIts also possible to search for exact matches instead:\n\n```python\nrequest = SearchParameters(\n    query_string=\"forest fires\",\n    exact_match=True,\n)\n```\n\nOr to ignore the query string and search the whole database instead:\n\n```python\nrequest = SearchParameters(\n    year_range=(2020, 2024),\n    sort_by=\"date\",\n    sort_order=\"descending\",\n)\n```\n\n### Continuing results\n\nThe response objects include continuation tokens, which can be used to get more results.\n\nFor the next selection of families:\n\n```python\nresponse = adaptor.search(SearchParameters(query_string=\"forest fires\"))\n\nfollow_up_request = SearchParameters(\n    query_string=\"forest fires\"\n    continuation_tokens=[response.continuation_token],\n\n)\nfollow_up_response = adaptor.search(follow_up_request)\n```\n\nIt is also possible to get more hits within families by using the continuation token on the family object, rather than at the responses root\n\nNote that `this_continuation_token` is used to mark the current continuation of the families, so getting more passages for a family after getting more families would look like this:\n\n```python\nfollow_up_response = adaptor.search(follow_up_request)\n\nthis_token = follow_up_response.this_continuation_token\npassage_token = follow_up_response.families[0].continuation_token\n\nfollow_up_request = SearchParameters(\n    query_string=\"forest fires\"\n    continuation_tokens=[this_token, passage_token],\n)\n```\n\n## Get a specific document\n\nUsers can also fetch single documents directly from Vespa, by document ID\n\n```python\nadaptor.get_by_id(document_id=\"id:YOUR_NAMESPACE:YOUR_SCHEMA_NAME::SOME_DOCUMENT_ID\")\n```\n\nAll of the above search functionality assumes that a valid set of vespa credentials is available in `~/.vespa`, or in a directory supplied to the `VespaSearchAdapter` constructor directly. See [the docs](docs/vespa-auth.md) for more information on how vespa expects credentials.\n\n# Test setup\n\nSome tests rely on a local running instance of vespa.\n\nThis requires the [vespa cli](https://docs.vespa.ai/en/vespa-cli.html) to be installed.\n\nSetup can then be run with:\n\n```\npoetry install --all-extras --with dev\npoetry shell\nmake vespa_dev_setup\nmake test\n```\n\nAlternatively, to only run non-vespa tests:\n\n```\nmake test_not_vespa\n```\n\nFor clean up:\n\n```\nmake vespa_dev_down\n```\n",
    "bugtrack_url": null,
    "license": null,
    "summary": null,
    "version": "0.5.6",
    "project_urls": null,
    "split_keywords": [],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "dbe3f6c1c98385fc8061d495a7a0fb19edd6f6712561cd04441ffc8dbd8c4128",
                "md5": "1a4d6caebf07e8744889f1985f03233d",
                "sha256": "10afd58b1b8a8f45ec1852b41dc520e11886eb13db745db7116f74700a693c37"
            },
            "downloads": -1,
            "filename": "cpr_sdk-0.5.6-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "1a4d6caebf07e8744889f1985f03233d",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": "<4.0,>=3.9",
            "size": 52205,
            "upload_time": "2024-04-03T09:58:25",
            "upload_time_iso_8601": "2024-04-03T09:58:25.327761Z",
            "url": "https://files.pythonhosted.org/packages/db/e3/f6c1c98385fc8061d495a7a0fb19edd6f6712561cd04441ffc8dbd8c4128/cpr_sdk-0.5.6-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "7f240b070787d783bf11ff0f728c7c196582398053317306efcb11762b6aa1fa",
                "md5": "01d592ea96b1bd117ea8c6aba97c1ea8",
                "sha256": "975f40b4642d83fd21c2545a9256dc6dc7dbd651f72b4fae745fa25f9308641e"
            },
            "downloads": -1,
            "filename": "cpr_sdk-0.5.6.tar.gz",
            "has_sig": false,
            "md5_digest": "01d592ea96b1bd117ea8c6aba97c1ea8",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": "<4.0,>=3.9",
            "size": 51827,
            "upload_time": "2024-04-03T09:58:26",
            "upload_time_iso_8601": "2024-04-03T09:58:26.933792Z",
            "url": "https://files.pythonhosted.org/packages/7f/24/0b070787d783bf11ff0f728c7c196582398053317306efcb11762b6aa1fa/cpr_sdk-0.5.6.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-04-03 09:58:26",
    "github": false,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "lcname": "cpr-sdk"
}

CPR Tech