Name | cpr-sdk JSON |
Version |
0.5.6
JSON |
| download |
home_page | None |
Summary | None |
upload_time | 2024-04-03 09:58:26 |
maintainer | None |
docs_url | None |
author | CPR Tech |
requires_python | <4.0,>=3.9 |
license | None |
keywords |
|
VCS |
|
bugtrack_url |
|
requirements |
No requirements were recorded.
|
Travis-CI |
No Travis.
|
coveralls test coverage |
No coveralls.
|
# cpr-sdk
Internal library for persistent access to text data.
> **Warning**
> This library is heavily under construction and doesn't work with any of our open data yet. We're working on making it usable for anyone.
## Documents and Datasets
The base document model of this library is `BaseDocument`, which contains only the metadata fields that are used in the parser.
### Loading from Huggingface Hub (recommended)
The `Dataset` class is automatically configured with the Huggingface repos we use. You can optionally provide a document limit, a dataset version, and override the repo that the data is loaded from.
If the repository is private you must provide a [user access token](https://huggingface.co/docs/hub/security-tokens), either in your environment as `HUGGINGFACE_TOKEN`, or as an argument to `from_huggingface`.
```py
from cpr_sdk.models import Dataset, GSTDocument
dataset = Dataset(GSTDocument).from_huggingface(
version="d8363af072d7e0f87ec281dd5084fb3d3f4583a9", # commit hash, optional
limit=1000,
token="my-huggingface-token", # required for private repos if not in env
)
```
### Loading from local storage or s3
```py
# document_id is also the filename stem
document = BaseDocument.load_from_local(folder_path="path/to/data/", document_id="document_1234")
document = BaseDocument.load_from_remote(dataset_key"s3://cpr-data", document_id="document_1234")
```
To manage metadata, documents need to be loaded into a `Dataset` object.
```py
from cpr_sdk.models import Dataset, CPRDocument, GSTDocument
dataset = Dataset().load_from_local("path/to/data", limit=1000)
assert all([isinstance(document, BaseDocument) for document in dataset])
dataset_with_metadata = dataset.add_metadata(
target_model=CPRDocument,
metadata_csv="path/to/metadata.csv",
)
assert all([isinstance(document, CPRDocument) for document in dataset_with_metadata])
```
Datasets have a number of methods for filtering and accessing documents.
```py
len(dataset)
>>> 1000
dataset[0]
>>> CPRDocument(...)
# Filtering
dataset.filter("document_id", "1234")
>>> Dataset()
dataset.filter_by_language("en")
>>> Dataset()
# Filtering using a function
dataset.filter("document_id", lambda x: x in ["1234", "5678"])
>>> Dataset()
```
## Search
This library can also be used to run searches against CPR documents and passages in Vespa.
```python
from src.cpr_sdk.search_adaptors import VespaSearchAdapter
from src.cpr_sdk.models.search import SearchParameters
adaptor = VespaSearchAdapter(instance_url="YOUR_INSTANCE_URL")
request = SearchParameters(query_string="forest fires")
response = adaptor.search(request)
```
The above example will return a `SearchResponse` object, which lists some basic information about the request, and the results, arranged as a list of Families, which each contain relevant Documents and/or Passages.
### Sorting
By default, results are sorted by relevance, but can be sorted by date, or name, eg
```python
request = SearchParameters(
query_string="forest fires",
sort_by="date",
sort_order="descending",
)
```
### Filters
Matching documents can also be filtered by keyword field, and by publication date
```python
request = SearchParameters(
query_string="forest fires",
filters={
"language": ["English", "French"],
"category": ["Executive"],
},
year_range=(2010, 2020)
)
```
### Search within families or documents
A subset of families or documents can be retrieved for search using their ids
```python
request = SearchParameters(
query_string="forest fires",
family_ids=["CCLW.family.10121.0", "CCLW.family.4980.0"],
)
```
```python
request = SearchParameters(
query_string="forest fires",
document_ids=["CCLW.executive.10121.4637", "CCLW.legislative.4980.1745"],
)
```
### Types of query
The default search approach uses a nearest neighbour search ranking.
Its also possible to search for exact matches instead:
```python
request = SearchParameters(
query_string="forest fires",
exact_match=True,
)
```
Or to ignore the query string and search the whole database instead:
```python
request = SearchParameters(
year_range=(2020, 2024),
sort_by="date",
sort_order="descending",
)
```
### Continuing results
The response objects include continuation tokens, which can be used to get more results.
For the next selection of families:
```python
response = adaptor.search(SearchParameters(query_string="forest fires"))
follow_up_request = SearchParameters(
query_string="forest fires"
continuation_tokens=[response.continuation_token],
)
follow_up_response = adaptor.search(follow_up_request)
```
It is also possible to get more hits within families by using the continuation token on the family object, rather than at the responses root
Note that `this_continuation_token` is used to mark the current continuation of the families, so getting more passages for a family after getting more families would look like this:
```python
follow_up_response = adaptor.search(follow_up_request)
this_token = follow_up_response.this_continuation_token
passage_token = follow_up_response.families[0].continuation_token
follow_up_request = SearchParameters(
query_string="forest fires"
continuation_tokens=[this_token, passage_token],
)
```
## Get a specific document
Users can also fetch single documents directly from Vespa, by document ID
```python
adaptor.get_by_id(document_id="id:YOUR_NAMESPACE:YOUR_SCHEMA_NAME::SOME_DOCUMENT_ID")
```
All of the above search functionality assumes that a valid set of vespa credentials is available in `~/.vespa`, or in a directory supplied to the `VespaSearchAdapter` constructor directly. See [the docs](docs/vespa-auth.md) for more information on how vespa expects credentials.
# Test setup
Some tests rely on a local running instance of vespa.
This requires the [vespa cli](https://docs.vespa.ai/en/vespa-cli.html) to be installed.
Setup can then be run with:
```
poetry install --all-extras --with dev
poetry shell
make vespa_dev_setup
make test
```
Alternatively, to only run non-vespa tests:
```
make test_not_vespa
```
For clean up:
```
make vespa_dev_down
```
Raw data
{
"_id": null,
"home_page": null,
"name": "cpr-sdk",
"maintainer": null,
"docs_url": null,
"requires_python": "<4.0,>=3.9",
"maintainer_email": null,
"keywords": null,
"author": "CPR Tech",
"author_email": "tech@climatepolicyradar.org",
"download_url": "https://files.pythonhosted.org/packages/7f/24/0b070787d783bf11ff0f728c7c196582398053317306efcb11762b6aa1fa/cpr_sdk-0.5.6.tar.gz",
"platform": null,
"description": "# cpr-sdk\n\nInternal library for persistent access to text data.\n\n> **Warning**\n> This library is heavily under construction and doesn't work with any of our open data yet. We're working on making it usable for anyone.\n\n## Documents and Datasets\n\nThe base document model of this library is `BaseDocument`, which contains only the metadata fields that are used in the parser.\n\n### Loading from Huggingface Hub (recommended)\n\nThe `Dataset` class is automatically configured with the Huggingface repos we use. You can optionally provide a document limit, a dataset version, and override the repo that the data is loaded from.\n\nIf the repository is private you must provide a [user access token](https://huggingface.co/docs/hub/security-tokens), either in your environment as `HUGGINGFACE_TOKEN`, or as an argument to `from_huggingface`.\n\n```py\nfrom cpr_sdk.models import Dataset, GSTDocument\n\ndataset = Dataset(GSTDocument).from_huggingface(\n version=\"d8363af072d7e0f87ec281dd5084fb3d3f4583a9\", # commit hash, optional\n limit=1000,\n token=\"my-huggingface-token\", # required for private repos if not in env\n)\n```\n\n### Loading from local storage or s3\n\n```py\n# document_id is also the filename stem\n\ndocument = BaseDocument.load_from_local(folder_path=\"path/to/data/\", document_id=\"document_1234\")\n\ndocument = BaseDocument.load_from_remote(dataset_key\"s3://cpr-data\", document_id=\"document_1234\")\n```\n\nTo manage metadata, documents need to be loaded into a `Dataset` object.\n\n```py\nfrom cpr_sdk.models import Dataset, CPRDocument, GSTDocument\n\ndataset = Dataset().load_from_local(\"path/to/data\", limit=1000)\nassert all([isinstance(document, BaseDocument) for document in dataset])\n\ndataset_with_metadata = dataset.add_metadata(\n target_model=CPRDocument,\n metadata_csv=\"path/to/metadata.csv\",\n)\n\nassert all([isinstance(document, CPRDocument) for document in dataset_with_metadata])\n```\n\nDatasets have a number of methods for filtering and accessing documents.\n\n```py\nlen(dataset)\n>>> 1000\n\ndataset[0]\n>>> CPRDocument(...)\n\n# Filtering\ndataset.filter(\"document_id\", \"1234\")\n>>> Dataset()\n\ndataset.filter_by_language(\"en\")\n>>> Dataset()\n\n# Filtering using a function\ndataset.filter(\"document_id\", lambda x: x in [\"1234\", \"5678\"])\n>>> Dataset()\n```\n\n## Search\n\nThis library can also be used to run searches against CPR documents and passages in Vespa.\n\n```python\nfrom src.cpr_sdk.search_adaptors import VespaSearchAdapter\nfrom src.cpr_sdk.models.search import SearchParameters\n\nadaptor = VespaSearchAdapter(instance_url=\"YOUR_INSTANCE_URL\")\n\nrequest = SearchParameters(query_string=\"forest fires\")\n\nresponse = adaptor.search(request)\n```\n\nThe above example will return a `SearchResponse` object, which lists some basic information about the request, and the results, arranged as a list of Families, which each contain relevant Documents and/or Passages.\n\n### Sorting\n\nBy default, results are sorted by relevance, but can be sorted by date, or name, eg\n\n```python\nrequest = SearchParameters(\n query_string=\"forest fires\",\n sort_by=\"date\",\n sort_order=\"descending\",\n)\n```\n\n### Filters\n\nMatching documents can also be filtered by keyword field, and by publication date\n\n```python\nrequest = SearchParameters(\n query_string=\"forest fires\",\n filters={\n \"language\": [\"English\", \"French\"],\n \"category\": [\"Executive\"],\n },\n year_range=(2010, 2020)\n)\n```\n\n### Search within families or documents\n\nA subset of families or documents can be retrieved for search using their ids\n\n```python\nrequest = SearchParameters(\n query_string=\"forest fires\",\n family_ids=[\"CCLW.family.10121.0\", \"CCLW.family.4980.0\"],\n)\n```\n\n```python\nrequest = SearchParameters(\n query_string=\"forest fires\",\n document_ids=[\"CCLW.executive.10121.4637\", \"CCLW.legislative.4980.1745\"],\n)\n```\n\n### Types of query\n\nThe default search approach uses a nearest neighbour search ranking.\n\nIts also possible to search for exact matches instead:\n\n```python\nrequest = SearchParameters(\n query_string=\"forest fires\",\n exact_match=True,\n)\n```\n\nOr to ignore the query string and search the whole database instead:\n\n```python\nrequest = SearchParameters(\n year_range=(2020, 2024),\n sort_by=\"date\",\n sort_order=\"descending\",\n)\n```\n\n### Continuing results\n\nThe response objects include continuation tokens, which can be used to get more results.\n\nFor the next selection of families:\n\n```python\nresponse = adaptor.search(SearchParameters(query_string=\"forest fires\"))\n\nfollow_up_request = SearchParameters(\n query_string=\"forest fires\"\n continuation_tokens=[response.continuation_token],\n\n)\nfollow_up_response = adaptor.search(follow_up_request)\n```\n\nIt is also possible to get more hits within families by using the continuation token on the family object, rather than at the responses root\n\nNote that `this_continuation_token` is used to mark the current continuation of the families, so getting more passages for a family after getting more families would look like this:\n\n```python\nfollow_up_response = adaptor.search(follow_up_request)\n\nthis_token = follow_up_response.this_continuation_token\npassage_token = follow_up_response.families[0].continuation_token\n\nfollow_up_request = SearchParameters(\n query_string=\"forest fires\"\n continuation_tokens=[this_token, passage_token],\n)\n```\n\n## Get a specific document\n\nUsers can also fetch single documents directly from Vespa, by document ID\n\n```python\nadaptor.get_by_id(document_id=\"id:YOUR_NAMESPACE:YOUR_SCHEMA_NAME::SOME_DOCUMENT_ID\")\n```\n\nAll of the above search functionality assumes that a valid set of vespa credentials is available in `~/.vespa`, or in a directory supplied to the `VespaSearchAdapter` constructor directly. See [the docs](docs/vespa-auth.md) for more information on how vespa expects credentials.\n\n# Test setup\n\nSome tests rely on a local running instance of vespa.\n\nThis requires the [vespa cli](https://docs.vespa.ai/en/vespa-cli.html) to be installed.\n\nSetup can then be run with:\n\n```\npoetry install --all-extras --with dev\npoetry shell\nmake vespa_dev_setup\nmake test\n```\n\nAlternatively, to only run non-vespa tests:\n\n```\nmake test_not_vespa\n```\n\nFor clean up:\n\n```\nmake vespa_dev_down\n```\n",
"bugtrack_url": null,
"license": null,
"summary": null,
"version": "0.5.6",
"project_urls": null,
"split_keywords": [],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "dbe3f6c1c98385fc8061d495a7a0fb19edd6f6712561cd04441ffc8dbd8c4128",
"md5": "1a4d6caebf07e8744889f1985f03233d",
"sha256": "10afd58b1b8a8f45ec1852b41dc520e11886eb13db745db7116f74700a693c37"
},
"downloads": -1,
"filename": "cpr_sdk-0.5.6-py3-none-any.whl",
"has_sig": false,
"md5_digest": "1a4d6caebf07e8744889f1985f03233d",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": "<4.0,>=3.9",
"size": 52205,
"upload_time": "2024-04-03T09:58:25",
"upload_time_iso_8601": "2024-04-03T09:58:25.327761Z",
"url": "https://files.pythonhosted.org/packages/db/e3/f6c1c98385fc8061d495a7a0fb19edd6f6712561cd04441ffc8dbd8c4128/cpr_sdk-0.5.6-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "7f240b070787d783bf11ff0f728c7c196582398053317306efcb11762b6aa1fa",
"md5": "01d592ea96b1bd117ea8c6aba97c1ea8",
"sha256": "975f40b4642d83fd21c2545a9256dc6dc7dbd651f72b4fae745fa25f9308641e"
},
"downloads": -1,
"filename": "cpr_sdk-0.5.6.tar.gz",
"has_sig": false,
"md5_digest": "01d592ea96b1bd117ea8c6aba97c1ea8",
"packagetype": "sdist",
"python_version": "source",
"requires_python": "<4.0,>=3.9",
"size": 51827,
"upload_time": "2024-04-03T09:58:26",
"upload_time_iso_8601": "2024-04-03T09:58:26.933792Z",
"url": "https://files.pythonhosted.org/packages/7f/24/0b070787d783bf11ff0f728c7c196582398053317306efcb11762b6aa1fa/cpr_sdk-0.5.6.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-04-03 09:58:26",
"github": false,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"lcname": "cpr-sdk"
}