Name | cpr_sdk JSON |
Version |
1.13.0
JSON |
| download |
home_page | None |
Summary | None |
upload_time | 2025-01-14 09:38:50 |
maintainer | None |
docs_url | None |
author | CPR Tech |
requires_python | <4.0,>=3.10 |
license | LICENSE |
keywords |
|
VCS |
|
bugtrack_url |
|
requirements |
No requirements were recorded.
|
Travis-CI |
No Travis.
|
coveralls test coverage |
No coveralls.
|
# cpr-sdk
Internal library for persistent access to text data.
> **Warning**
> This library is heavily under construction and doesn't work with any of our open data yet. We're working on making it usable for anyone.
## Documents and Datasets
The base document model of this library is `BaseDocument`, which contains only the metadata fields that are used in the parser.
### Loading from Huggingface Hub (recommended)
The `Dataset` class is automatically configured with the Huggingface repos we use. You can optionally provide a document limit, a dataset version, and override the repo that the data is loaded from.
If the repository is private you must provide a [user access token](https://huggingface.co/docs/hub/security-tokens), either in your environment as `HUGGINGFACE_TOKEN`, or as an argument to `from_huggingface`.
```py
from cpr_sdk.models import Dataset, GSTDocument
dataset = Dataset(GSTDocument).from_huggingface(
version="d8363af072d7e0f87ec281dd5084fb3d3f4583a9", # commit hash, optional
limit=1000,
token="my-huggingface-token", # required for private repos if not in env
)
```
The following flag is used for the passage level and flat dataset.
```py
dataset = Dataset(
document_model=BaseDocument
).from_huggingface(
dataset_name="ClimatePolicyRadar/passage-level-flat-dataset",
passage_level_and_flat=True
)
```
### Loading from local storage or s3
```py
# document_id is also the filename stem
document = BaseDocument.load_from_local(folder_path="path/to/data/", document_id="document_1234")
document = BaseDocument.load_from_remote(dataset_key"s3://cpr-data", document_id="document_1234")
```
To manage metadata, documents need to be loaded into a `Dataset` object.
```py
from cpr_sdk.models import Dataset, CPRDocument, GSTDocument
dataset = Dataset().load_from_local("path/to/data", limit=1000)
assert all([isinstance(document, BaseDocument) for document in dataset])
dataset_with_metadata = dataset.add_metadata(
target_model=CPRDocument,
metadata_csv="path/to/metadata.csv",
)
assert all([isinstance(document, CPRDocument) for document in dataset_with_metadata])
```
Datasets have a number of methods for filtering and accessing documents.
```py
len(dataset)
>>> 1000
dataset[0]
>>> CPRDocument(...)
# Filtering
dataset.filter("document_id", "1234")
>>> Dataset()
dataset.filter_by_language("en")
>>> Dataset()
# Filtering using a function
dataset.filter("document_id", lambda x: x in ["1234", "5678"])
>>> Dataset()
```
## Search
This library can also be used to run searches against CPR documents and passages in Vespa.
```python
from src.cpr_sdk.search_adaptors import VespaSearchAdapter
from src.cpr_sdk.models.search import SearchParameters
adaptor = VespaSearchAdapter(instance_url="YOUR_INSTANCE_URL")
request = SearchParameters(query_string="forest fires")
response = adaptor.search(request)
```
The above example will return a `SearchResponse` object, which lists some basic information about the request, and the results, arranged as a list of Families, which each contain relevant Documents and/or Passages.
### Sorting
By default, results are sorted by relevance, but can be sorted by date, or name, eg
```python
request = SearchParameters(
query_string="forest fires",
sort_by="date",
sort_order="descending",
)
```
### Filters
Matching documents can also be filtered by keyword field, and by publication date
```python
request = SearchParameters(
query_string="forest fires",
filters={
"language": ["English", "French"],
"category": ["Executive"],
},
year_range=(2010, 2020)
)
```
### Search within families or documents
A subset of families or documents can be retrieved for search using their ids
```python
request = SearchParameters(
query_string="forest fires",
family_ids=["CCLW.family.10121.0", "CCLW.family.4980.0"],
)
```
```python
request = SearchParameters(
query_string="forest fires",
document_ids=["CCLW.executive.10121.4637", "CCLW.legislative.4980.1745"],
)
```
### Types of query
The default search approach uses a nearest neighbour search ranking.
Its also possible to search for exact matches instead:
```python
request = SearchParameters(
query_string="forest fires",
exact_match=True,
)
```
Or to ignore the query string and search the whole database instead:
```python
request = SearchParameters(
year_range=(2020, 2024),
sort_by="date",
sort_order="descending",
)
```
### Continuing results
The response objects include continuation tokens, which can be used to get more results.
For the next selection of families:
```python
response = adaptor.search(SearchParameters(query_string="forest fires"))
follow_up_request = SearchParameters(
query_string="forest fires"
continuation_tokens=[response.continuation_token],
)
follow_up_response = adaptor.search(follow_up_request)
```
It is also possible to get more hits within families by using the continuation token on the family object, rather than at the responses root
Note that `this_continuation_token` is used to mark the current continuation of the families, so getting more passages for a family after getting more families would look like this:
```python
follow_up_response = adaptor.search(follow_up_request)
this_token = follow_up_response.this_continuation_token
passage_token = follow_up_response.families[0].continuation_token
follow_up_request = SearchParameters(
query_string="forest fires"
continuation_tokens=[this_token, passage_token],
)
```
## Get a specific document
Users can also fetch single documents directly from Vespa, by document ID
```python
adaptor.get_by_id(document_id="id:YOUR_NAMESPACE:YOUR_SCHEMA_NAME::SOME_DOCUMENT_ID")
```
All of the above search functionality assumes that a valid set of vespa credentials is available in `~/.vespa`, or in a directory supplied to the `VespaSearchAdapter` constructor directly. See [the docs](docs/vespa-auth.md) for more information on how vespa expects credentials.
# Test setup
Some tests rely on a local running instance of vespa.
This requires the [vespa cli](https://docs.vespa.ai/en/vespa-cli.html) to be installed.
Setup can then be run with:
```
poetry install --all-extras --with dev
poetry shell
make vespa_dev_setup
make test
```
Alternatively, to only run non-vespa tests:
```
make test_not_vespa
```
For clean up:
```
make vespa_dev_down
```
## Release Flow:
- Make updates to the package.
- Bump the package version in the `cpr_sdk/version.py` module.
- Make a PR.
- In CI/CD we will check that the version is greater than the latest release.
- Merge.
- Tag a release manually in github with a version that matches the latest on main that you just merged.
- In CI/CD we will check that the latest release matches the versions defined in code.
- Check in `pypi`.
Raw data
{
"_id": null,
"home_page": null,
"name": "cpr_sdk",
"maintainer": null,
"docs_url": null,
"requires_python": "<4.0,>=3.10",
"maintainer_email": null,
"keywords": null,
"author": "CPR Tech",
"author_email": "tech@climatepolicyradar.org",
"download_url": "https://files.pythonhosted.org/packages/d4/48/5577ea4d22db15521bc066a688b8a0e28afc2f351a99f2db229c032316c6/cpr_sdk-1.13.0.tar.gz",
"platform": null,
"description": "# cpr-sdk\n\nInternal library for persistent access to text data.\n\n> **Warning**\n> This library is heavily under construction and doesn't work with any of our open data yet. We're working on making it usable for anyone.\n\n## Documents and Datasets\n\nThe base document model of this library is `BaseDocument`, which contains only the metadata fields that are used in the parser.\n\n### Loading from Huggingface Hub (recommended)\n\nThe `Dataset` class is automatically configured with the Huggingface repos we use. You can optionally provide a document limit, a dataset version, and override the repo that the data is loaded from.\n\nIf the repository is private you must provide a [user access token](https://huggingface.co/docs/hub/security-tokens), either in your environment as `HUGGINGFACE_TOKEN`, or as an argument to `from_huggingface`.\n\n```py\nfrom cpr_sdk.models import Dataset, GSTDocument\n\ndataset = Dataset(GSTDocument).from_huggingface(\n version=\"d8363af072d7e0f87ec281dd5084fb3d3f4583a9\", # commit hash, optional\n limit=1000,\n token=\"my-huggingface-token\", # required for private repos if not in env\n)\n```\n\nThe following flag is used for the passage level and flat dataset.\n\n```py\ndataset = Dataset(\n document_model=BaseDocument\n).from_huggingface(\n dataset_name=\"ClimatePolicyRadar/passage-level-flat-dataset\",\n passage_level_and_flat=True\n)\n```\n\n### Loading from local storage or s3\n\n```py\n# document_id is also the filename stem\n\ndocument = BaseDocument.load_from_local(folder_path=\"path/to/data/\", document_id=\"document_1234\")\n\ndocument = BaseDocument.load_from_remote(dataset_key\"s3://cpr-data\", document_id=\"document_1234\")\n```\n\nTo manage metadata, documents need to be loaded into a `Dataset` object.\n\n```py\nfrom cpr_sdk.models import Dataset, CPRDocument, GSTDocument\n\ndataset = Dataset().load_from_local(\"path/to/data\", limit=1000)\nassert all([isinstance(document, BaseDocument) for document in dataset])\n\ndataset_with_metadata = dataset.add_metadata(\n target_model=CPRDocument,\n metadata_csv=\"path/to/metadata.csv\",\n)\n\nassert all([isinstance(document, CPRDocument) for document in dataset_with_metadata])\n```\n\nDatasets have a number of methods for filtering and accessing documents.\n\n```py\nlen(dataset)\n>>> 1000\n\ndataset[0]\n>>> CPRDocument(...)\n\n# Filtering\ndataset.filter(\"document_id\", \"1234\")\n>>> Dataset()\n\ndataset.filter_by_language(\"en\")\n>>> Dataset()\n\n# Filtering using a function\ndataset.filter(\"document_id\", lambda x: x in [\"1234\", \"5678\"])\n>>> Dataset()\n```\n\n## Search\n\nThis library can also be used to run searches against CPR documents and passages in Vespa.\n\n```python\nfrom src.cpr_sdk.search_adaptors import VespaSearchAdapter\nfrom src.cpr_sdk.models.search import SearchParameters\n\nadaptor = VespaSearchAdapter(instance_url=\"YOUR_INSTANCE_URL\")\n\nrequest = SearchParameters(query_string=\"forest fires\")\n\nresponse = adaptor.search(request)\n```\n\nThe above example will return a `SearchResponse` object, which lists some basic information about the request, and the results, arranged as a list of Families, which each contain relevant Documents and/or Passages.\n\n### Sorting\n\nBy default, results are sorted by relevance, but can be sorted by date, or name, eg\n\n```python\nrequest = SearchParameters(\n query_string=\"forest fires\",\n sort_by=\"date\",\n sort_order=\"descending\",\n)\n```\n\n### Filters\n\nMatching documents can also be filtered by keyword field, and by publication date\n\n```python\nrequest = SearchParameters(\n query_string=\"forest fires\",\n filters={\n \"language\": [\"English\", \"French\"],\n \"category\": [\"Executive\"],\n },\n year_range=(2010, 2020)\n)\n```\n\n### Search within families or documents\n\nA subset of families or documents can be retrieved for search using their ids\n\n```python\nrequest = SearchParameters(\n query_string=\"forest fires\",\n family_ids=[\"CCLW.family.10121.0\", \"CCLW.family.4980.0\"],\n)\n```\n\n```python\nrequest = SearchParameters(\n query_string=\"forest fires\",\n document_ids=[\"CCLW.executive.10121.4637\", \"CCLW.legislative.4980.1745\"],\n)\n```\n\n### Types of query\n\nThe default search approach uses a nearest neighbour search ranking.\n\nIts also possible to search for exact matches instead:\n\n```python\nrequest = SearchParameters(\n query_string=\"forest fires\",\n exact_match=True,\n)\n```\n\nOr to ignore the query string and search the whole database instead:\n\n```python\nrequest = SearchParameters(\n year_range=(2020, 2024),\n sort_by=\"date\",\n sort_order=\"descending\",\n)\n```\n\n### Continuing results\n\nThe response objects include continuation tokens, which can be used to get more results.\n\nFor the next selection of families:\n\n```python\nresponse = adaptor.search(SearchParameters(query_string=\"forest fires\"))\n\nfollow_up_request = SearchParameters(\n query_string=\"forest fires\"\n continuation_tokens=[response.continuation_token],\n\n)\nfollow_up_response = adaptor.search(follow_up_request)\n```\n\nIt is also possible to get more hits within families by using the continuation token on the family object, rather than at the responses root\n\nNote that `this_continuation_token` is used to mark the current continuation of the families, so getting more passages for a family after getting more families would look like this:\n\n```python\nfollow_up_response = adaptor.search(follow_up_request)\n\nthis_token = follow_up_response.this_continuation_token\npassage_token = follow_up_response.families[0].continuation_token\n\nfollow_up_request = SearchParameters(\n query_string=\"forest fires\"\n continuation_tokens=[this_token, passage_token],\n)\n```\n\n## Get a specific document\n\nUsers can also fetch single documents directly from Vespa, by document ID\n\n```python\nadaptor.get_by_id(document_id=\"id:YOUR_NAMESPACE:YOUR_SCHEMA_NAME::SOME_DOCUMENT_ID\")\n```\n\nAll of the above search functionality assumes that a valid set of vespa credentials is available in `~/.vespa`, or in a directory supplied to the `VespaSearchAdapter` constructor directly. See [the docs](docs/vespa-auth.md) for more information on how vespa expects credentials.\n\n# Test setup\n\nSome tests rely on a local running instance of vespa.\n\nThis requires the [vespa cli](https://docs.vespa.ai/en/vespa-cli.html) to be installed.\n\nSetup can then be run with:\n\n```\npoetry install --all-extras --with dev\npoetry shell\nmake vespa_dev_setup\nmake test\n```\n\nAlternatively, to only run non-vespa tests:\n\n```\nmake test_not_vespa\n```\n\nFor clean up:\n\n```\nmake vespa_dev_down\n```\n\n## Release Flow:\n\n- Make updates to the package.\n- Bump the package version in the `cpr_sdk/version.py` module.\n- Make a PR.\n - In CI/CD we will check that the version is greater than the latest release.\n- Merge.\n- Tag a release manually in github with a version that matches the latest on main that you just merged.\n - In CI/CD we will check that the latest release matches the versions defined in code.\n- Check in `pypi`.",
"bugtrack_url": null,
"license": "LICENSE",
"summary": null,
"version": "1.13.0",
"project_urls": null,
"split_keywords": [],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "38a9475deffc4c0a5d61a0ca434b47fa7a2044d608f1ca88c26c65dd4457a35b",
"md5": "ade5fb8c30a3065e07f8b81c0a7d9504",
"sha256": "4eff02c839ffe0b9b95701fa76def13601ed663fabd0ea27be60a0356093ee21"
},
"downloads": -1,
"filename": "cpr_sdk-1.13.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "ade5fb8c30a3065e07f8b81c0a7d9504",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": "<4.0,>=3.10",
"size": 60449,
"upload_time": "2025-01-14T09:38:49",
"upload_time_iso_8601": "2025-01-14T09:38:49.356001Z",
"url": "https://files.pythonhosted.org/packages/38/a9/475deffc4c0a5d61a0ca434b47fa7a2044d608f1ca88c26c65dd4457a35b/cpr_sdk-1.13.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "d4485577ea4d22db15521bc066a688b8a0e28afc2f351a99f2db229c032316c6",
"md5": "7869b781db52c19aae64f53eb5a9d13e",
"sha256": "7e339ac449cf8475cedf9d55e4cf5bdf6f8dd1d72801fb94067a8ced9af3fabe"
},
"downloads": -1,
"filename": "cpr_sdk-1.13.0.tar.gz",
"has_sig": false,
"md5_digest": "7869b781db52c19aae64f53eb5a9d13e",
"packagetype": "sdist",
"python_version": "source",
"requires_python": "<4.0,>=3.10",
"size": 58676,
"upload_time": "2025-01-14T09:38:50",
"upload_time_iso_8601": "2025-01-14T09:38:50.502315Z",
"url": "https://files.pythonhosted.org/packages/d4/48/5577ea4d22db15521bc066a688b8a0e28afc2f351a99f2db229c032316c6/cpr_sdk-1.13.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-01-14 09:38:50",
"github": false,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"lcname": "cpr_sdk"
}