cpr_sdk


Namecpr_sdk JSON
Version 1.13.0 PyPI version JSON
download
home_pageNone
SummaryNone
upload_time2025-01-14 09:38:50
maintainerNone
docs_urlNone
authorCPR Tech
requires_python<4.0,>=3.10
licenseLICENSE
keywords
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # cpr-sdk

Internal library for persistent access to text data.

> **Warning**
> This library is heavily under construction and doesn't work with any of our open data yet. We're working on making it usable for anyone.

## Documents and Datasets

The base document model of this library is `BaseDocument`, which contains only the metadata fields that are used in the parser.

### Loading from Huggingface Hub (recommended)

The `Dataset` class is automatically configured with the Huggingface repos we use. You can optionally provide a document limit, a dataset version, and override the repo that the data is loaded from.

If the repository is private you must provide a [user access token](https://huggingface.co/docs/hub/security-tokens), either in your environment as `HUGGINGFACE_TOKEN`, or as an argument to `from_huggingface`.

```py
from cpr_sdk.models import Dataset, GSTDocument

dataset = Dataset(GSTDocument).from_huggingface(
    version="d8363af072d7e0f87ec281dd5084fb3d3f4583a9", # commit hash, optional
    limit=1000,
    token="my-huggingface-token", # required for private repos if not in env
)
```

The following flag is used for the passage level and flat dataset.

```py
dataset = Dataset(
    document_model=BaseDocument
).from_huggingface(
    dataset_name="ClimatePolicyRadar/passage-level-flat-dataset",
    passage_level_and_flat=True
)
```

### Loading from local storage or s3

```py
# document_id is also the filename stem

document = BaseDocument.load_from_local(folder_path="path/to/data/", document_id="document_1234")

document = BaseDocument.load_from_remote(dataset_key"s3://cpr-data", document_id="document_1234")
```

To manage metadata, documents need to be loaded into a `Dataset` object.

```py
from cpr_sdk.models import Dataset, CPRDocument, GSTDocument

dataset = Dataset().load_from_local("path/to/data", limit=1000)
assert all([isinstance(document, BaseDocument) for document in dataset])

dataset_with_metadata = dataset.add_metadata(
    target_model=CPRDocument,
    metadata_csv="path/to/metadata.csv",
)

assert all([isinstance(document, CPRDocument) for document in dataset_with_metadata])
```

Datasets have a number of methods for filtering and accessing documents.

```py
len(dataset)
>>> 1000

dataset[0]
>>> CPRDocument(...)

# Filtering
dataset.filter("document_id", "1234")
>>> Dataset()

dataset.filter_by_language("en")
>>> Dataset()

# Filtering using a function
dataset.filter("document_id", lambda x: x in ["1234", "5678"])
>>> Dataset()
```

## Search

This library can also be used to run searches against CPR documents and passages in Vespa.

```python
from src.cpr_sdk.search_adaptors import VespaSearchAdapter
from src.cpr_sdk.models.search import SearchParameters

adaptor = VespaSearchAdapter(instance_url="YOUR_INSTANCE_URL")

request = SearchParameters(query_string="forest fires")

response = adaptor.search(request)
```

The above example will return a `SearchResponse` object, which lists some basic information about the request, and the results, arranged as a list of Families, which each contain relevant Documents and/or Passages.

### Sorting

By default, results are sorted by relevance, but can be sorted by date, or name, eg

```python
request = SearchParameters(
    query_string="forest fires",
    sort_by="date",
    sort_order="descending",
)
```

### Filters

Matching documents can also be filtered by keyword field, and by publication date

```python
request = SearchParameters(
    query_string="forest fires",
    filters={
        "language": ["English", "French"],
        "category": ["Executive"],
    },
    year_range=(2010, 2020)
)
```

### Search within families or documents

A subset of families or documents can be retrieved for search using their ids

```python
request = SearchParameters(
    query_string="forest fires",
    family_ids=["CCLW.family.10121.0", "CCLW.family.4980.0"],
)
```

```python
request = SearchParameters(
    query_string="forest fires",
    document_ids=["CCLW.executive.10121.4637", "CCLW.legislative.4980.1745"],
)
```

### Types of query

The default search approach uses a nearest neighbour search ranking.

Its also possible to search for exact matches instead:

```python
request = SearchParameters(
    query_string="forest fires",
    exact_match=True,
)
```

Or to ignore the query string and search the whole database instead:

```python
request = SearchParameters(
    year_range=(2020, 2024),
    sort_by="date",
    sort_order="descending",
)
```

### Continuing results

The response objects include continuation tokens, which can be used to get more results.

For the next selection of families:

```python
response = adaptor.search(SearchParameters(query_string="forest fires"))

follow_up_request = SearchParameters(
    query_string="forest fires"
    continuation_tokens=[response.continuation_token],

)
follow_up_response = adaptor.search(follow_up_request)
```

It is also possible to get more hits within families by using the continuation token on the family object, rather than at the responses root

Note that `this_continuation_token` is used to mark the current continuation of the families, so getting more passages for a family after getting more families would look like this:

```python
follow_up_response = adaptor.search(follow_up_request)

this_token = follow_up_response.this_continuation_token
passage_token = follow_up_response.families[0].continuation_token

follow_up_request = SearchParameters(
    query_string="forest fires"
    continuation_tokens=[this_token, passage_token],
)
```

## Get a specific document

Users can also fetch single documents directly from Vespa, by document ID

```python
adaptor.get_by_id(document_id="id:YOUR_NAMESPACE:YOUR_SCHEMA_NAME::SOME_DOCUMENT_ID")
```

All of the above search functionality assumes that a valid set of vespa credentials is available in `~/.vespa`, or in a directory supplied to the `VespaSearchAdapter` constructor directly. See [the docs](docs/vespa-auth.md) for more information on how vespa expects credentials.

# Test setup

Some tests rely on a local running instance of vespa.

This requires the [vespa cli](https://docs.vespa.ai/en/vespa-cli.html) to be installed.

Setup can then be run with:

```
poetry install --all-extras --with dev
poetry shell
make vespa_dev_setup
make test
```

Alternatively, to only run non-vespa tests:

```
make test_not_vespa
```

For clean up:

```
make vespa_dev_down
```

## Release Flow:

- Make updates to the package.
- Bump the package version in the `cpr_sdk/version.py` module.
- Make a PR.
  - In CI/CD we will check that the version is greater than the latest release.
- Merge.
- Tag a release manually in github with a version that matches the latest on main that you just merged.
  - In CI/CD we will check that the latest release matches the versions defined in code.
- Check in `pypi`.
            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "cpr_sdk",
    "maintainer": null,
    "docs_url": null,
    "requires_python": "<4.0,>=3.10",
    "maintainer_email": null,
    "keywords": null,
    "author": "CPR Tech",
    "author_email": "tech@climatepolicyradar.org",
    "download_url": "https://files.pythonhosted.org/packages/d4/48/5577ea4d22db15521bc066a688b8a0e28afc2f351a99f2db229c032316c6/cpr_sdk-1.13.0.tar.gz",
    "platform": null,
    "description": "# cpr-sdk\n\nInternal library for persistent access to text data.\n\n> **Warning**\n> This library is heavily under construction and doesn't work with any of our open data yet. We're working on making it usable for anyone.\n\n## Documents and Datasets\n\nThe base document model of this library is `BaseDocument`, which contains only the metadata fields that are used in the parser.\n\n### Loading from Huggingface Hub (recommended)\n\nThe `Dataset` class is automatically configured with the Huggingface repos we use. You can optionally provide a document limit, a dataset version, and override the repo that the data is loaded from.\n\nIf the repository is private you must provide a [user access token](https://huggingface.co/docs/hub/security-tokens), either in your environment as `HUGGINGFACE_TOKEN`, or as an argument to `from_huggingface`.\n\n```py\nfrom cpr_sdk.models import Dataset, GSTDocument\n\ndataset = Dataset(GSTDocument).from_huggingface(\n    version=\"d8363af072d7e0f87ec281dd5084fb3d3f4583a9\", # commit hash, optional\n    limit=1000,\n    token=\"my-huggingface-token\", # required for private repos if not in env\n)\n```\n\nThe following flag is used for the passage level and flat dataset.\n\n```py\ndataset = Dataset(\n    document_model=BaseDocument\n).from_huggingface(\n    dataset_name=\"ClimatePolicyRadar/passage-level-flat-dataset\",\n    passage_level_and_flat=True\n)\n```\n\n### Loading from local storage or s3\n\n```py\n# document_id is also the filename stem\n\ndocument = BaseDocument.load_from_local(folder_path=\"path/to/data/\", document_id=\"document_1234\")\n\ndocument = BaseDocument.load_from_remote(dataset_key\"s3://cpr-data\", document_id=\"document_1234\")\n```\n\nTo manage metadata, documents need to be loaded into a `Dataset` object.\n\n```py\nfrom cpr_sdk.models import Dataset, CPRDocument, GSTDocument\n\ndataset = Dataset().load_from_local(\"path/to/data\", limit=1000)\nassert all([isinstance(document, BaseDocument) for document in dataset])\n\ndataset_with_metadata = dataset.add_metadata(\n    target_model=CPRDocument,\n    metadata_csv=\"path/to/metadata.csv\",\n)\n\nassert all([isinstance(document, CPRDocument) for document in dataset_with_metadata])\n```\n\nDatasets have a number of methods for filtering and accessing documents.\n\n```py\nlen(dataset)\n>>> 1000\n\ndataset[0]\n>>> CPRDocument(...)\n\n# Filtering\ndataset.filter(\"document_id\", \"1234\")\n>>> Dataset()\n\ndataset.filter_by_language(\"en\")\n>>> Dataset()\n\n# Filtering using a function\ndataset.filter(\"document_id\", lambda x: x in [\"1234\", \"5678\"])\n>>> Dataset()\n```\n\n## Search\n\nThis library can also be used to run searches against CPR documents and passages in Vespa.\n\n```python\nfrom src.cpr_sdk.search_adaptors import VespaSearchAdapter\nfrom src.cpr_sdk.models.search import SearchParameters\n\nadaptor = VespaSearchAdapter(instance_url=\"YOUR_INSTANCE_URL\")\n\nrequest = SearchParameters(query_string=\"forest fires\")\n\nresponse = adaptor.search(request)\n```\n\nThe above example will return a `SearchResponse` object, which lists some basic information about the request, and the results, arranged as a list of Families, which each contain relevant Documents and/or Passages.\n\n### Sorting\n\nBy default, results are sorted by relevance, but can be sorted by date, or name, eg\n\n```python\nrequest = SearchParameters(\n    query_string=\"forest fires\",\n    sort_by=\"date\",\n    sort_order=\"descending\",\n)\n```\n\n### Filters\n\nMatching documents can also be filtered by keyword field, and by publication date\n\n```python\nrequest = SearchParameters(\n    query_string=\"forest fires\",\n    filters={\n        \"language\": [\"English\", \"French\"],\n        \"category\": [\"Executive\"],\n    },\n    year_range=(2010, 2020)\n)\n```\n\n### Search within families or documents\n\nA subset of families or documents can be retrieved for search using their ids\n\n```python\nrequest = SearchParameters(\n    query_string=\"forest fires\",\n    family_ids=[\"CCLW.family.10121.0\", \"CCLW.family.4980.0\"],\n)\n```\n\n```python\nrequest = SearchParameters(\n    query_string=\"forest fires\",\n    document_ids=[\"CCLW.executive.10121.4637\", \"CCLW.legislative.4980.1745\"],\n)\n```\n\n### Types of query\n\nThe default search approach uses a nearest neighbour search ranking.\n\nIts also possible to search for exact matches instead:\n\n```python\nrequest = SearchParameters(\n    query_string=\"forest fires\",\n    exact_match=True,\n)\n```\n\nOr to ignore the query string and search the whole database instead:\n\n```python\nrequest = SearchParameters(\n    year_range=(2020, 2024),\n    sort_by=\"date\",\n    sort_order=\"descending\",\n)\n```\n\n### Continuing results\n\nThe response objects include continuation tokens, which can be used to get more results.\n\nFor the next selection of families:\n\n```python\nresponse = adaptor.search(SearchParameters(query_string=\"forest fires\"))\n\nfollow_up_request = SearchParameters(\n    query_string=\"forest fires\"\n    continuation_tokens=[response.continuation_token],\n\n)\nfollow_up_response = adaptor.search(follow_up_request)\n```\n\nIt is also possible to get more hits within families by using the continuation token on the family object, rather than at the responses root\n\nNote that `this_continuation_token` is used to mark the current continuation of the families, so getting more passages for a family after getting more families would look like this:\n\n```python\nfollow_up_response = adaptor.search(follow_up_request)\n\nthis_token = follow_up_response.this_continuation_token\npassage_token = follow_up_response.families[0].continuation_token\n\nfollow_up_request = SearchParameters(\n    query_string=\"forest fires\"\n    continuation_tokens=[this_token, passage_token],\n)\n```\n\n## Get a specific document\n\nUsers can also fetch single documents directly from Vespa, by document ID\n\n```python\nadaptor.get_by_id(document_id=\"id:YOUR_NAMESPACE:YOUR_SCHEMA_NAME::SOME_DOCUMENT_ID\")\n```\n\nAll of the above search functionality assumes that a valid set of vespa credentials is available in `~/.vespa`, or in a directory supplied to the `VespaSearchAdapter` constructor directly. See [the docs](docs/vespa-auth.md) for more information on how vespa expects credentials.\n\n# Test setup\n\nSome tests rely on a local running instance of vespa.\n\nThis requires the [vespa cli](https://docs.vespa.ai/en/vespa-cli.html) to be installed.\n\nSetup can then be run with:\n\n```\npoetry install --all-extras --with dev\npoetry shell\nmake vespa_dev_setup\nmake test\n```\n\nAlternatively, to only run non-vespa tests:\n\n```\nmake test_not_vespa\n```\n\nFor clean up:\n\n```\nmake vespa_dev_down\n```\n\n## Release Flow:\n\n- Make updates to the package.\n- Bump the package version in the `cpr_sdk/version.py` module.\n- Make a PR.\n  - In CI/CD we will check that the version is greater than the latest release.\n- Merge.\n- Tag a release manually in github with a version that matches the latest on main that you just merged.\n  - In CI/CD we will check that the latest release matches the versions defined in code.\n- Check in `pypi`.",
    "bugtrack_url": null,
    "license": "LICENSE",
    "summary": null,
    "version": "1.13.0",
    "project_urls": null,
    "split_keywords": [],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "38a9475deffc4c0a5d61a0ca434b47fa7a2044d608f1ca88c26c65dd4457a35b",
                "md5": "ade5fb8c30a3065e07f8b81c0a7d9504",
                "sha256": "4eff02c839ffe0b9b95701fa76def13601ed663fabd0ea27be60a0356093ee21"
            },
            "downloads": -1,
            "filename": "cpr_sdk-1.13.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "ade5fb8c30a3065e07f8b81c0a7d9504",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": "<4.0,>=3.10",
            "size": 60449,
            "upload_time": "2025-01-14T09:38:49",
            "upload_time_iso_8601": "2025-01-14T09:38:49.356001Z",
            "url": "https://files.pythonhosted.org/packages/38/a9/475deffc4c0a5d61a0ca434b47fa7a2044d608f1ca88c26c65dd4457a35b/cpr_sdk-1.13.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "d4485577ea4d22db15521bc066a688b8a0e28afc2f351a99f2db229c032316c6",
                "md5": "7869b781db52c19aae64f53eb5a9d13e",
                "sha256": "7e339ac449cf8475cedf9d55e4cf5bdf6f8dd1d72801fb94067a8ced9af3fabe"
            },
            "downloads": -1,
            "filename": "cpr_sdk-1.13.0.tar.gz",
            "has_sig": false,
            "md5_digest": "7869b781db52c19aae64f53eb5a9d13e",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": "<4.0,>=3.10",
            "size": 58676,
            "upload_time": "2025-01-14T09:38:50",
            "upload_time_iso_8601": "2025-01-14T09:38:50.502315Z",
            "url": "https://files.pythonhosted.org/packages/d4/48/5577ea4d22db15521bc066a688b8a0e28afc2f351a99f2db229c032316c6/cpr_sdk-1.13.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-01-14 09:38:50",
    "github": false,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "lcname": "cpr_sdk"
}
        
Elapsed time: 0.43288s