ckanext-resource-indexer

Name	ckanext-resource-indexer JSON
Version	0.4.2.post1 JSON
	download
home_page	None
Summary	None
upload_time	2024-10-17 14:21:45
maintainer	None
docs_url	None
author	None
requires_python	None
license	AGPL
keywords	ckan
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage

            # ckanext-resource_indexer

Discover more results in the Dataset search by searching through the content of resources.

This extension indexes the content of files attached to resources. In this way
user has more chances to find the relevant results when using site search.

The process if indexation can be customized for each file format via [resource
indexers](#indexers). The following formats are supported out of the box:
* Plain text
* PDF
* JSON

## Structure
* [Installation](#installation)
* [Configuration](#configuration)
* [Indexers](#indexers)
  * [Register own indexer](#register-own-indexer)
  * [Built-in indexers](#built-in-indexers)

## Installation

1. Install the package as a CKAN extension
   ```sh
   pip install ckanext-resource-indexer
   ```
1. Add `resource_indexer` to the list of enabled plugins
1. **Optional**. Enable built-in indexers by adding the following items to the list of enabled plugins:
   * [`plain_resource_indexer`](#plain-indexer)
   * [`pdf_resource_indexer`](#pdf-indexer)
   * [`json_resource_indexer`](#json-indexer)


## Configuration
```ini
# Make an attempt to index remote files(fetch into tmp folder
# using URL)
# (optional, default: false).
ckanext.resource_indexer.allow_remote = 1

# Tiemeout for the attempt to download remote file
# (optional, default: 2).
ckanext.resource_indexer.remote_timeout = 10

# The size treshold(MB) for remote resources
# (optional, default: 4).
ckanext.resource_indexer.max_remote_size = 4

# List of resource formats(lowercase) that should be
# indexed.
# (optional, default: None)
ckanext.resource_indexer.indexable_formats = txt pdf

# Store the data extracted from resource inside specified field in the index.
# If empty, store data inside the general-purpose `text` field.
# (optional, default: text)
ckanext.resoruce_indexer.index_field = extras_res_attachment

# Boost matches by resource's content. Set values greater that 1 in order
# to promote such matches and value between 0 and 1 in order to put such
# matches further in search results. Works only when using custom index
# field(`ckanext.resoruce_indexer.index_field`)
# (optional, default: 1)
ckanext.resoruce_indexer.search_boost = 0.5

##### Indexer specific option ###############

### Plain
# Space-separated list of formats that can be indexed as a plain text
# (optional, default: txt csv json yaml yml html)
ckanext.resource_indexer.plain.indexable_formats = xml txt csv

### PDF
# Change a text from a single page before it added to the index
# (optional, default: builtins:str)
ckanext.resoruce_indexer.pdf.page_processor = custom.module:value_processor

### JSON
# Index JSON files as plain text(in addition to indexing as mapping)
# (optional, default: false)
ckanext.resoruce_indexer.json.add_as_plain = true

# Change a key before it's used for patching the package dictionary
# (optional, default: builtins:str)
ckanext.resoruce_indexer.json.key_processor = custom.module:key_processor

# Change a value before it's used for patching the package dictionary
# (optional, default: builtins:str)
ckanext.resoruce_indexer.json.value_processor = custom.module:value_processor
```

## Indexers

In order to extract the data from resources, this extension uses
**Indexers**. These are CKAN plugins implementing `IResourceIndexer` interface.

For every resource with the format specified by
`ckanext.resource_indexer.indexable_formats` config option, an appropriate
indexer is searched. If no indexers were found(or resource format is missing
from the `ckanext.resource_indexer.indexable_formats` config option), the
resource is skipped.

:information_source: Indexation can be temporarily disabled using one of the
following approaches:
* Set environment variable `CKANEXT_RESOURCE_INDEXER_BYPASS`(any non-empty
value), and the plugin won't interfer into standard dataset indexation
process.
* Use `ckanext.resource_indexer.utils.disabled_indexation` context manager:
  ```python
  with disabled_indexation():
      here_indexation_does_not_happen()

  here_indexation_happens()
  ```


Every indexer has weight(priority). Indexer with the highest weight will be
used to index the resource.

Indexation consists of two steps:

* meaningful data segments extracted from the resource
* these data segments are merged into the package dictionary consumed by the
  search engine(Solr) for indexing

It means, that the format of extracted segments must be compatible with the
merging logic from the second step. But other than that, there are no
particular requirements for the format of extracted data.

Data extraction happens locally. If the resource was uploaded to the local
filesystem, data is extracted directly from the resource's file. If the
resource is stored remotely(either uploaded to the cloud or linked via remote
URL), it can be temporarily downloaded to the local filesystem and removed
after processing. By default, non-local resources are ignored, but this can be
changed via `ckanext.resource_indexer.allow_remote` config option.

### Register own indexer

Implement `ckanext.resource_indexer.interface.IResourceIndexer` by providing following methods:

```python
class CustomIndexerPlugin(plugins.SingletonPlugin):
    plugins.implements(IResourceIndexer)

    def get_resource_indexer_weight(self, resource: dict[str, Any]) -> int:
        """Define priority of the indexer

        Args:
            resource: resource's details

        Returns:
            the weight of the indexer
            Expected values:
               0: skip handler
               10: use handler if no other handlers found
               20: use handler as a default one for the resource
               30: use handler as an optimal one for the resource
               40: use handler as a special-case handler for the resource
               50: ignore all the other handlers and use this one instead
        """
        return Weight.fallback

    def extract_indexable_chunks(self, path: str) -> Any:
        """Extract indexable data from the resource

        The result can have any form as long as it can be merged into the
        package dictionary by implementation of `merge_chunk_into_index`.

        Args:
            path: path to resource file

        Returns:
            all meaningfuld pieces of data with no type assumption

        """
        return []

    def merge_chunks_into_index(self, pkg_dict: dict[str, Any], chunks: Any):
        """Merge data into the package dictionary.


        Args:
            pkg_dict: package that is going to be indexed
            chunks: collection of data fragments extracted from resource

        Returns:
            all meaningfuld pieces of data with no type assumption
        """
        pass
```

### Built-in indexers

#### Plain indexer
Index formats specified by `ckanext.resource_indexer.indexable_formats` if they fall into the value of `ckanext.resource_indexer.plain.indexable_formats` config option, unless other handler with a non-fallback weight(>10) found.

Resources are indexed as-is. File is read and sent to the index without any additional changes.

Enable it by adding `plain_resource_indexer` to the list of enabled plugins.


#### PDF indexer

Extract and index text from the PDF file.

In order to enable it:
* install current extension with the `pdf` extra:
  ```sh
  pip install 'ckanext-resource-indexer[pdf]'
  ```
  or, if you've already installed the extension itself, just install `pdftotext`:
  ```sh
  pip install pdftotext
  ```
* add `pdf_resource_indexer` to the list of enabled plugins and
* install system packages for PDF processing. This will be different depending on your system. Examples:
  * CentOS
     ```sh
     yum install -y pulseaudio-libs-devel \
        gcc-c++ pkgconfig \
        python3-devel \
        libxml2-devel libxslt-devel \
        poppler poppler-utils poppler-cpp-devel
     ```

  * Debian
    ```sh
    apt install build-essential libpoppler-cpp-dev pkg-config python3-dev
    ```

  * macOS
    ```sh
    brew install pkg-config poppler python
    ```

If PDF content requires preprocessing, specify function that converts text from
a every separate as a `ckanext.resoruce_indexer.pdf.page_processor`. It uses
standard import-string format: `module.import.path:function`

#### JSON indexer

Read a dictionary from the JSON file, convert all non-string values into
strings(i.e, no nested values allowed), and apply it as a patch to the indexed
dataset.

Optionally, if `ckanext.resoruce_indexer.json.add_as_plain` flag enabled, index
the content of the file as a plain-text(similar to the [plain
indexer](#plain-indexer))

If key or value requires preprocessing, specify function that converts data as
a `ckanext.resoruce_indexer.json.key_processor` or
`ckanext.resoruce_indexer.json.value_processor`. It uses standard import-string
format: `module.import.path:function`


Enable it by adding `json_resource_indexer` to the list of enabled plugins.

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "ckanext-resource-indexer",
    "maintainer": null,
    "docs_url": null,
    "requires_python": null,
    "maintainer_email": "DataShades <datashades@linkdigital.com.au>",
    "keywords": "CKAN",
    "author": null,
    "author_email": "DataShades <datashades@linkdigital.com.au>, Sergey Motornyuk <sergey.motornyuk@linkdigital.com.au>",
    "download_url": "https://files.pythonhosted.org/packages/25/52/36719580135ff384e52e8d1f18c0882cc47a747d1218e9b3bbb05d57af7e/ckanext_resource_indexer-0.4.2.post1.tar.gz",
    "platform": null,
    "description": "# ckanext-resource_indexer\n\nDiscover more results in the Dataset search by searching through the content of resources.\n\nThis extension indexes the content of files attached to resources. In this way\nuser has more chances to find the relevant results when using site search.\n\nThe process if indexation can be customized for each file format via [resource\nindexers](#indexers). The following formats are supported out of the box:\n* Plain text\n* PDF\n* JSON\n\n## Structure\n* [Installation](#installation)\n* [Configuration](#configuration)\n* [Indexers](#indexers)\n  * [Register own indexer](#register-own-indexer)\n  * [Built-in indexers](#built-in-indexers)\n\n## Installation\n\n1. Install the package as a CKAN extension\n   ```sh\n   pip install ckanext-resource-indexer\n   ```\n1. Add `resource_indexer` to the list of enabled plugins\n1. **Optional**. Enable built-in indexers by adding the following items to the list of enabled plugins:\n   * [`plain_resource_indexer`](#plain-indexer)\n   * [`pdf_resource_indexer`](#pdf-indexer)\n   * [`json_resource_indexer`](#json-indexer)\n\n\n## Configuration\n```ini\n# Make an attempt to index remote files(fetch into tmp folder\n# using URL)\n# (optional, default: false).\nckanext.resource_indexer.allow_remote = 1\n\n# Tiemeout for the attempt to download remote file\n# (optional, default: 2).\nckanext.resource_indexer.remote_timeout = 10\n\n# The size treshold(MB) for remote resources\n# (optional, default: 4).\nckanext.resource_indexer.max_remote_size = 4\n\n# List of resource formats(lowercase) that should be\n# indexed.\n# (optional, default: None)\nckanext.resource_indexer.indexable_formats = txt pdf\n\n# Store the data extracted from resource inside specified field in the index.\n# If empty, store data inside the general-purpose `text` field.\n# (optional, default: text)\nckanext.resoruce_indexer.index_field = extras_res_attachment\n\n# Boost matches by resource's content. Set values greater that 1 in order\n# to promote such matches and value between 0 and 1 in order to put such\n# matches further in search results. Works only when using custom index\n# field(`ckanext.resoruce_indexer.index_field`)\n# (optional, default: 1)\nckanext.resoruce_indexer.search_boost = 0.5\n\n##### Indexer specific option ###############\n\n### Plain\n# Space-separated list of formats that can be indexed as a plain text\n# (optional, default: txt csv json yaml yml html)\nckanext.resource_indexer.plain.indexable_formats = xml txt csv\n\n### PDF\n# Change a text from a single page before it added to the index\n# (optional, default: builtins:str)\nckanext.resoruce_indexer.pdf.page_processor = custom.module:value_processor\n\n### JSON\n# Index JSON files as plain text(in addition to indexing as mapping)\n# (optional, default: false)\nckanext.resoruce_indexer.json.add_as_plain = true\n\n# Change a key before it's used for patching the package dictionary\n# (optional, default: builtins:str)\nckanext.resoruce_indexer.json.key_processor = custom.module:key_processor\n\n# Change a value before it's used for patching the package dictionary\n# (optional, default: builtins:str)\nckanext.resoruce_indexer.json.value_processor = custom.module:value_processor\n```\n\n## Indexers\n\nIn order to extract the data from resources, this extension uses\n**Indexers**. These are CKAN plugins implementing `IResourceIndexer` interface.\n\nFor every resource with the format specified by\n`ckanext.resource_indexer.indexable_formats` config option, an appropriate\nindexer is searched. If no indexers were found(or resource format is missing\nfrom the `ckanext.resource_indexer.indexable_formats` config option), the\nresource is skipped.\n\n:information_source: Indexation can be temporarily disabled using one of the\nfollowing approaches:\n* Set environment variable `CKANEXT_RESOURCE_INDEXER_BYPASS`(any non-empty\nvalue), and the plugin won't interfer into standard dataset indexation\nprocess.\n* Use `ckanext.resource_indexer.utils.disabled_indexation` context manager:\n  ```python\n  with disabled_indexation():\n      here_indexation_does_not_happen()\n\n  here_indexation_happens()\n  ```\n\n\nEvery indexer has weight(priority). Indexer with the highest weight will be\nused to index the resource.\n\nIndexation consists of two steps:\n\n* meaningful data segments extracted from the resource\n* these data segments are merged into the package dictionary consumed by the\n  search engine(Solr) for indexing\n\nIt means, that the format of extracted segments must be compatible with the\nmerging logic from the second step. But other than that, there are no\nparticular requirements for the format of extracted data.\n\nData extraction happens locally. If the resource was uploaded to the local\nfilesystem, data is extracted directly from the resource's file. If the\nresource is stored remotely(either uploaded to the cloud or linked via remote\nURL), it can be temporarily downloaded to the local filesystem and removed\nafter processing. By default, non-local resources are ignored, but this can be\nchanged via `ckanext.resource_indexer.allow_remote` config option.\n\n### Register own indexer\n\nImplement `ckanext.resource_indexer.interface.IResourceIndexer` by providing following methods:\n\n```python\nclass CustomIndexerPlugin(plugins.SingletonPlugin):\n    plugins.implements(IResourceIndexer)\n\n    def get_resource_indexer_weight(self, resource: dict[str, Any]) -> int:\n        \"\"\"Define priority of the indexer\n\n        Args:\n            resource: resource's details\n\n        Returns:\n            the weight of the indexer\n            Expected values:\n               0: skip handler\n               10: use handler if no other handlers found\n               20: use handler as a default one for the resource\n               30: use handler as an optimal one for the resource\n               40: use handler as a special-case handler for the resource\n               50: ignore all the other handlers and use this one instead\n        \"\"\"\n        return Weight.fallback\n\n    def extract_indexable_chunks(self, path: str) -> Any:\n        \"\"\"Extract indexable data from the resource\n\n        The result can have any form as long as it can be merged into the\n        package dictionary by implementation of `merge_chunk_into_index`.\n\n        Args:\n            path: path to resource file\n\n        Returns:\n            all meaningfuld pieces of data with no type assumption\n\n        \"\"\"\n        return []\n\n    def merge_chunks_into_index(self, pkg_dict: dict[str, Any], chunks: Any):\n        \"\"\"Merge data into the package dictionary.\n\n\n        Args:\n            pkg_dict: package that is going to be indexed\n            chunks: collection of data fragments extracted from resource\n\n        Returns:\n            all meaningfuld pieces of data with no type assumption\n        \"\"\"\n        pass\n```\n\n### Built-in indexers\n\n#### Plain indexer\nIndex formats specified by `ckanext.resource_indexer.indexable_formats` if they fall into the value of `ckanext.resource_indexer.plain.indexable_formats` config option, unless other handler with a non-fallback weight(>10) found.\n\nResources are indexed as-is. File is read and sent to the index without any additional changes.\n\nEnable it by adding `plain_resource_indexer` to the list of enabled plugins.\n\n\n#### PDF indexer\n\nExtract and index text from the PDF file.\n\nIn order to enable it:\n* install current extension with the `pdf` extra:\n  ```sh\n  pip install 'ckanext-resource-indexer[pdf]'\n  ```\n  or, if you've already installed the extension itself, just install `pdftotext`:\n  ```sh\n  pip install pdftotext\n  ```\n* add `pdf_resource_indexer` to the list of enabled plugins and\n* install system packages for PDF processing. This will be different depending on your system. Examples:\n  * CentOS\n     ```sh\n     yum install -y pulseaudio-libs-devel \\\n        gcc-c++ pkgconfig \\\n        python3-devel \\\n        libxml2-devel libxslt-devel \\\n        poppler poppler-utils poppler-cpp-devel\n     ```\n\n  * Debian\n    ```sh\n    apt install build-essential libpoppler-cpp-dev pkg-config python3-dev\n    ```\n\n  * macOS\n    ```sh\n    brew install pkg-config poppler python\n    ```\n\nIf PDF content requires preprocessing, specify function that converts text from\na every separate as a `ckanext.resoruce_indexer.pdf.page_processor`. It uses\nstandard import-string format: `module.import.path:function`\n\n#### JSON indexer\n\nRead a dictionary from the JSON file, convert all non-string values into\nstrings(i.e, no nested values allowed), and apply it as a patch to the indexed\ndataset.\n\nOptionally, if `ckanext.resoruce_indexer.json.add_as_plain` flag enabled, index\nthe content of the file as a plain-text(similar to the [plain\nindexer](#plain-indexer))\n\nIf key or value requires preprocessing, specify function that converts data as\na `ckanext.resoruce_indexer.json.key_processor` or\n`ckanext.resoruce_indexer.json.value_processor`. It uses standard import-string\nformat: `module.import.path:function`\n\n\nEnable it by adding `json_resource_indexer` to the list of enabled plugins.\n",
    "bugtrack_url": null,
    "license": "AGPL",
    "summary": null,
    "version": "0.4.2.post1",
    "project_urls": {
        "Homepage": "https://github.com/DataShades/ckanext-resource_indexer"
    },
    "split_keywords": [
        "ckan"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "72bf672bb8a077dabd7098392a46172f70a6eb0d43f034b04be5215f10649e4b",
                "md5": "9b5f7d608201e0f6acfef4e0bb08ca01",
                "sha256": "0396fc52ecc1ab01e9c8f379b9aa088895532c22b60f30d173d857f635c7a74c"
            },
            "downloads": -1,
            "filename": "ckanext_resource_indexer-0.4.2.post1-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "9b5f7d608201e0f6acfef4e0bb08ca01",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": null,
            "size": 28351,
            "upload_time": "2024-10-17T14:21:43",
            "upload_time_iso_8601": "2024-10-17T14:21:43.285967Z",
            "url": "https://files.pythonhosted.org/packages/72/bf/672bb8a077dabd7098392a46172f70a6eb0d43f034b04be5215f10649e4b/ckanext_resource_indexer-0.4.2.post1-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "255236719580135ff384e52e8d1f18c0882cc47a747d1218e9b3bbb05d57af7e",
                "md5": "5916a2424f2b9a8f89f5e384ba24a211",
                "sha256": "5cf95c6a68c7d50a3620c61d83e8fdc13a37cf5256c6c8f31e5df11fda154c6a"
            },
            "downloads": -1,
            "filename": "ckanext_resource_indexer-0.4.2.post1.tar.gz",
            "has_sig": false,
            "md5_digest": "5916a2424f2b9a8f89f5e384ba24a211",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": null,
            "size": 27949,
            "upload_time": "2024-10-17T14:21:45",
            "upload_time_iso_8601": "2024-10-17T14:21:45.130003Z",
            "url": "https://files.pythonhosted.org/packages/25/52/36719580135ff384e52e8d1f18c0882cc47a747d1218e9b3bbb05d57af7e/ckanext_resource_indexer-0.4.2.post1.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-10-17 14:21:45",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "DataShades",
    "github_project": "ckanext-resource_indexer",
    "travis_ci": false,
    "coveralls": true,
    "github_actions": true,
    "requirements": [],
    "lcname": "ckanext-resource-indexer"
}

None