arekit-ss


Namearekit-ss JSON
Version 0.24.0 PyPI version JSON
download
home_pagehttps://github.com/nicolay-r/arekit-ss
SummaryLow Resource Context Relation Sampler for contexts with relations for fact-checking and fine-tuning your LLM models, powered by AREkit
upload_time2023-11-07 12:26:24
maintainer
docs_urlNone
authorNicolay Rusnachenko
requires_python
licenseMIT License
keywords relation extraction data processing
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            ## arekit-ss 0.24.0

![](https://img.shields.io/badge/Python-3.9-brightgreen.svg)
![](https://img.shields.io/badge/AREkit-0.24.0-orange.svg)
[![](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/nicolay-r/arekit-ss/blob/master/arekit_ss.ipynb)

<p align="center">
    <img src="logo.png"/>
</p>

`arekit-ss` [AREkit double "s"] -- is an object-pair context sampler 
for [datasources](https://github.com/nicolay-r/AREkit/wiki/Binded-Sources), 
powered by [AREkit](https://github.com/nicolay-r/AREkit)

> **NOTE:** For custom text sampling, please follow the [ARElight](https://github.com/nicolay-r/ARElight) project.

## Installation

Install dependencies:
```bash
pip install git+https://github.com/nicolay-r/arekit-ss.git@0.24.0
```

Download AREkit related data, from which `sources` are required:
```bash
python -m arekit.download_data
```

## Usage
[![](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/nicolay-r/arekit-ss/blob/master/arekit_ss.ipynb)

Example of composing prompts:
```bash
python -m arekit_ss.sample --writer csv --source rusentrel --sampler prompt \
  --prompt "For text: '{text}', the attitude between '{s_val}' and '{t_val}' is: '{label_val}'" \
  --dest_lang en --docs_limit 1
```

> **Mind the case (issue [#18](https://github.com/nicolay-r/arekit-ss/issues/18)):**
> switching to another language may affect on amount of extracted data because of `terms_per_context`
> parameter that crops context by fixed and predefined amount of words.

<details>
<summary>

## Parameters
</summary>

* `source` -- source name from the list of the [supported sources](https://github.com/nicolay-r/arekit-ss/blob/master/arekit_ss/sources/src_list.py).
    * `terms_per_context` -- amount of words (terms) in between SOURCE and TARGET objects.
    * `object-source-types` -- filter specific source object types
    * `object-target-types` -- filter specific target object types
    * `relation_types` -- list of types, in which items separated with `|` char; all by default
    * `splits` -- Manual selection of the data-types related splits that should be chosen for the sampling process; 
      types should be separated by ':' sign; for example: 'train:test'
* `sampler` -- List of the supported samplers:
    * `nn` -- CNN/LSTM architecture related, including frames annotation from [RuSentiFrames](https://github.com/nicolay-r/RuSentiFrames).
        * `no-vectorize` -- flag is applicable only for `nn`, and denotes no need to generate embeddings for features
    * `bert` -- BERT-based, single-input sequence.
    * `prompt` -- prompt-based sampler for LLM systems [[prompt engeneering guide]](https://github.com/dair-ai/Prompt-Engineering-Guide)
        * `prompt` -- text of the prompt which includes the following parameters:
          * `{text}` is an original text of the sample
          * `{s_val}` and `{t_val}` values of the source and target of the pairs respectively
          * `{label_val}` value of the label
* `writer` -- the output format of samples:
    * `csv` -- for [AREnets](https://github.com/nicolay-r/AREnets) framework;
    * `jsonl` -- for [OpenNRE](https://github.com/thunlp/OpenNRE) framework.
    * `sqlite` -- SQLite-3.0 database.
* `mask_entities` -- mask entity mode.
* Text translation parameters:
    * `src_lang` -- original language of the text.
    * `dest_lang` -- target language of the text.
* `output_dir` -- target directory for samples storing
* Limiting the amount of documents from source:
    * `docs_limit` -- amount of documents to be considered for sampling from the whole source.
    * `doc_ids` -- list of the document IDs.
</details>

![output_prompts](https://github.com/nicolay-r/arekit-ss/assets/14871187/d1499f24-b2df-410b-98cc-8d4018de8c65)

## Powered by

* [AREkit framework](https://github.com/nicolay-r/AREkit)



            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/nicolay-r/arekit-ss",
    "name": "arekit-ss",
    "maintainer": "",
    "docs_url": null,
    "requires_python": "",
    "maintainer_email": "",
    "keywords": "relation extraction,data processing",
    "author": "Nicolay Rusnachenko",
    "author_email": "rusnicolay@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/18/e4/64461a438268d63a10cf64218475cf985c6aebf0b6d9d86243ee2bca5b9c/arekit_ss-0.24.0.tar.gz",
    "platform": null,
    "description": "## arekit-ss 0.24.0\n\n![](https://img.shields.io/badge/Python-3.9-brightgreen.svg)\n![](https://img.shields.io/badge/AREkit-0.24.0-orange.svg)\n[![](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/nicolay-r/arekit-ss/blob/master/arekit_ss.ipynb)\n\n<p align=\"center\">\n    <img src=\"logo.png\"/>\n</p>\n\n`arekit-ss` [AREkit double \"s\"] -- is an object-pair context sampler \nfor [datasources](https://github.com/nicolay-r/AREkit/wiki/Binded-Sources), \npowered by [AREkit](https://github.com/nicolay-r/AREkit)\n\n> **NOTE:** For custom text sampling, please follow the [ARElight](https://github.com/nicolay-r/ARElight) project.\n\n## Installation\n\nInstall dependencies:\n```bash\npip install git+https://github.com/nicolay-r/arekit-ss.git@0.24.0\n```\n\nDownload AREkit related data, from which `sources` are required:\n```bash\npython -m arekit.download_data\n```\n\n## Usage\n[![](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/nicolay-r/arekit-ss/blob/master/arekit_ss.ipynb)\n\nExample of composing prompts:\n```bash\npython -m arekit_ss.sample --writer csv --source rusentrel --sampler prompt \\\n  --prompt \"For text: '{text}', the attitude between '{s_val}' and '{t_val}' is: '{label_val}'\" \\\n  --dest_lang en --docs_limit 1\n```\n\n> **Mind the case (issue [#18](https://github.com/nicolay-r/arekit-ss/issues/18)):**\n> switching to another language may affect on amount of extracted data because of `terms_per_context`\n> parameter that crops context by fixed and predefined amount of words.\n\n<details>\n<summary>\n\n## Parameters\n</summary>\n\n* `source` -- source name from the list of the [supported sources](https://github.com/nicolay-r/arekit-ss/blob/master/arekit_ss/sources/src_list.py).\n    * `terms_per_context` -- amount of words (terms) in between SOURCE and TARGET objects.\n    * `object-source-types` -- filter specific source object types\n    * `object-target-types` -- filter specific target object types\n    * `relation_types` -- list of types, in which items separated with `|` char; all by default\n    * `splits` -- Manual selection of the data-types related splits that should be chosen for the sampling process; \n      types should be separated by ':' sign; for example: 'train:test'\n* `sampler` -- List of the supported samplers:\n    * `nn` -- CNN/LSTM architecture related, including frames annotation from [RuSentiFrames](https://github.com/nicolay-r/RuSentiFrames).\n        * `no-vectorize` -- flag is applicable only for `nn`, and denotes no need to generate embeddings for features\n    * `bert` -- BERT-based, single-input sequence.\n    * `prompt` -- prompt-based sampler for LLM systems [[prompt engeneering guide]](https://github.com/dair-ai/Prompt-Engineering-Guide)\n        * `prompt` -- text of the prompt which includes the following parameters:\n          * `{text}` is an original text of the sample\n          * `{s_val}` and `{t_val}` values of the source and target of the pairs respectively\n          * `{label_val}` value of the label\n* `writer` -- the output format of samples:\n    * `csv` -- for [AREnets](https://github.com/nicolay-r/AREnets) framework;\n    * `jsonl` -- for [OpenNRE](https://github.com/thunlp/OpenNRE) framework.\n    * `sqlite` -- SQLite-3.0 database.\n* `mask_entities` -- mask entity mode.\n* Text translation parameters:\n    * `src_lang` -- original language of the text.\n    * `dest_lang` -- target language of the text.\n* `output_dir` -- target directory for samples storing\n* Limiting the amount of documents from source:\n    * `docs_limit` -- amount of documents to be considered for sampling from the whole source.\n    * `doc_ids` -- list of the document IDs.\n</details>\n\n![output_prompts](https://github.com/nicolay-r/arekit-ss/assets/14871187/d1499f24-b2df-410b-98cc-8d4018de8c65)\n\n## Powered by\n\n* [AREkit framework](https://github.com/nicolay-r/AREkit)\n\n\n",
    "bugtrack_url": null,
    "license": "MIT License",
    "summary": "Low Resource Context Relation Sampler for contexts with relations for fact-checking and fine-tuning your LLM models, powered by AREkit",
    "version": "0.24.0",
    "project_urls": {
        "Homepage": "https://github.com/nicolay-r/arekit-ss"
    },
    "split_keywords": [
        "relation extraction",
        "data processing"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "e10d8da3e2918b8ad133acba45f6f9559e733338217a060ac1c75696bebfc4f1",
                "md5": "1c50964b1b60971fe4503aaeeb6c622a",
                "sha256": "8a6f85917dc474dc660115bc59d095a7c52bc4d4182f6906a8a053e6054f7896"
            },
            "downloads": -1,
            "filename": "arekit_ss-0.24.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "1c50964b1b60971fe4503aaeeb6c622a",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": null,
            "size": 23908,
            "upload_time": "2023-11-07T12:26:22",
            "upload_time_iso_8601": "2023-11-07T12:26:22.362894Z",
            "url": "https://files.pythonhosted.org/packages/e1/0d/8da3e2918b8ad133acba45f6f9559e733338217a060ac1c75696bebfc4f1/arekit_ss-0.24.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "18e464461a438268d63a10cf64218475cf985c6aebf0b6d9d86243ee2bca5b9c",
                "md5": "c571b355f82113af19805e081bc8c433",
                "sha256": "cf37f76d1fd2936cb7a83278b0d944760317888c63d48d80d6f7aa914045f614"
            },
            "downloads": -1,
            "filename": "arekit_ss-0.24.0.tar.gz",
            "has_sig": false,
            "md5_digest": "c571b355f82113af19805e081bc8c433",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": null,
            "size": 17881,
            "upload_time": "2023-11-07T12:26:24",
            "upload_time_iso_8601": "2023-11-07T12:26:24.111142Z",
            "url": "https://files.pythonhosted.org/packages/18/e4/64461a438268d63a10cf64218475cf985c6aebf0b6d9d86243ee2bca5b9c/arekit_ss-0.24.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-11-07 12:26:24",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "nicolay-r",
    "github_project": "arekit-ss",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "arekit-ss"
}
        
Elapsed time: 0.17914s