# ij
Text feature extraction
To install: ```pip install ij```
# Examples
## TreeTokenizer
A tokenizer that is the result of multiple different tokenizers that might either all be applied to the
same text, or are used recursively to break up the text into finner pieces.
In a deep_tokenizer tokenizers[0] tokenizes the text, then the next tokenizer, tokeizers[1] is applied to these
tokens, and so forth. By default, the union of the tokens are returned. If token_prefixes is specified (usually,
a different one for each tokenizer), they are prepended to the tokens to distinguish what level of tokenization
they come from.
If text_collection is specified, along with max_df and/or min_df, the text_collection will serve to learn a
vocabulary for each tokenizer by collecting only those tokens whose frequency is at least min_df and no more than
max_df.
Params:
* tokenizers: A list of tokenizers (function taking text and outputing a list of strings
* token_prefixes: A list of prefixes to add in front of the tokens matched for each tokenizer
(or a single string that will be used for all tokenizers
* max_df and min_df: Only relevant when leaning text_collection is specified.
These are respectively the max and min frequency that tokens should have to be included.
The frequency can be expressed as a count, or a ratio of the total count.
Note that in the case of max_df, it will always be relative to the total count of tokens at the current level.
* return_tokenizer_info: Boolean (default False) indicating whether to return the tokenizer_info_list as well
Fit input X is a collection of the text to learn the vocabulary with
```python
>>> from ij import TreeTokenizer
>>> import re
>>>
>>> t = [re.compile('[\w-]+').findall, re.compile('\w+').findall]
>>> p = ['level_1=', 'level_2=']
>>> ttok = TreeTokenizer(tokenizers=t, token_prefixes=p)
>>> ttok.tokenize('A-B C B')
['level_1=A-B', 'level_1=C', 'level_1=B', 'level_2=A', 'level_2=B', 'level_2=C', 'level_2=B']
>>> s = ['A-B-C A-B A B', 'A-B C B']
>>> ttok = TreeTokenizer(tokenizers=t, token_prefixes=p, min_df=2)
>>> _ = ttok.fit(text_collection=s)
>>> ttok.transform(['A-B C B']).tolist()
[['level_1=A-B', 'level_1=B', 'level_2=C']]
```
Raw data
{
"_id": null,
"home_page": "https://github.com/thorwhalen/uu/tree/master/ij",
"name": "ij",
"maintainer": null,
"docs_url": null,
"requires_python": null,
"maintainer_email": null,
"keywords": null,
"author": "Thor Whalen",
"author_email": null,
"download_url": "https://files.pythonhosted.org/packages/7e/ea/12a4d451ffc2af037e61011d11d34fa517e532bed7c1779a4ae2f74f54bc/ij-0.0.5.tar.gz",
"platform": "any",
"description": "# ij\n\nText feature extraction\n\nTo install:\t```pip install ij```\n\n\n# Examples\n\n## TreeTokenizer\n\nA tokenizer that is the result of multiple different tokenizers that might either all be applied to the\nsame text, or are used recursively to break up the text into finner pieces.\nIn a deep_tokenizer tokenizers[0] tokenizes the text, then the next tokenizer, tokeizers[1] is applied to these\ntokens, and so forth. By default, the union of the tokens are returned. If token_prefixes is specified (usually,\na different one for each tokenizer), they are prepended to the tokens to distinguish what level of tokenization\nthey come from.\n\nIf text_collection is specified, along with max_df and/or min_df, the text_collection will serve to learn a\nvocabulary for each tokenizer by collecting only those tokens whose frequency is at least min_df and no more than\nmax_df.\n\nParams:\n * tokenizers: A list of tokenizers (function taking text and outputing a list of strings\n * token_prefixes: A list of prefixes to add in front of the tokens matched for each tokenizer\n (or a single string that will be used for all tokenizers\n * max_df and min_df: Only relevant when leaning text_collection is specified.\n These are respectively the max and min frequency that tokens should have to be included.\n The frequency can be expressed as a count, or a ratio of the total count.\n Note that in the case of max_df, it will always be relative to the total count of tokens at the current level.\n * return_tokenizer_info: Boolean (default False) indicating whether to return the tokenizer_info_list as well\n\nFit input X is a collection of the text to learn the vocabulary with\n\n\n```python\n>>> from ij import TreeTokenizer\n>>> import re\n>>>\n>>> t = [re.compile('[\\w-]+').findall, re.compile('\\w+').findall]\n>>> p = ['level_1=', 'level_2=']\n>>> ttok = TreeTokenizer(tokenizers=t, token_prefixes=p)\n>>> ttok.tokenize('A-B C B')\n['level_1=A-B', 'level_1=C', 'level_1=B', 'level_2=A', 'level_2=B', 'level_2=C', 'level_2=B']\n>>> s = ['A-B-C A-B A B', 'A-B C B']\n>>> ttok = TreeTokenizer(tokenizers=t, token_prefixes=p, min_df=2)\n>>> _ = ttok.fit(text_collection=s)\n>>> ttok.transform(['A-B C B']).tolist()\n[['level_1=A-B', 'level_1=B', 'level_2=C']]\n```\n",
"bugtrack_url": null,
"license": "apache-2.0",
"summary": "Text feature extraction",
"version": "0.0.5",
"project_urls": {
"Homepage": "https://github.com/thorwhalen/uu/tree/master/ij"
},
"split_keywords": [],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "e1b0170ab2beedd3195610a24ad78336253d397f9ac1ce274a91e63d470c14bb",
"md5": "906f91bb6e053860a3064cdd8edc674d",
"sha256": "7bf4a04e796c4f34912e99e99825851078e420012f9f9fbc60f98be97707c66b"
},
"downloads": -1,
"filename": "ij-0.0.5-py3-none-any.whl",
"has_sig": false,
"md5_digest": "906f91bb6e053860a3064cdd8edc674d",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": null,
"size": 11231,
"upload_time": "2025-01-11T12:35:42",
"upload_time_iso_8601": "2025-01-11T12:35:42.978454Z",
"url": "https://files.pythonhosted.org/packages/e1/b0/170ab2beedd3195610a24ad78336253d397f9ac1ce274a91e63d470c14bb/ij-0.0.5-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "7eea12a4d451ffc2af037e61011d11d34fa517e532bed7c1779a4ae2f74f54bc",
"md5": "44c8c45f47801013e4d864eef0652948",
"sha256": "b2f7e24f9f8fb93849e312bb4ec25ae7c1bc024468a4c6cf403d474fd04c4243"
},
"downloads": -1,
"filename": "ij-0.0.5.tar.gz",
"has_sig": false,
"md5_digest": "44c8c45f47801013e4d864eef0652948",
"packagetype": "sdist",
"python_version": "source",
"requires_python": null,
"size": 10478,
"upload_time": "2025-01-11T12:35:45",
"upload_time_iso_8601": "2025-01-11T12:35:45.149761Z",
"url": "https://files.pythonhosted.org/packages/7e/ea/12a4d451ffc2af037e61011d11d34fa517e532bed7c1779a4ae2f74f54bc/ij-0.0.5.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-01-11 12:35:45",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "thorwhalen",
"github_project": "uu",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"lcname": "ij"
}