# Code for Paper: “Low-Resource” Text Classification: A Parameter-Free Classification Method with Compressors
This paper is accepted to Findings of [ACL2023](https://aclanthology.org/2023.findings-acl.426/).
## Getting Started
This codebase is [available on pypi.org via](https://pypi.org/project/npc-gzip):
```sh
pip install npc-gzip
```
## Usage
See the [examples](./examples/) directory for example usage.
## Testing
This package utilizes `poetry` to maintain its dependencies and `pytest` to execute tests. To get started running the tests:
```sh
poetry shell
poetry install
pytest
```
-------------------------
### Original Codebase
#### Require
See `requirements.txt`.
Install requirements in a clean environment:
```sh
conda create -n npc python=3.7
conda activate npc
pip install -r requirements.txt
```
#### Run
```sh
python main_text.py
```
By default, this will only use 100 test and training samples per class as a quick demo. They can be changed by `--num_test`, `--num_train`.
```text
--compressor <gzip, lzma, bz2>
--dataset <AG_NEWS, SogouNews, DBpedia, YahooAnswers, 20News, Ohsumed_single, R8, R52, kinnews, kirnews, swahili, filipino> [Note that for small datasets like kinnews, default 100-shot is too big, need to set --num_test and --num_train.]
--num_train <INT>
--num_test <INT>
--data_dir <DIR> [This needs to be specified for R8, R52 and Ohsumed.]
--all_test [This will use the whole test dataset.]
--all_train
--record [This will record the distance matrix in order to save for the future use. It's helpful when you when to run on the whole dataset.]
--test_idx_start <INT>
--test_idx_end <INT> [These two args help us to run on a certain range of test set. Also helpful for calculating the distance matrix on the whole dataset.]
--para [This will use multiprocessing to accelerate.]
--output_dir <DIR> [The output directory to save information of tested indices or distance matrix.]
```
#### Calculate Accuracy (Optional)
If we want to calculate accuracy from recorded distance file `<DISTANCE DIR>`, use
```sh
python main_text.py --record --score --distance_fn <DISTANCE DIR>
```
to calculate accuracy. Otherwise, the accuracy will be calculated automatically using the command in the last section.
#### Use Custom Dataset
You can use your own custom dataset by passing `custom` to `--dataset`; pass the data directory that contains `train.txt` and `test.txt` to `--data_dir`; pass the class number to the `--class_num`.
Both `train.txt` and `test.txt` are expected to have the format `{label}\t{text}` per line.
You can change the delimiter according to you dataset by changing `delimiter` in `load_custom_dataset()` in `data.py`.
Raw data
{
"_id": null,
"home_page": "https://github.com/bazingagin/npc_gzip",
"name": "npc-gzip",
"maintainer": "Zach Bloss",
"docs_url": null,
"requires_python": ">=3.9,<4.0",
"maintainer_email": "zacharybloss@gmail.com",
"keywords": "npc_gzip,knn-gzip",
"author": "Gin Jiang",
"author_email": "bazingagin@gmail.com",
"download_url": "https://files.pythonhosted.org/packages/c0/bc/936d3d8828a6c5e739863b03490a6048b484a67b797c0e55c6dcf48764db/npc_gzip-0.1.1.tar.gz",
"platform": null,
"description": "# Code for Paper: \u201cLow-Resource\u201d Text Classification: A Parameter-Free Classification Method with Compressors\n\nThis paper is accepted to Findings of [ACL2023](https://aclanthology.org/2023.findings-acl.426/).\n\n## Getting Started\n\nThis codebase is [available on pypi.org via](https://pypi.org/project/npc-gzip):\n\n\n```sh\npip install npc-gzip\n```\n\n## Usage\n\nSee the [examples](./examples/) directory for example usage.\n\n## Testing\n\nThis package utilizes `poetry` to maintain its dependencies and `pytest` to execute tests. To get started running the tests:\n\n```sh\npoetry shell\npoetry install\npytest\n```\n\n-------------------------\n\n### Original Codebase\n\n#### Require\n\nSee `requirements.txt`.\n\nInstall requirements in a clean environment:\n\n```sh\nconda create -n npc python=3.7\nconda activate npc\npip install -r requirements.txt\n```\n\n#### Run\n\n```sh\npython main_text.py\n```\n\nBy default, this will only use 100 test and training samples per class as a quick demo. They can be changed by `--num_test`, `--num_train`.\n\n```text\n--compressor <gzip, lzma, bz2>\n--dataset <AG_NEWS, SogouNews, DBpedia, YahooAnswers, 20News, Ohsumed_single, R8, R52, kinnews, kirnews, swahili, filipino> [Note that for small datasets like kinnews, default 100-shot is too big, need to set --num_test and --num_train.]\n--num_train <INT>\n--num_test <INT>\n--data_dir <DIR> [This needs to be specified for R8, R52 and Ohsumed.]\n--all_test [This will use the whole test dataset.]\n--all_train\n--record [This will record the distance matrix in order to save for the future use. It's helpful when you when to run on the whole dataset.]\n--test_idx_start <INT>\n--test_idx_end <INT> [These two args help us to run on a certain range of test set. Also helpful for calculating the distance matrix on the whole dataset.]\n--para [This will use multiprocessing to accelerate.]\n--output_dir <DIR> [The output directory to save information of tested indices or distance matrix.]\n```\n\n#### Calculate Accuracy (Optional)\n\nIf we want to calculate accuracy from recorded distance file `<DISTANCE DIR>`, use\n\n```sh\npython main_text.py --record --score --distance_fn <DISTANCE DIR>\n```\n\nto calculate accuracy. Otherwise, the accuracy will be calculated automatically using the command in the last section.\n\n#### Use Custom Dataset\n\nYou can use your own custom dataset by passing `custom` to `--dataset`; pass the data directory that contains `train.txt` and `test.txt` to `--data_dir`; pass the class number to the `--class_num`.\n\nBoth `train.txt` and `test.txt` are expected to have the format `{label}\\t{text}` per line.\n\nYou can change the delimiter according to you dataset by changing `delimiter` in `load_custom_dataset()` in `data.py`.\n\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "Code for Paper: \u201cLow-Resource\u201d Text Classification: A Parameter-Free Classification Method with Compressors",
"version": "0.1.1",
"project_urls": {
"Homepage": "https://github.com/bazingagin/npc_gzip",
"Repository": "https://github.com/bazingagin/npc_gzip"
},
"split_keywords": [
"npc_gzip",
"knn-gzip"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "9fa8bb0b75702a650d4757f6db171f8bd82264c0fd43dbb010c09921ef7d6c06",
"md5": "f78db36698298b11ba604f248dfe980e",
"sha256": "e8051e6cc24ad2f19028f3a1f9427790e742fb5dd32fb1c669de79928c532264"
},
"downloads": -1,
"filename": "npc_gzip-0.1.1-py3-none-any.whl",
"has_sig": false,
"md5_digest": "f78db36698298b11ba604f248dfe980e",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.9,<4.0",
"size": 13400,
"upload_time": "2023-08-14T03:13:10",
"upload_time_iso_8601": "2023-08-14T03:13:10.853858Z",
"url": "https://files.pythonhosted.org/packages/9f/a8/bb0b75702a650d4757f6db171f8bd82264c0fd43dbb010c09921ef7d6c06/npc_gzip-0.1.1-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "c0bc936d3d8828a6c5e739863b03490a6048b484a67b797c0e55c6dcf48764db",
"md5": "be80ce8d2f2567bce4bcca1980f30261",
"sha256": "d81142055516d0ce4c0eedfec87a53cdfe0cf7b3cd4f59f9c033a219355e2e1e"
},
"downloads": -1,
"filename": "npc_gzip-0.1.1.tar.gz",
"has_sig": false,
"md5_digest": "be80ce8d2f2567bce4bcca1980f30261",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.9,<4.0",
"size": 11258,
"upload_time": "2023-08-14T03:13:11",
"upload_time_iso_8601": "2023-08-14T03:13:11.830715Z",
"url": "https://files.pythonhosted.org/packages/c0/bc/936d3d8828a6c5e739863b03490a6048b484a67b797c0e55c6dcf48764db/npc_gzip-0.1.1.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2023-08-14 03:13:11",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "bazingagin",
"github_project": "npc_gzip",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"lcname": "npc-gzip"
}