npc-gzip

Name	npc-gzip JSON
Version	0.1.1 JSON
	download
home_page	https://github.com/bazingagin/npc_gzip
Summary	Code for Paper: “Low-Resource” Text Classification: A Parameter-Free Classification Method with Compressors
upload_time	2023-08-14 03:13:11
maintainer	Zach Bloss
docs_url	None
author	Gin Jiang
requires_python	>=3.9,<4.0
license	MIT
keywords	npc_gzip knn-gzip
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            # Code for Paper: “Low-Resource” Text Classification: A Parameter-Free Classification Method with Compressors

This paper is accepted to Findings of [ACL2023](https://aclanthology.org/2023.findings-acl.426/).

## Getting Started

This codebase is [available on pypi.org via](https://pypi.org/project/npc-gzip):


```sh
pip install npc-gzip
```

## Usage

See the [examples](./examples/) directory for example usage.

## Testing

This package utilizes `poetry` to maintain its dependencies and `pytest` to execute tests. To get started running the tests:

```sh
poetry shell
poetry install
pytest
```

-------------------------

### Original Codebase

#### Require

See `requirements.txt`.

Install requirements in a clean environment:

```sh
conda create -n npc python=3.7
conda activate npc
pip install -r requirements.txt
```

#### Run

```sh
python main_text.py
```

By default, this will only use 100 test and training samples per class as a quick demo. They can be changed by `--num_test`, `--num_train`.

```text
--compressor <gzip, lzma, bz2>
--dataset <AG_NEWS, SogouNews, DBpedia, YahooAnswers, 20News, Ohsumed_single, R8, R52, kinnews, kirnews, swahili, filipino> [Note that for small datasets like kinnews, default 100-shot is too big, need to set --num_test and --num_train.]
--num_train <INT>
--num_test <INT>
--data_dir <DIR> [This needs to be specified for R8, R52 and Ohsumed.]
--all_test [This will use the whole test dataset.]
--all_train
--record [This will record the distance matrix in order to save for the future use. It's helpful when you when to run on the whole dataset.]
--test_idx_start <INT>
--test_idx_end <INT> [These two args help us to run on a certain range of test set. Also helpful for calculating the distance matrix on the whole dataset.]
--para [This will use multiprocessing to accelerate.]
--output_dir <DIR> [The output directory to save information of tested indices or distance matrix.]
```

#### Calculate Accuracy (Optional)

If we want to calculate accuracy from recorded distance file `<DISTANCE DIR>`, use

```sh
python main_text.py --record --score --distance_fn <DISTANCE DIR>
```

to calculate accuracy. Otherwise, the accuracy will be calculated automatically using the command in the last section.

#### Use Custom Dataset

You can use your own custom dataset by passing `custom` to `--dataset`; pass the data directory that contains `train.txt` and `test.txt` to `--data_dir`; pass the class number to the `--class_num`.

Both `train.txt` and `test.txt` are expected to have the format `{label}\t{text}` per line.

You can change the delimiter according to you dataset by changing `delimiter` in `load_custom_dataset()` in `data.py`.

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/bazingagin/npc_gzip",
    "name": "npc-gzip",
    "maintainer": "Zach Bloss",
    "docs_url": null,
    "requires_python": ">=3.9,<4.0",
    "maintainer_email": "zacharybloss@gmail.com",
    "keywords": "npc_gzip,knn-gzip",
    "author": "Gin Jiang",
    "author_email": "bazingagin@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/c0/bc/936d3d8828a6c5e739863b03490a6048b484a67b797c0e55c6dcf48764db/npc_gzip-0.1.1.tar.gz",
    "platform": null,
    "description": "# Code for Paper: \u201cLow-Resource\u201d Text Classification: A Parameter-Free Classification Method with Compressors\n\nThis paper is accepted to Findings of [ACL2023](https://aclanthology.org/2023.findings-acl.426/).\n\n## Getting Started\n\nThis codebase is [available on pypi.org via](https://pypi.org/project/npc-gzip):\n\n\n```sh\npip install npc-gzip\n```\n\n## Usage\n\nSee the [examples](./examples/) directory for example usage.\n\n## Testing\n\nThis package utilizes `poetry` to maintain its dependencies and `pytest` to execute tests. To get started running the tests:\n\n```sh\npoetry shell\npoetry install\npytest\n```\n\n-------------------------\n\n### Original Codebase\n\n#### Require\n\nSee `requirements.txt`.\n\nInstall requirements in a clean environment:\n\n```sh\nconda create -n npc python=3.7\nconda activate npc\npip install -r requirements.txt\n```\n\n#### Run\n\n```sh\npython main_text.py\n```\n\nBy default, this will only use 100 test and training samples per class as a quick demo. They can be changed by `--num_test`, `--num_train`.\n\n```text\n--compressor <gzip, lzma, bz2>\n--dataset <AG_NEWS, SogouNews, DBpedia, YahooAnswers, 20News, Ohsumed_single, R8, R52, kinnews, kirnews, swahili, filipino> [Note that for small datasets like kinnews, default 100-shot is too big, need to set --num_test and --num_train.]\n--num_train <INT>\n--num_test <INT>\n--data_dir <DIR> [This needs to be specified for R8, R52 and Ohsumed.]\n--all_test [This will use the whole test dataset.]\n--all_train\n--record [This will record the distance matrix in order to save for the future use. It's helpful when you when to run on the whole dataset.]\n--test_idx_start <INT>\n--test_idx_end <INT> [These two args help us to run on a certain range of test set. Also helpful for calculating the distance matrix on the whole dataset.]\n--para [This will use multiprocessing to accelerate.]\n--output_dir <DIR> [The output directory to save information of tested indices or distance matrix.]\n```\n\n#### Calculate Accuracy (Optional)\n\nIf we want to calculate accuracy from recorded distance file `<DISTANCE DIR>`, use\n\n```sh\npython main_text.py --record --score --distance_fn <DISTANCE DIR>\n```\n\nto calculate accuracy. Otherwise, the accuracy will be calculated automatically using the command in the last section.\n\n#### Use Custom Dataset\n\nYou can use your own custom dataset by passing `custom` to `--dataset`; pass the data directory that contains `train.txt` and `test.txt` to `--data_dir`; pass the class number to the `--class_num`.\n\nBoth `train.txt` and `test.txt` are expected to have the format `{label}\\t{text}` per line.\n\nYou can change the delimiter according to you dataset by changing `delimiter` in `load_custom_dataset()` in `data.py`.\n\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "Code for Paper: \u201cLow-Resource\u201d Text Classification: A Parameter-Free Classification Method with Compressors",
    "version": "0.1.1",
    "project_urls": {
        "Homepage": "https://github.com/bazingagin/npc_gzip",
        "Repository": "https://github.com/bazingagin/npc_gzip"
    },
    "split_keywords": [
        "npc_gzip",
        "knn-gzip"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "9fa8bb0b75702a650d4757f6db171f8bd82264c0fd43dbb010c09921ef7d6c06",
                "md5": "f78db36698298b11ba604f248dfe980e",
                "sha256": "e8051e6cc24ad2f19028f3a1f9427790e742fb5dd32fb1c669de79928c532264"
            },
            "downloads": -1,
            "filename": "npc_gzip-0.1.1-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "f78db36698298b11ba604f248dfe980e",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.9,<4.0",
            "size": 13400,
            "upload_time": "2023-08-14T03:13:10",
            "upload_time_iso_8601": "2023-08-14T03:13:10.853858Z",
            "url": "https://files.pythonhosted.org/packages/9f/a8/bb0b75702a650d4757f6db171f8bd82264c0fd43dbb010c09921ef7d6c06/npc_gzip-0.1.1-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "c0bc936d3d8828a6c5e739863b03490a6048b484a67b797c0e55c6dcf48764db",
                "md5": "be80ce8d2f2567bce4bcca1980f30261",
                "sha256": "d81142055516d0ce4c0eedfec87a53cdfe0cf7b3cd4f59f9c033a219355e2e1e"
            },
            "downloads": -1,
            "filename": "npc_gzip-0.1.1.tar.gz",
            "has_sig": false,
            "md5_digest": "be80ce8d2f2567bce4bcca1980f30261",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.9,<4.0",
            "size": 11258,
            "upload_time": "2023-08-14T03:13:11",
            "upload_time_iso_8601": "2023-08-14T03:13:11.830715Z",
            "url": "https://files.pythonhosted.org/packages/c0/bc/936d3d8828a6c5e739863b03490a6048b484a67b797c0e55c6dcf48764db/npc_gzip-0.1.1.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-08-14 03:13:11",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "bazingagin",
    "github_project": "npc_gzip",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "npc-gzip"
}

Gin Jiang