text-selection


Nametext-selection JSON
Version 0.0.3 PyPI version JSON
download
home_page
SummaryCommand-line interface (CLI) to select lines of a text file.
upload_time2023-05-30 08:33:55
maintainer
docs_urlNone
author
requires_python<4,>=3.8
licenseMIT
keywords text-to-speech speech synthesis corpus utils language linguistics
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # text-selection

[![PyPI](https://img.shields.io/pypi/v/text-selection.svg)](https://pypi.python.org/pypi/text-selection)
[![PyPI](https://img.shields.io/pypi/pyversions/text-selection.svg)](https://pypi.python.org/pypi/text-selection)
[![MIT](https://img.shields.io/github/license/stefantaubert/text-selection.svg)](https://github.com/stefantaubert/text-selection/blob/master/LICENSE)
[![PyPI](https://img.shields.io/pypi/wheel/text-selection.svg)](https://pypi.python.org/pypi/text-selection/#files)
![PyPI](https://img.shields.io/pypi/implementation/text-selection.svg)
[![PyPI](https://img.shields.io/github/commits-since/stefantaubert/text-selection/latest/master.svg)](https://github.com/stefantaubert/text-selection/compare/v0.0.3...master)
[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.7984739.svg)](https://doi.org/10.5281/zenodo.7984739)

Command-line interface (CLI) to select lines of a text file.

## Features

- dataset
  - `create`: create a dataset based on a text file
  - `export-statistics`: exporting statistics to a CSV
- subsets
  - `add`: add subsets
  - `remove`: remove subsets
  - `rename`: rename subset
  - `select-all`: select all lines
  - `select-fifo`: select lines FIFO-style
  - `select-greedily`: select lines greedily regarding units
  - `select-greedily-ep`: select lines greedily regarding units (epoch-based)
  - `select-uniformly`: select lines with units uniformly distributed
  - `select-randomly`: select lines randomly
  - `filter-duplicates`: filter duplicate lines
  - `filter-by-regex`: filter lines by regex
  - `filter-by-text`: filter lines by text
  - `filter-by-weight`: filter lines by weight
  - `filter-by-vocabulary`: filter lines by unit vocabulary
  - `filter-by-count`: filter lines by global unit frequencies
  - `filter-by-unit-freq`: filter lines by unit frequencies per line
  - `filter-by-line-nr`: filter lines by line number
  - `sort-by-line-nr`: sort lines by line number
  - `sort-by-text`: sort lines by text
  - `sort-by-weight`: sort lines by weights
  - `sort-by-shuffle`: shuffle lines
  - `reverse`: reverse lines
  - `export`: export lines
- weights
  - `create-from-file`: create weights from file
  - `create-uniform`: create uniform weights
  - `create-from-count`: create weights from unit count
  - `divide`: divide weights

## Roadmap

- add tests
- refactoring
- outsourcing greedy- and KLD-iterator

## Installation

```sh
pip install text-selection --user
```

## Usage

```txt
usage: text-selection-cli [-h] [-v] {dataset,subsets,weights} ...

CLI to select lines of a text file.

positional arguments:
  {dataset,subsets,weights}  description
    dataset                  dataset commands
    subsets                  subsets commands
    weights                  weights commands

optional arguments:
  -h, --help                 show this help message and exit
  -v, --version              show program's version number and exit
```

## Dependencies

- `tqdm`
- `numpy`
- `scipy`
- `pandas`
- `ordered_set>=4.1.0`

## Contributing

If you notice an error, please don't hesitate to open an issue.

### Development setup

```sh
# update
sudo apt update
# install Python 3.8, 3.9, 3.10 & 3.11 for ensuring that tests can be run
sudo apt install python3-pip \
  python3.8 python3.8-dev python3.8-distutils python3.8-venv \
  python3.9 python3.9-dev python3.9-distutils python3.9-venv \
  python3.10 python3.10-dev python3.10-distutils python3.10-venv \
  python3.11 python3.11-dev python3.11-distutils python3.11-venv
# install pipenv for creation of virtual environments
python3.8 -m pip install pipenv --user

# check out repo
git clone https://github.com/stefantaubert/text-selection.git
cd text-selection
# create virtual environment
python3.8 -m pipenv install --dev
```

## Running the tests

```sh
# first install the tool like in "Development setup"
# then, navigate into the directory of the repo (if not already done)
cd text-selection
# activate environment
python3.8 -m pipenv shell
# run tests
tox
```

Final lines of test result output:

```log
  py38: commands succeeded
  py39: commands succeeded
  py310: commands succeeded
  py311: commands succeeded
  congratulations :)
```

## License

MIT License

## Acknowledgments

Funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) – Project-ID 416228727 – CRC 1410

## Citation

If you want to cite this repo, you can use this BibTeX-entry generated by GitHub (see *About => Cite this repository*).

## Changelog

- v0.0.3 (2023-05-30)
  - Changed
    - Improved speed for filtering OOV/IV words by up to ~20k words/s
  - Added
    - Added `subsets select-randomly`
    - Added `subsets sort-by-shuffle`
    - Added `subsets add` option `--skip-existing`
  - Bugfix
    - Fixed evaluation of "from subsets" to ensure that the subsets exist
    - Fixed `subsets remove` didn't worked
- v0.0.2 (2023-01-13)
  - Added
    - Added creation of weights from lines
    - Add `--limit` to select duplicates
    - Add exit code
  - Changed
    - Set `--limit` positional where applicable
    - Don't output expected warning from `numpy` on KLD selection
  - Bugfixes
- v0.0.1 (2022-05-25)
  - Initial release

            

Raw data

            {
    "_id": null,
    "home_page": "",
    "name": "text-selection",
    "maintainer": "",
    "docs_url": null,
    "requires_python": "<4,>=3.8",
    "maintainer_email": "Stefan Taubert <pypi@stefantaubert.com>",
    "keywords": "Text-to-speech,Speech synthesis,Corpus,Utils,Language,Linguistics",
    "author": "",
    "author_email": "Stefan Taubert <pypi@stefantaubert.com>",
    "download_url": "https://files.pythonhosted.org/packages/03/32/434bd8f13bfb547fe1b75b0047949ec3377715221290f395fd901815d370/text-selection-0.0.3.tar.gz",
    "platform": null,
    "description": "# text-selection\n\n[![PyPI](https://img.shields.io/pypi/v/text-selection.svg)](https://pypi.python.org/pypi/text-selection)\n[![PyPI](https://img.shields.io/pypi/pyversions/text-selection.svg)](https://pypi.python.org/pypi/text-selection)\n[![MIT](https://img.shields.io/github/license/stefantaubert/text-selection.svg)](https://github.com/stefantaubert/text-selection/blob/master/LICENSE)\n[![PyPI](https://img.shields.io/pypi/wheel/text-selection.svg)](https://pypi.python.org/pypi/text-selection/#files)\n![PyPI](https://img.shields.io/pypi/implementation/text-selection.svg)\n[![PyPI](https://img.shields.io/github/commits-since/stefantaubert/text-selection/latest/master.svg)](https://github.com/stefantaubert/text-selection/compare/v0.0.3...master)\n[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.7984739.svg)](https://doi.org/10.5281/zenodo.7984739)\n\nCommand-line interface (CLI) to select lines of a text file.\n\n## Features\n\n- dataset\n  - `create`: create a dataset based on a text file\n  - `export-statistics`: exporting statistics to a CSV\n- subsets\n  - `add`: add subsets\n  - `remove`: remove subsets\n  - `rename`: rename subset\n  - `select-all`: select all lines\n  - `select-fifo`: select lines FIFO-style\n  - `select-greedily`: select lines greedily regarding units\n  - `select-greedily-ep`: select lines greedily regarding units (epoch-based)\n  - `select-uniformly`: select lines with units uniformly distributed\n  - `select-randomly`: select lines randomly\n  - `filter-duplicates`: filter duplicate lines\n  - `filter-by-regex`: filter lines by regex\n  - `filter-by-text`: filter lines by text\n  - `filter-by-weight`: filter lines by weight\n  - `filter-by-vocabulary`: filter lines by unit vocabulary\n  - `filter-by-count`: filter lines by global unit frequencies\n  - `filter-by-unit-freq`: filter lines by unit frequencies per line\n  - `filter-by-line-nr`: filter lines by line number\n  - `sort-by-line-nr`: sort lines by line number\n  - `sort-by-text`: sort lines by text\n  - `sort-by-weight`: sort lines by weights\n  - `sort-by-shuffle`: shuffle lines\n  - `reverse`: reverse lines\n  - `export`: export lines\n- weights\n  - `create-from-file`: create weights from file\n  - `create-uniform`: create uniform weights\n  - `create-from-count`: create weights from unit count\n  - `divide`: divide weights\n\n## Roadmap\n\n- add tests\n- refactoring\n- outsourcing greedy- and KLD-iterator\n\n## Installation\n\n```sh\npip install text-selection --user\n```\n\n## Usage\n\n```txt\nusage: text-selection-cli [-h] [-v] {dataset,subsets,weights} ...\n\nCLI to select lines of a text file.\n\npositional arguments:\n  {dataset,subsets,weights}  description\n    dataset                  dataset commands\n    subsets                  subsets commands\n    weights                  weights commands\n\noptional arguments:\n  -h, --help                 show this help message and exit\n  -v, --version              show program's version number and exit\n```\n\n## Dependencies\n\n- `tqdm`\n- `numpy`\n- `scipy`\n- `pandas`\n- `ordered_set>=4.1.0`\n\n## Contributing\n\nIf you notice an error, please don't hesitate to open an issue.\n\n### Development setup\n\n```sh\n# update\nsudo apt update\n# install Python 3.8, 3.9, 3.10 & 3.11 for ensuring that tests can be run\nsudo apt install python3-pip \\\n  python3.8 python3.8-dev python3.8-distutils python3.8-venv \\\n  python3.9 python3.9-dev python3.9-distutils python3.9-venv \\\n  python3.10 python3.10-dev python3.10-distutils python3.10-venv \\\n  python3.11 python3.11-dev python3.11-distutils python3.11-venv\n# install pipenv for creation of virtual environments\npython3.8 -m pip install pipenv --user\n\n# check out repo\ngit clone https://github.com/stefantaubert/text-selection.git\ncd text-selection\n# create virtual environment\npython3.8 -m pipenv install --dev\n```\n\n## Running the tests\n\n```sh\n# first install the tool like in \"Development setup\"\n# then, navigate into the directory of the repo (if not already done)\ncd text-selection\n# activate environment\npython3.8 -m pipenv shell\n# run tests\ntox\n```\n\nFinal lines of test result output:\n\n```log\n  py38: commands succeeded\n  py39: commands succeeded\n  py310: commands succeeded\n  py311: commands succeeded\n  congratulations :)\n```\n\n## License\n\nMIT License\n\n## Acknowledgments\n\nFunded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) \u2013 Project-ID 416228727 \u2013 CRC 1410\n\n## Citation\n\nIf you want to cite this repo, you can use this BibTeX-entry generated by GitHub (see *About => Cite this repository*).\n\n## Changelog\n\n- v0.0.3 (2023-05-30)\n  - Changed\n    - Improved speed for filtering OOV/IV words by up to ~20k words/s\n  - Added\n    - Added `subsets select-randomly`\n    - Added `subsets sort-by-shuffle`\n    - Added `subsets add` option `--skip-existing`\n  - Bugfix\n    - Fixed evaluation of \"from subsets\" to ensure that the subsets exist\n    - Fixed `subsets remove` didn't worked\n- v0.0.2 (2023-01-13)\n  - Added\n    - Added creation of weights from lines\n    - Add `--limit` to select duplicates\n    - Add exit code\n  - Changed\n    - Set `--limit` positional where applicable\n    - Don't output expected warning from `numpy` on KLD selection\n  - Bugfixes\n- v0.0.1 (2022-05-25)\n  - Initial release\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "Command-line interface (CLI) to select lines of a text file.",
    "version": "0.0.3",
    "project_urls": {
        "Homepage": "https://github.com/stefantaubert/text-selection",
        "Issues": "https://github.com/stefantaubert/text-selection/issues"
    },
    "split_keywords": [
        "text-to-speech",
        "speech synthesis",
        "corpus",
        "utils",
        "language",
        "linguistics"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "bc36fb7b12e17fe9ea7638697d06291abf2f2ada109f9b68d24e2469b0ba2c60",
                "md5": "c83d6beb1b98c747816a1086b3881b4f",
                "sha256": "ad75d5f83557f7e635dbcaae247ebf8f41067ea436cc5a2a04095f083adab3ff"
            },
            "downloads": -1,
            "filename": "text_selection-0.0.3-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "c83d6beb1b98c747816a1086b3881b4f",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": "<4,>=3.8",
            "size": 152147,
            "upload_time": "2023-05-30T08:33:53",
            "upload_time_iso_8601": "2023-05-30T08:33:53.635506Z",
            "url": "https://files.pythonhosted.org/packages/bc/36/fb7b12e17fe9ea7638697d06291abf2f2ada109f9b68d24e2469b0ba2c60/text_selection-0.0.3-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "0332434bd8f13bfb547fe1b75b0047949ec3377715221290f395fd901815d370",
                "md5": "eaeed18a58a69e84525438adea981c75",
                "sha256": "7bce559179c8a1254059e29c32526da4435685ac8ff6ded63434ddd63ac4d6c9"
            },
            "downloads": -1,
            "filename": "text-selection-0.0.3.tar.gz",
            "has_sig": false,
            "md5_digest": "eaeed18a58a69e84525438adea981c75",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": "<4,>=3.8",
            "size": 112568,
            "upload_time": "2023-05-30T08:33:55",
            "upload_time_iso_8601": "2023-05-30T08:33:55.862607Z",
            "url": "https://files.pythonhosted.org/packages/03/32/434bd8f13bfb547fe1b75b0047949ec3377715221290f395fd901815d370/text-selection-0.0.3.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-05-30 08:33:55",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "stefantaubert",
    "github_project": "text-selection",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "text-selection"
}
        
Elapsed time: 0.10198s