hojichar

Name	hojichar JSON
Version	0.9.0 JSON
	download
home_page	https://github.com/HojiChar/HojiChar
Summary	Text preprocessing management system.
upload_time	2023-08-08 11:44:44
maintainer
docs_url	None
author	kenta.shinzato
requires_python	>=3.8,<4.0
license	Apache-2.0
keywords
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            # HojiChar

[![PyPI version](https://badge.fury.io/py/hojichar.svg)](https://badge.fury.io/py/hojichar)
[![Python Versions](https://img.shields.io/pypi/pyversions/hojichar.svg)](https://pypi.org/project/hojichar/)
[![CI wowkflow](https://github.com/HojiChar/HojiChar/actions/workflows/ci.yaml/badge.svg)](https://github.com/HojiChar/HojiChar/actions/workflows/ci.yaml)
[![codecov](https://codecov.io/gh/HojiChar/HojiChar/branch/main/graph/badge.svg?token=16928I9U9Y)](https://codecov.io/gh/HojiChar/HojiChar)
![PyPI - Downloads](https://img.shields.io/pypi/dm/hojichar)

Official docs: <https://hojichar.github.io/HojiChar/hojichar.html>

## Features

- HojiChar provides a way to combine multiple arbitrary text processing tasks into a streamlined pipeline.
- The sequence of operations can be described declaratively, ensuring portability.
- HojiChar allows users to gather detailed statistical information from large amounts of text during processing.
- It enables management of any Python text processing tasks, providing a Command Line Interface (CLI) capable of parallel processing.

## Background and what is for HojiChar

Text preprocessing is far from a one-size-fits-all process. Depending on the data source and the specific task at hand, various steps including normalization, noise removal, and filtering may be necessary. Not all texts require the same level of preprocessing. For instance, relatively clean texts may only need minimal filtering, while "dirtier" sources like Common Crawl data often require more thorough processing. As a result, the preprocessing profile has to be tailored to each specific domain.

Many preprocessing operations can be viewed as filters, taking string as input, applying a transformation, and outputting the processed string. Even though these operations might seem straightforward individually, managing them in a multi-layered, efficient manner can be challenging.

Inspired by [`torchvision.transforms`](https://pytorch.org/vision/stable/transforms.html) and [iver56/audiomentations](https://github.com/iver56/audiomentations), HojiChar addresses these challenges. It enables users to define each text processing step as a class inheriting from `hojichar.Filter` and use `hojichar.Compose` to chain them together into a single filter. By writing out the `Compose` recipe as a profile, the preprocessing process for a specific domain's text can be made portable. Moreover, `Compose` automatically logs various metrics for each filter, such as byte changes, processing time, and number of rejected texts. This allows users to assess the validity of each operation and consider trade-offs between computation time and performance.

While there are other text normalization tools available, most are designed to perform a specific set of operations. Text preprocessing, despite its importance, is often considered a mundane task compared to machine learning or artificial intelligence tasks. As a result, many existing solutions can be ad hoc, poorly maintained, or inadequately tested. Recognizing these issues, we developed HojiChar as a robust tool for configuring text preprocessing.

## Install

```
pip install hojichar
```

## Defining a Compose Object

The [`Compose`](https://hojichar.github.io/HojiChar/hojichar.html#Compose) class in HojiChar allows you to create a sequence of text processing filters.

```Python
from hojichar import Compose, document_filters

cleaner = Compose([
    document_filters.JSONLoader(key="text"),
    document_filters.AcceptJapanese(),
    document_filters.DocumentLengthFilter(min_doc_len=0,max_doc_len=1000),
    document_filters.ExampleHojiChar(),
    document_filters.JSONDumper()
])
```

When a [`Compose`](https://hojichar.github.io/HojiChar/hojichar.html#Compose) object is called, it accepts a string and returns the processed string.

```Python
>>> cleaner('{"text": "こんにちは、"}')
{"text": "こんにちは、<hojichar>"}
```

The filter pipeline above accomplishes the following steps:

1. Extracts the value from the `'text'` key in the JSON object.
2. Discards the string if it's not in Japanese.
3. Rejects any text shorter than 0 characters or longer than 1000 characters.
4. Appends `<hojichar>` to the string.
5. Outputs the processed string as JSON with the key "text".

The filters used in the pipeline are predefined filters found in [`hojichar.filters`](https://hojichar.github.io/HojiChar/hojichar/filters.html).

While HojiChar provides some fundamental text processing filters and plans to add more in the future, users can also define their custom filters.

## User-defined Filters

A filter composing a [`Compose`](https://hojichar.github.io/HojiChar/hojichar.html#Compose) object is a class that inherits the [`Filter`](https://hojichar.github.io/HojiChar/hojichar.html#Filter) class and implements the text processing within the `apply` function.

```Python
from hojichar.core.filter_interface import Filter

class YourFilter(Filter):
    def apply(self, document):
        text = document.text
        """
        Write your text transformation...
        """
        document.text = text
        return document
```

The `apply` method accepts a `hojichar.Document` type as an argument and returns it after the transformations. The [`Document`](https://hojichar.github.io/HojiChar/hojichar.html#Document) is a class that encapsulates a string.

**Reject documents**

- The `hojichar.Document` has an `is_rejected` attribute. If a filter sets this flag to `True`, `Compose` will discard the document during processing.

**Definition of `__init__` for custom filter**

When creating a user-defined class and applying a custom constructor, make sure to initialize the parent class.

```python
class YourFilter(Filter):
    def __init__(self, your_param, *args, **kwargs) -> None:
        super().__init__(*args, **kwargs)
        self.your_param = your_param

    def apply(self, document):
        text = document.text
        text = process(text, self.your_param)
        document.text = text
        return document
```

This is because The `Filter` class implicitly has several arguments, one of which is `p`.

```python
cleaner = Compose([
    document_filters.JSONLoader(key="text"),
    document_filters.AcceptJapanese(p=0.5),
    document_filters.JSONDumper()
])
```

The `p` argument passed to the `document_filters.AcceptJapanese` constructor determines the probability of applying the filter; with a probability of `1-p`, it acts as an identity function. This behavior is defined in the parent class `hojichar.Filter`.

## Additional Notes on Compose

- Even though the behavior of a `Compose` object when called is a text-in, text-out function, `Compose` itself also inherits from the `Filter` class. Therefore, applying the `apply` method to a `Compose` object results in `hojihcar.Document` class being used as input and output.
- `Compose` class behaves like a Filter. If you add a Compose object as one of the filters in the constructor of Compose, the filter will be unfolded recursively.
- You can access various statistics regarding the processing performed by `Compose` through `Compose.statistics` or `Compose.statistics_obj`.
  - `Compose.statistics` is a dictionary like above.

    ```json
    {
    "total_info": {
        "processed_num": 10928,
        "discard_num": 5513,
        "input_MB": 104.514584,
        "output_MB": 25.33024,
        "cumulative_time": 114.071047143,
        "total_token_num": 0
    },
    "layers_info": [
        {
        "name": "0-JSONLoader",
        "discard_num": 0,
        "diff_MB": -1.9647932052612305,
        "cumulative_time": 0.420034328,
        "params": {
            "name": "JSONLoader",
            "p": 1,
            "skip_rejected": true,
            "key": "text",
            "ignore": true
        }
        },
        {
        "name": "1-DocumentNormalizer",
        "discard_num": 0,
        "diff_MB": -1.5221118927001953,
        "cumulative_time": 8.286988707,
        "params": {
            "name": "DocumentNormalizer",
            "p": 1,
            "skip_rejected": true
        }
        },
        {
        "name": "2-DocumentLengthFilter",
        "discard_num": 344,
        "diff_MB": -0.05566596984863281,
        "cumulative_time": 0.093768306,
        "params": {
            "name": "DocumentLengthFilter",
            "p": 1,
            "skip_rejected": true,
            "min_doc_len": 100,
            "max_doc_len": null
        }
        },
    ]
    }
    ```

- `Compose.statistics_obj` is a `hojichar.StatsContainer` class. The `hojichar.StatsContainer` class stores the raw values of the statistics dictionary, and addition operations are defined to easily calculate the total statistics processed with the same filter. You can get the statistics dictionary by calling `Compose.statistics_obj.get_human_readable_values()`.

## Parallel application of `Compose`

The `hojichar.Parallel` class allows for the application of `Compose` to an iterable of `Document` concurrently. This class empowers users to process vast collections of documents by harnessing the power of multiple CPU cores.

Example usage of `Parallel` class to proces a very large JSON Lines file concurrently.

```python
import hojichar

input_file = "your_text.jsonl"
input_doc_iter = (hojichar.Document(line) for line in open(input_file))

cleaner = hojichar.Compose([
    hojichar.document_filters.JSONLoader(),
    hojichar.document_filters.DocumentNormalizer(),
    # Insert your filters
    hojichar.document_filters.JSONDumper(),
])

with hojichar.Parallel(cleaner, num_jobs=10) as pfilter:
    out_doc_iter = pfilter.imap_apply(input_doc_iter)
    with open("your_processed_text.jsonl", "w") as fp:
        for doc in out_doc_iter:
            fp.write(doc.text + "\n")

```

- Always use the `Parallel` class within a `with` statement.
- `Parallel.imap_apply(doc_iter)` processes an iterator of `Document` and returns an iterator of the processed documents.
- For additional options and details about the `Parallel` class, please refer to the official documentation.

## CLI tool and preprocessing profile

- HojiChar provides CLI tools for text preprocess pipeline.
- User defines a series of preprocessing into a python file as profile.

- Example:

  ```bash
  cat <your_text.jsonl> | hojichar -p your_preprocessing_profile.py -o your_text_preprocessed.jsonl
  ```

- `hojichar --help`

  ```man
    usage: hojichar [-h] --profile <profile.py> [--args ARGS [ARGS ...]] [--output OUTPUT] [--input INPUT] [--dump-stats <path to stats.json>] [--exit-on-error] [--all] [--jobs JOBS]

    options:
    -h, --help            show this help message and exit
    --profile <profile.py>, -p <profile.py>
                            Path to a Python file that implements your custom filter.
    --args ARGS [ARGS ...]
                            Pass additional arguments to the profile. Use it like `--args arg1 arg2` etc. The arguments should be space-separated.
    --output OUTPUT, -o OUTPUT
                            Specifies the path for the output file. Defaults to standard output.
    --input INPUT, -i INPUT
                            Specifies the path for the input file. Defaults to standard input. If set this path, the progress bar is enabled.
    --dump-stats <path to stats.json>
                            Dump statistics to file. If the file exists, it will be appended.
    --exit-on-error       Exit if an exception occurs during filtering. Useful for debugging custom filters.
    --all                 A flag that specifies whether to include discarded samples. This is useful when inspecting discarded samples.
    --jobs JOBS, -j JOBS  The number ob parallel jobs. By default, the nuber of the CPU core.
  ```

## Definition of Profile

- HojiChar CLI receives a series of preprocessing as a profile.
- The preprocessing profile is provided as a Python file. Two patterns of the file are allowed.
- hojichar.utils.load_compose.load_compose() loads these profile.

### `FILTER` profile

- `hojichar.Compose` must be defined as `FILTER` variable.
- Example.

    ```python
    import json
    
    from hojichar import Compose, Filter
    from hojichar.filters.document_filters import ExampleHojiChar, JSONLoader
    
    
    class JSONDumper(Filter):
        def apply(self, document):
            text = document.text
            document.text = json.dumps({"text": text}, ensure_ascii=False)
            return document
    
    # FILTER must define Compose object.
    FILTER = Compose(
        [
            JSONLoader(),
            ExampleHojiChar(),
            JSONDumper(),
        ]
    )
    ```

  - Pass the texts to the filter you have defined using a pipe as follows.

    ```bash
    cat <your_file> | hojichar -p example_profile.py
    ```

- `hojichar.utils.load_compose.load_filter_from_file()` loads this type of profile.

### `FACTORY` profile

- A callable function that returns `hojichar.Compose` must be defined as `FACTORY` variable.
- The callable can receive arguments. In this way, parameters can be passed to the profile.
  - Some kinds of value are not preferred to static. For example, random seeds and some flags modify the behavior of a filter, etc
  - `FACTORY` provides a mechanism to pass those values as arguments to the preprocessing.
- Example.

  ```python
  import json
  
  from hojichar import Compose, Filter
  from hojichar.filters.document_filters import JSONLoader
  

  class AddSomething(Filter): #  Concat some value after every document.
      def __init__(self, something: str, *args, **kwargs) -> None:
          self.something = something

      def apply(self, document):
          text = document.text + self.something
          document.text = text
          return document

  class JSONDumper(Filter):
      def apply(self, document):
          text = document.text
          document.text = json.dumps({"text": text}, ensure_ascii=False)
          return document
  
  
  def callback(something):
      return Compose(
          [
              JSONLoader(),
              AddSomething(something),
              JSONDumper(),
          ]
      )
  
  # FACTORY must be callable which returns Compose object.
  FACTORY = callback
  ```

- Using `FACTORY` profile with arguments in CLI.

    ```bash
    cat <your_file> | hojichar -p example_profile.py --args arg1 arg2
    ```

- `hojichar.utils.load_compose.load_parametrized_filter_from_file()` or `load_factory_from_file` loads this type of profile.

## For Developers

### Installing from the Source Directory

To install the package, execute the following commands:

```
git clone https://github.com/HojiChar/HojiChar.git
cd HojiChar
poetry install
```

To install packages related to development, use:

```
poetry install --extras "dev lint test doc"
```

### Testing

Some filters incorporate doctests. You can run these tests with the command:

```
pytest --doctest-modules .
```

This command should be executed from the root of the project.

### Code style

- HojiChar requires type hints for all code. Type checking is performed in continuous integration (CI) in addition to the pytest tests.
- HojiChar code is subject to inspection by the Flake8 Linter and is formatted using Black and isort. For configuration details, please refer to `pyproject.toml`. You can perform linting and formatting from the root of the project using the following commands:

Linting

```
poetry run task lint
```

Formtatting

```
poetry run task format
```

### Building the Documentation

We use Pdoc for building the documentation. You can build the documentation using the following command:

```
pdoc -o docs hojichar
```

Run this command from the project root.

In practice, the process of building the documentation is automated by CI. When a Pull Request is merged into the main branch, the documentation is built in the `docs/` directory of the `docs` branch. This directory is then deployed to the official documentation site by GitHub Pages.

### Creating a Source Tarball

To create a source tarball, for instance, for packaging or distribution, run the following command:

```
poetry build
```

The tarball will be created in the dist directory. This command will compile the source code, and the resulting tarball can be installed with no additional dependencies other than the Python standard library.

### Creating a Release and Uploading it to PyPI

This command is primarily used by the project manager to create a release and upload it to PyPI.

Versions uploaded to PyPI are identified by git tags. The `__version__` variable in `__init__.py` or the `version` entry in `pyproject.toml` are ignored. The `poetry-dynamic-versioning` Poetry plugin is used to implement this process.

To add the plugin, use:

```
poetry self add "poetry-dynamic-versioning[plugin]"
```

The steps to push to PyPI are as follows, although in actuality, the process is automated by CI when a GitHub release is created from the tag.

```
git checkout v0.1.2
poetry config pypi-token.pypi <API TOKEN>
poetry build 
poetry publish
```

The actual task for the manager is to apply the appropriate tag to the commit to be released and to create the release from GitHub:

```
git tag -a v0.1.2 -m "Version 0.1.2"
git push origin v0.1.2
```

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/HojiChar/HojiChar",
    "name": "hojichar",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.8,<4.0",
    "maintainer_email": "",
    "keywords": "",
    "author": "kenta.shinzato",
    "author_email": "hoppiece@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/fc/02/112d8f751a88473800c9f3e2c1cebd7288d6aaea939d816e2ae40a1ddda3/hojichar-0.9.0.tar.gz",
    "platform": null,
    "description": "# HojiChar\n\n[![PyPI version](https://badge.fury.io/py/hojichar.svg)](https://badge.fury.io/py/hojichar)\n[![Python Versions](https://img.shields.io/pypi/pyversions/hojichar.svg)](https://pypi.org/project/hojichar/)\n[![CI wowkflow](https://github.com/HojiChar/HojiChar/actions/workflows/ci.yaml/badge.svg)](https://github.com/HojiChar/HojiChar/actions/workflows/ci.yaml)\n[![codecov](https://codecov.io/gh/HojiChar/HojiChar/branch/main/graph/badge.svg?token=16928I9U9Y)](https://codecov.io/gh/HojiChar/HojiChar)\n![PyPI - Downloads](https://img.shields.io/pypi/dm/hojichar)\n\nOfficial docs: <https://hojichar.github.io/HojiChar/hojichar.html>\n\n## Features\n\n- HojiChar provides a way to combine multiple arbitrary text processing tasks into a streamlined pipeline.\n- The sequence of operations can be described declaratively, ensuring portability.\n- HojiChar allows users to gather detailed statistical information from large amounts of text during processing.\n- It enables management of any Python text processing tasks, providing a Command Line Interface (CLI) capable of parallel processing.\n\n## Background and what is for HojiChar\n\nText preprocessing is far from a one-size-fits-all process. Depending on the data source and the specific task at hand, various steps including normalization, noise removal, and filtering may be necessary. Not all texts require the same level of preprocessing. For instance, relatively clean texts may only need minimal filtering, while \"dirtier\" sources like Common Crawl data often require more thorough processing. As a result, the preprocessing profile has to be tailored to each specific domain.\n\nMany preprocessing operations can be viewed as filters, taking string as input, applying a transformation, and outputting the processed string. Even though these operations might seem straightforward individually, managing them in a multi-layered, efficient manner can be challenging.\n\nInspired by [`torchvision.transforms`](https://pytorch.org/vision/stable/transforms.html) and [iver56/audiomentations](https://github.com/iver56/audiomentations), HojiChar addresses these challenges. It enables users to define each text processing step as a class inheriting from `hojichar.Filter` and use `hojichar.Compose` to chain them together into a single filter. By writing out the `Compose` recipe as a profile, the preprocessing process for a specific domain's text can be made portable. Moreover, `Compose` automatically logs various metrics for each filter, such as byte changes, processing time, and number of rejected texts. This allows users to assess the validity of each operation and consider trade-offs between computation time and performance.\n\nWhile there are other text normalization tools available, most are designed to perform a specific set of operations. Text preprocessing, despite its importance, is often considered a mundane task compared to machine learning or artificial intelligence tasks. As a result, many existing solutions can be ad hoc, poorly maintained, or inadequately tested. Recognizing these issues, we developed HojiChar as a robust tool for configuring text preprocessing.\n\n## Install\n\n```\npip install hojichar\n```\n\n## Defining a Compose Object\n\nThe [`Compose`](https://hojichar.github.io/HojiChar/hojichar.html#Compose) class in HojiChar allows you to create a sequence of text processing filters.\n\n```Python\nfrom hojichar import Compose, document_filters\n\ncleaner = Compose([\n    document_filters.JSONLoader(key=\"text\"),\n    document_filters.AcceptJapanese(),\n    document_filters.DocumentLengthFilter(min_doc_len=0,max_doc_len=1000),\n    document_filters.ExampleHojiChar(),\n    document_filters.JSONDumper()\n])\n```\n\nWhen a [`Compose`](https://hojichar.github.io/HojiChar/hojichar.html#Compose) object is called, it accepts a string and returns the processed string.\n\n```Python\n>>> cleaner('{\"text\": \"\u3053\u3093\u306b\u3061\u306f\u3001\"}')\n{\"text\": \"\u3053\u3093\u306b\u3061\u306f\u3001<hojichar>\"}\n```\n\nThe filter pipeline above accomplishes the following steps:\n\n1. Extracts the value from the `'text'` key in the JSON object.\n2. Discards the string if it's not in Japanese.\n3. Rejects any text shorter than 0 characters or longer than 1000 characters.\n4. Appends `<hojichar>` to the string.\n5. Outputs the processed string as JSON with the key \"text\".\n\nThe filters used in the pipeline are predefined filters found in [`hojichar.filters`](https://hojichar.github.io/HojiChar/hojichar/filters.html).\n\nWhile HojiChar provides some fundamental text processing filters and plans to add more in the future, users can also define their custom filters.\n\n## User-defined Filters\n\nA filter composing a [`Compose`](https://hojichar.github.io/HojiChar/hojichar.html#Compose) object is a class that inherits the [`Filter`](https://hojichar.github.io/HojiChar/hojichar.html#Filter) class and implements the text processing within the `apply` function.\n\n```Python\nfrom hojichar.core.filter_interface import Filter\n\nclass YourFilter(Filter):\n    def apply(self, document):\n        text = document.text\n        \"\"\"\n        Write your text transformation...\n        \"\"\"\n        document.text = text\n        return document\n```\n\nThe `apply` method accepts a `hojichar.Document` type as an argument and returns it after the transformations. The [`Document`](https://hojichar.github.io/HojiChar/hojichar.html#Document) is a class that encapsulates a string.\n\n**Reject documents**\n\n- The `hojichar.Document` has an `is_rejected` attribute. If a filter sets this flag to `True`, `Compose` will discard the document during processing.\n\n**Definition of `__init__` for custom filter**\n\nWhen creating a user-defined class and applying a custom constructor, make sure to initialize the parent class.\n\n```python\nclass YourFilter(Filter):\n    def __init__(self, your_param, *args, **kwargs) -> None:\n        super().__init__(*args, **kwargs)\n        self.your_param = your_param\n\n    def apply(self, document):\n        text = document.text\n        text = process(text, self.your_param)\n        document.text = text\n        return document\n```\n\nThis is because The `Filter` class implicitly has several arguments, one of which is `p`.\n\n```python\ncleaner = Compose([\n    document_filters.JSONLoader(key=\"text\"),\n    document_filters.AcceptJapanese(p=0.5),\n    document_filters.JSONDumper()\n])\n```\n\nThe `p` argument passed to the `document_filters.AcceptJapanese` constructor determines the probability of applying the filter; with a probability of `1-p`, it acts as an identity function. This behavior is defined in the parent class `hojichar.Filter`.\n\n## Additional Notes on Compose\n\n- Even though the behavior of a `Compose` object when called is a text-in, text-out function, `Compose` itself also inherits from the `Filter` class. Therefore, applying the `apply` method to a `Compose` object results in `hojihcar.Document` class being used as input and output.\n- `Compose` class behaves like a Filter. If you add a Compose object as one of the filters in the constructor of Compose, the filter will be unfolded recursively.\n- You can access various statistics regarding the processing performed by `Compose` through `Compose.statistics` or `Compose.statistics_obj`.\n  - `Compose.statistics` is a dictionary like above.\n\n    ```json\n    {\n    \"total_info\": {\n        \"processed_num\": 10928,\n        \"discard_num\": 5513,\n        \"input_MB\": 104.514584,\n        \"output_MB\": 25.33024,\n        \"cumulative_time\": 114.071047143,\n        \"total_token_num\": 0\n    },\n    \"layers_info\": [\n        {\n        \"name\": \"0-JSONLoader\",\n        \"discard_num\": 0,\n        \"diff_MB\": -1.9647932052612305,\n        \"cumulative_time\": 0.420034328,\n        \"params\": {\n            \"name\": \"JSONLoader\",\n            \"p\": 1,\n            \"skip_rejected\": true,\n            \"key\": \"text\",\n            \"ignore\": true\n        }\n        },\n        {\n        \"name\": \"1-DocumentNormalizer\",\n        \"discard_num\": 0,\n        \"diff_MB\": -1.5221118927001953,\n        \"cumulative_time\": 8.286988707,\n        \"params\": {\n            \"name\": \"DocumentNormalizer\",\n            \"p\": 1,\n            \"skip_rejected\": true\n        }\n        },\n        {\n        \"name\": \"2-DocumentLengthFilter\",\n        \"discard_num\": 344,\n        \"diff_MB\": -0.05566596984863281,\n        \"cumulative_time\": 0.093768306,\n        \"params\": {\n            \"name\": \"DocumentLengthFilter\",\n            \"p\": 1,\n            \"skip_rejected\": true,\n            \"min_doc_len\": 100,\n            \"max_doc_len\": null\n        }\n        },\n    ]\n    }\n    ```\n\n- `Compose.statistics_obj` is a `hojichar.StatsContainer` class. The `hojichar.StatsContainer` class stores the raw values of the statistics dictionary, and addition operations are defined to easily calculate the total statistics processed with the same filter. You can get the statistics dictionary by calling `Compose.statistics_obj.get_human_readable_values()`.\n\n## Parallel application of `Compose`\n\nThe `hojichar.Parallel` class allows for the application of `Compose` to an iterable of `Document` concurrently. This class empowers users to process vast collections of documents by harnessing the power of multiple CPU cores.\n\nExample usage of `Parallel` class to proces a very large JSON Lines file concurrently.\n\n```python\nimport hojichar\n\ninput_file = \"your_text.jsonl\"\ninput_doc_iter = (hojichar.Document(line) for line in open(input_file))\n\ncleaner = hojichar.Compose([\n    hojichar.document_filters.JSONLoader(),\n    hojichar.document_filters.DocumentNormalizer(),\n    # Insert your filters\n    hojichar.document_filters.JSONDumper(),\n])\n\nwith hojichar.Parallel(cleaner, num_jobs=10) as pfilter:\n    out_doc_iter = pfilter.imap_apply(input_doc_iter)\n    with open(\"your_processed_text.jsonl\", \"w\") as fp:\n        for doc in out_doc_iter:\n            fp.write(doc.text + \"\\n\")\n\n```\n\n- Always use the `Parallel` class within a `with` statement.\n- `Parallel.imap_apply(doc_iter)` processes an iterator of `Document` and returns an iterator of the processed documents.\n- For additional options and details about the `Parallel` class, please refer to the official documentation.\n\n## CLI tool and preprocessing profile\n\n- HojiChar provides CLI tools for text preprocess pipeline.\n- User defines a series of preprocessing into a python file as profile.\n\n- Example:\n\n  ```bash\n  cat <your_text.jsonl> | hojichar -p your_preprocessing_profile.py -o your_text_preprocessed.jsonl\n  ```\n\n- `hojichar --help`\n\n  ```man\n    usage: hojichar [-h] --profile <profile.py> [--args ARGS [ARGS ...]] [--output OUTPUT] [--input INPUT] [--dump-stats <path to stats.json>] [--exit-on-error] [--all] [--jobs JOBS]\n\n    options:\n    -h, --help            show this help message and exit\n    --profile <profile.py>, -p <profile.py>\n                            Path to a Python file that implements your custom filter.\n    --args ARGS [ARGS ...]\n                            Pass additional arguments to the profile. Use it like `--args arg1 arg2` etc. The arguments should be space-separated.\n    --output OUTPUT, -o OUTPUT\n                            Specifies the path for the output file. Defaults to standard output.\n    --input INPUT, -i INPUT\n                            Specifies the path for the input file. Defaults to standard input. If set this path, the progress bar is enabled.\n    --dump-stats <path to stats.json>\n                            Dump statistics to file. If the file exists, it will be appended.\n    --exit-on-error       Exit if an exception occurs during filtering. Useful for debugging custom filters.\n    --all                 A flag that specifies whether to include discarded samples. This is useful when inspecting discarded samples.\n    --jobs JOBS, -j JOBS  The number ob parallel jobs. By default, the nuber of the CPU core.\n  ```\n\n## Definition of Profile\n\n- HojiChar CLI receives a series of preprocessing as a profile.\n- The preprocessing profile is provided as a Python file. Two patterns of the file are allowed.\n- hojichar.utils.load_compose.load_compose() loads these profile.\n\n### `FILTER` profile\n\n- `hojichar.Compose` must be defined as `FILTER` variable.\n- Example.\n\n    ```python\n    import json\n    \n    from hojichar import Compose, Filter\n    from hojichar.filters.document_filters import ExampleHojiChar, JSONLoader\n    \n    \n    class JSONDumper(Filter):\n        def apply(self, document):\n            text = document.text\n            document.text = json.dumps({\"text\": text}, ensure_ascii=False)\n            return document\n    \n    # FILTER must define Compose object.\n    FILTER = Compose(\n        [\n            JSONLoader(),\n            ExampleHojiChar(),\n            JSONDumper(),\n        ]\n    )\n    ```\n\n  - Pass the texts to the filter you have defined using a pipe as follows.\n\n    ```bash\n    cat <your_file> | hojichar -p example_profile.py\n    ```\n\n- `hojichar.utils.load_compose.load_filter_from_file()` loads this type of profile.\n\n### `FACTORY` profile\n\n- A callable function that returns `hojichar.Compose` must be defined as `FACTORY` variable.\n- The callable can receive arguments. In this way, parameters can be passed to the profile.\n  - Some kinds of value are not preferred to static. For example, random seeds and some flags modify the behavior of a filter, etc\n  - `FACTORY` provides a mechanism to pass those values as arguments to the preprocessing.\n- Example.\n\n  ```python\n  import json\n  \n  from hojichar import Compose, Filter\n  from hojichar.filters.document_filters import JSONLoader\n  \n\n  class AddSomething(Filter): #  Concat some value after every document.\n      def __init__(self, something: str, *args, **kwargs) -> None:\n          self.something = something\n\n      def apply(self, document):\n          text = document.text + self.something\n          document.text = text\n          return document\n\n  class JSONDumper(Filter):\n      def apply(self, document):\n          text = document.text\n          document.text = json.dumps({\"text\": text}, ensure_ascii=False)\n          return document\n  \n  \n  def callback(something):\n      return Compose(\n          [\n              JSONLoader(),\n              AddSomething(something),\n              JSONDumper(),\n          ]\n      )\n  \n  # FACTORY must be callable which returns Compose object.\n  FACTORY = callback\n  ```\n\n- Using `FACTORY` profile with arguments in CLI.\n\n    ```bash\n    cat <your_file> | hojichar -p example_profile.py --args arg1 arg2\n    ```\n\n- `hojichar.utils.load_compose.load_parametrized_filter_from_file()` or `load_factory_from_file` loads this type of profile.\n\n## For Developers\n\n### Installing from the Source Directory\n\nTo install the package, execute the following commands:\n\n```\ngit clone https://github.com/HojiChar/HojiChar.git\ncd HojiChar\npoetry install\n```\n\nTo install packages related to development, use:\n\n```\npoetry install --extras \"dev lint test doc\"\n```\n\n### Testing\n\nSome filters incorporate doctests. You can run these tests with the command:\n\n```\npytest --doctest-modules .\n```\n\nThis command should be executed from the root of the project.\n\n### Code style\n\n- HojiChar requires type hints for all code. Type checking is performed in continuous integration (CI) in addition to the pytest tests.\n- HojiChar code is subject to inspection by the Flake8 Linter and is formatted using Black and isort. For configuration details, please refer to `pyproject.toml`. You can perform linting and formatting from the root of the project using the following commands:\n\nLinting\n\n```\npoetry run task lint\n```\n\nFormtatting\n\n```\npoetry run task format\n```\n\n### Building the Documentation\n\nWe use Pdoc for building the documentation. You can build the documentation using the following command:\n\n```\npdoc -o docs hojichar\n```\n\nRun this command from the project root.\n\nIn practice, the process of building the documentation is automated by CI. When a Pull Request is merged into the main branch, the documentation is built in the `docs/` directory of the `docs` branch. This directory is then deployed to the official documentation site by GitHub Pages.\n\n### Creating a Source Tarball\n\nTo create a source tarball, for instance, for packaging or distribution, run the following command:\n\n```\npoetry build\n```\n\nThe tarball will be created in the dist directory. This command will compile the source code, and the resulting tarball can be installed with no additional dependencies other than the Python standard library.\n\n### Creating a Release and Uploading it to PyPI\n\nThis command is primarily used by the project manager to create a release and upload it to PyPI.\n\nVersions uploaded to PyPI are identified by git tags. The `__version__` variable in `__init__.py` or the `version` entry in `pyproject.toml` are ignored. The `poetry-dynamic-versioning` Poetry plugin is used to implement this process.\n\nTo add the plugin, use:\n\n```\npoetry self add \"poetry-dynamic-versioning[plugin]\"\n```\n\nThe steps to push to PyPI are as follows, although in actuality, the process is automated by CI when a GitHub release is created from the tag.\n\n```\ngit checkout v0.1.2\npoetry config pypi-token.pypi <API TOKEN>\npoetry build \npoetry publish\n```\n\nThe actual task for the manager is to apply the appropriate tag to the commit to be released and to create the release from GitHub:\n\n```\ngit tag -a v0.1.2 -m \"Version 0.1.2\"\ngit push origin v0.1.2\n```\n",
    "bugtrack_url": null,
    "license": "Apache-2.0",
    "summary": "Text preprocessing management system.",
    "version": "0.9.0",
    "project_urls": {
        "Homepage": "https://github.com/HojiChar/HojiChar",
        "Repository": "https://github.com/HojiChar/HojiChar"
    },
    "split_keywords": [],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "ae0cbefafadadb6e57eb1f2196a03f62f9e9348bb1ac55c6e966f4b2dcc7dcff",
                "md5": "a77e70b46d54a1183082aa0db3a06ce9",
                "sha256": "968b4b80fada3c7a4d7cd48929d58ae96396a8cd916fab4d6776dd75e82d255b"
            },
            "downloads": -1,
            "filename": "hojichar-0.9.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "a77e70b46d54a1183082aa0db3a06ce9",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8,<4.0",
            "size": 49220,
            "upload_time": "2023-08-08T11:44:43",
            "upload_time_iso_8601": "2023-08-08T11:44:43.246480Z",
            "url": "https://files.pythonhosted.org/packages/ae/0c/befafadadb6e57eb1f2196a03f62f9e9348bb1ac55c6e966f4b2dcc7dcff/hojichar-0.9.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "fc02112d8f751a88473800c9f3e2c1cebd7288d6aaea939d816e2ae40a1ddda3",
                "md5": "141b2c64ff7002d1bc7d0803c3355a46",
                "sha256": "e4bbab088f13808a8a25e568970c848c838bdc6ec3a941393b4246ab4ad006ac"
            },
            "downloads": -1,
            "filename": "hojichar-0.9.0.tar.gz",
            "has_sig": false,
            "md5_digest": "141b2c64ff7002d1bc7d0803c3355a46",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8,<4.0",
            "size": 48086,
            "upload_time": "2023-08-08T11:44:44",
            "upload_time_iso_8601": "2023-08-08T11:44:44.861501Z",
            "url": "https://files.pythonhosted.org/packages/fc/02/112d8f751a88473800c9f3e2c1cebd7288d6aaea939d816e2ae40a1ddda3/hojichar-0.9.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-08-08 11:44:44",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "HojiChar",
    "github_project": "HojiChar",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "hojichar"
}

kenta.shinzato