squeakily
================
<!-- WARNING: THIS FILE WAS AUTOGENERATED! DO NOT EDIT! -->
This repository is heavily inspired by BigScience’s [ROOTs
project](https://github.com/bigscience-workshop/data-preparation) and
EleutherAI’s [The Pile](https://github.com/EleutherAI/the-pile).
The overall pipeline is as follows:
<div>
<p>
<img src="index_files/figure-commonmark/mermaid-figure-1.png"
style="width:5.53in;height:0.72in" />
</p>
</div>
In this library, we define filtering as data instances being removed
from the dataset based on some criteria and cleaning as data instances
being modified in some way.
## Install
``` sh
pip install squeakily
```
## How to use
### Using the API
First, we need to define a datasource. `squeakily` accepts any `Dataset`
object from the [HuggingFace
Datasets](https://huggingface.co/docs/datasets/index) library. For
example, we can use the
[wikitext](https://huggingface.co/datasets/wikitext) dataset:
``` python
from datasets import load_dataset
ds = load_dataset("wikitext", "wikitext-103-v1", split="train[:1%]")
```
We simply need to wrap the `Dataset` object in a dictionary, with the
key being the name of the datasource and the value being the `Dataset`
object, the filter and cleaners. For example:
``` python
from squeakily.filter import check_char_repetition, check_flagged_words
from squeakily.clean import remove_empty_lines, normalize_whitespace
datasources = [
{
"dataset": ds,
"columns": ["text"],
"filters": [check_char_repetition, check_flagged_words],
"cleaners": [remove_empty_lines, normalize_whitespace],
},
# ...
]
```
<div>
> **Warning**
>
> Note: The order of the filters and cleaning functions matter. Filters
> and cleaners are applied in the order they are defined.
</div>
<div>
> **Important**
>
> Note: As of now, we only use the first column of the given column
> names. This is because the `squeakily` library is designed to work
> with language datasets, which usually have a single column of text.
> Future versions will support multiple columns.
</div>
Finally, we can apply the filters and cleaners to the datasouces using a
[`Pipeline`](https://CarperAI.github.io/squeakily/core.html#pipeline)
object:
``` python
from squeakily.core import Pipeline
pipeline = Pipeline(datasources)
pipeline.run()
```
<pre style="white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace"><span style="color: #7fbfbf; text-decoration-color: #7fbfbf">[11/16/22 04:32:57] </span><span style="color: #000080; text-decoration-color: #000080">INFO </span> Running datasource: wikitext <a href="file:///fsx/home-nathan/work/squeakily/squeakily/core.py" target="_blank"><span style="color: #7f7f7f; text-decoration-color: #7f7f7f">core.py</span></a><span style="color: #7f7f7f; text-decoration-color: #7f7f7f">:</span><a href="file:///fsx/home-nathan/work/squeakily/squeakily/core.py#41" target="_blank"><span style="color: #7f7f7f; text-decoration-color: #7f7f7f">41</span></a>
</pre>
<pre style="white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace"><span style="color: #7fbfbf; text-decoration-color: #7fbfbf"> </span><span style="color: #000080; text-decoration-color: #000080">INFO </span> Running filter: check_char_repetition on text <a href="file:///fsx/home-nathan/work/squeakily/squeakily/core.py" target="_blank"><span style="color: #7f7f7f; text-decoration-color: #7f7f7f">core.py</span></a><span style="color: #7f7f7f; text-decoration-color: #7f7f7f">:</span><a href="file:///fsx/home-nathan/work/squeakily/squeakily/core.py#54" target="_blank"><span style="color: #7f7f7f; text-decoration-color: #7f7f7f">54</span></a>
</pre>
<pre style="white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace"><span style="color: #7fbfbf; text-decoration-color: #7fbfbf"> </span><span style="color: #000080; text-decoration-color: #000080">INFO </span> Running filter: check_flagged_words on text <a href="file:///fsx/home-nathan/work/squeakily/squeakily/core.py" target="_blank"><span style="color: #7f7f7f; text-decoration-color: #7f7f7f">core.py</span></a><span style="color: #7f7f7f; text-decoration-color: #7f7f7f">:</span><a href="file:///fsx/home-nathan/work/squeakily/squeakily/core.py#54" target="_blank"><span style="color: #7f7f7f; text-decoration-color: #7f7f7f">54</span></a>
</pre>
<pre style="white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace"><span style="color: #7fbfbf; text-decoration-color: #7fbfbf"> </span><span style="color: #000080; text-decoration-color: #000080">INFO </span> Running cleaner: remove_empty_lines on text <a href="file:///fsx/home-nathan/work/squeakily/squeakily/core.py" target="_blank"><span style="color: #7f7f7f; text-decoration-color: #7f7f7f">core.py</span></a><span style="color: #7f7f7f; text-decoration-color: #7f7f7f">:</span><a href="file:///fsx/home-nathan/work/squeakily/squeakily/core.py#57" target="_blank"><span style="color: #7f7f7f; text-decoration-color: #7f7f7f">57</span></a>
</pre>
<pre style="white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace"><span style="color: #7fbfbf; text-decoration-color: #7fbfbf">[11/16/22 04:32:59] </span><span style="color: #000080; text-decoration-color: #000080">INFO </span> Running cleaner: normalize_whitespace on text <a href="file:///fsx/home-nathan/work/squeakily/squeakily/core.py" target="_blank"><span style="color: #7f7f7f; text-decoration-color: #7f7f7f">core.py</span></a><span style="color: #7f7f7f; text-decoration-color: #7f7f7f">:</span><a href="file:///fsx/home-nathan/work/squeakily/squeakily/core.py#57" target="_blank"><span style="color: #7f7f7f; text-decoration-color: #7f7f7f">57</span></a>
</pre>
<div>
> **Note**
>
> Note: If you want to run cleaners first, you can pass
> `cleaning_first=True` to the `run` function.
>
> ``` python
> pipeline.run(cleaning_first=True)
> ```
</div>
If you need to run a filter or cleaner at the dataset level rather than
the example level, you can pass `global_filters` or `global_cleaners` to
the
[`Pipeline.run`](https://CarperAI.github.io/squeakily/core.html#pipeline.run)
function. For example:
``` python
from squeakily.filter import minhash_dedup
pipeline.run(global_filters=[minhash_dedup])
```
<div>
> **Note**
>
> Note: If you use global filters or cleaners, all datasets must have a
> common column name in order to properly concatenate them.
</div>
<div>
> **Note**
>
> Note: You can also specifiy if you want a specific dataset to be
> skipped by setting the `skip_global` parameter to `True` when defining
> the datasource.
>
> ``` python
> datasources = [
> {
> "dataset": ds,
> "columns": ["text"],
> "filters": [check_char_repetition, check_flagged_words],
> "cleaners": [remove_empty_lines, normalize_whitespace],
> "skip_global": True,
> },
> # ...
> ]
> ```
</div>
Additionally, you can run the pipeline in a dry run mode by passing
`dry_run=True` to the `run` function. This will make no modifications to
the datasets’ documents, but will add additional columns to the datasets
with the results of the filters and cleaners. For example, if you if you
ran the pipeline with the
[`check_char_repetition`](https://CarperAI.github.io/squeakily/filter.html#check_char_repetition)
filter, you would get a new column called
[`check_char_repetition`](https://CarperAI.github.io/squeakily/filter.html#check_char_repetition)
with a float value between 0 and 1 indicating the percentage of
characters that are repeated in the document.
``` python
::: {.cell}
``` {.python .cell-code}
pipeline = Pipeline(datasources)
pipeline.run(dry_run=True)
pipeline.datasources[0]["dataset"].features
```
:::
Raw data
{
"_id": null,
"home_page": "https://github.com/CarperAI/squeakily",
"name": "squeakily",
"maintainer": "",
"docs_url": null,
"requires_python": ">=3.7",
"maintainer_email": "",
"keywords": "nbdev jupyter notebook python",
"author": "ncoop57",
"author_email": "nacooper01@wm.edu",
"download_url": "https://files.pythonhosted.org/packages/39/aa/0421fd40e966964f5e26d7c8536b7d6ab48e84562b8e4e5f6ce59c529d3c/squeakily-0.0.2.tar.gz",
"platform": null,
"description": "squeakily\n================\n\n<!-- WARNING: THIS FILE WAS AUTOGENERATED! DO NOT EDIT! -->\n\nThis repository is heavily inspired by BigScience\u2019s [ROOTs\nproject](https://github.com/bigscience-workshop/data-preparation) and\nEleutherAI\u2019s [The Pile](https://github.com/EleutherAI/the-pile).\n\nThe overall pipeline is as follows:\n\n<div>\n\n<p>\n\n<img src=\"index_files/figure-commonmark/mermaid-figure-1.png\"\nstyle=\"width:5.53in;height:0.72in\" />\n\n</p>\n\n</div>\n\nIn this library, we define filtering as data instances being removed\nfrom the dataset based on some criteria and cleaning as data instances\nbeing modified in some way.\n\n## Install\n\n``` sh\npip install squeakily\n```\n\n## How to use\n\n### Using the API\n\nFirst, we need to define a datasource. `squeakily` accepts any `Dataset`\nobject from the [HuggingFace\nDatasets](https://huggingface.co/docs/datasets/index) library. For\nexample, we can use the\n[wikitext](https://huggingface.co/datasets/wikitext) dataset:\n\n``` python\nfrom datasets import load_dataset\n\nds = load_dataset(\"wikitext\", \"wikitext-103-v1\", split=\"train[:1%]\")\n```\n\nWe simply need to wrap the `Dataset` object in a dictionary, with the\nkey being the name of the datasource and the value being the `Dataset`\nobject, the filter and cleaners. For example:\n\n``` python\nfrom squeakily.filter import check_char_repetition, check_flagged_words\nfrom squeakily.clean import remove_empty_lines, normalize_whitespace\n\ndatasources = [\n {\n \"dataset\": ds,\n \"columns\": [\"text\"],\n \"filters\": [check_char_repetition, check_flagged_words],\n \"cleaners\": [remove_empty_lines, normalize_whitespace],\n },\n # ...\n]\n```\n\n<div>\n\n> **Warning**\n>\n> Note: The order of the filters and cleaning functions matter. Filters\n> and cleaners are applied in the order they are defined.\n\n</div>\n\n<div>\n\n> **Important**\n>\n> Note: As of now, we only use the first column of the given column\n> names. This is because the `squeakily` library is designed to work\n> with language datasets, which usually have a single column of text.\n> Future versions will support multiple columns.\n\n</div>\n\nFinally, we can apply the filters and cleaners to the datasouces using a\n[`Pipeline`](https://CarperAI.github.io/squeakily/core.html#pipeline)\nobject:\n\n``` python\nfrom squeakily.core import Pipeline\n\npipeline = Pipeline(datasources)\npipeline.run()\n```\n\n<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\"><span style=\"color: #7fbfbf; text-decoration-color: #7fbfbf\">[11/16/22 04:32:57] </span><span style=\"color: #000080; text-decoration-color: #000080\">INFO </span> Running datasource: wikitext <a href=\"file:///fsx/home-nathan/work/squeakily/squeakily/core.py\" target=\"_blank\"><span style=\"color: #7f7f7f; text-decoration-color: #7f7f7f\">core.py</span></a><span style=\"color: #7f7f7f; text-decoration-color: #7f7f7f\">:</span><a href=\"file:///fsx/home-nathan/work/squeakily/squeakily/core.py#41\" target=\"_blank\"><span style=\"color: #7f7f7f; text-decoration-color: #7f7f7f\">41</span></a>\n</pre>\n<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\"><span style=\"color: #7fbfbf; text-decoration-color: #7fbfbf\"> </span><span style=\"color: #000080; text-decoration-color: #000080\">INFO </span> Running filter: check_char_repetition on text <a href=\"file:///fsx/home-nathan/work/squeakily/squeakily/core.py\" target=\"_blank\"><span style=\"color: #7f7f7f; text-decoration-color: #7f7f7f\">core.py</span></a><span style=\"color: #7f7f7f; text-decoration-color: #7f7f7f\">:</span><a href=\"file:///fsx/home-nathan/work/squeakily/squeakily/core.py#54\" target=\"_blank\"><span style=\"color: #7f7f7f; text-decoration-color: #7f7f7f\">54</span></a>\n</pre>\n<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\"><span style=\"color: #7fbfbf; text-decoration-color: #7fbfbf\"> </span><span style=\"color: #000080; text-decoration-color: #000080\">INFO </span> Running filter: check_flagged_words on text <a href=\"file:///fsx/home-nathan/work/squeakily/squeakily/core.py\" target=\"_blank\"><span style=\"color: #7f7f7f; text-decoration-color: #7f7f7f\">core.py</span></a><span style=\"color: #7f7f7f; text-decoration-color: #7f7f7f\">:</span><a href=\"file:///fsx/home-nathan/work/squeakily/squeakily/core.py#54\" target=\"_blank\"><span style=\"color: #7f7f7f; text-decoration-color: #7f7f7f\">54</span></a>\n</pre>\n<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\"><span style=\"color: #7fbfbf; text-decoration-color: #7fbfbf\"> </span><span style=\"color: #000080; text-decoration-color: #000080\">INFO </span> Running cleaner: remove_empty_lines on text <a href=\"file:///fsx/home-nathan/work/squeakily/squeakily/core.py\" target=\"_blank\"><span style=\"color: #7f7f7f; text-decoration-color: #7f7f7f\">core.py</span></a><span style=\"color: #7f7f7f; text-decoration-color: #7f7f7f\">:</span><a href=\"file:///fsx/home-nathan/work/squeakily/squeakily/core.py#57\" target=\"_blank\"><span style=\"color: #7f7f7f; text-decoration-color: #7f7f7f\">57</span></a>\n</pre>\n<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\"><span style=\"color: #7fbfbf; text-decoration-color: #7fbfbf\">[11/16/22 04:32:59] </span><span style=\"color: #000080; text-decoration-color: #000080\">INFO </span> Running cleaner: normalize_whitespace on text <a href=\"file:///fsx/home-nathan/work/squeakily/squeakily/core.py\" target=\"_blank\"><span style=\"color: #7f7f7f; text-decoration-color: #7f7f7f\">core.py</span></a><span style=\"color: #7f7f7f; text-decoration-color: #7f7f7f\">:</span><a href=\"file:///fsx/home-nathan/work/squeakily/squeakily/core.py#57\" target=\"_blank\"><span style=\"color: #7f7f7f; text-decoration-color: #7f7f7f\">57</span></a>\n</pre>\n\n<div>\n\n> **Note**\n>\n> Note: If you want to run cleaners first, you can pass\n> `cleaning_first=True` to the `run` function.\n>\n> ``` python\n> pipeline.run(cleaning_first=True)\n> ```\n\n</div>\n\nIf you need to run a filter or cleaner at the dataset level rather than\nthe example level, you can pass `global_filters` or `global_cleaners` to\nthe\n[`Pipeline.run`](https://CarperAI.github.io/squeakily/core.html#pipeline.run)\nfunction. For example:\n\n``` python\nfrom squeakily.filter import minhash_dedup\n\npipeline.run(global_filters=[minhash_dedup])\n```\n\n<div>\n\n> **Note**\n>\n> Note: If you use global filters or cleaners, all datasets must have a\n> common column name in order to properly concatenate them.\n\n</div>\n\n<div>\n\n> **Note**\n>\n> Note: You can also specifiy if you want a specific dataset to be\n> skipped by setting the `skip_global` parameter to `True` when defining\n> the datasource.\n>\n> ``` python\n> datasources = [\n> {\n> \"dataset\": ds,\n> \"columns\": [\"text\"],\n> \"filters\": [check_char_repetition, check_flagged_words],\n> \"cleaners\": [remove_empty_lines, normalize_whitespace],\n> \"skip_global\": True,\n> },\n> # ...\n> ]\n> ```\n\n</div>\n\nAdditionally, you can run the pipeline in a dry run mode by passing\n`dry_run=True` to the `run` function. This will make no modifications to\nthe datasets\u2019 documents, but will add additional columns to the datasets\nwith the results of the filters and cleaners. For example, if you if you\nran the pipeline with the\n[`check_char_repetition`](https://CarperAI.github.io/squeakily/filter.html#check_char_repetition)\nfilter, you would get a new column called\n[`check_char_repetition`](https://CarperAI.github.io/squeakily/filter.html#check_char_repetition)\nwith a float value between 0 and 1 indicating the percentage of\ncharacters that are repeated in the document.\n\n``` python\n\n::: {.cell}\n``` {.python .cell-code}\npipeline = Pipeline(datasources)\npipeline.run(dry_run=True)\npipeline.datasources[0][\"dataset\"].features\n```\n\n:::\n",
"bugtrack_url": null,
"license": "Apache Software License 2.0",
"summary": "A library for squeakily cleaning and filtering language datasets.",
"version": "0.0.2",
"split_keywords": [
"nbdev",
"jupyter",
"notebook",
"python"
],
"urls": [
{
"comment_text": "",
"digests": {
"md5": "773a51f518a0bfeb7607d0dc63252a4d",
"sha256": "1625bcb2d322234be3d494f31f4c70675c667f19958f607561a9713075be83c3"
},
"downloads": -1,
"filename": "squeakily-0.0.2-py3-none-any.whl",
"has_sig": false,
"md5_digest": "773a51f518a0bfeb7607d0dc63252a4d",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.7",
"size": 19958,
"upload_time": "2022-11-30T04:03:25",
"upload_time_iso_8601": "2022-11-30T04:03:25.089206Z",
"url": "https://files.pythonhosted.org/packages/9d/c0/85d8d44d22c14172d100cd658872acc878265b2a7199904e1f3ac43b004c/squeakily-0.0.2-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"md5": "9b9e1d131a1a88ff08139c0b972a56f5",
"sha256": "12ea479d62c4d30d7d3560aa7cfb10cda5b92d55f6f12c17399195d45fab36fc"
},
"downloads": -1,
"filename": "squeakily-0.0.2.tar.gz",
"has_sig": false,
"md5_digest": "9b9e1d131a1a88ff08139c0b972a56f5",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.7",
"size": 21903,
"upload_time": "2022-11-30T04:03:27",
"upload_time_iso_8601": "2022-11-30T04:03:27.089928Z",
"url": "https://files.pythonhosted.org/packages/39/aa/0421fd40e966964f5e26d7c8536b7d6ab48e84562b8e4e5f6ce59c529d3c/squeakily-0.0.2.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2022-11-30 04:03:27",
"github": true,
"gitlab": false,
"bitbucket": false,
"github_user": "CarperAI",
"github_project": "squeakily",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"lcname": "squeakily"
}