# semantic-cleaning
<!-- WARNING: THIS FILE WAS AUTOGENERATED! DO NOT EDIT! -->
<a href="https://colab.research.google.com/github/yuval6957/semantic-cleaning/blob/main/nbs/index.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>
This file will become your README and also the index of your
documentation.
## Install
``` sh
pip install semantic_cleaning
```
## How to use
``` python
import os
from tqdm.auto import tqdm
from typing import List, Dict, Set, Union, Callable
import torch
from torch.utils.data import DataLoader
from datasets import Dataset, load_dataset
import numpy as np
from transformers import AutoTokenizer, AutoModel, AutoModelForCausalLM
import torch.nn.functional as F
import transformers
```
Processing a dataset to get a sentence for QA or comment and response
etc.
``` python
data = load_dataset("0-hero/OIG-small-chip2")
_ = preprocess_data(data,schema = ":{user} :{chip2}")
data['train']['_merged'][0]
```
2
Compute the embadding fot the sentences
``` python
tokenizer = AutoTokenizer.from_pretrained('sentence-transformers/all-mpnet-base-v2')
model = AutoModel.from_pretrained('sentence-transformers/all-mpnet-base-v2').to('cuda')
embedding = compute_embeddings(data = data, embedding_model = model, tokenizer = tokenizer, batch_size = 64, num_workers =16, dataset_feature = '_merged'):
```
We can get the indicis of all the duplicated lines with the folowing
command:
``` python
to_delete = deduplicate_embeddings(embedded =embeddeing, epsilon=1e-2, batch_size=20000)
```
The full process could be run like this
``` python
deduplicated = deduplicate_dataset(
dataset = data['train'],
model = model,
tokenizer = tokenizer,
epsilon = 1e-2,
model_batch_size = 64,
deduplication_batch_size = 20000,
num_workers = 16,
dataset_feature = '_merged'
)
print (f"cleaned:{(1-len(deduplicated)/len(data['train']))*100:.2f}:%")
```
And deduplicated can be pushed back to the hub or saved on local drive
Raw data
{
"_id": null,
"home_page": "https://github.com/yuval6957/semantic-cleaning",
"name": "semantic-cleaning",
"maintainer": "",
"docs_url": null,
"requires_python": ">=3.7",
"maintainer_email": "",
"keywords": "nbdev jupyter notebook python",
"author": "Yuval Reina",
"author_email": "yuval.reina@gmail.com",
"download_url": "https://files.pythonhosted.org/packages/7e/b6/3a85542123b7afb56262804f39169ee03a3e2a58aae6083fc6a7f2be109d/semantic-cleaning-0.0.3.tar.gz",
"platform": null,
"description": "# semantic-cleaning\n\n<!-- WARNING: THIS FILE WAS AUTOGENERATED! DO NOT EDIT! -->\n\n<a href=\"https://colab.research.google.com/github/yuval6957/semantic-cleaning/blob/main/nbs/index.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>\n\nThis file will become your README and also the index of your\ndocumentation.\n\n## Install\n\n``` sh\npip install semantic_cleaning\n```\n\n## How to use\n\n``` python\nimport os\nfrom tqdm.auto import tqdm\nfrom typing import List, Dict, Set, Union, Callable\nimport torch\nfrom torch.utils.data import DataLoader\nfrom datasets import Dataset, load_dataset\nimport numpy as np\nfrom transformers import AutoTokenizer, AutoModel, AutoModelForCausalLM\nimport torch.nn.functional as F\nimport transformers\n```\n\nProcessing a dataset to get a sentence for QA or comment and response\netc.\n\n``` python\ndata = load_dataset(\"0-hero/OIG-small-chip2\")\n_ = preprocess_data(data,schema = \":{user} :{chip2}\")\ndata['train']['_merged'][0]\n```\n\n 2\n\nCompute the embadding fot the sentences\n\n``` python\ntokenizer = AutoTokenizer.from_pretrained('sentence-transformers/all-mpnet-base-v2')\nmodel = AutoModel.from_pretrained('sentence-transformers/all-mpnet-base-v2').to('cuda')\nembedding = compute_embeddings(data = data, embedding_model = model, tokenizer = tokenizer, batch_size = 64, num_workers =16, dataset_feature = '_merged'):\n```\n\nWe can get the indicis of all the duplicated lines with the folowing\ncommand:\n\n``` python\nto_delete = deduplicate_embeddings(embedded =embeddeing, epsilon=1e-2, batch_size=20000)\n```\n\nThe full process could be run like this\n\n``` python\n deduplicated = deduplicate_dataset(\n dataset = data['train'], \n model = model, \n tokenizer = tokenizer,\n epsilon = 1e-2, \n model_batch_size = 64, \n deduplication_batch_size = 20000, \n num_workers = 16,\n dataset_feature = '_merged'\n )\n print (f\"cleaned:{(1-len(deduplicated)/len(data['train']))*100:.2f}:%\")\n```\n\nAnd deduplicated can be pushed back to the hub or saved on local drive\n",
"bugtrack_url": null,
"license": "Apache Software License 2.0",
"summary": "Tools for semantic cleaning of a test dataset",
"version": "0.0.3",
"project_urls": {
"Homepage": "https://github.com/yuval6957/semantic-cleaning"
},
"split_keywords": [
"nbdev",
"jupyter",
"notebook",
"python"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "303589a8d5304b11d0239538b7edd8d0f526606a6f2cf72a35f4bdd443ffc745",
"md5": "d6d1daee4f5ba80635977e7e8eaf700b",
"sha256": "e60bbc39f69b581b696757f3aeecd390354d403216157c31cf9735ee4ea75b73"
},
"downloads": -1,
"filename": "semantic_cleaning-0.0.3-py3-none-any.whl",
"has_sig": false,
"md5_digest": "d6d1daee4f5ba80635977e7e8eaf700b",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.7",
"size": 10326,
"upload_time": "2023-05-29T08:45:30",
"upload_time_iso_8601": "2023-05-29T08:45:30.840238Z",
"url": "https://files.pythonhosted.org/packages/30/35/89a8d5304b11d0239538b7edd8d0f526606a6f2cf72a35f4bdd443ffc745/semantic_cleaning-0.0.3-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "7eb63a85542123b7afb56262804f39169ee03a3e2a58aae6083fc6a7f2be109d",
"md5": "ac048217cee0332e3dd02a2fd286dd16",
"sha256": "7fec8b5bcfd74a50587c131aacbb1c81bdd9d2ea3a7925f44a590fe56958fa9e"
},
"downloads": -1,
"filename": "semantic-cleaning-0.0.3.tar.gz",
"has_sig": false,
"md5_digest": "ac048217cee0332e3dd02a2fd286dd16",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.7",
"size": 10403,
"upload_time": "2023-05-29T08:45:33",
"upload_time_iso_8601": "2023-05-29T08:45:33.082259Z",
"url": "https://files.pythonhosted.org/packages/7e/b6/3a85542123b7afb56262804f39169ee03a3e2a58aae6083fc6a7f2be109d/semantic-cleaning-0.0.3.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2023-05-29 08:45:33",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "yuval6957",
"github_project": "semantic-cleaning",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"requirements": [],
"lcname": "semantic-cleaning"
}