scikit-embeddings

Name	scikit-embeddings JSON
Version	0.2.0 JSON
	download
home_page
Summary	Tools for training word and document embeddings in scikit-learn.
upload_time	2023-09-01 13:44:41
maintainer
docs_url	None
author	Márton Kardos
requires_python	>=3.9,<4.0
license	MIT
keywords
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            # scikit-embeddings
Utilites for training word, document and sentence embeddings in scikit-learn pipelines.

## Features
 - Train Word, Paragraph or Sentence embeddings in scikit-learn compatible pipelines.
 - Stream texts easily from disk and chunk them so you can use large datasets for training embeddings.
 - spaCy tokenizers with lemmatization, stop word removal and augmentation with POS-tags/Morphological information etc. for highest quality embeddings for literary analysis.
 - Fast and performant trainable tokenizer components from `tokenizers`.
 - Easy to integrate components and pipelines in your scikit-learn workflows and machine learning pipelines.
 - Easy serialization and integration with HugginFace Hub for quickly publishing your embedding pipelines.

### What scikit-embeddings is not for:
 - Using pretrained embeddings in scikit-learn pipelines (for these purposes I recommend [embetter](https://github.com/koaning/embetter/tree/main))
 - Training transformer models and deep neural language models (if you want to do this, do it with [transformers](https://huggingface.co/docs/transformers/index))


## Examples

### Streams

scikit-embeddings comes with a handful of utilities for streaming data from disk or other sources,
chunking and filtering. Here's an example of how you would go about obtaining chunks of text from jsonl files with a "content field".

```python
from skembedding.streams import Stream

# let's say you have a list of file paths
files: list[str] = [...]

# Stream text chunks from jsonl files with a 'content' field.
text_chunks = (
    Stream(files)
    .read_files(lines=True)
    .json()
    .grab("content")
    .chunk(10_000)
)
```

### Word Embeddings

You can train classic vanilla word embeddings by building a pipeline that contains a `WordLevel` tokenizer and an embedding model:

```python
from skembedding.tokenizers import WordLevelTokenizer
from skembedding.models import Word2VecEmbedding
from skembeddings.pipeline import EmbeddingPipeline

embedding_pipe = EmbeddingPipeline(
    WordLevelTokenizer(),
    Word2VecEmbedding(n_components=100, algorithm="cbow")
)
embedding_pipe.fit(texts)
```

### Fasttext-like

You can train an embedding pipeline that uses subword information by using a tokenizer that does that.
You may want to use `Unigram`, `BPE` or `WordPiece` for these purposes.
Fasttext also uses skip-gram by default so let's change to that.

```python
from skembedding.tokenizers import UnigramTokenizer
from skembedding.models import Word2VecEmbedding
from skembeddings.pipeline import EmbeddingPipeline

embedding_pipe = EmbeddingPipeline(
    UnigramTokenizer(),
    Word2VecEmbedding(n_components=250, algorithm="sg")
)
embedding_pipe.fit(texts)
```

### Sense2Vec

We provide a spaCy tokenizer that can lemmatize tokens and append morphological information so you can get fine-grained
semantic information even on relatively small corpora. I recommend using this for literary analysis.

```python
from skembeddings.models import Word2VecEmbedding
from skembeddings.tokenizers import SpacyTokenizer
from skembeddings.pipeline import EmbeddingPipeline

# Single token pattern that lets alphabetical tokens pass, but not stopwords
pattern = [[{"IS_ALPHA": True, "IS_STOP": False}]]

# Build tokenizer that lemmatizes and appends POS-tags to the lemmas
tokenizer = SpacyTokenizer(
    "en_core_web_sm",
    out_attrs=("LEMMA", "UPOS"),
    patterns=pattern,
)

# Build a pipeline
embedding_pipeline = EmbeddingPipeline(
    tokenizer,
    Word2VecEmbedding(50, algorithm="cbow")
)

# Fitting pipeline on corpus
embedding_pipeline.fit(corpus)
```

### Paragraph Embeddings

You can train Doc2Vec paragpraph embeddings with the chosen choice of tokenization.

```python
from skembedding.tokenizers import WordPieceTokenizer
from skembedding.models import ParagraphEmbedding
from skembeddings.pipeline import EmbeddingPipeline

embedding_pipe = EmbeddingPipeline(
    WordPieceTokenizer(),
    ParagraphEmbedding(n_components=250, algorithm="dm")
)
embedding_pipe.fit(texts)
```

### Iterative training

In the case of large datasets you can train on individual chunks with `partial_fit()`.

```python
for chunk in text_chunks:
    embedding_pipe.partial_fit(chunk)
```

### Serialization

Pipelines can be safely serialized to disk:

```python
embedding_pipe.to_disk("output_folder/")

embedding_pipe = EmbeddingPipeline.from_disk("output_folder/")
```

Or published to HugginFace Hub:

```python
from huggingface_hub import login

login()
embedding_pipe.to_hub("username/name_of_pipeline")

embedding_pipe = EmbeddingPipeline.from_hub("username/name_of_pipeline")
```

### Text Classification

You can include an embedding model in your classification pipelines by adding some classification head.

```python
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

X_train, X_test, y_train, y_test = train_test_split(X, y)

cls_pipe = make_pipeline(embedding_pipe, LogisticRegression())
cls_pipe.fit(X_train, y_train)

y_pred = cls_pipe.predict(X_test)
print(classification_report(y_test, y_pred))
```


### Feature Extraction

If you intend to use the features produced by tokenizers in other text pipelines, such as topic models,
you can use `ListCountVectorizer` or `Joiner`.

Here's an example of an NMF topic model that use lemmata enriched with POS tags.

```python
from sklearn.decomposition import NMF
from sklearn.pipelines import make_pipeline
from sklearn.feature_extraction.text import TfidfTransformer, TfidfVectorizer
from skembedding.tokenizers import SpacyTokenizer
from skembedding.feature_extraction import ListCountVectorizer
from skembedding.preprocessing import Joiner

# Single token pattern that lets alphabetical tokens pass, but not stopwords
pattern = [[{"IS_ALPHA": True, "IS_STOP": False}]]

# Build tokenizer that lemmatizes and appends POS-tags to the lemmas
tokenizer = SpacyTokenizer(
    "en_core_web_sm",
    out_attrs=("LEMMA", "UPOS"),
    patterns=pattern,
)

# Example with ListCountVectorizer
topic_pipeline = make_pipeline(
    tokenizer,
    ListCountVectorizer(),
    TfidfTransformer(), # tf-idf weighting (optional)
    NMF(15), # 15 topics in the model 
)

# Alternatively you can just join the tokens together with whitespace
topic_pipeline = make_pipeline(
    tokenizer,
    Joiner(),
    TfidfVectorizer(),
    NMF(15), 
)
```

Raw data

            {
    "_id": null,
    "home_page": "",
    "name": "scikit-embeddings",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.9,<4.0",
    "maintainer_email": "",
    "keywords": "",
    "author": "M\u00e1rton Kardos",
    "author_email": "power.up1163@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/c0/b6/90047135d5d7c685a369c3bfc641a3bd2da1f2adecea5030babd0cbe438e/scikit_embeddings-0.2.0.tar.gz",
    "platform": null,
    "description": "# scikit-embeddings\nUtilites for training word, document and sentence embeddings in scikit-learn pipelines.\n\n## Features\n - Train Word, Paragraph or Sentence embeddings in scikit-learn compatible pipelines.\n - Stream texts easily from disk and chunk them so you can use large datasets for training embeddings.\n - spaCy tokenizers with lemmatization, stop word removal and augmentation with POS-tags/Morphological information etc. for highest quality embeddings for literary analysis.\n - Fast and performant trainable tokenizer components from `tokenizers`.\n - Easy to integrate components and pipelines in your scikit-learn workflows and machine learning pipelines.\n - Easy serialization and integration with HugginFace Hub for quickly publishing your embedding pipelines.\n\n### What scikit-embeddings is not for:\n - Using pretrained embeddings in scikit-learn pipelines (for these purposes I recommend [embetter](https://github.com/koaning/embetter/tree/main))\n - Training transformer models and deep neural language models (if you want to do this, do it with [transformers](https://huggingface.co/docs/transformers/index))\n\n\n## Examples\n\n### Streams\n\nscikit-embeddings comes with a handful of utilities for streaming data from disk or other sources,\nchunking and filtering. Here's an example of how you would go about obtaining chunks of text from jsonl files with a \"content field\".\n\n```python\nfrom skembedding.streams import Stream\n\n# let's say you have a list of file paths\nfiles: list[str] = [...]\n\n# Stream text chunks from jsonl files with a 'content' field.\ntext_chunks = (\n    Stream(files)\n    .read_files(lines=True)\n    .json()\n    .grab(\"content\")\n    .chunk(10_000)\n)\n```\n\n### Word Embeddings\n\nYou can train classic vanilla word embeddings by building a pipeline that contains a `WordLevel` tokenizer and an embedding model:\n\n```python\nfrom skembedding.tokenizers import WordLevelTokenizer\nfrom skembedding.models import Word2VecEmbedding\nfrom skembeddings.pipeline import EmbeddingPipeline\n\nembedding_pipe = EmbeddingPipeline(\n    WordLevelTokenizer(),\n    Word2VecEmbedding(n_components=100, algorithm=\"cbow\")\n)\nembedding_pipe.fit(texts)\n```\n\n### Fasttext-like\n\nYou can train an embedding pipeline that uses subword information by using a tokenizer that does that.\nYou may want to use `Unigram`, `BPE` or `WordPiece` for these purposes.\nFasttext also uses skip-gram by default so let's change to that.\n\n```python\nfrom skembedding.tokenizers import UnigramTokenizer\nfrom skembedding.models import Word2VecEmbedding\nfrom skembeddings.pipeline import EmbeddingPipeline\n\nembedding_pipe = EmbeddingPipeline(\n    UnigramTokenizer(),\n    Word2VecEmbedding(n_components=250, algorithm=\"sg\")\n)\nembedding_pipe.fit(texts)\n```\n\n### Sense2Vec\n\nWe provide a spaCy tokenizer that can lemmatize tokens and append morphological information so you can get fine-grained\nsemantic information even on relatively small corpora. I recommend using this for literary analysis.\n\n```python\nfrom skembeddings.models import Word2VecEmbedding\nfrom skembeddings.tokenizers import SpacyTokenizer\nfrom skembeddings.pipeline import EmbeddingPipeline\n\n# Single token pattern that lets alphabetical tokens pass, but not stopwords\npattern = [[{\"IS_ALPHA\": True, \"IS_STOP\": False}]]\n\n# Build tokenizer that lemmatizes and appends POS-tags to the lemmas\ntokenizer = SpacyTokenizer(\n    \"en_core_web_sm\",\n    out_attrs=(\"LEMMA\", \"UPOS\"),\n    patterns=pattern,\n)\n\n# Build a pipeline\nembedding_pipeline = EmbeddingPipeline(\n    tokenizer,\n    Word2VecEmbedding(50, algorithm=\"cbow\")\n)\n\n# Fitting pipeline on corpus\nembedding_pipeline.fit(corpus)\n```\n\n### Paragraph Embeddings\n\nYou can train Doc2Vec paragpraph embeddings with the chosen choice of tokenization.\n\n```python\nfrom skembedding.tokenizers import WordPieceTokenizer\nfrom skembedding.models import ParagraphEmbedding\nfrom skembeddings.pipeline import EmbeddingPipeline\n\nembedding_pipe = EmbeddingPipeline(\n    WordPieceTokenizer(),\n    ParagraphEmbedding(n_components=250, algorithm=\"dm\")\n)\nembedding_pipe.fit(texts)\n```\n\n### Iterative training\n\nIn the case of large datasets you can train on individual chunks with `partial_fit()`.\n\n```python\nfor chunk in text_chunks:\n    embedding_pipe.partial_fit(chunk)\n```\n\n### Serialization\n\nPipelines can be safely serialized to disk:\n\n```python\nembedding_pipe.to_disk(\"output_folder/\")\n\nembedding_pipe = EmbeddingPipeline.from_disk(\"output_folder/\")\n```\n\nOr published to HugginFace Hub:\n\n```python\nfrom huggingface_hub import login\n\nlogin()\nembedding_pipe.to_hub(\"username/name_of_pipeline\")\n\nembedding_pipe = EmbeddingPipeline.from_hub(\"username/name_of_pipeline\")\n```\n\n### Text Classification\n\nYou can include an embedding model in your classification pipelines by adding some classification head.\n\n```python\nfrom sklearn.linear_model import LogisticRegression\nfrom sklearn.model_selection import train_test_split\nfrom sklearn.metrics import classification_report\n\nX_train, X_test, y_train, y_test = train_test_split(X, y)\n\ncls_pipe = make_pipeline(embedding_pipe, LogisticRegression())\ncls_pipe.fit(X_train, y_train)\n\ny_pred = cls_pipe.predict(X_test)\nprint(classification_report(y_test, y_pred))\n```\n\n\n### Feature Extraction\n\nIf you intend to use the features produced by tokenizers in other text pipelines, such as topic models,\nyou can use `ListCountVectorizer` or `Joiner`.\n\nHere's an example of an NMF topic model that use lemmata enriched with POS tags.\n\n```python\nfrom sklearn.decomposition import NMF\nfrom sklearn.pipelines import make_pipeline\nfrom sklearn.feature_extraction.text import TfidfTransformer, TfidfVectorizer\nfrom skembedding.tokenizers import SpacyTokenizer\nfrom skembedding.feature_extraction import ListCountVectorizer\nfrom skembedding.preprocessing import Joiner\n\n# Single token pattern that lets alphabetical tokens pass, but not stopwords\npattern = [[{\"IS_ALPHA\": True, \"IS_STOP\": False}]]\n\n# Build tokenizer that lemmatizes and appends POS-tags to the lemmas\ntokenizer = SpacyTokenizer(\n    \"en_core_web_sm\",\n    out_attrs=(\"LEMMA\", \"UPOS\"),\n    patterns=pattern,\n)\n\n# Example with ListCountVectorizer\ntopic_pipeline = make_pipeline(\n    tokenizer,\n    ListCountVectorizer(),\n    TfidfTransformer(), # tf-idf weighting (optional)\n    NMF(15), # 15 topics in the model \n)\n\n# Alternatively you can just join the tokens together with whitespace\ntopic_pipeline = make_pipeline(\n    tokenizer,\n    Joiner(),\n    TfidfVectorizer(),\n    NMF(15), \n)\n```\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "Tools for training word and document embeddings in scikit-learn.",
    "version": "0.2.0",
    "project_urls": null,
    "split_keywords": [],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "7607eb43659f456e90ceb4eba92fc2771b4ab223df2a2e6274b43cd4f68a9701",
                "md5": "e257f2a2598e3cfa9dd5290fb9236d70",
                "sha256": "ca9a4e4f99e83cd0595edf9071f8e105c9ae4a4e8eff3ac7c9e5e9bdbde7c0db"
            },
            "downloads": -1,
            "filename": "scikit_embeddings-0.2.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "e257f2a2598e3cfa9dd5290fb9236d70",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.9,<4.0",
            "size": 24928,
            "upload_time": "2023-09-01T13:44:39",
            "upload_time_iso_8601": "2023-09-01T13:44:39.779929Z",
            "url": "https://files.pythonhosted.org/packages/76/07/eb43659f456e90ceb4eba92fc2771b4ab223df2a2e6274b43cd4f68a9701/scikit_embeddings-0.2.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "c0b690047135d5d7c685a369c3bfc641a3bd2da1f2adecea5030babd0cbe438e",
                "md5": "ba6a61b468911df01247d3c2fbbe7cc7",
                "sha256": "b8e1adf17fafc83a4ec56856e232ea22f632aa63a24aab7ca19cfe713448ef83"
            },
            "downloads": -1,
            "filename": "scikit_embeddings-0.2.0.tar.gz",
            "has_sig": false,
            "md5_digest": "ba6a61b468911df01247d3c2fbbe7cc7",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.9,<4.0",
            "size": 20188,
            "upload_time": "2023-09-01T13:44:41",
            "upload_time_iso_8601": "2023-09-01T13:44:41.179663Z",
            "url": "https://files.pythonhosted.org/packages/c0/b6/90047135d5d7c685a369c3bfc641a3bd2da1f2adecea5030babd0cbe438e/scikit_embeddings-0.2.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-09-01 13:44:41",
    "github": false,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "lcname": "scikit-embeddings"
}

Márton Kardos