TuoTuo


NameTuoTuo JSON
Version 0.2.7 PyPI version JSON
download
home_pagehttps://github.com/RobbenRibery/TuoTuo
SummaryLDA & Neura based topic modelling library
upload_time2023-05-16 21:46:43
maintainer
docs_urlNone
authortuotuo Superman
requires_python
licenseMIT
keywords generative topic modelling latent dirichlet allocation
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # TuoTuo 
TuoTuo A Topic Modelling library written in Python. TuoTuo is also a cute boy, my son, who is now 6 months old.  

<br/>

## Installation 
Use the package manager [pip](https://pip.pypa.io/en/stable/) to install TuoTuo. You may find the Pypi distribution [here](https://pypi.org/project/TuoTuo/).

```bash
pip install TuoTuo --upgrade
```

## Usage 
Currently, the library only supports Topic modeling via Latent Dirichlet Allocation (LDA). As we know, LDA can be implemented using Gibbs Sampling and Variational Inference, we choose the latter as this is mathematically more sophisticated 

- Generate some documents based on pre-defined Dirichlet Parameters over 5 different topics and 40 unique words

```python
from tuotuo.generator import doc_generator 

gen = doc_generator(
    M = 100, 
    # we sample 100 documents 
    L = 20, 
    # each document would contain 20 pre-defined words 
    topic_prior = tr.tensor([1,1,1,1,1], dtype=tr.double)
    # we use a exchangable Dirichlet Distribution as our topic prior, 
    # that is a uniform distribution on 5 topics
)
train_docs = gen.generate_doc()
```

- Form Training Document and train the variational inference parameters in LDA 

```python
from tuotuo.lda_model import LDASmoothed 
import matplotlib.pyplot as plt 

lda = LDASmoothed(
    num_topics = 5, 
)

perplexes = lda.fit(
    train_docs,
    sampling= False,
    verbose=True, 
    return_perplexities=True,
)
plt.plot(perplexes)

=>=>=>=>=>=>=>=>
Topic Dirichlet Prior, Alpha
1

Exchangeable Word Dirichlet Prior, Eta 
1

Var Inf - Word Dirichlet prior, Lambda
(5, 40)

Var Inf - Topic Dirichlet prior, Gamma
(100, 5)

Init perplexity = 84.99592157507153
End perplexity = 45.96696541539976
```
![Perplexity over 100 iteration](images/generated_doc_perplexities.png)


- Check out the top 5 words for each topic according to the variational inference parameter: $\lambda$ 

```python
for topic_index in range(lda._lambda_.shape[0]):

    top5 = np.argsort(lda._lambda_[topic_index,:],)[-5:]
    print(f"Topic {topic_index}")
    for i, idx in enumerate(top5):
        print(f"Top {i+1} -> {lda.train_doc.idx_to_vocab[idx]}")
    print()

=>=>=>=>=>=>=>=>
Topic 0 
Top 1 -> physical
Top 2 -> quantum
Top 3 -> research
Top 4 -> scientst
Top 5 -> astrophysics

Topic 1
Top 1 -> divorce
Top 2 -> attorney
Top 3 -> court
Top 4 -> bankrupt
Top 5 -> contract

Topic 2
Top 1 -> content
Top 2 -> Craftsmanship
Top 3 -> concert
Top 4 -> asymmetrical
Top 5 -> Symmetrical

Topic 3
Top 1 -> recreation
Top 2 -> FIFA
Top 3 -> football
Top 4 -> Olympic
Top 5 -> athletics

Topic 4
Top 1 -> fever
Top 2 -> appetite
Top 3 -> contagious
Top 4 -> decongestant
Top 5 -> injection
```
As we can see from the top 5 words, we can easily realize the following mapping: 

Topic 0 -> science 
Topic 1 -> law 
Topic 2 -> art 
Topic 3 -> sport 
Topic 4 -> health


## Contributing & References

Pull requests are welcome. For major changes, please open an issue first
to discuss what you would like to change.

As there is no mature topic modeling library available, we are also looking for collaborators who would like to contribute in the following directions: 

1. Variational Inference version for batch & online LDA, following the [original LDA Paper by David Blei in 2023][https://www.jmlr.org/papers/volume3/blei03a/blei03a.pdf] and the [Online LDA Paper on NeuraIPs][https://papers.nips.cc/paper_files/paper/2010/file/71f6278d140af599e06ad9bf1ba03cb0-Paper.pdf]. 

Most of the work is completed for this part, we still need to work on:
- computational optimization 
- online LDA implementation 
- efficient Newton's update on priors, namely $\alpha$ and $\eta$ 

2. Extend the library to support neural variational inference Following this [ICML paper: Neural Variational Inference for Text Processing][https://arxiv.org/pdf/1511.06038.pdf]

3. Extend the training to support Reinforcement Learning Following this [ACL paper: Neural Topic Model with Reinforcement Learning][https://aclanthology.org/D19-1350.pdf]

## License

[MIT](https://choosealicense.com/licenses/mit/)

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/RobbenRibery/TuoTuo",
    "name": "TuoTuo",
    "maintainer": "",
    "docs_url": null,
    "requires_python": "",
    "maintainer_email": "",
    "keywords": "Generative Topic Modelling,Latent Dirichlet Allocation",
    "author": "tuotuo Superman",
    "author_email": "tuotuo@HanwellSquare.BigForce.com",
    "download_url": "https://github.com/RobbenRibery/TuoTuo/archive/refs/tags/pypi-0.2.7.tar.gz",
    "platform": null,
    "description": "# TuoTuo \nTuoTuo A Topic Modelling library written in Python. TuoTuo is also a cute boy, my son, who is now 6 months old.  \n\n<br/>\n\n## Installation \nUse the package manager [pip](https://pip.pypa.io/en/stable/) to install TuoTuo. You may find the Pypi distribution [here](https://pypi.org/project/TuoTuo/).\n\n```bash\npip install TuoTuo --upgrade\n```\n\n## Usage \nCurrently, the library only supports Topic modeling via Latent Dirichlet Allocation (LDA). As we know, LDA can be implemented using Gibbs Sampling and Variational Inference, we choose the latter as this is mathematically more sophisticated \n\n- Generate some documents based on pre-defined Dirichlet Parameters over 5 different topics and 40 unique words\n\n```python\nfrom tuotuo.generator import doc_generator \n\ngen = doc_generator(\n    M = 100, \n    # we sample 100 documents \n    L = 20, \n    # each document would contain 20 pre-defined words \n    topic_prior = tr.tensor([1,1,1,1,1], dtype=tr.double)\n    # we use a exchangable Dirichlet Distribution as our topic prior, \n    # that is a uniform distribution on 5 topics\n)\ntrain_docs = gen.generate_doc()\n```\n\n- Form Training Document and train the variational inference parameters in LDA \n\n```python\nfrom tuotuo.lda_model import LDASmoothed \nimport matplotlib.pyplot as plt \n\nlda = LDASmoothed(\n    num_topics = 5, \n)\n\nperplexes = lda.fit(\n    train_docs,\n    sampling= False,\n    verbose=True, \n    return_perplexities=True,\n)\nplt.plot(perplexes)\n\n=>=>=>=>=>=>=>=>\nTopic Dirichlet Prior, Alpha\n1\n\nExchangeable Word Dirichlet Prior, Eta \n1\n\nVar Inf - Word Dirichlet prior, Lambda\n(5, 40)\n\nVar Inf - Topic Dirichlet prior, Gamma\n(100, 5)\n\nInit perplexity = 84.99592157507153\nEnd perplexity = 45.96696541539976\n```\n![Perplexity over 100 iteration](images/generated_doc_perplexities.png)\n\n\n- Check out the top 5 words for each topic according to the variational inference parameter: $\\lambda$ \n\n```python\nfor topic_index in range(lda._lambda_.shape[0]):\n\n    top5 = np.argsort(lda._lambda_[topic_index,:],)[-5:]\n    print(f\"Topic {topic_index}\")\n    for i, idx in enumerate(top5):\n        print(f\"Top {i+1} -> {lda.train_doc.idx_to_vocab[idx]}\")\n    print()\n\n=>=>=>=>=>=>=>=>\nTopic 0 \nTop 1 -> physical\nTop 2 -> quantum\nTop 3 -> research\nTop 4 -> scientst\nTop 5 -> astrophysics\n\nTopic 1\nTop 1 -> divorce\nTop 2 -> attorney\nTop 3 -> court\nTop 4 -> bankrupt\nTop 5 -> contract\n\nTopic 2\nTop 1 -> content\nTop 2 -> Craftsmanship\nTop 3 -> concert\nTop 4 -> asymmetrical\nTop 5 -> Symmetrical\n\nTopic 3\nTop 1 -> recreation\nTop 2 -> FIFA\nTop 3 -> football\nTop 4 -> Olympic\nTop 5 -> athletics\n\nTopic 4\nTop 1 -> fever\nTop 2 -> appetite\nTop 3 -> contagious\nTop 4 -> decongestant\nTop 5 -> injection\n```\nAs we can see from the top 5 words, we can easily realize the following mapping: \n\nTopic 0 -> science \nTopic 1 -> law \nTopic 2 -> art \nTopic 3 -> sport \nTopic 4 -> health\n\n\n## Contributing & References\n\nPull requests are welcome. For major changes, please open an issue first\nto discuss what you would like to change.\n\nAs there is no mature topic modeling library available, we are also looking for collaborators who would like to contribute in the following directions: \n\n1. Variational Inference version for batch & online LDA, following the [original LDA Paper by David Blei in 2023][https://www.jmlr.org/papers/volume3/blei03a/blei03a.pdf] and the [Online LDA Paper on NeuraIPs][https://papers.nips.cc/paper_files/paper/2010/file/71f6278d140af599e06ad9bf1ba03cb0-Paper.pdf]. \n\nMost of the work is completed for this part, we still need to work on:\n- computational optimization \n- online LDA implementation \n- efficient Newton's update on priors, namely $\\alpha$ and $\\eta$ \n\n2. Extend the library to support neural variational inference Following this [ICML paper: Neural Variational Inference for Text Processing][https://arxiv.org/pdf/1511.06038.pdf]\n\n3. Extend the training to support Reinforcement Learning Following this [ACL paper: Neural Topic Model with Reinforcement Learning][https://aclanthology.org/D19-1350.pdf]\n\n## License\n\n[MIT](https://choosealicense.com/licenses/mit/)\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "LDA & Neura based topic modelling library",
    "version": "0.2.7",
    "project_urls": {
        "Download": "https://github.com/RobbenRibery/TuoTuo/archive/refs/tags/pypi-0.2.7.tar.gz",
        "Homepage": "https://github.com/RobbenRibery/TuoTuo"
    },
    "split_keywords": [
        "generative topic modelling",
        "latent dirichlet allocation"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "b80b884abe6a83999c8ae71a1a5b6470a825319c3069c13cb21db5b3e714ca83",
                "md5": "cff1dde52b5f6f7dd300b2793c64d13d",
                "sha256": "f7fbca02365539d95b4c3563ee3881599e96d7b0c14fa63c11ab3de74d110b3a"
            },
            "downloads": -1,
            "filename": "TuoTuo-0.2.7-cp38-cp38-macosx_10_14_arm64.whl",
            "has_sig": false,
            "md5_digest": "cff1dde52b5f6f7dd300b2793c64d13d",
            "packagetype": "bdist_wheel",
            "python_version": "cp38",
            "requires_python": null,
            "size": 489786,
            "upload_time": "2023-05-16T21:46:43",
            "upload_time_iso_8601": "2023-05-16T21:46:43.054703Z",
            "url": "https://files.pythonhosted.org/packages/b8/0b/884abe6a83999c8ae71a1a5b6470a825319c3069c13cb21db5b3e714ca83/TuoTuo-0.2.7-cp38-cp38-macosx_10_14_arm64.whl",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-05-16 21:46:43",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "RobbenRibery",
    "github_project": "TuoTuo",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "requirements": [],
    "lcname": "tuotuo"
}
        
Elapsed time: 0.06287s