tomotopy


Nametomotopy JSON
Version 0.12.7 PyPI version JSON
download
home_pagehttps://github.com/bab2min/tomotopy
SummaryTomoto, Topic Modeling Tool for Python
upload_time2023-12-18 15:25:46
maintainer
docs_urlNone
authorbab2min
requires_python
licenseMIT License
keywords nlp topic model
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            What is tomotopy?
------------------
`tomotopy` is a Python extension of `tomoto` (Topic Modeling Tool) which is a Gibbs-sampling based topic model library written in C++.
It utilizes a vectorization of modern CPUs for maximizing speed. 
The current version of `tomoto` supports several major topic models including 

* Latent Dirichlet Allocation (`tomotopy.LDAModel`)
* Labeled LDA (`tomotopy.LLDAModel`)
* Partially Labeled LDA (`tomotopy.PLDAModel`)
* Supervised LDA (`tomotopy.SLDAModel`)
* Dirichlet Multinomial Regression (`tomotopy.DMRModel`)
* Generalized Dirichlet Multinomial Regression (`tomotopy.GDMRModel`)
* Hierarchical Dirichlet Process (`tomotopy.HDPModel`)
* Hierarchical LDA (`tomotopy.HLDAModel`)
* Multi Grain LDA (`tomotopy.MGLDAModel`) 
* Pachinko Allocation (`tomotopy.PAModel`)
* Hierarchical PA (`tomotopy.HPAModel`)
* Correlated Topic Model (`tomotopy.CTModel`)
* Dynamic Topic Model (`tomotopy.DTModel`)
* Pseudo-document based Topic Model (`tomotopy.PTModel`).

.. image:: https://badge.fury.io/py/tomotopy.svg

Getting Started
---------------
You can install tomotopy easily using pip. (https://pypi.org/project/tomotopy/)
::

    $ pip install --upgrade pip
    $ pip install tomotopy

The supported OS and Python versions are:

* Linux (x86-64) with Python >= 3.6 
* macOS >= 10.13 with Python >= 3.6
* Windows 7 or later (x86, x86-64) with Python >= 3.6
* Other OS with Python >= 3.6: Compilation from source code required (with c++14 compatible compiler)

After installing, you can start tomotopy by just importing.
::

    import tomotopy as tp
    print(tp.isa) # prints 'avx2', 'avx', 'sse2' or 'none'

Currently, tomotopy can exploits AVX2, AVX or SSE2 SIMD instruction set for maximizing performance.
When the package is imported, it will check available instruction sets and select the best option.
If `tp.isa` tells `none`, iterations of training may take a long time. 
But, since most of modern Intel or AMD CPUs provide SIMD instruction set, the SIMD acceleration could show a big improvement.

Here is a sample code for simple LDA training of texts from 'sample.txt' file.
::

    import tomotopy as tp
    mdl = tp.LDAModel(k=20)
    for line in open('sample.txt'):
        mdl.add_doc(line.strip().split())
    
    for i in range(0, 100, 10):
        mdl.train(10)
        print('Iteration: {}\tLog-likelihood: {}'.format(i, mdl.ll_per_word))
    
    for k in range(mdl.k):
        print('Top 10 words of topic #{}'.format(k))
        print(mdl.get_topic_words(k, top_n=10))
    
    mdl.summary()

Performance of tomotopy
-----------------------
`tomotopy` uses Collapsed Gibbs-Sampling(CGS) to infer the distribution of topics and the distribution of words.
Generally CGS converges more slowly than Variational Bayes(VB) that [gensim's LdaModel] uses, but its iteration can be computed much faster.
In addition, `tomotopy` can take advantage of multicore CPUs with a SIMD instruction set, which can result in faster iterations.

[gensim's LdaModel]: https://radimrehurek.com/gensim/models/ldamodel.html 

Following chart shows the comparison of LDA model's running time between `tomotopy` and `gensim`. 
The input data consists of 1000 random documents from English Wikipedia with 1,506,966 words (about 10.1 MB).
`tomotopy` trains 200 iterations and `gensim` trains 10 iterations.

.. image:: https://bab2min.github.io/tomotopy/images/tmt_i5.png

↑ Performance in Intel i5-6600, x86-64 (4 cores)

.. image:: https://bab2min.github.io/tomotopy/images/tmt_xeon.png

↑ Performance in Intel Xeon E5-2620 v4, x86-64 (8 cores, 16 threads)

.. image:: https://bab2min.github.io/tomotopy/images/tmt_r7_3700x.png

↑ Performance in AMD Ryzen7 3700X, x86-64 (8 cores, 16 threads)

Although `tomotopy` iterated 20 times more, the overall running time was 5~10 times faster than `gensim`. And it yields a stable result.

It is difficult to compare CGS and VB directly because they are totaly different techniques.
But from a practical point of view, we can compare the speed and the result between them.
The following chart shows the log-likelihood per word of two models' result. 

.. image:: https://bab2min.github.io/tomotopy/images/LLComp.png




The SIMD instruction set has a great effect on performance. Following is a comparison between SIMD instruction sets.

.. image:: https://bab2min.github.io/tomotopy/images/SIMDComp.png

Fortunately, most of recent x86-64 CPUs provide AVX2 instruction set, so we can enjoy the performance of AVX2.

Vocabulary controlling using CF and DF
---------------------------------------
CF(collection frequency) and DF(document frequency) are concepts used in information retreival, 
and each represents the total number of times the word appears in the corpus 
and the number of documents in which the word appears within the corpus, respectively.
`tomotopy` provides these two measures under the parameters of `min_cf` and `min_df` to trim low frequency words when building the corpus.

For example, let's say we have 5 documents #0 ~ #4 which are composed of the following words:
::

    #0 : a, b, c, d, e, c
    #1 : a, b, e, f
    #2 : c, d, c
    #3 : a, e, f, g
    #4 : a, b, g

Both CF of `a` and CF of `c` are 4 because it appears 4 times in the entire corpus. 
But DF of `a` is 4 and DF of `c` is 2 because `a` appears in #0, #1, #3 and #4 and `c` only appears in #0 and #2.
So if we trim low frequency words using `min_cf=3`, the result becomes follows:
::

    (d, f and g are removed.)
    #0 : a, b, c, e, c
    #1 : a, b, e
    #2 : c, c
    #3 : a, e
    #4 : a, b

However when `min_df=3` the result is like :
::

    (c, d, f and g are removed.)
    #0 : a, b, e
    #1 : a, b, e
    #2 : (empty doc)
    #3 : a, e
    #4 : a, b

As we can see, `min_df` is a stronger criterion than `min_cf`. 
In performing topic modeling, words that appear repeatedly in only one document do not contribute to estimating the topic-word distribution. 
So, removing words with low `df` is a good way to reduce model size while preserving the results of the final model.
In short, please prefer using `min_df` to `min_cf`.

Model Save and Load
-------------------
`tomotopy` provides `save` and `load` method for each topic model class, 
so you can save the model into the file whenever you want, and re-load it from the file.
::

    import tomotopy as tp
    
    mdl = tp.HDPModel()
    for line in open('sample.txt'):
        mdl.add_doc(line.strip().split())
    
    for i in range(0, 100, 10):
        mdl.train(10)
        print('Iteration: {}\tLog-likelihood: {}'.format(i, mdl.ll_per_word))
    
    # save into file
    mdl.save('sample_hdp_model.bin')
    
    # load from file
    mdl = tp.HDPModel.load('sample_hdp_model.bin')
    for k in range(mdl.k):
        if not mdl.is_live_topic(k): continue
        print('Top 10 words of topic #{}'.format(k))
        print(mdl.get_topic_words(k, top_n=10))
    
    # the saved model is HDP model, 
    # so when you load it by LDA model, it will raise an exception
    mdl = tp.LDAModel.load('sample_hdp_model.bin')

When you load the model from a file, a model type in the file should match the class of methods.

See more at `tomotopy.LDAModel.save` and `tomotopy.LDAModel.load` methods.

Documents in the Model and out of the Model
-------------------------------------------
We can use Topic Model for two major purposes. 
The basic one is to discover topics from a set of documents as a result of trained model,
and the more advanced one is to infer topic distributions for unseen documents by using trained model.

We named the document in the former purpose (used for model training) as **document in the model**,
and the document in the later purpose (unseen document during training) as **document out of the model**.

In `tomotopy`, these two different kinds of document are generated differently.
A **document in the model** can be created by `tomotopy.LDAModel.add_doc` method.
`add_doc` can be called before `tomotopy.LDAModel.train` starts. 
In other words, after `train` called, `add_doc` cannot add a document into the model because the set of document used for training has become fixed.

To acquire the instance of the created document, you should use `tomotopy.LDAModel.docs` like:

::

    mdl = tp.LDAModel(k=20)
    idx = mdl.add_doc(words)
    if idx < 0: raise RuntimeError("Failed to add doc")
    doc_inst = mdl.docs[idx]
    # doc_inst is an instance of the added document

A **document out of the model** is generated by `tomotopy.LDAModel.make_doc` method. `make_doc` can be called only after `train` starts.
If you use `make_doc` before the set of document used for training has become fixed, you may get wrong results.
Since `make_doc` returns the instance directly, you can use its return value for other manipulations.

::

    mdl = tp.LDAModel(k=20)
    # add_doc ...
    mdl.train(100)
    doc_inst = mdl.make_doc(unseen_doc) # doc_inst is an instance of the unseen document

Inference for Unseen Documents
------------------------------
If a new document is created by `tomotopy.LDAModel.make_doc`, its topic distribution can be inferred by the model.
Inference for unseen document should be performed using `tomotopy.LDAModel.infer` method.

::

    mdl = tp.LDAModel(k=20)
    # add_doc ...
    mdl.train(100)
    doc_inst = mdl.make_doc(unseen_doc)
    topic_dist, ll = mdl.infer(doc_inst)
    print("Topic Distribution for Unseen Docs: ", topic_dist)
    print("Log-likelihood of inference: ", ll)

The `infer` method can infer only one instance of `tomotopy.Document` or a `list` of instances of `tomotopy.Document`. 
See more at `tomotopy.LDAModel.infer`.

Corpus and transform
--------------------
Every topic model in `tomotopy` has its own internal document type.
A document can be created and added into suitable for each model through each model's `add_doc` method. 
However, trying to add the same list of documents to different models becomes quite inconvenient, 
because `add_doc` should be called for the same list of documents to each different model.
Thus, `tomotopy` provides `tomotopy.utils.Corpus` class that holds a list of documents. 
`tomotopy.utils.Corpus` can be inserted into any model by passing as argument `corpus` to `__init__` or `add_corpus` method of each model. 
So, inserting `tomotopy.utils.Corpus` just has the same effect to inserting documents the corpus holds.

Some topic models requires different data for its documents. 
For example, `tomotopy.DMRModel` requires argument `metadata` in `str` type, 
but `tomotopy.PLDAModel` requires argument `labels` in `List[str]` type. 
Since `tomotopy.utils.Corpus` holds an independent set of documents rather than being tied to a specific topic model, 
data types required by a topic model may be inconsistent when a corpus is added into that topic model. 
In this case, miscellaneous data can be transformed to be fitted target topic model using argument `transform`. 
See more details in the following code:

::

    from tomotopy import DMRModel
    from tomotopy.utils import Corpus

    corpus = Corpus()
    corpus.add_doc("a b c d e".split(), a_data=1)
    corpus.add_doc("e f g h i".split(), a_data=2)
    corpus.add_doc("i j k l m".split(), a_data=3)

    model = DMRModel(k=10)
    model.add_corpus(corpus) 
    # You lose `a_data` field in `corpus`, 
    # and `metadata` that `DMRModel` requires is filled with the default value, empty str.

    assert model.docs[0].metadata == ''
    assert model.docs[1].metadata == ''
    assert model.docs[2].metadata == ''

    def transform_a_data_to_metadata(misc: dict):
        return {'metadata': str(misc['a_data'])}
    # this function transforms `a_data` to `metadata`

    model = DMRModel(k=10)
    model.add_corpus(corpus, transform=transform_a_data_to_metadata)
    # Now docs in `model` has non-default `metadata`, that generated from `a_data` field.

    assert model.docs[0].metadata == '1'
    assert model.docs[1].metadata == '2'
    assert model.docs[2].metadata == '3'


Parallel Sampling Algorithms
----------------------------
Since version 0.5.0, `tomotopy` allows you to choose a parallelism algorithm. 
The algorithm provided in versions prior to 0.4.2 is `COPY_MERGE`, which is provided for all topic models.
The new algorithm `PARTITION`, available since 0.5.0, makes training generally faster and more memory-efficient, but it is available at not all topic models.

The following chart shows the speed difference between the two algorithms based on the number of topics and the number of workers.

.. image:: https://bab2min.github.io/tomotopy/images/algo_comp.png

.. image:: https://bab2min.github.io/tomotopy/images/algo_comp2.png

Performance by Version
----------------------
Performance changes by version are shown in the following graph. 
The time it takes to run the LDA model train with 1000 iteration was measured. 
(Docs: 11314, Vocab: 60382, Words: 2364724, Intel Xeon Gold 5120 @2.2GHz)

.. image:: https://bab2min.github.io/tomotopy/images/lda-perf-t1.png

.. image:: https://bab2min.github.io/tomotopy/images/lda-perf-t4.png

.. image:: https://bab2min.github.io/tomotopy/images/lda-perf-t8.png

Pining Topics using Word Priors
-------------------------------
Since version 0.6.0, a new method `tomotopy.LDAModel.set_word_prior` has been added. It allows you to control word prior for each topic.
For example, we can set the weight of the word 'church' to 1.0 in topic 0, and the weight to 0.1 in the rest of the topics by following codes.
This means that the probability that the word 'church' is assigned to topic 0 is 10 times higher than the probability of being assigned to another topic.
Therefore, most of 'church' is assigned to topic 0, so topic 0 contains many words related to 'church'. 
This allows to manipulate some topics to be placed at a specific topic number.

::

    import tomotopy as tp
    mdl = tp.LDAModel(k=20)
    
    # add documents into `mdl`

    # setting word prior
    mdl.set_word_prior('church', [1.0 if k == 0 else 0.1 for k in range(20)])

See `word_prior_example` in `example.py` for more details.

Examples
--------
You can find an example python code of tomotopy at https://github.com/bab2min/tomotopy/blob/main/examples/ .

You can also get the data file used in the example code at https://drive.google.com/file/d/18OpNijd4iwPyYZ2O7pQoPyeTAKEXa71J/view .

License
---------
`tomotopy` is licensed under the terms of MIT License, 
meaning you can use it for any reasonable purpose and remain in complete ownership of all the documentation you produce.

History
-------
* 0.12.7 (2023-12-19)
    * New features
        * Added Topic Model Viewer `tomotopy.viewer.open_viewer()`
        * Optimized the performance of `tomotopy.utils.Corpus.process()`
    * Bug fixes
        * `Document.span` now returns the ranges in character unit, not in byte unit.

* 0.12.6 (2023-12-11)
    * New features
        * Added some convenience features to `tomotopy.LDAModel.train` and `tomotopy.LDAModel.set_word_prior`.
        * `LDAModel.train` now has new arguments `callback`, `callback_interval` and `show_progres` to monitor the training progress.
        * `LDAModel.set_word_prior` now can accept `Dict[int, float]` type as its argument `prior`.

* 0.12.5 (2023-08-03)
    * New features
        * Added support for Linux ARM64 architecture.

* 0.12.4 (2023-01-22)
    * New features
        * Added support for macOS ARM64 architecture.
    * Bug fixes
        * Fixed an issue where `tomotopy.Document.get_sub_topic_dist()` raises a bad argument exception.
        * Fixed an issue where exception raising sometimes causes crashes.

* 0.12.3 (2022-07-19)
    * New features
        * Now, inserting an empty document using `tomotopy.LDAModel.add_doc()` just ignores it instead of raising an exception. If the newly added argument `ignore_empty_words` is set to False, an exception is raised as before.
        * `tomotopy.HDPModel.purge_dead_topics()` method is added to remove non-live topics from the model.
    * Bug fixes
        * Fixed an issue that prevents setting user defined values for nuSq in `tomotopy.SLDAModel` (by @jucendrero).
        * Fixed an issue where `tomotopy.utils.Coherence` did not work for `tomotopy.DTModel`.
        * Fixed an issue that often crashed when calling `make_dic()` before calling `train()`.
        * Resolved the problem that the results of `tomotopy.DMRModel` and `tomotopy.GDMRModel` are different even when the seed is fixed.
        * The parameter optimization process of `tomotopy.DMRModel` and `tomotopy.GDMRModel` has been improved.
        * Fixed an issue that sometimes crashed when calling `tomotopy.PTModel.copy()`.

* 0.12.2 (2021-09-06)
    * An issue where calling `convert_to_lda` of `tomotopy.HDPModel` with `min_cf > 0`, `min_df > 0` or `rm_top > 0` causes a crash has been fixed.
    * A new argument `from_pseudo_doc` is added to `tomotopy.Document.get_topics` and `tomotopy.Document.get_topic_dist`.
      This argument is only valid for documents of `PTModel`, it enables to control a source for computing topic distribution.
    * A default value for argument `p` of `tomotopy.PTModel` has been changed. The new default value is `k * 10`.
    * Using documents generated by `make_doc` without calling `infer` doesn't cause a crash anymore, but just print warning messages.
    * An issue where the internal C++ code isn't compiled at clang c++17 environment has been fixed.

* 0.12.1 (2021-06-20)
    * An issue where `tomotopy.LDAModel.set_word_prior()` causes a crash has been fixed.
    * Now `tomotopy.LDAModel.perplexity` and `tomotopy.LDAModel.ll_per_word` return the accurate value when `TermWeight` is not `ONE`.
    * `tomotopy.LDAModel.used_vocab_weighted_freq` was added, which returns term-weighted frequencies of words.
    * Now `tomotopy.LDAModel.summary()` shows not only the entropy of words, but also the entropy of term-weighted words.

* 0.12.0 (2021-04-26)
    * Now `tomotopy.DMRModel` and `tomotopy.GDMRModel` support multiple values of metadata (see https://github.com/bab2min/tomotopy/blob/main/examples/dmr_multi_label.py )
    * The performance of `tomotopy.GDMRModel` was improved.
    * A `copy()` method has been added for all topic models to do a deep copy.
    * An issue was fixed where words that are excluded from training (by `min_cf`, `min_df`) have incorrect topic id. Now all excluded words have `-1` as topic id.
    * Now all exceptions and warnings that generated by `tomotopy` follow standard Python types.
    * Compiler requirements have been raised to C++14.

* 0.11.1 (2021-03-28)
    * A critical bug of asymmetric alphas was fixed. Due to this bug, version 0.11.0 has been removed from releases.

* 0.11.0 (2021-03-26) (removed)
    * A new topic model `tomotopy.PTModel` for short texts was added into the package.
    * An issue was fixed where `tomotopy.HDPModel.infer` causes a segmentation fault sometimes.
    * A mismatch of numpy API version was fixed.
    * Now asymmetric document-topic priors are supported.
    * Serializing topic models to `bytes` in memory is supported.
    * An argument `normalize` was added to `get_topic_dist()`, `get_topic_word_dist()` and `get_sub_topic_dist()` for controlling normalization of results.
    * Now `tomotopy.DMRModel.lambdas` and `tomotopy.DMRModel.alpha` give correct values.
    * Categorical metadata supports for `tomotopy.GDMRModel` were added (see https://github.com/bab2min/tomotopy/blob/main/examples/gdmr_both_categorical_and_numerical.py ).
    * Python3.5 support was dropped.

* 0.10.2 (2021-02-16)
    * An issue was fixed where `tomotopy.CTModel.train` fails with large K.
    * An issue was fixed where `tomotopy.utils.Corpus` loses their `uid` values.

* 0.10.1 (2021-02-14)
    * An issue was fixed where `tomotopy.utils.Corpus.extract_ngrams` craches with empty input.
    * An issue was fixed where `tomotopy.LDAModel.infer` raises exception with valid input.
    * An issue was fixed where `tomotopy.HLDAModel.infer` generates wrong `tomotopy.Document.path`.
    * Since a new parameter `freeze_topics` for `tomotopy.HLDAModel.train` was added, you can control whether to create a new topic or not when training.

* 0.10.0 (2020-12-19)
    * The interface of `tomotopy.utils.Corpus` and of `tomotopy.LDAModel.docs` were unified. Now you can access the document in corpus with the same manner.
    * __getitem__ of `tomotopy.utils.Corpus` was improved. Not only indexing by int, but also by Iterable[int], slicing are supported. Also indexing by uid is supported.
    * New methods `tomotopy.utils.Corpus.extract_ngrams` and `tomotopy.utils.Corpus.concat_ngrams` were added. They extracts n-gram collocations using PMI and concatenates them into a single words.
    * A new method `tomotopy.LDAModel.add_corpus` was added, and `tomotopy.LDAModel.infer` can receive corpus as input. 
    * A new module `tomotopy.coherence` was added. It provides the way to calculate coherence of the model.
    * A paramter `window_size` was added to `tomotopy.label.FoRelevance`.
    * An issue was fixed where NaN often occurs when training `tomotopy.HDPModel`.
    * Now Python3.9 is supported.
    * A dependency to py-cpuinfo was removed and the initializing of the module was improved.

* 0.9.1 (2020-08-08)
    * Memory leaks of version 0.9.0 was fixed.
    * `tomotopy.CTModel.summary()` was fixed.

* 0.9.0 (2020-08-04)
    * The `tomotopy.LDAModel.summary()` method, which prints human-readable summary of the model, has been added.
    * The random number generator of package has been replaced with [EigenRand]. It speeds up the random number generation and solves the result difference between platforms.
    * Due to above, even if `seed` is the same, the model training result may be different from the version before 0.9.0.
    * Fixed a training error in `tomotopy.HDPModel`.
    * `tomotopy.DMRModel.alpha` now shows Dirichlet prior of per-document topic distribution by metadata.
    * `tomotopy.DTModel.get_count_by_topics()` has been modified to return a 2-dimensional `ndarray`.
    * `tomotopy.DTModel.alpha` has been modified to return the same value as `tomotopy.DTModel.get_alpha()`.
    * Fixed an issue where the `metadata` value could not be obtained for the document of `tomotopy.GDMRModel`.
    * `tomotopy.HLDAModel.alpha` now shows Dirichlet prior of per-document depth distribution.
    * `tomotopy.LDAModel.global_step` has been added.
    * `tomotopy.MGLDAModel.get_count_by_topics()` now returns the word count for both global and local topics.
    * `tomotopy.PAModel.alpha`, `tomotopy.PAModel.subalpha`, and `tomotopy.PAModel.get_count_by_super_topic()` have been added.

[EigenRand]: https://github.com/bab2min/EigenRand

* 0.8.2 (2020-07-14)
    * New properties `tomotopy.DTModel.num_timepoints` and `tomotopy.DTModel.num_docs_by_timepoint` have been added.
    * A bug which causes different results with the different platform even if `seeds` were the same was partially fixed. 
      As a result of this fix, now `tomotopy` in 32 bit yields different training results from earlier version.

* 0.8.1 (2020-06-08)
    * A bug where `tomotopy.LDAModel.used_vocabs` returned an incorrect value was fixed.
    * Now `tomotopy.CTModel.prior_cov` returns a covariance matrix with shape `[k, k]`.
    * Now `tomotopy.CTModel.get_correlations` with empty arguments returns a correlation matrix with shape `[k, k]`.

* 0.8.0 (2020-06-06)
    * Since NumPy was introduced in tomotopy, many methods and properties of tomotopy return not just `list`, but `numpy.ndarray` now.
    * Tomotopy has a new dependency `NumPy >= 1.10.0`.
    * A wrong estimation of `tomotopy.HDPModel.infer` was fixed.
    * A new method about converting HDPModel to LDAModel was added.
    * New properties including `tomotopy.LDAModel.used_vocabs`, `tomotopy.LDAModel.used_vocab_freq` and `tomotopy.LDAModel.used_vocab_df` were added into topic models.
    * A new g-DMR topic model(`tomotopy.GDMRModel`) was added.
    * An error at initializing `tomotopy.label.FoRelevance` in macOS was fixed.
    * An error that occured when using `tomotopy.utils.Corpus` created without `raw` parameters was fixed.

* 0.7.1 (2020-05-08)
    * `tomotopy.Document.path` was added for `tomotopy.HLDAModel`.
    * A memory corruption bug in `tomotopy.label.PMIExtractor` was fixed.
    * A compile error in gcc 7 was fixed.

* 0.7.0 (2020-04-18)
    * `tomotopy.DTModel` was added into the package.
    * A bug in `tomotopy.utils.Corpus.save` was fixed.
    * A new method `tomotopy.Document.get_count_vector` was added into Document class.
    * Now linux distributions use manylinux2010 and an additional optimization is applied.

* 0.6.2 (2020-03-28)
    * A critical bug related to `save` and `load` was fixed. Version 0.6.0 and 0.6.1 have been removed from releases.

* 0.6.1 (2020-03-22) (removed)
    * A bug related to module loading was fixed.

* 0.6.0 (2020-03-22) (removed)
    * `tomotopy.utils.Corpus` class that manages multiple documents easily was added.
    * `tomotopy.LDAModel.set_word_prior` method that controls word-topic priors of topic models was added.
    * A new argument `min_df` that filters words based on document frequency was added into every topic model's __init__.
    * `tomotopy.label`, the submodule about topic labeling was added. Currently, only `tomotopy.label.FoRelevance` is provided.

* 0.5.2 (2020-03-01)
    * A segmentation fault problem was fixed in `tomotopy.LLDAModel.add_doc`.
    * A bug was fixed that `infer` of `tomotopy.HDPModel` sometimes crashes the program.
    * A crash issue was fixed of `tomotopy.LDAModel.infer` with ps=tomotopy.ParallelScheme.PARTITION, together=True.

* 0.5.1 (2020-01-11)
    * A bug was fixed that `tomotopy.SLDAModel.make_doc` doesn't support missing values for `y`.
    * Now `tomotopy.SLDAModel` fully supports missing values for response variables `y`. Documents with missing values (NaN) are included in modeling topic, but excluded from regression of response variables.

* 0.5.0 (2019-12-30)
    * Now `tomotopy.PAModel.infer` returns both topic distribution nd sub-topic distribution.
    * New methods get_sub_topics and get_sub_topic_dist were added into `tomotopy.Document`. (for PAModel)
    * New parameter `parallel` was added for `tomotopy.LDAModel.train` and `tomotopy.LDAModel.infer` method. You can select parallelism algorithm by changing this parameter.
    * `tomotopy.ParallelScheme.PARTITION`, a new algorithm, was added. It works efficiently when the number of workers is large, the number of topics or the size of vocabulary is big.
    * A bug where `rm_top` didn't work at `min_cf` < 2 was fixed.

* 0.4.2 (2019-11-30)
    * Wrong topic assignments of `tomotopy.LLDAModel` and `tomotopy.PLDAModel` were fixed.
    * Readable __repr__ of `tomotopy.Document` and `tomotopy.Dictionary` was implemented.

* 0.4.1 (2019-11-27)
    * A bug at init function of `tomotopy.PLDAModel` was fixed.

* 0.4.0 (2019-11-18)
    * New models including `tomotopy.PLDAModel` and `tomotopy.HLDAModel` were added into the package.

* 0.3.1 (2019-11-05)
    * An issue where `get_topic_dist()` returns incorrect value when `min_cf` or `rm_top` is set was fixed.
    * The return value of `get_topic_dist()` of `tomotopy.MGLDAModel` document was fixed to include local topics.
    * The estimation speed with `tw=ONE` was improved.

* 0.3.0 (2019-10-06)
    * A new model, `tomotopy.LLDAModel` was added into the package.
    * A crashing issue of `HDPModel` was fixed.
    * Since hyperparameter estimation for `HDPModel` was implemented, the result of `HDPModel` may differ from previous versions.
        If you want to turn off hyperparameter estimation of HDPModel, set `optim_interval` to zero.

* 0.2.0 (2019-08-18)
    * New models including `tomotopy.CTModel` and `tomotopy.SLDAModel` were added into the package.
    * A new parameter option `rm_top` was added for all topic models.
    * The problems in `save` and `load` method for `PAModel` and `HPAModel` were fixed.
    * An occassional crash in loading `HDPModel` was fixed.
    * The problem that `ll_per_word` was calculated incorrectly when `min_cf` > 0 was fixed.

* 0.1.6 (2019-08-09)
    * Compiling errors at clang with macOS environment were fixed.

* 0.1.4 (2019-08-05)
    * The issue when `add_doc` receives an empty list as input was fixed.
    * The issue that `tomotopy.PAModel.get_topic_words` doesn't extract the word distribution of subtopic was fixed.

* 0.1.3 (2019-05-19)
    * The parameter `min_cf` and its stopword-removing function were added for all topic models.

* 0.1.0 (2019-05-12)
    * First version of **tomotopy**

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/bab2min/tomotopy",
    "name": "tomotopy",
    "maintainer": "",
    "docs_url": null,
    "requires_python": "",
    "maintainer_email": "",
    "keywords": "NLP,Topic Model",
    "author": "bab2min",
    "author_email": "bab2min@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/e7/67/c5183deb13264ffb3b83866ffef7606f6b35cf8880b1770bfc85dccd618f/tomotopy-0.12.7.tar.gz",
    "platform": null,
    "description": "What is tomotopy?\n------------------\n`tomotopy` is a Python extension of `tomoto` (Topic Modeling Tool) which is a Gibbs-sampling based topic model library written in C++.\nIt utilizes a vectorization of modern CPUs for maximizing speed. \nThe current version of `tomoto` supports several major topic models including \n\n* Latent Dirichlet Allocation (`tomotopy.LDAModel`)\n* Labeled LDA (`tomotopy.LLDAModel`)\n* Partially Labeled LDA (`tomotopy.PLDAModel`)\n* Supervised LDA (`tomotopy.SLDAModel`)\n* Dirichlet Multinomial Regression (`tomotopy.DMRModel`)\n* Generalized Dirichlet Multinomial Regression (`tomotopy.GDMRModel`)\n* Hierarchical Dirichlet Process (`tomotopy.HDPModel`)\n* Hierarchical LDA (`tomotopy.HLDAModel`)\n* Multi Grain LDA (`tomotopy.MGLDAModel`) \n* Pachinko Allocation (`tomotopy.PAModel`)\n* Hierarchical PA (`tomotopy.HPAModel`)\n* Correlated Topic Model (`tomotopy.CTModel`)\n* Dynamic Topic Model (`tomotopy.DTModel`)\n* Pseudo-document based Topic Model (`tomotopy.PTModel`).\n\n.. image:: https://badge.fury.io/py/tomotopy.svg\n\nGetting Started\n---------------\nYou can install tomotopy easily using pip. (https://pypi.org/project/tomotopy/)\n::\n\n    $ pip install --upgrade pip\n    $ pip install tomotopy\n\nThe supported OS and Python versions are:\n\n* Linux (x86-64) with Python >= 3.6 \n* macOS >= 10.13 with Python >= 3.6\n* Windows 7 or later (x86, x86-64) with Python >= 3.6\n* Other OS with Python >= 3.6: Compilation from source code required (with c++14 compatible compiler)\n\nAfter installing, you can start tomotopy by just importing.\n::\n\n    import tomotopy as tp\n    print(tp.isa) # prints 'avx2', 'avx', 'sse2' or 'none'\n\nCurrently, tomotopy can exploits AVX2, AVX or SSE2 SIMD instruction set for maximizing performance.\nWhen the package is imported, it will check available instruction sets and select the best option.\nIf `tp.isa` tells `none`, iterations of training may take a long time. \nBut, since most of modern Intel or AMD CPUs provide SIMD instruction set, the SIMD acceleration could show a big improvement.\n\nHere is a sample code for simple LDA training of texts from 'sample.txt' file.\n::\n\n    import tomotopy as tp\n    mdl = tp.LDAModel(k=20)\n    for line in open('sample.txt'):\n        mdl.add_doc(line.strip().split())\n    \n    for i in range(0, 100, 10):\n        mdl.train(10)\n        print('Iteration: {}\\tLog-likelihood: {}'.format(i, mdl.ll_per_word))\n    \n    for k in range(mdl.k):\n        print('Top 10 words of topic #{}'.format(k))\n        print(mdl.get_topic_words(k, top_n=10))\n    \n    mdl.summary()\n\nPerformance of tomotopy\n-----------------------\n`tomotopy` uses Collapsed Gibbs-Sampling(CGS) to infer the distribution of topics and the distribution of words.\nGenerally CGS converges more slowly than Variational Bayes(VB) that [gensim's LdaModel] uses, but its iteration can be computed much faster.\nIn addition, `tomotopy` can take advantage of multicore CPUs with a SIMD instruction set, which can result in faster iterations.\n\n[gensim's LdaModel]: https://radimrehurek.com/gensim/models/ldamodel.html \n\nFollowing chart shows the comparison of LDA model's running time between `tomotopy` and `gensim`. \nThe input data consists of 1000 random documents from English Wikipedia with 1,506,966 words (about 10.1 MB).\n`tomotopy` trains 200 iterations and `gensim` trains 10 iterations.\n\n.. image:: https://bab2min.github.io/tomotopy/images/tmt_i5.png\n\n\u2191 Performance in Intel i5-6600, x86-64 (4 cores)\n\n.. image:: https://bab2min.github.io/tomotopy/images/tmt_xeon.png\n\n\u2191 Performance in Intel Xeon E5-2620 v4, x86-64 (8 cores, 16 threads)\n\n.. image:: https://bab2min.github.io/tomotopy/images/tmt_r7_3700x.png\n\n\u2191 Performance in AMD Ryzen7 3700X, x86-64 (8 cores, 16 threads)\n\nAlthough `tomotopy` iterated 20 times more, the overall running time was 5~10 times faster than `gensim`. And it yields a stable result.\n\nIt is difficult to compare CGS and VB directly because they are totaly different techniques.\nBut from a practical point of view, we can compare the speed and the result between them.\nThe following chart shows the log-likelihood per word of two models' result. \n\n.. image:: https://bab2min.github.io/tomotopy/images/LLComp.png\n\n\n\n\nThe SIMD instruction set has a great effect on performance. Following is a comparison between SIMD instruction sets.\n\n.. image:: https://bab2min.github.io/tomotopy/images/SIMDComp.png\n\nFortunately, most of recent x86-64 CPUs provide AVX2 instruction set, so we can enjoy the performance of AVX2.\n\nVocabulary controlling using CF and DF\n---------------------------------------\nCF(collection frequency) and DF(document frequency) are concepts used in information retreival, \nand each represents the total number of times the word appears in the corpus \nand the number of documents in which the word appears within the corpus, respectively.\n`tomotopy` provides these two measures under the parameters of `min_cf` and `min_df` to trim low frequency words when building the corpus.\n\nFor example, let's say we have 5 documents #0 ~ #4 which are composed of the following words:\n::\n\n    #0 : a, b, c, d, e, c\n    #1 : a, b, e, f\n    #2 : c, d, c\n    #3 : a, e, f, g\n    #4 : a, b, g\n\nBoth CF of `a` and CF of `c` are 4 because it appears 4 times in the entire corpus. \nBut DF of `a` is 4 and DF of `c` is 2 because `a` appears in #0, #1, #3 and #4 and `c` only appears in #0 and #2.\nSo if we trim low frequency words using `min_cf=3`, the result becomes follows:\n::\n\n    (d, f and g are removed.)\n    #0 : a, b, c, e, c\n    #1 : a, b, e\n    #2 : c, c\n    #3 : a, e\n    #4 : a, b\n\nHowever when `min_df=3` the result is like :\n::\n\n    (c, d, f and g are removed.)\n    #0 : a, b, e\n    #1 : a, b, e\n    #2 : (empty doc)\n    #3 : a, e\n    #4 : a, b\n\nAs we can see, `min_df` is a stronger criterion than `min_cf`. \nIn performing topic modeling, words that appear repeatedly in only one document do not contribute to estimating the topic-word distribution. \nSo, removing words with low `df` is a good way to reduce model size while preserving the results of the final model.\nIn short, please prefer using `min_df` to `min_cf`.\n\nModel Save and Load\n-------------------\n`tomotopy` provides `save` and `load` method for each topic model class, \nso you can save the model into the file whenever you want, and re-load it from the file.\n::\n\n    import tomotopy as tp\n    \n    mdl = tp.HDPModel()\n    for line in open('sample.txt'):\n        mdl.add_doc(line.strip().split())\n    \n    for i in range(0, 100, 10):\n        mdl.train(10)\n        print('Iteration: {}\\tLog-likelihood: {}'.format(i, mdl.ll_per_word))\n    \n    # save into file\n    mdl.save('sample_hdp_model.bin')\n    \n    # load from file\n    mdl = tp.HDPModel.load('sample_hdp_model.bin')\n    for k in range(mdl.k):\n        if not mdl.is_live_topic(k): continue\n        print('Top 10 words of topic #{}'.format(k))\n        print(mdl.get_topic_words(k, top_n=10))\n    \n    # the saved model is HDP model, \n    # so when you load it by LDA model, it will raise an exception\n    mdl = tp.LDAModel.load('sample_hdp_model.bin')\n\nWhen you load the model from a file, a model type in the file should match the class of methods.\n\nSee more at `tomotopy.LDAModel.save` and `tomotopy.LDAModel.load` methods.\n\nDocuments in the Model and out of the Model\n-------------------------------------------\nWe can use Topic Model for two major purposes. \nThe basic one is to discover topics from a set of documents as a result of trained model,\nand the more advanced one is to infer topic distributions for unseen documents by using trained model.\n\nWe named the document in the former purpose (used for model training) as **document in the model**,\nand the document in the later purpose (unseen document during training) as **document out of the model**.\n\nIn `tomotopy`, these two different kinds of document are generated differently.\nA **document in the model** can be created by `tomotopy.LDAModel.add_doc` method.\n`add_doc` can be called before `tomotopy.LDAModel.train` starts. \nIn other words, after `train` called, `add_doc` cannot add a document into the model because the set of document used for training has become fixed.\n\nTo acquire the instance of the created document, you should use `tomotopy.LDAModel.docs` like:\n\n::\n\n    mdl = tp.LDAModel(k=20)\n    idx = mdl.add_doc(words)\n    if idx < 0: raise RuntimeError(\"Failed to add doc\")\n    doc_inst = mdl.docs[idx]\n    # doc_inst is an instance of the added document\n\nA **document out of the model** is generated by `tomotopy.LDAModel.make_doc` method. `make_doc` can be called only after `train` starts.\nIf you use `make_doc` before the set of document used for training has become fixed, you may get wrong results.\nSince `make_doc` returns the instance directly, you can use its return value for other manipulations.\n\n::\n\n    mdl = tp.LDAModel(k=20)\n    # add_doc ...\n    mdl.train(100)\n    doc_inst = mdl.make_doc(unseen_doc) # doc_inst is an instance of the unseen document\n\nInference for Unseen Documents\n------------------------------\nIf a new document is created by `tomotopy.LDAModel.make_doc`, its topic distribution can be inferred by the model.\nInference for unseen document should be performed using `tomotopy.LDAModel.infer` method.\n\n::\n\n    mdl = tp.LDAModel(k=20)\n    # add_doc ...\n    mdl.train(100)\n    doc_inst = mdl.make_doc(unseen_doc)\n    topic_dist, ll = mdl.infer(doc_inst)\n    print(\"Topic Distribution for Unseen Docs: \", topic_dist)\n    print(\"Log-likelihood of inference: \", ll)\n\nThe `infer` method can infer only one instance of `tomotopy.Document` or a `list` of instances of `tomotopy.Document`. \nSee more at `tomotopy.LDAModel.infer`.\n\nCorpus and transform\n--------------------\nEvery topic model in `tomotopy` has its own internal document type.\nA document can be created and added into suitable for each model through each model's `add_doc` method. \nHowever, trying to add the same list of documents to different models becomes quite inconvenient, \nbecause `add_doc` should be called for the same list of documents to each different model.\nThus, `tomotopy` provides `tomotopy.utils.Corpus` class that holds a list of documents. \n`tomotopy.utils.Corpus` can be inserted into any model by passing as argument `corpus` to `__init__` or `add_corpus` method of each model. \nSo, inserting `tomotopy.utils.Corpus` just has the same effect to inserting documents the corpus holds.\n\nSome topic models requires different data for its documents. \nFor example, `tomotopy.DMRModel` requires argument `metadata` in `str` type, \nbut `tomotopy.PLDAModel` requires argument `labels` in `List[str]` type. \nSince `tomotopy.utils.Corpus` holds an independent set of documents rather than being tied to a specific topic model, \ndata types required by a topic model may be inconsistent when a corpus is added into that topic model. \nIn this case, miscellaneous data can be transformed to be fitted target topic model using argument `transform`. \nSee more details in the following code:\n\n::\n\n    from tomotopy import DMRModel\n    from tomotopy.utils import Corpus\n\n    corpus = Corpus()\n    corpus.add_doc(\"a b c d e\".split(), a_data=1)\n    corpus.add_doc(\"e f g h i\".split(), a_data=2)\n    corpus.add_doc(\"i j k l m\".split(), a_data=3)\n\n    model = DMRModel(k=10)\n    model.add_corpus(corpus) \n    # You lose `a_data` field in `corpus`, \n    # and `metadata` that `DMRModel` requires is filled with the default value, empty str.\n\n    assert model.docs[0].metadata == ''\n    assert model.docs[1].metadata == ''\n    assert model.docs[2].metadata == ''\n\n    def transform_a_data_to_metadata(misc: dict):\n        return {'metadata': str(misc['a_data'])}\n    # this function transforms `a_data` to `metadata`\n\n    model = DMRModel(k=10)\n    model.add_corpus(corpus, transform=transform_a_data_to_metadata)\n    # Now docs in `model` has non-default `metadata`, that generated from `a_data` field.\n\n    assert model.docs[0].metadata == '1'\n    assert model.docs[1].metadata == '2'\n    assert model.docs[2].metadata == '3'\n\n\nParallel Sampling Algorithms\n----------------------------\nSince version 0.5.0, `tomotopy` allows you to choose a parallelism algorithm. \nThe algorithm provided in versions prior to 0.4.2 is `COPY_MERGE`, which is provided for all topic models.\nThe new algorithm `PARTITION`, available since 0.5.0, makes training generally faster and more memory-efficient, but it is available at not all topic models.\n\nThe following chart shows the speed difference between the two algorithms based on the number of topics and the number of workers.\n\n.. image:: https://bab2min.github.io/tomotopy/images/algo_comp.png\n\n.. image:: https://bab2min.github.io/tomotopy/images/algo_comp2.png\n\nPerformance by Version\n----------------------\nPerformance changes by version are shown in the following graph. \nThe time it takes to run the LDA model train with 1000 iteration was measured. \n(Docs: 11314, Vocab: 60382, Words: 2364724, Intel Xeon Gold 5120 @2.2GHz)\n\n.. image:: https://bab2min.github.io/tomotopy/images/lda-perf-t1.png\n\n.. image:: https://bab2min.github.io/tomotopy/images/lda-perf-t4.png\n\n.. image:: https://bab2min.github.io/tomotopy/images/lda-perf-t8.png\n\nPining Topics using Word Priors\n-------------------------------\nSince version 0.6.0, a new method `tomotopy.LDAModel.set_word_prior` has been added. It allows you to control word prior for each topic.\nFor example, we can set the weight of the word 'church' to 1.0 in topic 0, and the weight to 0.1 in the rest of the topics by following codes.\nThis means that the probability that the word 'church' is assigned to topic 0 is 10 times higher than the probability of being assigned to another topic.\nTherefore, most of 'church' is assigned to topic 0, so topic 0 contains many words related to 'church'. \nThis allows to manipulate some topics to be placed at a specific topic number.\n\n::\n\n    import tomotopy as tp\n    mdl = tp.LDAModel(k=20)\n    \n    # add documents into `mdl`\n\n    # setting word prior\n    mdl.set_word_prior('church', [1.0 if k == 0 else 0.1 for k in range(20)])\n\nSee `word_prior_example` in `example.py` for more details.\n\nExamples\n--------\nYou can find an example python code of tomotopy at https://github.com/bab2min/tomotopy/blob/main/examples/ .\n\nYou can also get the data file used in the example code at https://drive.google.com/file/d/18OpNijd4iwPyYZ2O7pQoPyeTAKEXa71J/view .\n\nLicense\n---------\n`tomotopy` is licensed under the terms of MIT License, \nmeaning you can use it for any reasonable purpose and remain in complete ownership of all the documentation you produce.\n\nHistory\n-------\n* 0.12.7 (2023-12-19)\n    * New features\n        * Added Topic Model Viewer `tomotopy.viewer.open_viewer()`\n        * Optimized the performance of `tomotopy.utils.Corpus.process()`\n    * Bug fixes\n        * `Document.span` now returns the ranges in character unit, not in byte unit.\n\n* 0.12.6 (2023-12-11)\n    * New features\n        * Added some convenience features to `tomotopy.LDAModel.train` and `tomotopy.LDAModel.set_word_prior`.\n        * `LDAModel.train` now has new arguments `callback`, `callback_interval` and `show_progres` to monitor the training progress.\n        * `LDAModel.set_word_prior` now can accept `Dict[int, float]` type as its argument `prior`.\n\n* 0.12.5 (2023-08-03)\n    * New features\n        * Added support for Linux ARM64 architecture.\n\n* 0.12.4 (2023-01-22)\n    * New features\n        * Added support for macOS ARM64 architecture.\n    * Bug fixes\n        * Fixed an issue where `tomotopy.Document.get_sub_topic_dist()` raises a bad argument exception.\n        * Fixed an issue where exception raising sometimes causes crashes.\n\n* 0.12.3 (2022-07-19)\n    * New features\n        * Now, inserting an empty document using `tomotopy.LDAModel.add_doc()` just ignores it instead of raising an exception. If the newly added argument `ignore_empty_words` is set to False, an exception is raised as before.\n        * `tomotopy.HDPModel.purge_dead_topics()` method is added to remove non-live topics from the model.\n    * Bug fixes\n        * Fixed an issue that prevents setting user defined values for nuSq in `tomotopy.SLDAModel` (by @jucendrero).\n        * Fixed an issue where `tomotopy.utils.Coherence` did not work for `tomotopy.DTModel`.\n        * Fixed an issue that often crashed when calling `make_dic()` before calling `train()`.\n        * Resolved the problem that the results of `tomotopy.DMRModel` and `tomotopy.GDMRModel` are different even when the seed is fixed.\n        * The parameter optimization process of `tomotopy.DMRModel` and `tomotopy.GDMRModel` has been improved.\n        * Fixed an issue that sometimes crashed when calling `tomotopy.PTModel.copy()`.\n\n* 0.12.2 (2021-09-06)\n    * An issue where calling `convert_to_lda` of `tomotopy.HDPModel` with `min_cf > 0`, `min_df > 0` or `rm_top > 0` causes a crash has been fixed.\n    * A new argument `from_pseudo_doc` is added to `tomotopy.Document.get_topics` and `tomotopy.Document.get_topic_dist`.\n      This argument is only valid for documents of `PTModel`, it enables to control a source for computing topic distribution.\n    * A default value for argument `p` of `tomotopy.PTModel` has been changed. The new default value is `k * 10`.\n    * Using documents generated by `make_doc` without calling `infer` doesn't cause a crash anymore, but just print warning messages.\n    * An issue where the internal C++ code isn't compiled at clang c++17 environment has been fixed.\n\n* 0.12.1 (2021-06-20)\n    * An issue where `tomotopy.LDAModel.set_word_prior()` causes a crash has been fixed.\n    * Now `tomotopy.LDAModel.perplexity` and `tomotopy.LDAModel.ll_per_word` return the accurate value when `TermWeight` is not `ONE`.\n    * `tomotopy.LDAModel.used_vocab_weighted_freq` was added, which returns term-weighted frequencies of words.\n    * Now `tomotopy.LDAModel.summary()` shows not only the entropy of words, but also the entropy of term-weighted words.\n\n* 0.12.0 (2021-04-26)\n    * Now `tomotopy.DMRModel` and `tomotopy.GDMRModel` support multiple values of metadata (see https://github.com/bab2min/tomotopy/blob/main/examples/dmr_multi_label.py )\n    * The performance of `tomotopy.GDMRModel` was improved.\n    * A `copy()` method has been added for all topic models to do a deep copy.\n    * An issue was fixed where words that are excluded from training (by `min_cf`, `min_df`) have incorrect topic id. Now all excluded words have `-1` as topic id.\n    * Now all exceptions and warnings that generated by `tomotopy` follow standard Python types.\n    * Compiler requirements have been raised to C++14.\n\n* 0.11.1 (2021-03-28)\n    * A critical bug of asymmetric alphas was fixed. Due to this bug, version 0.11.0 has been removed from releases.\n\n* 0.11.0 (2021-03-26) (removed)\n    * A new topic model `tomotopy.PTModel` for short texts was added into the package.\n    * An issue was fixed where `tomotopy.HDPModel.infer` causes a segmentation fault sometimes.\n    * A mismatch of numpy API version was fixed.\n    * Now asymmetric document-topic priors are supported.\n    * Serializing topic models to `bytes` in memory is supported.\n    * An argument `normalize` was added to `get_topic_dist()`, `get_topic_word_dist()` and `get_sub_topic_dist()` for controlling normalization of results.\n    * Now `tomotopy.DMRModel.lambdas` and `tomotopy.DMRModel.alpha` give correct values.\n    * Categorical metadata supports for `tomotopy.GDMRModel` were added (see https://github.com/bab2min/tomotopy/blob/main/examples/gdmr_both_categorical_and_numerical.py ).\n    * Python3.5 support was dropped.\n\n* 0.10.2 (2021-02-16)\n    * An issue was fixed where `tomotopy.CTModel.train` fails with large K.\n    * An issue was fixed where `tomotopy.utils.Corpus` loses their `uid` values.\n\n* 0.10.1 (2021-02-14)\n    * An issue was fixed where `tomotopy.utils.Corpus.extract_ngrams` craches with empty input.\n    * An issue was fixed where `tomotopy.LDAModel.infer` raises exception with valid input.\n    * An issue was fixed where `tomotopy.HLDAModel.infer` generates wrong `tomotopy.Document.path`.\n    * Since a new parameter `freeze_topics` for `tomotopy.HLDAModel.train` was added, you can control whether to create a new topic or not when training.\n\n* 0.10.0 (2020-12-19)\n    * The interface of `tomotopy.utils.Corpus` and of `tomotopy.LDAModel.docs` were unified. Now you can access the document in corpus with the same manner.\n    * __getitem__ of `tomotopy.utils.Corpus` was improved. Not only indexing by int, but also by Iterable[int], slicing are supported. Also indexing by uid is supported.\n    * New methods `tomotopy.utils.Corpus.extract_ngrams` and `tomotopy.utils.Corpus.concat_ngrams` were added. They extracts n-gram collocations using PMI and concatenates them into a single words.\n    * A new method `tomotopy.LDAModel.add_corpus` was added, and `tomotopy.LDAModel.infer` can receive corpus as input. \n    * A new module `tomotopy.coherence` was added. It provides the way to calculate coherence of the model.\n    * A paramter `window_size` was added to `tomotopy.label.FoRelevance`.\n    * An issue was fixed where NaN often occurs when training `tomotopy.HDPModel`.\n    * Now Python3.9 is supported.\n    * A dependency to py-cpuinfo was removed and the initializing of the module was improved.\n\n* 0.9.1 (2020-08-08)\n    * Memory leaks of version 0.9.0 was fixed.\n    * `tomotopy.CTModel.summary()` was fixed.\n\n* 0.9.0 (2020-08-04)\n    * The `tomotopy.LDAModel.summary()` method, which prints human-readable summary of the model, has been added.\n    * The random number generator of package has been replaced with [EigenRand]. It speeds up the random number generation and solves the result difference between platforms.\n    * Due to above, even if `seed` is the same, the model training result may be different from the version before 0.9.0.\n    * Fixed a training error in `tomotopy.HDPModel`.\n    * `tomotopy.DMRModel.alpha` now shows Dirichlet prior of per-document topic distribution by metadata.\n    * `tomotopy.DTModel.get_count_by_topics()` has been modified to return a 2-dimensional `ndarray`.\n    * `tomotopy.DTModel.alpha` has been modified to return the same value as `tomotopy.DTModel.get_alpha()`.\n    * Fixed an issue where the `metadata` value could not be obtained for the document of `tomotopy.GDMRModel`.\n    * `tomotopy.HLDAModel.alpha` now shows Dirichlet prior of per-document depth distribution.\n    * `tomotopy.LDAModel.global_step` has been added.\n    * `tomotopy.MGLDAModel.get_count_by_topics()` now returns the word count for both global and local topics.\n    * `tomotopy.PAModel.alpha`, `tomotopy.PAModel.subalpha`, and `tomotopy.PAModel.get_count_by_super_topic()` have been added.\n\n[EigenRand]: https://github.com/bab2min/EigenRand\n\n* 0.8.2 (2020-07-14)\n    * New properties `tomotopy.DTModel.num_timepoints` and `tomotopy.DTModel.num_docs_by_timepoint` have been added.\n    * A bug which causes different results with the different platform even if `seeds` were the same was partially fixed. \n      As a result of this fix, now `tomotopy` in 32 bit yields different training results from earlier version.\n\n* 0.8.1 (2020-06-08)\n    * A bug where `tomotopy.LDAModel.used_vocabs` returned an incorrect value was fixed.\n    * Now `tomotopy.CTModel.prior_cov` returns a covariance matrix with shape `[k, k]`.\n    * Now `tomotopy.CTModel.get_correlations` with empty arguments returns a correlation matrix with shape `[k, k]`.\n\n* 0.8.0 (2020-06-06)\n    * Since NumPy was introduced in tomotopy, many methods and properties of tomotopy return not just `list`, but `numpy.ndarray` now.\n    * Tomotopy has a new dependency `NumPy >= 1.10.0`.\n    * A wrong estimation of `tomotopy.HDPModel.infer` was fixed.\n    * A new method about converting HDPModel to LDAModel was added.\n    * New properties including `tomotopy.LDAModel.used_vocabs`, `tomotopy.LDAModel.used_vocab_freq` and `tomotopy.LDAModel.used_vocab_df` were added into topic models.\n    * A new g-DMR topic model(`tomotopy.GDMRModel`) was added.\n    * An error at initializing `tomotopy.label.FoRelevance` in macOS was fixed.\n    * An error that occured when using `tomotopy.utils.Corpus` created without `raw` parameters was fixed.\n\n* 0.7.1 (2020-05-08)\n    * `tomotopy.Document.path` was added for `tomotopy.HLDAModel`.\n    * A memory corruption bug in `tomotopy.label.PMIExtractor` was fixed.\n    * A compile error in gcc 7 was fixed.\n\n* 0.7.0 (2020-04-18)\n    * `tomotopy.DTModel` was added into the package.\n    * A bug in `tomotopy.utils.Corpus.save` was fixed.\n    * A new method `tomotopy.Document.get_count_vector` was added into Document class.\n    * Now linux distributions use manylinux2010 and an additional optimization is applied.\n\n* 0.6.2 (2020-03-28)\n    * A critical bug related to `save` and `load` was fixed. Version 0.6.0 and 0.6.1 have been removed from releases.\n\n* 0.6.1 (2020-03-22) (removed)\n    * A bug related to module loading was fixed.\n\n* 0.6.0 (2020-03-22) (removed)\n    * `tomotopy.utils.Corpus` class that manages multiple documents easily was added.\n    * `tomotopy.LDAModel.set_word_prior` method that controls word-topic priors of topic models was added.\n    * A new argument `min_df` that filters words based on document frequency was added into every topic model's __init__.\n    * `tomotopy.label`, the submodule about topic labeling was added. Currently, only `tomotopy.label.FoRelevance` is provided.\n\n* 0.5.2 (2020-03-01)\n    * A segmentation fault problem was fixed in `tomotopy.LLDAModel.add_doc`.\n    * A bug was fixed that `infer` of `tomotopy.HDPModel` sometimes crashes the program.\n    * A crash issue was fixed of `tomotopy.LDAModel.infer` with ps=tomotopy.ParallelScheme.PARTITION, together=True.\n\n* 0.5.1 (2020-01-11)\n    * A bug was fixed that `tomotopy.SLDAModel.make_doc` doesn't support missing values for `y`.\n    * Now `tomotopy.SLDAModel` fully supports missing values for response variables `y`. Documents with missing values (NaN) are included in modeling topic, but excluded from regression of response variables.\n\n* 0.5.0 (2019-12-30)\n    * Now `tomotopy.PAModel.infer` returns both topic distribution nd sub-topic distribution.\n    * New methods get_sub_topics and get_sub_topic_dist were added into `tomotopy.Document`. (for PAModel)\n    * New parameter `parallel` was added for `tomotopy.LDAModel.train` and `tomotopy.LDAModel.infer` method. You can select parallelism algorithm by changing this parameter.\n    * `tomotopy.ParallelScheme.PARTITION`, a new algorithm, was added. It works efficiently when the number of workers is large, the number of topics or the size of vocabulary is big.\n    * A bug where `rm_top` didn't work at `min_cf` < 2 was fixed.\n\n* 0.4.2 (2019-11-30)\n    * Wrong topic assignments of `tomotopy.LLDAModel` and `tomotopy.PLDAModel` were fixed.\n    * Readable __repr__ of `tomotopy.Document` and `tomotopy.Dictionary` was implemented.\n\n* 0.4.1 (2019-11-27)\n    * A bug at init function of `tomotopy.PLDAModel` was fixed.\n\n* 0.4.0 (2019-11-18)\n    * New models including `tomotopy.PLDAModel` and `tomotopy.HLDAModel` were added into the package.\n\n* 0.3.1 (2019-11-05)\n    * An issue where `get_topic_dist()` returns incorrect value when `min_cf` or `rm_top` is set was fixed.\n    * The return value of `get_topic_dist()` of `tomotopy.MGLDAModel` document was fixed to include local topics.\n    * The estimation speed with `tw=ONE` was improved.\n\n* 0.3.0 (2019-10-06)\n    * A new model, `tomotopy.LLDAModel` was added into the package.\n    * A crashing issue of `HDPModel` was fixed.\n    * Since hyperparameter estimation for `HDPModel` was implemented, the result of `HDPModel` may differ from previous versions.\n        If you want to turn off hyperparameter estimation of HDPModel, set `optim_interval` to zero.\n\n* 0.2.0 (2019-08-18)\n    * New models including `tomotopy.CTModel` and `tomotopy.SLDAModel` were added into the package.\n    * A new parameter option `rm_top` was added for all topic models.\n    * The problems in `save` and `load` method for `PAModel` and `HPAModel` were fixed.\n    * An occassional crash in loading `HDPModel` was fixed.\n    * The problem that `ll_per_word` was calculated incorrectly when `min_cf` > 0 was fixed.\n\n* 0.1.6 (2019-08-09)\n    * Compiling errors at clang with macOS environment were fixed.\n\n* 0.1.4 (2019-08-05)\n    * The issue when `add_doc` receives an empty list as input was fixed.\n    * The issue that `tomotopy.PAModel.get_topic_words` doesn't extract the word distribution of subtopic was fixed.\n\n* 0.1.3 (2019-05-19)\n    * The parameter `min_cf` and its stopword-removing function were added for all topic models.\n\n* 0.1.0 (2019-05-12)\n    * First version of **tomotopy**\n",
    "bugtrack_url": null,
    "license": "MIT License",
    "summary": "Tomoto, Topic Modeling Tool for Python",
    "version": "0.12.7",
    "project_urls": {
        "Homepage": "https://github.com/bab2min/tomotopy"
    },
    "split_keywords": [
        "nlp",
        "topic model"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "e7715bd0c9453a6bdf29b640cd9d2cd0d92e40925d7618555e628e874016deb5",
                "md5": "88b3c9fb909c2176d561262f7501b25c",
                "sha256": "1338a46ea8bd9263e58abf8d886e96cc8ac2aae56184dfd3992b046b5aa933e1"
            },
            "downloads": -1,
            "filename": "tomotopy-0.12.7-cp38-cp38-macosx_11_0_arm64.whl",
            "has_sig": false,
            "md5_digest": "88b3c9fb909c2176d561262f7501b25c",
            "packagetype": "bdist_wheel",
            "python_version": "cp38",
            "requires_python": null,
            "size": 3452956,
            "upload_time": "2023-12-18T15:42:50",
            "upload_time_iso_8601": "2023-12-18T15:42:50.594111Z",
            "url": "https://files.pythonhosted.org/packages/e7/71/5bd0c9453a6bdf29b640cd9d2cd0d92e40925d7618555e628e874016deb5/tomotopy-0.12.7-cp38-cp38-macosx_11_0_arm64.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "83b4fb1ef45cba9e5d24f73211864523e844970ee95cd56d6bdeb5f8240dae90",
                "md5": "2c7f162ab9519ceb1f5c17da539be363",
                "sha256": "d43b7476c3f9077d77f64f51fce6e8e1a58d7d7f504acf0f2297af916640e2ea"
            },
            "downloads": -1,
            "filename": "tomotopy-0.12.7-cp38-cp38-win32.whl",
            "has_sig": false,
            "md5_digest": "2c7f162ab9519ceb1f5c17da539be363",
            "packagetype": "bdist_wheel",
            "python_version": "cp38",
            "requires_python": null,
            "size": 3396266,
            "upload_time": "2023-12-18T15:43:44",
            "upload_time_iso_8601": "2023-12-18T15:43:44.059136Z",
            "url": "https://files.pythonhosted.org/packages/83/b4/fb1ef45cba9e5d24f73211864523e844970ee95cd56d6bdeb5f8240dae90/tomotopy-0.12.7-cp38-cp38-win32.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "c8bd2a9999b3270c3d67d80b02548f6885a716eaddf54854ae72eb427b192be8",
                "md5": "0107a1b2b6d8ed8d23673389841357c5",
                "sha256": "66624abd5a9cbd8969a97725e5046de744d236330927a66d31a62f2146c1b115"
            },
            "downloads": -1,
            "filename": "tomotopy-0.12.7-cp39-cp39-macosx_11_0_arm64.whl",
            "has_sig": false,
            "md5_digest": "0107a1b2b6d8ed8d23673389841357c5",
            "packagetype": "bdist_wheel",
            "python_version": "cp39",
            "requires_python": null,
            "size": 3441283,
            "upload_time": "2023-12-18T15:41:32",
            "upload_time_iso_8601": "2023-12-18T15:41:32.480867Z",
            "url": "https://files.pythonhosted.org/packages/c8/bd/2a9999b3270c3d67d80b02548f6885a716eaddf54854ae72eb427b192be8/tomotopy-0.12.7-cp39-cp39-macosx_11_0_arm64.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "71577fe6696a17144c04461e7fe17e935de827a8f795ca78ed31f73c28297e87",
                "md5": "bc25caaa1cb8fc33aeb528e6126fe898",
                "sha256": "33c468b1f5ec18c4e871739b6c91141c3391c7996283b38a3907dfdba6ca78fd"
            },
            "downloads": -1,
            "filename": "tomotopy-0.12.7-cp39-cp39-win32.whl",
            "has_sig": false,
            "md5_digest": "bc25caaa1cb8fc33aeb528e6126fe898",
            "packagetype": "bdist_wheel",
            "python_version": "cp39",
            "requires_python": null,
            "size": 3396698,
            "upload_time": "2023-12-18T15:43:08",
            "upload_time_iso_8601": "2023-12-18T15:43:08.303107Z",
            "url": "https://files.pythonhosted.org/packages/71/57/7fe6696a17144c04461e7fe17e935de827a8f795ca78ed31f73c28297e87/tomotopy-0.12.7-cp39-cp39-win32.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "e767c5183deb13264ffb3b83866ffef7606f6b35cf8880b1770bfc85dccd618f",
                "md5": "3d621e6507d4387247ae7ec43920004c",
                "sha256": "e1a4e0b6426489ed11cb1940e17d6704d77672dd86314418c0108a6ffb9a78f6"
            },
            "downloads": -1,
            "filename": "tomotopy-0.12.7.tar.gz",
            "has_sig": false,
            "md5_digest": "3d621e6507d4387247ae7ec43920004c",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": null,
            "size": 1349771,
            "upload_time": "2023-12-18T15:25:46",
            "upload_time_iso_8601": "2023-12-18T15:25:46.442854Z",
            "url": "https://files.pythonhosted.org/packages/e7/67/c5183deb13264ffb3b83866ffef7606f6b35cf8880b1770bfc85dccd618f/tomotopy-0.12.7.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-12-18 15:25:46",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "bab2min",
    "github_project": "tomotopy",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "requirements": [],
    "lcname": "tomotopy"
}
        
Elapsed time: 0.15156s