=====
SINr
=====
|languages| |downloads| |license| |version| |cpython| |wheel| |python| |activity| |contributors|
*SINr* is an open-source tool to efficiently compute graph and word
embeddings. Its aim is to provide sparse interpretable vectors from a
graph structure. The dimensions of the vector produced are related to
the community structure detected in the graph. By leveraging the
relative connection of vertices to communities, *SINr* builds an
interpretable space. *SINr* is focused on providing tools to build and
interpret the embeddings produced.
*SINr* is a Python module relying on
`Networkit <https://networkit.github.io>`__ for the graph structure and
community detection. *SINr* also provides efficient implementations to
extract word co-occurrence graphs from large text corpora. One of the
strength of *SINr* is its ability to work with text and produce
interpretable word embeddings that are competitive with similar
approaches. For more details on the performances of *SINr* on downstream
evaluation tasks, please refer to the `Publications <#publications>`__
section.
Requirements
============
- As SINr relies on libraries implemented using C/C++, a modern C++
compiler is required.
- OpenMP (required for `Networkit <https://networkit.github.io>`__ and
compiling *SINr*\ ’s Cython)
- Python 3.9
- Pip
- Cython
- Conda (recommended)
Install
=======
SINr can be installed through ``pip``.
pip
---
.. code:: bash
conda activate sinr # activate conda environment
pip install sinr
Usage example
=============
To get started using *SINr* to build graph and word embeddings, have a
look at the `notebook <https://github.com/SINr-Embeddings/sinr/tree/main/notebooks>`_
directory.
Here is a minimum working example of *SINr*
.. code:: python
import nltk # For textual resources
import sinr.text.preprocess as ppcs
from sinr.text.cooccurrence import Cooccurrence
from sinr.text.pmi import pmi_filter
import sinr.graph_embeddings as ge
import sinr.text.evaluate as ev
# Get a textual corpus
# For example, texts from the Project Gutenberg electronic text archive,
# hosted at http://www.gutenberg.org/
nltk.download('gutenberg')
gutenberg = nltk.corpus.gutenberg # contains 25,000 free electronic books
file = open("my_corpus.txt", "w")
file.write(gutenberg.raw())
file.close()
# Preprocess corpus
vrt_maker = ppcs.VRTMaker(ppcs.Corpus(ppcs.Corpus.REGISTER_WEB,
ppcs.Corpus.LANGUAGE_EN,
"my_corpus.txt"),
".", n_jobs=8)
vrt_maker.do_txt_to_vrt()
sentences = ppcs.extract_text("my_corpus.vrt", min_freq=20)
# Construct cooccurrence matrix
c = Cooccurrence()
c.fit(sentences, window=5)
c.matrix = pmi_filter(c.matrix)
c.save("my_cooc_matrix.pk")
# Train SINr model
model = ge.SINr.load_from_cooc_pkl("my_cooc_matrix.pk")
commu = model.detect_communities(gamma=10)
model.extract_embeddings(commu)
# Construct SINrVectors to manipulate the model
sinr_vec = ge.InterpretableWordsModelBuilder(model,
'my_sinr_vectors',
n_jobs=8,
n_neighbors=25).build()
sinr_vec.save()
# Sparsify vectors for better interpretability and performances
sinr_vec.sparsify(100)
# Evaluate the model with the similarity task
print('\nResults of the similarity evaluation :')
print(ev.similarity_MEN_WS353_SCWS(sinr_vec))
# Explore word vectors and dimensions of the model
print("\nDimensions activated by the word 'apple' :")
print(sinr_vec.get_obj_stereotypes('apple', topk_dim=5, topk_val=3))
print("\nWords similar to 'apple' :")
print(sinr_vec.most_similar('apple'))
# Load an existing SinrVectors object
sinr_vec = ge.SINrVectors('my_sinr_vectors')
sinr_vec.load()
Documentation
=============
The documentation for *SINr* is `available
online <https://sinr-embeddings.github.io/sinr/index.html>`__.
Contributing
============
Pull requests are welcome. For major changes, please open an issue first
to disccus the changes to be made.
License
=======
Released under `CeCILL 2.1 <https://cecill.info/>`__, see `LICENSE <https://github.com/SINr-Embeddings/sinr/blob/main/LICENSE>`__ for more details.
Publications
============
*SINr* is currently maintained at the *University of Le Mans*. If you
find *SINr* useful for your own research, please cite the appropriate
papers from the list below. Publications can also be found on
`publications page in the
documentation <https://sinr-embeddings.github.io/sinr/publications.html>`__.
**Initial SINr paper, 2021**
- Thibault Prouteau, Victor Connes, Nicolas Dugué, Anthony Perez,
Jean-Charles Lamirel, et al.. SINr: Fast Computing of Sparse
Interpretable Node Representations is not a Sin!. Advances in
Intelligent Data Analysis XIX, 19th International Symposium on
Intelligent Data Analysis, IDA 2021, Apr 2021, Porto, Portugal.
pp.325-337,
⟨\ `10.1007/978-3-030-74251-5_26 <https://dx.doi.org/10.1007/978-3-030-74251-5_26>`__\ ⟩.
`⟨hal-03197434⟩ <https://hal.science/hal-03197434>`__
**Interpretability of SINr embedding**
- Thibault Prouteau, Nicolas Dugué, Nathalie Camelin, Sylvain Meignier.
Are Embedding Spaces Interpretable? Results of an Intrusion Detection
Evaluation on a Large French Corpus. LREC 2022, Jun 2022, Marseille,
France. `⟨hal-03770444⟩ <https://hal.science/hal-03770444>`__
**Sparsity of SINr embedding**
- Simon Guillot, Thibault Prouteau, Nicolas Dugué.
Sparser is better: one step closer to word embedding interpretability.
IWCS 2023, Nancy, France.
`⟨hal-04321407⟩ <https://hal.science/hal-04321407>`__
**Filtering dimensions of SINr embedding**
- Anna Béranger, Nicolas Dugué, Simon Guillot, Thibault Prouteau.
Filtering communities in word co-occurrence networks to foster the
emergence of meaning. Complex Networks 2023, Menton, France.
`⟨hal-04398742⟩ <https://hal.science/hal-04398742>`__
.. |languages| image:: https://img.shields.io/github/languages/count/SINr-Embeddings/sinr
.. |downloads| image:: https://img.shields.io/pypi/dm/sinr
.. |license| image:: https://img.shields.io/pypi/l/sinr?color=green
.. |version| image:: https://img.shields.io/pypi/v/sinr
.. |cpython| image:: https://img.shields.io/pypi/implementation/sinr
.. |wheel| image:: https://img.shields.io/pypi/wheel/sinr
.. |python| image:: https://img.shields.io/pypi/pyversions/sinr
.. |activity| image:: https://img.shields.io/github/commit-activity/y/SINr-Embeddings/sinr
.. |contributors| image:: https://img.shields.io/github/contributors/SINr-Embeddings/sinr
Raw data
{
"_id": null,
"home_page": "https://sinr-embeddings.github.io/sinr/_build/html/index.html",
"name": "sinr",
"maintainer": null,
"docs_url": null,
"requires_python": "<4.0,>=3.8",
"maintainer_email": null,
"keywords": "node embedding, word embedding, embedding, graph embedding, louvain, community",
"author": "Thibault Prouteau",
"author_email": "thibault.prouteau@univ-lemans.fr",
"download_url": "https://files.pythonhosted.org/packages/39/d6/03eb7a53cab07afd4a9d20487d60a58428d3dc2b507687b8bd2187293048/sinr-1.3.3.3.tar.gz",
"platform": null,
"description": "=====\nSINr\n=====\n|languages| |downloads| |license| |version| |cpython| |wheel| |python| |activity| |contributors|\n\n*SINr* is an open-source tool to efficiently compute graph and word\nembeddings. Its aim is to provide sparse interpretable vectors from a\ngraph structure. The dimensions of the vector produced are related to\nthe community structure detected in the graph. By leveraging the\nrelative connection of vertices to communities, *SINr* builds an\ninterpretable space. *SINr* is focused on providing tools to build and\ninterpret the embeddings produced.\n\n*SINr* is a Python module relying on\n`Networkit <https://networkit.github.io>`__ for the graph structure and\ncommunity detection. *SINr* also provides efficient implementations to\nextract word co-occurrence graphs from large text corpora. One of the\nstrength of *SINr* is its ability to work with text and produce\ninterpretable word embeddings that are competitive with similar\napproaches. For more details on the performances of *SINr* on downstream\nevaluation tasks, please refer to the `Publications <#publications>`__\nsection.\n\nRequirements\n============\n\n- As SINr relies on libraries implemented using C/C++, a modern C++\n compiler is required.\n- OpenMP (required for `Networkit <https://networkit.github.io>`__ and\n compiling *SINr*\\ \u2019s Cython)\n- Python 3.9\n- Pip\n- Cython\n- Conda (recommended)\n\nInstall\n=======\n\nSINr can be installed through ``pip``.\n\npip\n---\n\n.. code:: bash\n\n conda activate sinr # activate conda environment\n pip install sinr\n\nUsage example\n=============\n\nTo get started using *SINr* to build graph and word embeddings, have a\nlook at the `notebook <https://github.com/SINr-Embeddings/sinr/tree/main/notebooks>`_ \ndirectory.\n\nHere is a minimum working example of *SINr*\n\n.. code:: python\n\n import nltk # For textual resources\n\n import sinr.text.preprocess as ppcs\n from sinr.text.cooccurrence import Cooccurrence\n from sinr.text.pmi import pmi_filter\n import sinr.graph_embeddings as ge\n import sinr.text.evaluate as ev\n\n # Get a textual corpus\n # For example, texts from the Project Gutenberg electronic text archive,\n # hosted at http://www.gutenberg.org/\n nltk.download('gutenberg')\n gutenberg = nltk.corpus.gutenberg # contains 25,000 free electronic books\n file = open(\"my_corpus.txt\", \"w\")\n file.write(gutenberg.raw())\n file.close()\n\n # Preprocess corpus\n vrt_maker = ppcs.VRTMaker(ppcs.Corpus(ppcs.Corpus.REGISTER_WEB,\n ppcs.Corpus.LANGUAGE_EN,\n \"my_corpus.txt\"),\n \".\", n_jobs=8)\n vrt_maker.do_txt_to_vrt()\n sentences = ppcs.extract_text(\"my_corpus.vrt\", min_freq=20)\n\n # Construct cooccurrence matrix\n c = Cooccurrence()\n c.fit(sentences, window=5)\n c.matrix = pmi_filter(c.matrix)\n c.save(\"my_cooc_matrix.pk\")\n\n # Train SINr model\n model = ge.SINr.load_from_cooc_pkl(\"my_cooc_matrix.pk\")\n commu = model.detect_communities(gamma=10)\n model.extract_embeddings(commu)\n\n # Construct SINrVectors to manipulate the model\n sinr_vec = ge.InterpretableWordsModelBuilder(model,\n 'my_sinr_vectors',\n n_jobs=8,\n n_neighbors=25).build()\n sinr_vec.save()\n\n # Sparsify vectors for better interpretability and performances\n sinr_vec.sparsify(100)\n\n # Evaluate the model with the similarity task\n print('\\nResults of the similarity evaluation :')\n print(ev.similarity_MEN_WS353_SCWS(sinr_vec))\n\n # Explore word vectors and dimensions of the model\n print(\"\\nDimensions activated by the word 'apple' :\")\n print(sinr_vec.get_obj_stereotypes('apple', topk_dim=5, topk_val=3))\n\n print(\"\\nWords similar to 'apple' :\")\n print(sinr_vec.most_similar('apple'))\n\n # Load an existing SinrVectors object\n sinr_vec = ge.SINrVectors('my_sinr_vectors')\n sinr_vec.load()\n\nDocumentation\n=============\n\nThe documentation for *SINr* is `available\nonline <https://sinr-embeddings.github.io/sinr/index.html>`__.\n\nContributing\n============\n\nPull requests are welcome. For major changes, please open an issue first\nto disccus the changes to be made.\n\nLicense\n=======\n\nReleased under `CeCILL 2.1 <https://cecill.info/>`__, see `LICENSE <https://github.com/SINr-Embeddings/sinr/blob/main/LICENSE>`__ for more details.\n\nPublications\n============\n\n*SINr* is currently maintained at the *University of Le Mans*. If you\nfind *SINr* useful for your own research, please cite the appropriate\npapers from the list below. Publications can also be found on\n`publications page in the\ndocumentation <https://sinr-embeddings.github.io/sinr/publications.html>`__.\n\n**Initial SINr paper, 2021**\n\n- Thibault Prouteau, Victor Connes, Nicolas Dugu\u00e9, Anthony Perez,\n Jean-Charles Lamirel, et al.. SINr: Fast Computing of Sparse\n Interpretable Node Representations is not a Sin!. Advances in\n Intelligent Data Analysis XIX, 19th International Symposium on\n Intelligent Data Analysis, IDA 2021, Apr 2021, Porto, Portugal.\n pp.325-337,\n \u27e8\\ `10.1007/978-3-030-74251-5_26 <https://dx.doi.org/10.1007/978-3-030-74251-5_26>`__\\ \u27e9.\n `\u27e8hal-03197434\u27e9 <https://hal.science/hal-03197434>`__\n\n**Interpretability of SINr embedding**\n\n- Thibault Prouteau, Nicolas Dugu\u00e9, Nathalie Camelin, Sylvain Meignier.\n Are Embedding Spaces Interpretable? Results of an Intrusion Detection\n Evaluation on a Large French Corpus. LREC 2022, Jun 2022, Marseille,\n France. `\u27e8hal-03770444\u27e9 <https://hal.science/hal-03770444>`__\n\n**Sparsity of SINr embedding**\n\n- Simon Guillot, Thibault Prouteau, Nicolas Dugu\u00e9.\n Sparser is better: one step closer to word embedding interpretability.\n IWCS 2023, Nancy, France.\n `\u27e8hal-04321407\u27e9 <https://hal.science/hal-04321407>`__\n\n**Filtering dimensions of SINr embedding**\n\n- Anna B\u00e9ranger, Nicolas Dugu\u00e9, Simon Guillot, Thibault Prouteau.\n Filtering communities in word co-occurrence networks to foster the\n emergence of meaning. Complex Networks 2023, Menton, France.\n `\u27e8hal-04398742\u27e9 <https://hal.science/hal-04398742>`__\n\n \n \n.. |languages| image:: https://img.shields.io/github/languages/count/SINr-Embeddings/sinr\n.. |downloads| image:: https://img.shields.io/pypi/dm/sinr\n.. |license| image:: https://img.shields.io/pypi/l/sinr?color=green\n.. |version| image:: https://img.shields.io/pypi/v/sinr\n.. |cpython| image:: https://img.shields.io/pypi/implementation/sinr\n.. |wheel| image:: https://img.shields.io/pypi/wheel/sinr\n.. |python| image:: https://img.shields.io/pypi/pyversions/sinr\n.. |activity| image:: https://img.shields.io/github/commit-activity/y/SINr-Embeddings/sinr\n.. |contributors| image:: https://img.shields.io/github/contributors/SINr-Embeddings/sinr\n\n",
"bugtrack_url": null,
"license": "CeCILL 2.1",
"summary": "Build word and graph embeddings based on community detection in graphs.",
"version": "1.3.3.3",
"project_urls": {
"Homepage": "https://sinr-embeddings.github.io/sinr/_build/html/index.html",
"Repository": "https://github.com/SINr-Embeddings/sinr"
},
"split_keywords": [
"node embedding",
" word embedding",
" embedding",
" graph embedding",
" louvain",
" community"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "ea87d262b1c6f2e1e03188054b752863a249f2265a0f992c563343c883f7c75a",
"md5": "21f5ea7f0aff4d3d3e03ca117ee44ef4",
"sha256": "cee32308cb7efe16db2b20e1c81767b7ff42fdcb7531a79aaccfdd43ea03e782"
},
"downloads": -1,
"filename": "sinr-1.3.3.3-cp310-cp310-manylinux_2_35_x86_64.whl",
"has_sig": false,
"md5_digest": "21f5ea7f0aff4d3d3e03ca117ee44ef4",
"packagetype": "bdist_wheel",
"python_version": "cp310",
"requires_python": "<4.0,>=3.8",
"size": 891699,
"upload_time": "2024-12-05T15:12:23",
"upload_time_iso_8601": "2024-12-05T15:12:23.361365Z",
"url": "https://files.pythonhosted.org/packages/ea/87/d262b1c6f2e1e03188054b752863a249f2265a0f992c563343c883f7c75a/sinr-1.3.3.3-cp310-cp310-manylinux_2_35_x86_64.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "39d603eb7a53cab07afd4a9d20487d60a58428d3dc2b507687b8bd2187293048",
"md5": "6b12ea6f1de86ef6cda5319b8d57327a",
"sha256": "41c7d2a901906895cb485e1196e6b7eb55947bc222a2e6193b2419e2c958914c"
},
"downloads": -1,
"filename": "sinr-1.3.3.3.tar.gz",
"has_sig": false,
"md5_digest": "6b12ea6f1de86ef6cda5319b8d57327a",
"packagetype": "sdist",
"python_version": "source",
"requires_python": "<4.0,>=3.8",
"size": 61070,
"upload_time": "2024-12-05T15:12:25",
"upload_time_iso_8601": "2024-12-05T15:12:25.418441Z",
"url": "https://files.pythonhosted.org/packages/39/d6/03eb7a53cab07afd4a9d20487d60a58428d3dc2b507687b8bd2187293048/sinr-1.3.3.3.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-12-05 15:12:25",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "SINr-Embeddings",
"github_project": "sinr",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"requirements": [],
"lcname": "sinr"
}