cntext

Name	cntext JSON
Version	1.9.0 JSON
	download
home_page	https://github.com/hidadeng/cntext
Summary	Chinese text analysis library, which can perform word frequency statistics, dictionary expansion, sentiment analysis, similarity, readability, co-occurrence analysis, social calculation (attitude, prejudice, culture) on texts
upload_time	2023-12-28 04:21:41
maintainer
docs_url	None
author	大邓
requires_python	>=3.5
license	MIT
keywords	chinese text mining sentiment sentiment analysis natural language processing sentiment dictionary development text similarity
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            [![DOI](https://zenodo.org/badge/487297608.svg)](https://zenodo.org/badge/latestdoi/487297608)


<!-- START doctoc generated TOC please keep comment here to allow auto update -->
<!-- DON'T EDIT THIS SECTION, INSTEAD RE-RUN doctoc TO UPDATE -->
**Table of Contents**

- [Installation](#installation)
- [QuickStart](#quickstart)
- [1. Basic](#1-basic)
  - [1.1  readability](#11--readability)
  - [1.2  term_freq(text, lang)](#12--term_freqtext-lang)
  - [1.3 dict_pkl_list](#13-dict_pkl_list)
  - [1.4 load_pkl_dict](#14-load_pkl_dict)
  - [1.5 sentiment](#15-sentiment)
  - [1.6 sentiment_by_valence()](#16-sentiment_by_valence)
- [2. dictionary](#2-dictionary)
  - [2.1 SoPmi](#21-sopmi)
  - [2.2 W2VModels](#22-w2vmodels)
  - [Note](#note)
  - [2.3 co_occurrence_matrix](#23-co_occurrence_matrix)
  - [2.4  Glove](#24--glove)
- [3. similarity](#3-similarity)
- [4. Text2Mind](#4-text2mind)
  - [4.1 tm.sematic_distance(words, c_words1, c_words2)](#41-tmsematic_distancewords-c_words1-c_words2)
  - [4.2 tm.sematic_projection(words, c_words1, c_words2)](#42-tmsematic_projectionwords-c_words1-c_words2)
- [Citation](#citation)
  - [apalike](#apalike)
  - [bibtex](#bibtex)
  - [endnote](#endnote)

<!-- END doctoc generated TOC please keep comment here to allow auto update -->


![](img/logo.png)



[中文文档](chinese_readme.md)

[中文博客](https://hidadeng.github.io/blog/)

**cntext** is a text analysis package that provides traditional text analysis methods, such as word count, readability, document similarity, sentiment analysis, etc. It has built-in multiple Chinese and English sentiment dictionaries. Supporting word embedding models training and usage, cntext provides semantic distance and semantic projection now.

- [github repo](https://github.com/hidadeng/cntext) ``https://github.com/hidadeng/cntext``

- [pypi link](https://pypi.org/project/cntext/)  ``https://pypi.org/project/cntext/``

By the day of 2023-11-20, the cumulative download volume of cntext reached 36581

[![](img/cntext-stats.png)](https://www.pepy.tech/projects/cntext)




<br>

## Installation

```
pip install cntext --upgrade
pip install numpy==1.24.2
pip install gensim==4.2.0
pip install scikit-learn==1.1.2
```



<br>

## QuickStart 

```python
import cntext as ct

help(ct)
```

Run

```
Help on package cntext:

NAME
    cntext

PACKAGE CONTENTS
    bias
    dictionary
    similarity
    stats
```



<br>



## 1. Basic

Currently, the built-in functions of stats.py are:

- **readability()**  the readability of text, support Chinese and English
- **term_freq()**  word count 
- **dict_pkl_list()**  get the list of built-in dictionaries (pkl format) in cntext
- **load_pkl_dict()**  load the pkl dictionary file
- **sentiment()** sentiment analysis
- **sentiment_by_valence()** valence sentiment analysis



```python
import cntext as ct

text = 'What a sunny day!'


diction = {'Pos': ['sunny', 'good'],
           'Neg': ['bad', 'terrible'],
           'Adv': ['very']}

ct.sentiment(text=text,
             diction=diction,
             lang='english')
```

Run

```
{'Pos_num': 1,
 'Neg_num': 0,
 'Adv_num': 0,
 'stopword_num': 1,
 'word_num': 5,
 'sentence_num': 1}
```

<br>



### 1.1  readability

The larger the indicator, the higher the complexity of the article and the worse the readability.

**readability(text, lang='chinese')**

- text:  text string
- lang:  "chinese" or "english"，default is "chinese"



```python
import cntext as ct

text = 'Committed to publishing quality research software with zero article processing charges or subscription fees.'

ct.readability(text=text, 
               lang='english')
```

Run

```
{'readability': 19.982}
```

<br>



### 1.2  term_freq(text, lang)

Word count statistics function, return Counter type.

```python
import cntext as ct

text = 'Committed to publishing quality research software with zero article processing charges or subscription fees.'

ct.term_freq(text=text, lang='english')
```

Run

```
Counter({'committed': 1, 
         'publishing': 1, 
         'quality': 1, 
         'research': 1, 
         'software': 1, 
         'zero': 1, 
         'article': 1, 
         'processing': 1, 
         'charges': 1, 
         'subscription': 1, 
         'fees.': 1})
```

<br>



### 1.3 dict_pkl_list  

get the list of built-in dictionaries (pkl format) in cntext

```python
import cntext as ct

ct.dict_pkl_list()
```

Run

```
['DUTIR.pkl',
 'HOWNET.pkl',
 'sentiws.pkl',
 'Chinese_Digitalization.pkl',
 'ChineseFinancialFormalUnformalSentiment.pkl',
 'Concreteness.pkl',
 'ANEW.pkl',
 'LSD2015.pkl',
 'NRC.pkl',
 'geninqposneg.pkl',
 'HuLiu.pkl',
 'AFINN.pkl',
 'ChineseEmoBank.pkl',
 'ADV_CONJ.pkl',
 'Loughran_McDonald_Financial_Sentiment.pkl',
 'Chinese_Loughran_McDonald_Financial_Sentiment.pkl',
 'STOPWORDS.pkl']
```

We list 12 pkl dictionary here, some of English dictionary listed below are organized from [quanteda.sentiment](https://github.com/quanteda/quanteda.sentiment)

| pkl文件                                     | 词典                                                         | 语言            | 功能                                                         |
| ------------------------------------------- | ------------------------------------------------------------ | --------------- | ------------------------------------------------------------ |
| ChineseEmoBank.pkl                                   | Chinese Sentiment Dictionary, includes 「valence」「arousal」. In cntext, we only take Chinese valence-arousal words (CVAW, single word) into account, ignore CVAP, CVAS, CVAT. | Chinese         | valence, arousal|
| DUTIR.pkl                                   | DUTIR                                                        | Chinese         | Seven categories of emotions: 哀, 好, 惊, 惧, 乐, 怒, 恶     |
| HOWNET.pkl                                  | Hownet                                                       | Chinese         | Positive、Negative                                           |
| SentiWS.pkl                      | SentimentWortschatz (SentiWS)                                | German         | Positive、Negative；<br>                              |
| ChineseFinancialFormalUnformalSentiment.pkl | Chinese finance dictionary, contains formal、unformal、positive、negative | Chinese         | formal-pos、<br>formal-neg；<br>unformal-pos、<br>unformal-neg |
| ANEW.pkl                                    | Affective Norms for English Words (ANEW)                     | English         |                                                       |
| LSD2015.pkl                                 | Lexicoder Sentiment Dictionary (2015)                        | English         | Positive、Negative                                           |
| NRC.pkl                                     | NRC Word-Emotion Association Lexicon                         | English         | fine-grained sentiment words;                                |
| HuLiu.pkl                                   | Hu&Liu (2004)                                                | English         | Positive、Negative                                           |
| AFINN.pkl                                   | Affective Norms for English Words                        | English         |                       |
| ADV_CONJ.pkl                                | adverbial & conjunction                                      | Chinese         |                                                              |
| STOPWORDS.pkl                               |                                                              | English&Chinese | stopwordlist                                                 |
| Concreteness.pkl                            | Brysbaert, M., Warriner, A. B., & Kuperman, V. (2014). Concreteness ratings for 40 thousand generally known English word lemmas. Behavior Research Methods, 46, 904–911 | English         | word & concreateness score                                   |
| Chinese_Loughran_McDonald_Financial_Sentiment.pkl | 曾庆生, 周波, 张程, and 陈信元. "年报语调与内部人交易: 表里如一还是口是心非?." 管理世界 34, no. 09 (2018): 143-160. | Chinese | 正面、负面词                                                 |
| Chinese_Digitalization.pkl |吴非,胡慧芷,林慧妍,任晓怡. 企业数字化转型与资本市场表现——来自股票流动性的经验证据[J]. 管理世界,2021,37(07):130-144+10. | 中文    | 基于这篇论文，构建了中文数字化词典，含人工智能技术、大数据技术、云计算技术、区块链技术、数字技术应用等关键词列表。                                               |
| Loughran_McDonald_Financial_Sentiment.pkl         | Loughran, Tim, and Bill McDonald. "When is a liability not a liability? Textual analysis, dictionaries, and 10‐Ks." The Journal of finance 66, no. 1 (2011): 35-65. | English | Positive and Negative emotion words in the financial field。 Besides, in version of 2018, author add ``Uncertainty, Litigious, StrongModal, WeakModal, Constraining`` |
| Chinese_FLS.pkl | 许帅,邵帅,何贤杰.业绩说明会前瞻性信息对分析师盈余预测准确性的影响——信口雌黄还是言而有征[J].中国管理科学:1-15. | 中文 | 前瞻性词典集，含174个词语 |






<br>

### 1.4 load_pkl_dict 

load the pkl dictionary file and return dict type data.

```python
import cntext as ct

print(ct.__version__)
# load the pkl dictionary file
print(ct.load_pkl_dict('NRC.pkl'))
```

Run

```
1.8.0

{'NRC': {'anger': ['abandoned', 'abandonment', 'abhor', 'abhorrent', ...],
         'anticipation': ['accompaniment','achievement','acquiring', ...],
         'disgust': ['abject', 'abortion', 'abundance', 'abuse', ...],
         'fear': ['anxiety', 'anxious', 'apache', 'appalling', ...],
         ......
 
 'Desc': 'NRC Word-Emotion Association Lexicon', 
 'Referer': 'Mohammad, Saif M., and Peter D. Turney. "Nrc emotion lexicon." National Research Council, Canada 2 (2013).'
         }
```

<br>



### 1.5 sentiment

**sentiment(text, diction, lang='chinese')**

Calculate the occurrences of each emotional category words in text; The complex influence of adverbs and negative words on emotion is not considered.

- **text**:  text string
- **diction**:  emotion dictionary data, support diy or built-in dicitonary
- **lang**: "chinese" or "english"，default is "chinese"



We can use built-in dicitonary in cntext, such as NRC.pkl

```python
import cntext as ct

text = 'What a happy day!'

ct.sentiment(text=text,
             diction=ct.load_pkl_dict('NRC.pkl')['NRC'],
             lang='english')
```

Run

```
{'anger_num': 0,
 'anticipation_num': 1,
 'disgust_num': 0,
 'fear_num': 0,
 'joy_num': 1,
 'negative_num': 0,
 'positive_num': 1,
 'sadness_num': 0,
 'surprise_num': 0,
 'trust_num': 1,
 'stopword_num': 1,
 'word_num': 5,
 'sentence_num': 1}
```

We can also use DIY dicitonary, just like

```python
import cntext as ct

text = 'What a happy day!'

diction = {'Pos': ['happy', 'good'],
           'Neg': ['bad', 'terrible'],
           'Adv': ['very']}

ct.sentiment(text=text,
             diction=diction,
             lang='english')
```

Run

```
{'Pos_num': 1,
 'Neg_num': 0,
 'Adv_num': 0,
 'stopword_num': 1,
 'word_num': 5,
 'sentence_num': 1}
```

<br>



### 1.6 sentiment_by_valence()

**sentiment_by_valence(text, diction, lang='english')**

Calculate the occurrences of each sentiment category words in text;  The complex influence of intensity adverbs and negative words on emotion is not considered.

- text:  text sring
- diction:  sentiment dictionary with valence.；
- lang: "chinese" or "english"; default language="english"



Here we want to study the concreteness of text.  The **concreteness.pkl** that comes from Brysbaert2014. 

>Brysbaert, M., Warriner, A. B., & Kuperman, V. (2014). Concreteness ratings for 40 thousand generally known English word lemmas. Behavior Research Methods, 46, 904–911

```python
import cntext as ct

# load the concreteness.pkl dictionary file;  cntext version >=1.7.1
concreteness_df = ct.load_pkl_dict('concreteness.pkl')['concreteness']
concreteness_df.head()
```

Run

|| word | valence |
| ---: | :-------------- | ----------: |
|  0 | roadsweeper   |      4.85 |
|  1 | traindriver   |      4.54 |
|  2 | tush          |      4.45 |
|  3 | hairdress     |      3.93 |
|  4 | pharmaceutics |      3.77 |

<br>

```python
reply = "I'll go look for that"

score=ct.sentiment_by_valence(text=reply, 
                              diction=concreteness_df, 
                              lang='english')
score
```

Run

```
1.85
```



<br>

```python
employee_replys = ["I'll go look for that",
                   "I'll go search for that",
                   "I'll go search for that top",
                   "I'll go search for that t-shirt",
                   "I'll go look for that t-shirt in grey",
                   "I'll go search for that t-shirt in grey"]

for idx, reply in enumerate(employee_replys):
    score=ct.sentiment_by_valence(text=reply, 
                                  diction=concreteness_df, 
                                  lang='english')
    
    template = "Concreteness Score: {score:.2f} | Example-{idx}: {exmaple}"
    print(template.format(score=score, 
                          idx=idx, 
                          exmaple=reply))
    
ct.sentiment_by_valence(text=text, diction=concreteness_df, lang='english')
```

Run

```
Concreteness Score: 1.55 | Example-0: I'll go look for that
Concreteness Score: 1.55 | Example-1: I'll go search for that
Concreteness Score: 1.89 | Example-2: I'll go search for that top
Concreteness Score: 2.04 | Example-3: I'll go search for that t-shirt
Concreteness Score: 2.37 | Example-4: I'll go look for that t-shirt in grey
Concreteness Score: 2.37 | Example-5: I'll go search for that t-shirt in grey
```



<br><br>





## 2. dictionary

This module is used to build or expand the vocabulary (dictionary), including

- **SoPmi** Co-occurrence algorithm to extend vocabulary (dictionary), Only support chinese
- **W2VModels** using word2vec to extend vocabulary (dictionary), support english & chinese 

### 2.1 SoPmi 

```python
import cntext as ct
import os

sopmier = ct.SoPmi(cwd=os.getcwd(),
                   #raw corpus data，txt file.only support chinese data now.
                   input_txt_file='data/sopmi_corpus.txt', 
                   #muanually selected seed words
                   seedword_txt_file='data/sopmi_seed_words.txt', #人工标注的初始种子词
                   )   

sopmier.sopmi()
```

Run

```
Step 1/4:...Preprocess   Corpus ...
Step 2/4:...Collect co-occurrency information ...
Step 3/4:...Calculate   mutual information ...
Step 4/4:...Save    candidate words ...
Finish! used 44.49 s
```



<br>

### 2.2 W2VModels 

**In particular, note that the code needs to set the lang parameter**

```python
import cntext as ct
import os

#init W2VModels, corpus data w2v_corpus.txt
model = ct.W2VModels(cwd=os.getcwd(), lang='english')  
model.train(input_txt_file='data/w2v_corpus.txt')


#According to the seed word, filter out the top 100 words that are most similar to each category words
model.find(seedword_txt_file='data/w2v_seeds/integrity.txt', 
           topn=100)
model.find(seedword_txt_file='data/w2v_seeds/innovation.txt', 
           topn=100)
model.find(seedword_txt_file='data/w2v_seeds/quality.txt', 
           topn=100)
model.find(seedword_txt_file='data/w2v_seeds/respect.txt', 
           topn=100)
model.find(seedword_txt_file='data/w2v_seeds/teamwork.txt', 
           topn=100)
```

Run

```
Step 1/4:...Preprocess   corpus ...
Step 2/4:...Train  word2vec model
            used   174 s
Step 3/4:...Prepare similar candidates for each seed word in the word2vec model...
Step 4/4 Finish! Used 187 s
Step 3/4:...Prepare similar candidates for each seed word in the word2vec model...
Step 4/4 Finish! Used 187 s
Step 3/4:...Prepare similar candidates for each seed word in the word2vec model...
Step 4/4 Finish! Used 187 s
Step 3/4:...Prepare similar candidates for each seed word in the word2vec model...
Step 4/4 Finish! Used 187 s
Step 3/4:...Prepare similar candidates for each seed word in the word2vec model...
Step 4/4 Finish! Used 187 s

```

<br>

### Note

When runing out the W2VModels, there will appear a file called **w2v.model**  in the directory of **output/w2v_candi_words**.Note this w2v file can be used later.

```python
from gensim.models import KeyedVectors

w2v_model = KeyedVectors.load("the path of w2v.model")
#to extract vector for word
#w2v_model.get_vector(word)
#if you need more information about the usage of w2_model, please use help function
#help(w2_model)
```

For example, we load the ``output/w2v_candi_words/w2v.model`` 

```python
from gensim.models import KeyedVectors

w2v_model = KeyedVectors.load('output/w2v_candi_words/w2v.model')
# find the most similar word in w2v.model
w2v_model.most_similar('innovation')
```

Run

```
[('technology', 0.689210832118988),
 ('infrastructure', 0.669672966003418),
 ('resources', 0.6695448160171509),
 ('talent', 0.6627111434936523),
 ('execution', 0.6549549102783203),
 ('marketing', 0.6533523797988892),
 ('merchandising', 0.6504817008972168),
 ('diversification', 0.6479553580284119),
 ('expertise', 0.6446896195411682),
 ('digital', 0.6326863765716553)]
```

<br>

```python
#to extract vector for "innovation"
w2v_model.get_vector('innovation')
```

Run

```
array([-0.45616838, -0.7799563 ,  0.56367606, -0.8570078 ,  0.600359  ,
       -0.6588043 ,  0.31116748, -0.11956959, -0.47599426,  0.21840936,
       -0.02268819,  0.1832016 ,  0.24452794,  0.01084935, -1.4213187 ,
        0.22840202,  0.46387577,  1.198386  , -0.621511  , -0.51598716,
        0.13352732,  0.04140598, -0.23470387,  0.6402956 ,  0.20394802,
        0.10799981,  0.24908689, -1.0117126 , -2.3168423 , -0.0402851 ,
        1.6886286 ,  0.5357047 ,  0.22932841, -0.6094084 ,  0.4515793 ,
       -0.5900931 ,  1.8684244 , -0.21056202,  0.29313338, -0.221067  ,
       -0.9535679 ,  0.07325   , -0.15823542,  1.1477109 ,  0.6716076 ,
       -1.0096023 ,  0.10605699,  1.4148282 ,  0.24576302,  0.5740349 ,
        0.19984631,  0.53964925,  0.41962907,  0.41497853, -1.0322098 ,
        0.01090925,  0.54345983,  0.806317  ,  0.31737605, -0.7965337 ,
        0.9282971 , -0.8775608 , -0.26852605, -0.06743863,  0.42815775,
       -0.11774074, -0.17956367,  0.88813037, -0.46279573, -1.0841943 ,
       -0.06798118,  0.4493006 ,  0.71962464, -0.02876493,  1.0282255 ,
       -1.1993176 , -0.38734904, -0.15875885, -0.81085825, -0.07678922,
       -0.16753489,  0.14065655, -1.8609751 ,  0.03587054,  1.2792674 ,
        1.2732009 , -0.74120265, -0.98000383,  0.4521185 , -0.26387128,
        0.37045383,  0.3680011 ,  0.7197629 , -0.3570571 ,  0.8016917 ,
        0.39243212, -0.5027844 , -1.2106236 ,  0.6412354 , -0.878307  ],
      dtype=float32)
```

<br><br>



### 2.3 co_occurrence_matrix

generate word co-occurrence matrix

```python
import cntext as ct

documents = ["I go to school every day by bus .",
         "i go to theatre every night by bus"]

ct.co_occurrence_matrix(documents, 
                        window_size=2, 
                        lang='english')
```

![](img/co_occurrence1.png)



<br><br>



### 2.4  Glove

Build the Glove model for english corpus data. corpus file path is ``data/brown_corpus.txt``

```python
import cntext as ct
import os

model = ct.Glove(cwd=os.getcwd(), lang='english')
model.create_vocab(file='data/brown_corpus.txt', min_count=5)
model.cooccurrence_matrix()
model.train_embeddings(vector_size=50, max_iter=25)
model.save()
```

Run

```
Step 1/4: ...Create vocabulary for Glove.
Step 2/4: ...Create cooccurrence matrix.
Step 3/4: ...Train glove embeddings. 
             Note, this part takes a long time to run
Step 3/4: ... Finish! Use 175.98 s
```

The generate生成的词嵌入模型文件位于output/Glove内

<br><br>



## 3. similarity

Four text similarity functions

- **cosine_sim(text1, text2)**
- **jaccard_sim(text1, text2)**   
- **minedit_sim(text1, text2)**  
- **simple_sim(text1, text2)** 

Algorithm implementation reference from ``Cohen, Lauren, Christopher Malloy, and Quoc Nguyen. Lazy prices. No. w25084. National Bureau of Economic Research, 2018.``



<br>

```python
import cntext as ct 


text1 = 'Programming is fun!'
text2 = 'Programming is interesting!'

print(ct.cosine_sim(text1, text2))
print(ct.jaccard_sim(text1, text2))
print(ct.minedit_sim(text1, text2))
print(ct.simple_sim(text1, text2))
```

Run

```
0.67
0.50
1.00
0.90
```

<br><br>

## 4. Text2Mind

Word embeddings contain human cognitive information. 

- **tm.sematic_distance(words, c_words1, c_words2)**  
- **tm.sematic_projection(words, c_words1, c_words2)**  



### 4.1 tm.sematic_distance(words, c_words1, c_words2) 

Calculate the two semantic distance， and return the difference between the two.

- **words**   concept words, words = ['program', 'software', 'computer']
- **c_words1**  concept words1,  c_words1 = ["man", "he", "him"]
- **c_words2**  concept words2, c_words2 = ["woman", "she", "her"]



For example, 

```
male_concept = ['male', 'man', 'he', 'him']

female_concept = ['female', 'woman', 'she', 'her']

software_engineer_concept  = ['engineer',  'programming',  'software']

d1 = distance(male_concept,  software_engineer_concept)

d2 = distance(female_concept,  software_engineer_concept)
```

If d1-d2<0，it means in semantic space,  between man and woman, software_engineer_concept is more closer to male_concept。

In other words, there is a stereotype (bias) of women for software engineers in this corpus.

[download glove_w2v.6B.100d.txt from google Driver](https://drive.google.com/file/d/1tuQB9PDx42z67ScEQrg650aDTYPz-elJ/view?usp=sharing) 



```python
import cntext as ct

#Note: this is a word2vec format model
tm = ct.Text2Mind(w2v_model_path='glove_w2v.6B.100d.txt')

engineer = ['program', 'software', 'computer']
mans =  ["man", "he", "him"]
womans = ["woman", "she", "her"]


tm.sematic_distance(words=animals, 
                    c_words1=mans, 
                    c_words2=womans)
```

Run

```
-0.38
```

-0.38 means in semantic space, engineer is closer to man, other than woman.

<br>

### 4.2 tm.sematic_projection(words, c_words1, c_words2) 

To explain the semantic projection of the word vector model, I use the picture from a Nature paper in 2022[@Grand2022SemanticPR]. Regarding the names of animals, human cognition information about animal size is hidden in the corpus text. By projecting the meaning of **LARGE WORDS** and **SMALL WORDS** with the vectors of different **animals**, the projection of the animal on the **size vector**(just like the red line in the bellow picture) is obtained, so the size of the animal can be compared by calculation.

Calculate the projected length of each word vector in the concept vector.Note that the calculation result reflects the direction of concept.**Greater than 0 means semantically closer to c_words2**.



> Grand, G., Blank, I.A., Pereira, F. and Fedorenko, E., 2022. Semantic projection recovers rich human knowledge of multiple object features from word embeddings. _Nature Human Behaviour_, pp.1-13.







![](img/Nature_Semantic_projection_recovering_human_knowledge_of.png)

For example, in the corpus,  perhaps show that our human beings have different size memory(perception) about animals.

```python
animals = ['mouse', 'cat', 'horse',  'pig', 'whale']
small_words = ["small", "little", "tiny"]
large_words = ["large", "big", "huge"]

tm.sematic_projection(words=animals, 
                      c_words1=small_words, 
                      c_words2=large_words)
```

Run

```
[('mouse', -1.68),
 ('cat', -0.92),
 ('pig', -0.46),
 ('whale', -0.24),
 ('horse', 0.4)]
```

Regarding the perception of size, humans have implied in the text that mice are smaller and horses are larger.

<br><br>



## Citation
If you use **cntext** in your research or in your project, please cite:


### apalike
```
Deng X., Nan P. (2022). cntext: a Python tool for text mining (version 1.7.9). DOI: 10.5281/zenodo.7063523 URL: https://github.com/hiDaDeng/cntext
```

### bibtex

```
@misc{YourReferenceHere,
author = {Deng, Xudong and Nan, Peng},
doi = {10.5281/zenodo.7063523},
month = {9},
title = {cntext: a Python tool for text mining},
url = {https://github.com/hiDaDeng/cntext},
year = {2022}
}
```

### endnote

```
%0 Generic
%A Deng, Xudong
%A Nan, Peng
%D 2022
%K text mining
%K text analysi
%K social science
%K management science
%K semantic analysis
%R 10.5281/zenodo.7063523
%T cntext: a Python tool for text mining
%U https://github.com/hiDaDeng/cntext
```

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/hidadeng/cntext",
    "name": "cntext",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.5",
    "maintainer_email": "",
    "keywords": "chinese,text mining,sentiment,sentiment analysis,natural language processing,sentiment dictionary development,text similarity",
    "author": "\u5927\u9093",
    "author_email": "thunderhit@qq.com",
    "download_url": "",
    "platform": null,
    "description": "[![DOI](https://zenodo.org/badge/487297608.svg)](https://zenodo.org/badge/latestdoi/487297608)\n\n\n<!-- START doctoc generated TOC please keep comment here to allow auto update -->\n<!-- DON'T EDIT THIS SECTION, INSTEAD RE-RUN doctoc TO UPDATE -->\n**Table of Contents**\n\n- [Installation](#installation)\n- [QuickStart](#quickstart)\n- [1. Basic](#1-basic)\n  - [1.1  readability](#11--readability)\n  - [1.2  term_freq(text, lang)](#12--term_freqtext-lang)\n  - [1.3 dict_pkl_list](#13-dict_pkl_list)\n  - [1.4 load_pkl_dict](#14-load_pkl_dict)\n  - [1.5 sentiment](#15-sentiment)\n  - [1.6 sentiment_by_valence()](#16-sentiment_by_valence)\n- [2. dictionary](#2-dictionary)\n  - [2.1 SoPmi](#21-sopmi)\n  - [2.2 W2VModels](#22-w2vmodels)\n  - [Note](#note)\n  - [2.3 co_occurrence_matrix](#23-co_occurrence_matrix)\n  - [2.4  Glove](#24--glove)\n- [3. similarity](#3-similarity)\n- [4. Text2Mind](#4-text2mind)\n  - [4.1 tm.sematic_distance(words, c_words1, c_words2)](#41-tmsematic_distancewords-c_words1-c_words2)\n  - [4.2 tm.sematic_projection(words, c_words1, c_words2)](#42-tmsematic_projectionwords-c_words1-c_words2)\n- [Citation](#citation)\n  - [apalike](#apalike)\n  - [bibtex](#bibtex)\n  - [endnote](#endnote)\n\n<!-- END doctoc generated TOC please keep comment here to allow auto update -->\n\n\n![](img/logo.png)\n\n\n\n[\u4e2d\u6587\u6587\u6863](chinese_readme.md)\n\n[\u4e2d\u6587\u535a\u5ba2](https://hidadeng.github.io/blog/)\n\n**cntext** is a text analysis package that provides traditional text analysis methods, such as word count, readability, document similarity, sentiment analysis, etc. It has built-in multiple Chinese and English sentiment dictionaries. Supporting word embedding models training and usage, cntext provides semantic distance and semantic projection now.\n\n- [github repo](https://github.com/hidadeng/cntext) ``https://github.com/hidadeng/cntext``\n\n- [pypi link](https://pypi.org/project/cntext/)  ``https://pypi.org/project/cntext/``\n\nBy the day of 2023-11-20, the cumulative download volume of cntext reached 36581\n\n[![](img/cntext-stats.png)](https://www.pepy.tech/projects/cntext)\n\n\n\n\n<br>\n\n## Installation\n\n```\npip install cntext --upgrade\npip install numpy==1.24.2\npip install gensim==4.2.0\npip install scikit-learn==1.1.2\n```\n\n\n\n<br>\n\n## QuickStart \n\n```python\nimport cntext as ct\n\nhelp(ct)\n```\n\nRun\n\n```\nHelp on package cntext:\n\nNAME\n    cntext\n\nPACKAGE CONTENTS\n    bias\n    dictionary\n    similarity\n    stats\n```\n\n\n\n<br>\n\n\n\n## 1. Basic\n\nCurrently, the built-in functions of stats.py are:\n\n- **readability()**  the readability of text, support Chinese and English\n- **term_freq()**  word count \n- **dict_pkl_list()**  get the list of built-in dictionaries (pkl format) in cntext\n- **load_pkl_dict()**  load the pkl dictionary file\n- **sentiment()** sentiment analysis\n- **sentiment_by_valence()** valence sentiment analysis\n\n\n\n```python\nimport cntext as ct\n\ntext = 'What a sunny day!'\n\n\ndiction = {'Pos': ['sunny', 'good'],\n           'Neg': ['bad', 'terrible'],\n           'Adv': ['very']}\n\nct.sentiment(text=text,\n             diction=diction,\n             lang='english')\n```\n\nRun\n\n```\n{'Pos_num': 1,\n 'Neg_num': 0,\n 'Adv_num': 0,\n 'stopword_num': 1,\n 'word_num': 5,\n 'sentence_num': 1}\n```\n\n<br>\n\n\n\n### 1.1  readability\n\nThe larger the indicator, the higher the complexity of the article and the worse the readability.\n\n**readability(text, lang='chinese')**\n\n- text:  text string\n- lang:  \"chinese\" or \"english\"\uff0cdefault is \"chinese\"\n\n\n\n```python\nimport cntext as ct\n\ntext = 'Committed to publishing quality research software with zero article processing charges or subscription fees.'\n\nct.readability(text=text, \n               lang='english')\n```\n\nRun\n\n```\n{'readability': 19.982}\n```\n\n<br>\n\n\n\n### 1.2  term_freq(text, lang)\n\nWord count statistics function, return Counter type.\n\n```python\nimport cntext as ct\n\ntext = 'Committed to publishing quality research software with zero article processing charges or subscription fees.'\n\nct.term_freq(text=text, lang='english')\n```\n\nRun\n\n```\nCounter({'committed': 1, \n         'publishing': 1, \n         'quality': 1, \n         'research': 1, \n         'software': 1, \n         'zero': 1, \n         'article': 1, \n         'processing': 1, \n         'charges': 1, \n         'subscription': 1, \n         'fees.': 1})\n```\n\n<br>\n\n\n\n### 1.3 dict_pkl_list  \n\nget the list of built-in dictionaries (pkl format) in cntext\n\n```python\nimport cntext as ct\n\nct.dict_pkl_list()\n```\n\nRun\n\n```\n['DUTIR.pkl',\n 'HOWNET.pkl',\n 'sentiws.pkl',\n 'Chinese_Digitalization.pkl',\n 'ChineseFinancialFormalUnformalSentiment.pkl',\n 'Concreteness.pkl',\n 'ANEW.pkl',\n 'LSD2015.pkl',\n 'NRC.pkl',\n 'geninqposneg.pkl',\n 'HuLiu.pkl',\n 'AFINN.pkl',\n 'ChineseEmoBank.pkl',\n 'ADV_CONJ.pkl',\n 'Loughran_McDonald_Financial_Sentiment.pkl',\n 'Chinese_Loughran_McDonald_Financial_Sentiment.pkl',\n 'STOPWORDS.pkl']\n```\n\nWe list 12 pkl dictionary here, some of English dictionary listed below are organized from [quanteda.sentiment](https://github.com/quanteda/quanteda.sentiment)\n\n| pkl\u6587\u4ef6                                     | \u8bcd\u5178                                                         | \u8bed\u8a00            | \u529f\u80fd                                                         |\n| ------------------------------------------- | ------------------------------------------------------------ | --------------- | ------------------------------------------------------------ |\n| ChineseEmoBank.pkl                                   | Chinese Sentiment Dictionary, includes \u300cvalence\u300d\u300carousal\u300d. In cntext, we only take Chinese valence-arousal words (CVAW, single word) into account, ignore CVAP, CVAS, CVAT. | Chinese         | valence, arousal|\n| DUTIR.pkl                                   | DUTIR                                                        | Chinese         | Seven categories of emotions: \u54c0, \u597d, \u60ca, \u60e7, \u4e50, \u6012, \u6076     |\n| HOWNET.pkl                                  | Hownet                                                       | Chinese         | Positive\u3001Negative                                           |\n| SentiWS.pkl                      | SentimentWortschatz (SentiWS)                                | German         | Positive\u3001Negative\uff1b<br>                              |\n| ChineseFinancialFormalUnformalSentiment.pkl | Chinese finance dictionary, contains formal\u3001unformal\u3001positive\u3001negative | Chinese         | formal-pos\u3001<br>formal-neg\uff1b<br>unformal-pos\u3001<br>unformal-neg |\n| ANEW.pkl                                    | Affective Norms for English Words (ANEW)                     | English         |                                                       |\n| LSD2015.pkl                                 | Lexicoder Sentiment Dictionary (2015)                        | English         | Positive\u3001Negative                                           |\n| NRC.pkl                                     | NRC Word-Emotion Association Lexicon                         | English         | fine-grained sentiment words;                                |\n| HuLiu.pkl                                   | Hu&Liu (2004)                                                | English         | Positive\u3001Negative                                           |\n| AFINN.pkl                                   | Affective Norms for English Words                        | English         |                       |\n| ADV_CONJ.pkl                                | adverbial & conjunction                                      | Chinese         |                                                              |\n| STOPWORDS.pkl                               |                                                              | English&Chinese | stopwordlist                                                 |\n| Concreteness.pkl                            | Brysbaert, M., Warriner, A. B., & Kuperman, V. (2014). Concreteness ratings for 40 thousand generally known English word lemmas. Behavior Research Methods, 46, 904\u2013911 | English         | word & concreateness score                                   |\n| Chinese_Loughran_McDonald_Financial_Sentiment.pkl | \u66fe\u5e86\u751f, \u5468\u6ce2, \u5f20\u7a0b, and \u9648\u4fe1\u5143. \"\u5e74\u62a5\u8bed\u8c03\u4e0e\u5185\u90e8\u4eba\u4ea4\u6613: \u8868\u91cc\u5982\u4e00\u8fd8\u662f\u53e3\u662f\u5fc3\u975e?.\" \u7ba1\u7406\u4e16\u754c 34, no. 09 (2018): 143-160. | Chinese | \u6b63\u9762\u3001\u8d1f\u9762\u8bcd                                                 |\n| Chinese_Digitalization.pkl |\u5434\u975e,\u80e1\u6167\u82b7,\u6797\u6167\u598d,\u4efb\u6653\u6021. \u4f01\u4e1a\u6570\u5b57\u5316\u8f6c\u578b\u4e0e\u8d44\u672c\u5e02\u573a\u8868\u73b0\u2014\u2014\u6765\u81ea\u80a1\u7968\u6d41\u52a8\u6027\u7684\u7ecf\u9a8c\u8bc1\u636e[J]. \u7ba1\u7406\u4e16\u754c,2021,37(07):130-144+10. | \u4e2d\u6587    | \u57fa\u4e8e\u8fd9\u7bc7\u8bba\u6587\uff0c\u6784\u5efa\u4e86\u4e2d\u6587\u6570\u5b57\u5316\u8bcd\u5178\uff0c\u542b\u4eba\u5de5\u667a\u80fd\u6280\u672f\u3001\u5927\u6570\u636e\u6280\u672f\u3001\u4e91\u8ba1\u7b97\u6280\u672f\u3001\u533a\u5757\u94fe\u6280\u672f\u3001\u6570\u5b57\u6280\u672f\u5e94\u7528\u7b49\u5173\u952e\u8bcd\u5217\u8868\u3002                                               |\n| Loughran_McDonald_Financial_Sentiment.pkl         | Loughran, Tim, and Bill McDonald. \"When is a liability not a liability? Textual analysis, dictionaries, and 10\u2010Ks.\" The Journal of finance 66, no. 1 (2011): 35-65. | English | Positive and Negative emotion words in the financial field\u3002 Besides, in version of 2018, author add ``Uncertainty, Litigious, StrongModal, WeakModal, Constraining`` |\n| Chinese_FLS.pkl | \u8bb8\u5e05,\u90b5\u5e05,\u4f55\u8d24\u6770.\u4e1a\u7ee9\u8bf4\u660e\u4f1a\u524d\u77bb\u6027\u4fe1\u606f\u5bf9\u5206\u6790\u5e08\u76c8\u4f59\u9884\u6d4b\u51c6\u786e\u6027\u7684\u5f71\u54cd\u2014\u2014\u4fe1\u53e3\u96cc\u9ec4\u8fd8\u662f\u8a00\u800c\u6709\u5f81[J].\u4e2d\u56fd\u7ba1\u7406\u79d1\u5b66:1-15. | \u4e2d\u6587 | \u524d\u77bb\u6027\u8bcd\u5178\u96c6\uff0c\u542b174\u4e2a\u8bcd\u8bed |\n\n\n\n\n\n\n<br>\n\n### 1.4 load_pkl_dict \n\nload the pkl dictionary file and return dict type data.\n\n```python\nimport cntext as ct\n\nprint(ct.__version__)\n# load the pkl dictionary file\nprint(ct.load_pkl_dict('NRC.pkl'))\n```\n\nRun\n\n```\n1.8.0\n\n{'NRC': {'anger': ['abandoned', 'abandonment', 'abhor', 'abhorrent', ...],\n         'anticipation': ['accompaniment','achievement','acquiring', ...],\n         'disgust': ['abject', 'abortion', 'abundance', 'abuse', ...],\n         'fear': ['anxiety', 'anxious', 'apache', 'appalling', ...],\n         ......\n \n 'Desc': 'NRC Word-Emotion Association Lexicon', \n 'Referer': 'Mohammad, Saif M., and Peter D. Turney. \"Nrc emotion lexicon.\" National Research Council, Canada 2 (2013).'\n         }\n```\n\n<br>\n\n\n\n### 1.5 sentiment\n\n**sentiment(text, diction, lang='chinese')**\n\nCalculate the occurrences of each emotional category words in text; The complex influence of adverbs and negative words on emotion is not considered.\n\n- **text**:  text string\n- **diction**:  emotion dictionary data, support diy or built-in dicitonary\n- **lang**: \"chinese\" or \"english\"\uff0cdefault is \"chinese\"\n\n\n\nWe can use built-in dicitonary in cntext, such as NRC.pkl\n\n```python\nimport cntext as ct\n\ntext = 'What a happy day!'\n\nct.sentiment(text=text,\n             diction=ct.load_pkl_dict('NRC.pkl')['NRC'],\n             lang='english')\n```\n\nRun\n\n```\n{'anger_num': 0,\n 'anticipation_num': 1,\n 'disgust_num': 0,\n 'fear_num': 0,\n 'joy_num': 1,\n 'negative_num': 0,\n 'positive_num': 1,\n 'sadness_num': 0,\n 'surprise_num': 0,\n 'trust_num': 1,\n 'stopword_num': 1,\n 'word_num': 5,\n 'sentence_num': 1}\n```\n\nWe can also use DIY dicitonary, just like\n\n```python\nimport cntext as ct\n\ntext = 'What a happy day!'\n\ndiction = {'Pos': ['happy', 'good'],\n           'Neg': ['bad', 'terrible'],\n           'Adv': ['very']}\n\nct.sentiment(text=text,\n             diction=diction,\n             lang='english')\n```\n\nRun\n\n```\n{'Pos_num': 1,\n 'Neg_num': 0,\n 'Adv_num': 0,\n 'stopword_num': 1,\n 'word_num': 5,\n 'sentence_num': 1}\n```\n\n<br>\n\n\n\n### 1.6 sentiment_by_valence()\n\n**sentiment_by_valence(text, diction, lang='english')**\n\nCalculate the occurrences of each sentiment category words in text;  The complex influence of intensity adverbs and negative words on emotion is not considered.\n\n- text:  text sring\n- diction:  sentiment dictionary with valence.\uff1b\n- lang: \"chinese\" or \"english\"; default language=\"english\"\n\n\n\nHere we want to study the concreteness of text.  The **concreteness.pkl** that comes from Brysbaert2014. \n\n>Brysbaert, M., Warriner, A. B., & Kuperman, V. (2014). Concreteness ratings for 40 thousand generally known English word lemmas. Behavior Research Methods, 46, 904\u2013911\n\n```python\nimport cntext as ct\n\n# load the concreteness.pkl dictionary file;  cntext version >=1.7.1\nconcreteness_df = ct.load_pkl_dict('concreteness.pkl')['concreteness']\nconcreteness_df.head()\n```\n\nRun\n\n|| word | valence |\n| ---: | :-------------- | ----------: |\n|  0 | roadsweeper   |      4.85 |\n|  1 | traindriver   |      4.54 |\n|  2 | tush          |      4.45 |\n|  3 | hairdress     |      3.93 |\n|  4 | pharmaceutics |      3.77 |\n\n<br>\n\n```python\nreply = \"I'll go look for that\"\n\nscore=ct.sentiment_by_valence(text=reply, \n                              diction=concreteness_df, \n                              lang='english')\nscore\n```\n\nRun\n\n```\n1.85\n```\n\n\n\n<br>\n\n```python\nemployee_replys = [\"I'll go look for that\",\n                   \"I'll go search for that\",\n                   \"I'll go search for that top\",\n                   \"I'll go search for that t-shirt\",\n                   \"I'll go look for that t-shirt in grey\",\n                   \"I'll go search for that t-shirt in grey\"]\n\nfor idx, reply in enumerate(employee_replys):\n    score=ct.sentiment_by_valence(text=reply, \n                                  diction=concreteness_df, \n                                  lang='english')\n    \n    template = \"Concreteness Score: {score:.2f} | Example-{idx}: {exmaple}\"\n    print(template.format(score=score, \n                          idx=idx, \n                          exmaple=reply))\n    \nct.sentiment_by_valence(text=text, diction=concreteness_df, lang='english')\n```\n\nRun\n\n```\nConcreteness Score: 1.55 | Example-0: I'll go look for that\nConcreteness Score: 1.55 | Example-1: I'll go search for that\nConcreteness Score: 1.89 | Example-2: I'll go search for that top\nConcreteness Score: 2.04 | Example-3: I'll go search for that t-shirt\nConcreteness Score: 2.37 | Example-4: I'll go look for that t-shirt in grey\nConcreteness Score: 2.37 | Example-5: I'll go search for that t-shirt in grey\n```\n\n\n\n<br><br>\n\n\n\n\n\n## 2. dictionary\n\nThis module is used to build or expand the vocabulary (dictionary), including\n\n- **SoPmi** Co-occurrence algorithm to extend vocabulary (dictionary), Only support chinese\n- **W2VModels** using word2vec to extend vocabulary (dictionary), support english & chinese \n\n### 2.1 SoPmi \n\n```python\nimport cntext as ct\nimport os\n\nsopmier = ct.SoPmi(cwd=os.getcwd(),\n                   #raw corpus data\uff0ctxt file.only support chinese data now.\n                   input_txt_file='data/sopmi_corpus.txt', \n                   #muanually selected seed words\n                   seedword_txt_file='data/sopmi_seed_words.txt', #\u4eba\u5de5\u6807\u6ce8\u7684\u521d\u59cb\u79cd\u5b50\u8bcd\n                   )   \n\nsopmier.sopmi()\n```\n\nRun\n\n```\nStep 1/4:...Preprocess   Corpus ...\nStep 2/4:...Collect co-occurrency information ...\nStep 3/4:...Calculate   mutual information ...\nStep 4/4:...Save    candidate words ...\nFinish! used 44.49 s\n```\n\n\n\n<br>\n\n### 2.2 W2VModels \n\n**In particular, note that the code needs to set the lang parameter**\n\n```python\nimport cntext as ct\nimport os\n\n#init W2VModels, corpus data w2v_corpus.txt\nmodel = ct.W2VModels(cwd=os.getcwd(), lang='english')  \nmodel.train(input_txt_file='data/w2v_corpus.txt')\n\n\n#According to the seed word, filter out the top 100 words that are most similar to each category words\nmodel.find(seedword_txt_file='data/w2v_seeds/integrity.txt', \n           topn=100)\nmodel.find(seedword_txt_file='data/w2v_seeds/innovation.txt', \n           topn=100)\nmodel.find(seedword_txt_file='data/w2v_seeds/quality.txt', \n           topn=100)\nmodel.find(seedword_txt_file='data/w2v_seeds/respect.txt', \n           topn=100)\nmodel.find(seedword_txt_file='data/w2v_seeds/teamwork.txt', \n           topn=100)\n```\n\nRun\n\n```\nStep 1/4:...Preprocess   corpus ...\nStep 2/4:...Train  word2vec model\n            used   174 s\nStep 3/4:...Prepare similar candidates for each seed word in the word2vec model...\nStep 4/4 Finish! Used 187 s\nStep 3/4:...Prepare similar candidates for each seed word in the word2vec model...\nStep 4/4 Finish! Used 187 s\nStep 3/4:...Prepare similar candidates for each seed word in the word2vec model...\nStep 4/4 Finish! Used 187 s\nStep 3/4:...Prepare similar candidates for each seed word in the word2vec model...\nStep 4/4 Finish! Used 187 s\nStep 3/4:...Prepare similar candidates for each seed word in the word2vec model...\nStep 4/4 Finish! Used 187 s\n\n```\n\n<br>\n\n### Note\n\nWhen runing out the W2VModels, there will appear a file called **w2v.model**  in the directory of **output/w2v_candi_words**.Note this w2v file can be used later.\n\n```python\nfrom gensim.models import KeyedVectors\n\nw2v_model = KeyedVectors.load(\"the path of w2v.model\")\n#to extract vector for word\n#w2v_model.get_vector(word)\n#if you need more information about the usage of w2_model, please use help function\n#help(w2_model)\n```\n\nFor example, we load the ``output/w2v_candi_words/w2v.model`` \n\n```python\nfrom gensim.models import KeyedVectors\n\nw2v_model = KeyedVectors.load('output/w2v_candi_words/w2v.model')\n# find the most similar word in w2v.model\nw2v_model.most_similar('innovation')\n```\n\nRun\n\n```\n[('technology', 0.689210832118988),\n ('infrastructure', 0.669672966003418),\n ('resources', 0.6695448160171509),\n ('talent', 0.6627111434936523),\n ('execution', 0.6549549102783203),\n ('marketing', 0.6533523797988892),\n ('merchandising', 0.6504817008972168),\n ('diversification', 0.6479553580284119),\n ('expertise', 0.6446896195411682),\n ('digital', 0.6326863765716553)]\n```\n\n<br>\n\n```python\n#to extract vector for \"innovation\"\nw2v_model.get_vector('innovation')\n```\n\nRun\n\n```\narray([-0.45616838, -0.7799563 ,  0.56367606, -0.8570078 ,  0.600359  ,\n       -0.6588043 ,  0.31116748, -0.11956959, -0.47599426,  0.21840936,\n       -0.02268819,  0.1832016 ,  0.24452794,  0.01084935, -1.4213187 ,\n        0.22840202,  0.46387577,  1.198386  , -0.621511  , -0.51598716,\n        0.13352732,  0.04140598, -0.23470387,  0.6402956 ,  0.20394802,\n        0.10799981,  0.24908689, -1.0117126 , -2.3168423 , -0.0402851 ,\n        1.6886286 ,  0.5357047 ,  0.22932841, -0.6094084 ,  0.4515793 ,\n       -0.5900931 ,  1.8684244 , -0.21056202,  0.29313338, -0.221067  ,\n       -0.9535679 ,  0.07325   , -0.15823542,  1.1477109 ,  0.6716076 ,\n       -1.0096023 ,  0.10605699,  1.4148282 ,  0.24576302,  0.5740349 ,\n        0.19984631,  0.53964925,  0.41962907,  0.41497853, -1.0322098 ,\n        0.01090925,  0.54345983,  0.806317  ,  0.31737605, -0.7965337 ,\n        0.9282971 , -0.8775608 , -0.26852605, -0.06743863,  0.42815775,\n       -0.11774074, -0.17956367,  0.88813037, -0.46279573, -1.0841943 ,\n       -0.06798118,  0.4493006 ,  0.71962464, -0.02876493,  1.0282255 ,\n       -1.1993176 , -0.38734904, -0.15875885, -0.81085825, -0.07678922,\n       -0.16753489,  0.14065655, -1.8609751 ,  0.03587054,  1.2792674 ,\n        1.2732009 , -0.74120265, -0.98000383,  0.4521185 , -0.26387128,\n        0.37045383,  0.3680011 ,  0.7197629 , -0.3570571 ,  0.8016917 ,\n        0.39243212, -0.5027844 , -1.2106236 ,  0.6412354 , -0.878307  ],\n      dtype=float32)\n```\n\n<br><br>\n\n\n\n### 2.3 co_occurrence_matrix\n\ngenerate word co-occurrence matrix\n\n```python\nimport cntext as ct\n\ndocuments = [\"I go to school every day by bus .\",\n         \"i go to theatre every night by bus\"]\n\nct.co_occurrence_matrix(documents, \n                        window_size=2, \n                        lang='english')\n```\n\n![](img/co_occurrence1.png)\n\n\n\n<br><br>\n\n\n\n### 2.4  Glove\n\nBuild the Glove model for english corpus data. corpus file path is ``data/brown_corpus.txt``\n\n```python\nimport cntext as ct\nimport os\n\nmodel = ct.Glove(cwd=os.getcwd(), lang='english')\nmodel.create_vocab(file='data/brown_corpus.txt', min_count=5)\nmodel.cooccurrence_matrix()\nmodel.train_embeddings(vector_size=50, max_iter=25)\nmodel.save()\n```\n\nRun\n\n```\nStep 1/4: ...Create vocabulary for Glove.\nStep 2/4: ...Create cooccurrence matrix.\nStep 3/4: ...Train glove embeddings. \n             Note, this part takes a long time to run\nStep 3/4: ... Finish! Use 175.98 s\n```\n\nThe generate\u751f\u6210\u7684\u8bcd\u5d4c\u5165\u6a21\u578b\u6587\u4ef6\u4f4d\u4e8eoutput/Glove\u5185\n\n<br><br>\n\n\n\n## 3. similarity\n\nFour text similarity functions\n\n- **cosine_sim(text1, text2)**\n- **jaccard_sim(text1, text2)**   \n- **minedit_sim(text1, text2)**  \n- **simple_sim(text1, text2)** \n\nAlgorithm implementation reference from ``Cohen, Lauren, Christopher Malloy, and Quoc Nguyen. Lazy prices. No. w25084. National Bureau of Economic Research, 2018.``\n\n\n\n<br>\n\n```python\nimport cntext as ct \n\n\ntext1 = 'Programming is fun!'\ntext2 = 'Programming is interesting!'\n\nprint(ct.cosine_sim(text1, text2))\nprint(ct.jaccard_sim(text1, text2))\nprint(ct.minedit_sim(text1, text2))\nprint(ct.simple_sim(text1, text2))\n```\n\nRun\n\n```\n0.67\n0.50\n1.00\n0.90\n```\n\n<br><br>\n\n## 4. Text2Mind\n\nWord embeddings contain human cognitive information. \n\n- **tm.sematic_distance(words, c_words1, c_words2)**  \n- **tm.sematic_projection(words, c_words1, c_words2)**  \n\n\n\n### 4.1 tm.sematic_distance(words, c_words1, c_words2) \n\nCalculate the two semantic distance\uff0c and return the difference between the two.\n\n- **words**   concept words, words = ['program', 'software', 'computer']\n- **c_words1**  concept words1,  c_words1 = [\"man\", \"he\", \"him\"]\n- **c_words2**  concept words2, c_words2 = [\"woman\", \"she\", \"her\"]\n\n\n\nFor example, \n\n```\nmale_concept = ['male', 'man', 'he', 'him']\n\nfemale_concept = ['female', 'woman', 'she', 'her']\n\nsoftware_engineer_concept  = ['engineer',  'programming',  'software']\n\nd1 = distance(male_concept,  software_engineer_concept)\n\nd2 = distance(female_concept,  software_engineer_concept)\n```\n\nIf d1-d2<0\uff0cit means in semantic space,  between man and woman, software_engineer_concept is more closer to male_concept\u3002\n\nIn other words, there is a stereotype (bias) of women for software engineers in this corpus.\n\n[download glove_w2v.6B.100d.txt from google Driver](https://drive.google.com/file/d/1tuQB9PDx42z67ScEQrg650aDTYPz-elJ/view?usp=sharing) \n\n\n\n```python\nimport cntext as ct\n\n#Note: this is a word2vec format model\ntm = ct.Text2Mind(w2v_model_path='glove_w2v.6B.100d.txt')\n\nengineer = ['program', 'software', 'computer']\nmans =  [\"man\", \"he\", \"him\"]\nwomans = [\"woman\", \"she\", \"her\"]\n\n\ntm.sematic_distance(words=animals, \n                    c_words1=mans, \n                    c_words2=womans)\n```\n\nRun\n\n```\n-0.38\n```\n\n-0.38 means in semantic space, engineer is closer to man, other than woman.\n\n<br>\n\n### 4.2 tm.sematic_projection(words, c_words1, c_words2) \n\nTo explain the semantic projection of the word vector model, I use the picture from a Nature paper in 2022[@Grand2022SemanticPR]. Regarding the names of animals, human cognition information about animal size is hidden in the corpus text. By projecting the meaning of **LARGE WORDS** and **SMALL WORDS** with the vectors of different **animals**, the projection of the animal on the **size vector**(just like the red line in the bellow picture) is obtained, so the size of the animal can be compared by calculation.\n\nCalculate the projected length of each word vector in the concept vector.Note that the calculation result reflects the direction of concept.**Greater than 0 means semantically closer to c_words2**.\n\n\n\n> Grand, G., Blank, I.A., Pereira, F. and Fedorenko, E., 2022. Semantic projection recovers rich human knowledge of multiple object features from word embeddings. _Nature Human Behaviour_, pp.1-13.\n\n\n\n\n\n\n\n![](img/Nature_Semantic_projection_recovering_human_knowledge_of.png)\n\nFor example, in the corpus,  perhaps show that our human beings have different size memory(perception) about animals.\n\n```python\nanimals = ['mouse', 'cat', 'horse',  'pig', 'whale']\nsmall_words = [\"small\", \"little\", \"tiny\"]\nlarge_words = [\"large\", \"big\", \"huge\"]\n\ntm.sematic_projection(words=animals, \n                      c_words1=small_words, \n                      c_words2=large_words)\n```\n\nRun\n\n```\n[('mouse', -1.68),\n ('cat', -0.92),\n ('pig', -0.46),\n ('whale', -0.24),\n ('horse', 0.4)]\n```\n\nRegarding the perception of size, humans have implied in the text that mice are smaller and horses are larger.\n\n<br><br>\n\n\n\n## Citation\nIf you use **cntext** in your research or in your project, please cite:\n\n\n### apalike\n```\nDeng X., Nan P. (2022). cntext: a Python tool for text mining (version 1.7.9). DOI: 10.5281/zenodo.7063523 URL: https://github.com/hiDaDeng/cntext\n```\n\n### bibtex\n\n```\n@misc{YourReferenceHere,\nauthor = {Deng, Xudong and Nan, Peng},\ndoi = {10.5281/zenodo.7063523},\nmonth = {9},\ntitle = {cntext: a Python tool for text mining},\nurl = {https://github.com/hiDaDeng/cntext},\nyear = {2022}\n}\n```\n\n### endnote\n\n```\n%0 Generic\n%A Deng, Xudong\n%A Nan, Peng\n%D 2022\n%K text mining\n%K text analysi\n%K social science\n%K management science\n%K semantic analysis\n%R 10.5281/zenodo.7063523\n%T cntext: a Python tool for text mining\n%U https://github.com/hiDaDeng/cntext\n```\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "Chinese text analysis library, which can perform word frequency statistics, dictionary expansion, sentiment analysis, similarity, readability, co-occurrence analysis, social calculation (attitude, prejudice, culture) on texts",
    "version": "1.9.0",
    "project_urls": {
        "Homepage": "https://github.com/hidadeng/cntext"
    },
    "split_keywords": [
        "chinese",
        "text mining",
        "sentiment",
        "sentiment analysis",
        "natural language processing",
        "sentiment dictionary development",
        "text similarity"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "4faf48878ac74dc17160c1ab4702c0d6187559e50df613de544aa45ce12b6758",
                "md5": "d1025bf1fbcb78d6d80cf5be7d9f4eae",
                "sha256": "9e9c2a9bfc19262859ddec4595fbe0a54d351a405bd87ccfa9ab8fc03cedf609"
            },
            "downloads": -1,
            "filename": "cntext-1.9.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "d1025bf1fbcb78d6d80cf5be7d9f4eae",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.5",
            "size": 964054,
            "upload_time": "2023-12-28T04:21:41",
            "upload_time_iso_8601": "2023-12-28T04:21:41.577435Z",
            "url": "https://files.pythonhosted.org/packages/4f/af/48878ac74dc17160c1ab4702c0d6187559e50df613de544aa45ce12b6758/cntext-1.9.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-12-28 04:21:41",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "hidadeng",
    "github_project": "cntext",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "cntext"
}

大邓