envtext

Name	envtext JSON
Version	0.1.4 JSON
	download
home_page	https://github.com/celtics1863/envtext
Summary	envtext for Chinese texts analysis in Environment domain
upload_time	2023-04-19 04:47:30
maintainer
docs_url	None
author	Bi Huaibin
requires_python	>=3.6
license
keywords	nlp bert chinese lstm rnn domain text analysis
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            # envText

[English](README-en.md)


**首款**中文环境领域文本分析工具。

特性：  
1. :one:支持中文环境领域大规模预训练模型**envBert**！

2. :two:支持中文环境领域大规模预训练**词向量**!

3. :three:支持中文环境领域专家过滤的**词表**!

4. :four: **一且设计均为领域专家研究服务**：
    - 为神经网络模型精简了接口，只保留了必要的batch_size, learning_rate等参数
    - 进一步优化huggingface transformers输入输出接口，支持20余种数据集格式
    - 一键使用模型，让领域专家精力集中在分析问题上

5. :five: 使用transformers接口，支持轻松自定义模型


如果您觉得本项目有用或是有帮助到您，麻烦您点击一下右上角的star :star:。您的支持是我们维护项目的最大动力:metal:！



# 快速开始

## 1. 安装


```bash
pip install envtext
```


## 2. 推理 (without training)

支持的预训练模型
```python
from envtext import Config
print(Config.pretrained_models)
```

| Task | backbone | model name | number of labels | description |
| ---- | ---- | ---- | ---- |  ---- |
| 掩码语言模型 | env-bert | celtics1863/env-bert-chinese| --- | [link](https://huggingface.co/celtics1863/env-bert-chinese) | 
| 新闻分类 | env-bert | celtics1863/env-news-cls-bert | 8 类别 | [link](https://huggingface.co/celtics1863/env-news-cls-bert)|
| 论文分类 | env-bert | celtics1863/env-news-cls-bert | 10类别 | [link](https://huggingface.co/celtics1863/env-news-cls-bert) |
| 政策分类 | env-bert | celtics1863/env-news-cls-bert | 15 类别 | [link](https://huggingface.co/celtics1863/env-news-cls-bert) |
| 话题分类 | env-bert | celtics1863/env-topic | 63 类别 | [link](https://huggingface.co/celtics1863/env-topic) |
| 词性/实体/术语识别 | env-bert | celtics1863/pos-bert | 41 类别 | [link](https://huggingface.co/celtics1863/pos-bert) |
| 掩码语言模型 | env-albert | celtics1863/env-albert-chinese| --- | [link](https://huggingface.co/celtics1863/env-albert-chinese) |
| 新闻分类 | env-albert | celtics1863/env-news-cls-albert | 8 类别 | [link](https://huggingface.co/celtics1863/env-news-cls-albert) |
| 论文分类 | env-albert | celtics1863/env-paper-cls-albert | 10 类别 | [link](https://huggingface.co/celtics1863/env-paper-cls-albert) |
| 政策分类 | env-albert | celtics1863/env-policy-cls-albert | 15 类别 | [link](https://huggingface.co/celtics1863/env-policy-cls-albert) |
| 话题分类 | env-albert | celtics1863/env-topic | 63 类别 | [link](https://huggingface.co/celtics1863/env-topic-albert) |
| 词性/实体/术语识别 | env-albert | celtics1863/pos-ner-albert | 41 类别 | [link](https://huggingface.co/celtics1863/pos-ner-albert) |
| 词向量 | word2vec | word2vec | ---- | [link](https://links.jianshu.com/go?to=https%3A%2F%2Farxiv.org%2Fabs%2F1301.3781v3) |
| 词向量 | env-bert | bert2vec | ---- | [link](https://huggingface.co/celtics1863/env-bert-chinese) |



#### 2.1 环境话题分类
```python
from envtext import AlbertCLS,Config
model = AlbertCLS(Config.albert.topic_cls)
model("在全球气候大会上，气候变化是各国政府都关心的话题")
```
<!-- ![](./fig/topic_albert.html) -->

![](./fig/topic_albert.png)


#### 2.2 环境新闻分类

```python
from envtext import AlbertCLS,Config
model = AlbertCLS(Config.albert.news_cls)
model("清洁能源基地建设对国家能源安全具有战略支撑作用。打造高质量的清洁能源基地的同时，也面临着一系列挑战，比如如何持续降低光储系统的度电成本、如何通过数字化的手段进一步提升运营与运维效率，如何更有效地提升光储系统的安全防护水平、如何在高比例新能源条件下实现稳定并网与消纳等。")
```

<!-- ![](./fig/news_albert.html) -->
![](./fig/news_albert.png)

#### 2.3 环境政策分类

```python
from envtext import AlbertCLS,Config
model = AlbertCLS(Config.albert.news_cls)
model("两个《办法》适用于行政主管部门在依法行使监督管理职责中，对建设用地和农用地土壤污染责任人不明确或者存在争议的情况下，开展的土壤污染责任人认定活动。这是当前土壤污染责任人认定工作的重点。涉及民事纠纷的责任人认定应当依据民事法律予以确定，不适用本《办法》。")
```
<!-- ![](./fig/policy_albert.html) -->
![](./fig/policy_albert.png)


#### 2.4 环境术语/实体/词性识别

```python
from envtext import AlbertNER,Config
model = AlbertNER(Config.albert.pos_ner)
model("在全球气候大会上，气候变化是各国政府都关心的话题")
```
<!-- ![](./fig/pos_albert.svg) -->
![](./fig/pos_albert.png)

#### 2.5 word2vec词向量

导入模型
```python
from envtext.models import load_word2vec
model = load_word2vec()
```

获得向量：
```python
model.get_vector('环境保护')
```
results:
```bash
array([-13.304651  ,  -3.1560812 ,   6.4074125 ,  -3.6906316 ,
        -1.4232658 ,   4.7912726 ,  -0.8003967 ,   4.0756955 ,
        -2.7932549 ,   4.029449  ,  -1.9410586 ,  -6.844793  ,
        -8.859059  ,  -0.93295586,   6.1359916 ,   1.9588425 ,
         2.625194  ,  -4.3848248 ,  -6.4393744 ,   6.0373173 ,
        -6.155831  ,  -6.4436955 ,   5.107795  , -11.209849  ,
         0.04123919,   1.286314  , -11.320914  ,  -6.475419  ,
         0.8528328 ,  -6.1932034 ,   2.0541244 ,  -3.3850324 ,
         4.284287  ,  -7.197888  ,  -2.6205683 ,   0.31572345,
         5.227246  ,   3.903521  ,  -2.5171268 ,   2.4655945 ,
        -5.5421305 ,   5.5044537 ,   6.984615  ,  -7.6862364 ,
         0.87583727,   0.03240405,   2.3616972 ,  -0.9396556 ,
         3.9617348 ,   0.6690969 , -10.708663  ,  -2.8534212 ,
        -0.8638448 ,  12.048176  ,   5.5968127 ,  -6.834452  ,
         6.9515004 ,   3.948555  ,  -4.527055  ,   4.389503  ,
        -0.47533572,   6.79178   ,  -0.8689579 ,  -2.7712438 ],
      dtype=float32)
```

计算相似度
```python
model.most_similar('环境保护')
```
results:
```bash
[('环保', 0.8425659537315369),
 ('生态环境保护', 0.7966809868812561),
 ('土壤环境保护', 0.7429764270782471),
 ('环境污染防治', 0.7383896708488464),
 ('生态保护', 0.6929160952568054),
 ('大气环境保护', 0.6914916634559631),
 ('应对气候变化', 0.6642681956291199),
 ('水污染防治', 0.6642411947250366),
 ('大气污染防治', 0.6606612801551819),
 ('环境管理', 0.6518533825874329)]
```

#### 2.6 env-bert词向量

导入模型：
```python
from envtext import Bert2Vec,Config
model = Bert2Vec(Config.bert.bert_mlm)
```
获取向量：
```python
#获得词向量
model.get_vector('环境保护')
#获得句向量，输入已经被分好词的句子
model.get_vector(["环境保护","人人有责"])
```
结果：
```
array([ 1.4521e+00, -3.4131e-01,  6.8420e-02, -6.1371e-02,  2.9004e-01,
        1.8872e-01, -4.0405e-01,  4.1138e-01, -5.0000e-01,  5.2344e-01,
        5.9814e-01, -3.1396e-01,  3.0029e-01,  3.2959e-02,  1.6553e+00,
       -4.4800e-01,  1.0195e+00, -6.4697e-01,  3.0200e-01,  5.7080e-01,
        7.6599e-02,  3.4155e-01,  1.2805e-01, -2.1863e-01, -3.3398e-01,
        6.9092e-01,  4.2725e-01, -4.8364e-01,  7.8760e-01,  3.8940e-01,
        4.9927e-02, -7.1106e-02, -5.3271e-01, -4.8486e-01,  3.1665e-01,
        5.1367e-01,  8.8477e-01, -2.2302e-01,  1.9943e-02,  7.3047e-01,
       -1.5417e-01, -1.4206e-02, -5.2881e-01,  4.0674e-01,  2.7466e-01,
       -1.3940e-01,  5.2490e-01, -1.1514e+00, -4.2676e-01,  9.5508e-01,
       ...])
```

计算相似度
```python
model.add_words(
    [
        "环境污染",
        "水污染",
        "大气污染",
        "北京市",
        "上海市",
        "兰州市"
    ])
model.most_similar("郑州市")
```
results:
```bash
[('兰州市', 0.8755860328674316),
 ('北京市', 0.7335232496261597),
 ('上海市', 0.7241109013557434),
 ('大气污染', 0.471857488155365),
 ('水污染', 0.4557272493839264)]
```

#### 2.7 完型填空

用`[MASK]`标记需要填的部分
```
from envtext import  BertMLM,Config
model = BertMLM(Config.bert_mlm)
model("在全球气候大会上，[MASK][MASK][MASK][MASK]是各国政府都关心的话题")
```
results：
```bash
text:在全球气候大会上，[MASK][MASK][MASK][MASK]是各国政府都关心的话题 
  predict: ['气', '体', '减', '少'] ; probability: 0.5166 
  predict: ['气', '体', '减', '排'] ; probability: 0.5166 
  predict: ['气', '体', '减', '碳'] ; probability: 0.5166 
  predict: ['气', '体', '减', '缓'] ; probability: 0.5166 
  predict: ['气', '体', '减', '量'] ; probability: 0.5166 
```


#### 2.8 情感分析

预测情感激烈程度
```
from envtext import  BertSA,Config
model = BertMLM(Config.intensity_sa)
model("中国到现在都没有达到3000年的平均气温，现在就把近期时间气温上升跟工业革命联系起来是不是为时尚早？即便没有工业革命1743年中国北方的罕见高温，1743年7月20至25日，华北地区下午的气温均高于40℃。其中7月25日最热，气温高达44.4℃。这样的极端高温纪录，迄今从未被超越。民国三十一年(公元1942年)和公元1999年夏季，华北地区先后出现了两次极端高温纪录，分别为42.6℃、42.2℃，均低于乾隆八年的温度。又要算到什么头上呢？！！！")
```
results：
![](./fig/sa_bert.png)


#### 2.9 实体抽取

使用cluener上训练的模型
```
from envtext import  BertNER,Config
model = BertNER(Config.bert.clue_ner)
model([
"生生不息CSOL生化狂潮让你填弹狂扫",
"那不勒斯vs锡耶纳以及桑普vs热那亚之上呢？",
"加勒比海盗3：世界尽头》的去年同期成绩死死甩在身后，后者则即将赶超《变形金刚》，",
"布鲁京斯研究所桑顿中国中心研究部主任李成说，东亚的和平与安全，是美国的“核心利益”之一。",
"此数据换算成亚洲盘罗马客场可让平半低水。",
],print_result=True)
```
results：
![](fig/cluener_bert.png)


## 3. 训练并推理

使用envtext，您可以标记一些示例，训练您的模型，并进一步使用模型来推断其余的文本。

目前支持的模型：

| Taskname | Bert models |Albert models | RNNs models | Others |
| ------ | ------ | ------ | ------ | ----- | 
| 完型填空 | BertMLM  | ------  |  ------  | ----- |
| 分类   | BertCLS | AlbertCLS |   RNNCLS  |  CNNCLS,TFIDFCLS  |
| 情感分析 | BertSA  |  ----    |  RNNSA  |  ------  |
| 多选   |BertMultiChoice | AlbertMultiChoice  | RNNMultiChoice | ----- |
| 命名实体识别 | BertNER  | AlbertNER   | RNNNER  | -----    |
| 嵌套命名实体识别 | BertGP  | -----   | -----  | -----    |
| 关系分类  | BertRelation  | ----   | ----  | -----    |
| 实体关系联合抽取 | BertTriple  | ----   | ----  | -----    |
| 词向量  |  Bert2vec  |  -----   |----- | Word2Vec |

除了文本生成任务外，基本支持大部分的NLP任务。

Bert and Albert支持环境文本中的大规模预训练模型' envBert '和' envalbert '，以及huggingface transformer中的其他Bert模型。

RNN模型由“LSTM”、“GRU”和“RNN”组成，可以用环境域预训练的词向量进行初始化，也可以用Onehot编码进行初始化。

### 3.1 训练

##### 3.1 Bert/albert 模型训练

```python
#导入bert模型(eg. 分类模型)
from envtext.models import BertCLS
model = BertCLS('celtics1863/env-bert-chinese')

# # 如果使用自定义的数据集
# model.load_dataset(file_path,task = 'cls',format = 'datasets-format')
# # 使用envtext中默认的数据集
model.load_dataset('isclimate')

#模型训练
model.train()

#模型保存
model.save_model('classification') #input directory
```

##### 3.2 RNN training

```python
#导入 rnn model(eg. 分类模型)
from envtext.models import RNNCLS
model = RNNCLS()

# # 使用自定义的模型
# model.load_dataset(file_path,task = 'cls',format = 'datasets-format')
# # 使用EnvText自带的数据集
model.load_dataset('isclimate')

#模型训练
model.train()

#保存模型
model.save_model('classification') #输入待保存的文件夹
```


### 3.2 使用自定义的模型推理

#### 3.2.1 使用自定义bert模型推理
```python
#从文件夹导入莫小仙
from envtext.models import BertMLM
model = BertMLM('celtics1863/env-bert-chinese')

#预测结果，可以输入 str 或 List[str]
model('[MASK][MASK][MASK][MASK]是各国政府都关心的话题')

#导出结果
model.save_result('result.csv')
```
#### 3.2.2 使用RNN模型推理

从 含有`pytorch_model.bin`的`文件` 推理

```python
from envtext.models import RNNCLS

model = RNNCLS('local directory')

#predict
model('气候变化是各国政府都关心的话题')

#save result
model.save_result('result.csv')
```


### 4. 自定义模型

##### 4.1 自定义bert模型

从bert模型定义一个回归器

```python
from envtext.models.bert_base import BertBase
import torch
from transformers import BertPreTrainedModel,BertModel

class MyBert(BertPreTrainedModel):
    def __init__(self, config):
        super(MyBert, self).__init__(config)
        self.bert = BertModel(config) #bert model
        self.regressor = torch.nn.Linear(config.hidden_size, 1) #regressor
        self.loss = torch.nn.MSELoss() #loss function
        
    def forward(self, input_ids, token_type_ids=None, attention_mask=None, labels=None,
              position_ids=None, inputs_embeds=None, head_mask=None):
        outputs = self.bert(input_ids,
                            attention_mask=attention_mask,
                            token_type_ids=token_type_ids,
                            position_ids=position_ids,
                            head_mask=head_mask,
                            inputs_embeds=inputs_embeds)
        #use[CLS] token
        cls_output = outputs[0][:,0,:] 

        #get logits 
        logits = self.regressor(cls_output)

        outputs = (logits,)
        
        #这里需要与bert的接口保持一致
        if labels is not None: 
            loss = self.loss(logits.squeeze(),labels)
            outputs = (loss,) + outputs
        return outputs

```
对齐EnvText的接口：

```python
class MyBertModel(BertBase):
    #Rewrite the initialization function
    def initialize_bert(self,path = None,config = None,**kwargs):
        super().initialize_bert(path,config,**kwargs)
        self.model = MyBert.from_pretrained(self.model_path)

    #[Optional] 重写预处理函数
    def preprocess(self,text, logits, **kwargs):     
        text = text.replace("\n", "")
        return text

    #[Optional] 重写后处理函数
    def postprocess(self,text, logits, **kwargs):      
        logits = logits.squeeze()
        return logits.tolist()
            
     #[Optional] 在训练时会调用，计算除loss以外的metric
     def compute_metrics(eval_pred)
         from envtext.utils.metrics import metrics_for_reg
         return metrics_for_reg(eval_pred)
     
     #[Optional] Optional to align parameters in config,
     #对齐参数配置
     def align_config(self):
         super().align_config()
         ##可以使用self.update_config() 或 self.set_attribute() 接口重新设置config
         pass
```

##### 4.2 self-defined RNN model

RNN模型的定义与此类似。  

首先，实现LSTM分类模型，具体如下:

```python
from torch import nn
import torch
class MyRNN(nn.Module):
    def __init__(self,config):
        self.rnn = nn.LSTM(config.embed_size, config.hidden_size ,config.num_layers,batch_first = True)
        self.classifier = nn.Linear(config.hidden_size,config.num_labels)
    
    def forward(self,X,labels = None):
        X,_ = self.rnn(X)
        logits = self.classifier(X)
        
        #Align interfaces, still need to output with labels present (loss,logits) and without labels (logits,)
        if labels is not None:
            loss = self.loss_fn(logits,labels)
            return (loss,logits) 
        return (logits,)
```
对齐EnvText的接口

```python
import numpy as np
class MyRNNModel(BertBase):
    #Rewrite the initialization function
    def initialize_bert(self,path = None,config = None,**kwargs):
        super().initialize_bert(path,config,**kwargs) #保持不变
        self.model = MyRNN.from_pretrained(self.model_path) 

    #[Optional] rewrite the function that postprocesses the prediction result
    def postprocess(self,text, logits, print_result = True ,save_result = True):     
        pred = np.argmax(logits,axis = -1)
        return pred.tolist()
            
    #[Optional] rewrite metrics，add metric besides loss, for training
    def compute_metrics(eval_pred):
        return {} #返回一个dict
        
    #[Optional] rewrite align_config
    #Because there are times when you need to accept multiple inputs, such as the number of categories or a list of categories when classifying tasks, you can use this interface for alignment.
    def align_config(self):
        super().align_config()
```

对于更详细的教程，案例将添加在[jupyter notebooks]('notebooks')


### 5. 使用建议

1. Bert模型比较大，如果只有cpu的情况下，建议先用RNN模型，跑出一个结果，观察数据集的数量/质量是否达标，再考虑是否用Bert模型。一般envbert模型要比RNN模型领先10个点左右，尤其在数据集越小的情况下，envbert的优势越明显。
2. 神经网络模型受到初始化权重影响，每一次训练的情况不一样，建议多跑几次，取最好的结果。
3. Learning rate, Epoch, Batchsize是三个最关键的超参数，需要对不同数据集小心调整。默认的参数可以在大多数情况下达到较优的值，但是一定不会达到最好的结果。

# LISENCE
Apache Lisence

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/celtics1863/envtext",
    "name": "envtext",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.6",
    "maintainer_email": "",
    "keywords": "NLP,bert,Chinese,LSTM,RNN,domain text analysis",
    "author": "Bi Huaibin",
    "author_email": "bi.huaibin@foxmail.com",
    "download_url": "",
    "platform": null,
    "description": "# envText\n\n[English](README-en.md)\n\n\n**\u9996\u6b3e**\u4e2d\u6587\u73af\u5883\u9886\u57df\u6587\u672c\u5206\u6790\u5de5\u5177\u3002\n\n\u7279\u6027\uff1a  \n1. :one:\u652f\u6301\u4e2d\u6587\u73af\u5883\u9886\u57df\u5927\u89c4\u6a21\u9884\u8bad\u7ec3\u6a21\u578b**envBert**\uff01\n\n2. :two:\u652f\u6301\u4e2d\u6587\u73af\u5883\u9886\u57df\u5927\u89c4\u6a21\u9884\u8bad\u7ec3**\u8bcd\u5411\u91cf**!\n\n3. :three:\u652f\u6301\u4e2d\u6587\u73af\u5883\u9886\u57df\u4e13\u5bb6\u8fc7\u6ee4\u7684**\u8bcd\u8868**!\n\n4. :four: **\u4e00\u4e14\u8bbe\u8ba1\u5747\u4e3a\u9886\u57df\u4e13\u5bb6\u7814\u7a76\u670d\u52a1**\uff1a\n    - \u4e3a\u795e\u7ecf\u7f51\u7edc\u6a21\u578b\u7cbe\u7b80\u4e86\u63a5\u53e3\uff0c\u53ea\u4fdd\u7559\u4e86\u5fc5\u8981\u7684batch_size, learning_rate\u7b49\u53c2\u6570\n    - \u8fdb\u4e00\u6b65\u4f18\u5316huggingface transformers\u8f93\u5165\u8f93\u51fa\u63a5\u53e3\uff0c\u652f\u630120\u4f59\u79cd\u6570\u636e\u96c6\u683c\u5f0f\n    - \u4e00\u952e\u4f7f\u7528\u6a21\u578b\uff0c\u8ba9\u9886\u57df\u4e13\u5bb6\u7cbe\u529b\u96c6\u4e2d\u5728\u5206\u6790\u95ee\u9898\u4e0a\n\n5. :five: \u4f7f\u7528transformers\u63a5\u53e3\uff0c\u652f\u6301\u8f7b\u677e\u81ea\u5b9a\u4e49\u6a21\u578b\n\n\n\u5982\u679c\u60a8\u89c9\u5f97\u672c\u9879\u76ee\u6709\u7528\u6216\u662f\u6709\u5e2e\u52a9\u5230\u60a8\uff0c\u9ebb\u70e6\u60a8\u70b9\u51fb\u4e00\u4e0b\u53f3\u4e0a\u89d2\u7684star :star:\u3002\u60a8\u7684\u652f\u6301\u662f\u6211\u4eec\u7ef4\u62a4\u9879\u76ee\u7684\u6700\u5927\u52a8\u529b:metal:\uff01\n\n\n\n# \u5feb\u901f\u5f00\u59cb\n\n## 1. \u5b89\u88c5\n\n\n```bash\npip install envtext\n```\n\n\n## 2. \u63a8\u7406 (without training)\n\n\u652f\u6301\u7684\u9884\u8bad\u7ec3\u6a21\u578b\n```python\nfrom envtext import Config\nprint(Config.pretrained_models)\n```\n\n| Task | backbone | model name | number of labels | description |\n| ---- | ---- | ---- | ---- |  ---- |\n| \u63a9\u7801\u8bed\u8a00\u6a21\u578b | env-bert | celtics1863/env-bert-chinese| --- | [link](https://huggingface.co/celtics1863/env-bert-chinese) | \n| \u65b0\u95fb\u5206\u7c7b | env-bert | celtics1863/env-news-cls-bert | 8 \u7c7b\u522b | [link](https://huggingface.co/celtics1863/env-news-cls-bert)|\n| \u8bba\u6587\u5206\u7c7b | env-bert | celtics1863/env-news-cls-bert | 10\u7c7b\u522b | [link](https://huggingface.co/celtics1863/env-news-cls-bert) |\n| \u653f\u7b56\u5206\u7c7b | env-bert | celtics1863/env-news-cls-bert | 15 \u7c7b\u522b | [link](https://huggingface.co/celtics1863/env-news-cls-bert) |\n| \u8bdd\u9898\u5206\u7c7b | env-bert | celtics1863/env-topic | 63 \u7c7b\u522b | [link](https://huggingface.co/celtics1863/env-topic) |\n| \u8bcd\u6027/\u5b9e\u4f53/\u672f\u8bed\u8bc6\u522b | env-bert | celtics1863/pos-bert | 41 \u7c7b\u522b | [link](https://huggingface.co/celtics1863/pos-bert) |\n| \u63a9\u7801\u8bed\u8a00\u6a21\u578b | env-albert | celtics1863/env-albert-chinese| --- | [link](https://huggingface.co/celtics1863/env-albert-chinese) |\n| \u65b0\u95fb\u5206\u7c7b | env-albert | celtics1863/env-news-cls-albert | 8 \u7c7b\u522b | [link](https://huggingface.co/celtics1863/env-news-cls-albert) |\n| \u8bba\u6587\u5206\u7c7b | env-albert | celtics1863/env-paper-cls-albert | 10 \u7c7b\u522b | [link](https://huggingface.co/celtics1863/env-paper-cls-albert) |\n| \u653f\u7b56\u5206\u7c7b | env-albert | celtics1863/env-policy-cls-albert | 15 \u7c7b\u522b | [link](https://huggingface.co/celtics1863/env-policy-cls-albert) |\n| \u8bdd\u9898\u5206\u7c7b | env-albert | celtics1863/env-topic | 63 \u7c7b\u522b | [link](https://huggingface.co/celtics1863/env-topic-albert) |\n| \u8bcd\u6027/\u5b9e\u4f53/\u672f\u8bed\u8bc6\u522b | env-albert | celtics1863/pos-ner-albert | 41 \u7c7b\u522b | [link](https://huggingface.co/celtics1863/pos-ner-albert) |\n| \u8bcd\u5411\u91cf | word2vec | word2vec | ---- | [link](https://links.jianshu.com/go?to=https%3A%2F%2Farxiv.org%2Fabs%2F1301.3781v3) |\n| \u8bcd\u5411\u91cf | env-bert | bert2vec | ---- | [link](https://huggingface.co/celtics1863/env-bert-chinese) |\n\n\n\n#### 2.1 \u73af\u5883\u8bdd\u9898\u5206\u7c7b\n```python\nfrom envtext import AlbertCLS,Config\nmodel = AlbertCLS(Config.albert.topic_cls)\nmodel(\"\u5728\u5168\u7403\u6c14\u5019\u5927\u4f1a\u4e0a\uff0c\u6c14\u5019\u53d8\u5316\u662f\u5404\u56fd\u653f\u5e9c\u90fd\u5173\u5fc3\u7684\u8bdd\u9898\")\n```\n<!-- ![](./fig/topic_albert.html) -->\n\n![](./fig/topic_albert.png)\n\n\n#### 2.2 \u73af\u5883\u65b0\u95fb\u5206\u7c7b\n\n```python\nfrom envtext import AlbertCLS,Config\nmodel = AlbertCLS(Config.albert.news_cls)\nmodel(\"\u6e05\u6d01\u80fd\u6e90\u57fa\u5730\u5efa\u8bbe\u5bf9\u56fd\u5bb6\u80fd\u6e90\u5b89\u5168\u5177\u6709\u6218\u7565\u652f\u6491\u4f5c\u7528\u3002\u6253\u9020\u9ad8\u8d28\u91cf\u7684\u6e05\u6d01\u80fd\u6e90\u57fa\u5730\u7684\u540c\u65f6\uff0c\u4e5f\u9762\u4e34\u7740\u4e00\u7cfb\u5217\u6311\u6218\uff0c\u6bd4\u5982\u5982\u4f55\u6301\u7eed\u964d\u4f4e\u5149\u50a8\u7cfb\u7edf\u7684\u5ea6\u7535\u6210\u672c\u3001\u5982\u4f55\u901a\u8fc7\u6570\u5b57\u5316\u7684\u624b\u6bb5\u8fdb\u4e00\u6b65\u63d0\u5347\u8fd0\u8425\u4e0e\u8fd0\u7ef4\u6548\u7387\uff0c\u5982\u4f55\u66f4\u6709\u6548\u5730\u63d0\u5347\u5149\u50a8\u7cfb\u7edf\u7684\u5b89\u5168\u9632\u62a4\u6c34\u5e73\u3001\u5982\u4f55\u5728\u9ad8\u6bd4\u4f8b\u65b0\u80fd\u6e90\u6761\u4ef6\u4e0b\u5b9e\u73b0\u7a33\u5b9a\u5e76\u7f51\u4e0e\u6d88\u7eb3\u7b49\u3002\")\n```\n\n<!-- ![](./fig/news_albert.html) -->\n![](./fig/news_albert.png)\n\n#### 2.3 \u73af\u5883\u653f\u7b56\u5206\u7c7b\n\n```python\nfrom envtext import AlbertCLS,Config\nmodel = AlbertCLS(Config.albert.news_cls)\nmodel(\"\u4e24\u4e2a\u300a\u529e\u6cd5\u300b\u9002\u7528\u4e8e\u884c\u653f\u4e3b\u7ba1\u90e8\u95e8\u5728\u4f9d\u6cd5\u884c\u4f7f\u76d1\u7763\u7ba1\u7406\u804c\u8d23\u4e2d\uff0c\u5bf9\u5efa\u8bbe\u7528\u5730\u548c\u519c\u7528\u5730\u571f\u58e4\u6c61\u67d3\u8d23\u4efb\u4eba\u4e0d\u660e\u786e\u6216\u8005\u5b58\u5728\u4e89\u8bae\u7684\u60c5\u51b5\u4e0b\uff0c\u5f00\u5c55\u7684\u571f\u58e4\u6c61\u67d3\u8d23\u4efb\u4eba\u8ba4\u5b9a\u6d3b\u52a8\u3002\u8fd9\u662f\u5f53\u524d\u571f\u58e4\u6c61\u67d3\u8d23\u4efb\u4eba\u8ba4\u5b9a\u5de5\u4f5c\u7684\u91cd\u70b9\u3002\u6d89\u53ca\u6c11\u4e8b\u7ea0\u7eb7\u7684\u8d23\u4efb\u4eba\u8ba4\u5b9a\u5e94\u5f53\u4f9d\u636e\u6c11\u4e8b\u6cd5\u5f8b\u4e88\u4ee5\u786e\u5b9a\uff0c\u4e0d\u9002\u7528\u672c\u300a\u529e\u6cd5\u300b\u3002\")\n```\n<!-- ![](./fig/policy_albert.html) -->\n![](./fig/policy_albert.png)\n\n\n#### 2.4 \u73af\u5883\u672f\u8bed/\u5b9e\u4f53/\u8bcd\u6027\u8bc6\u522b\n\n```python\nfrom envtext import AlbertNER,Config\nmodel = AlbertNER(Config.albert.pos_ner)\nmodel(\"\u5728\u5168\u7403\u6c14\u5019\u5927\u4f1a\u4e0a\uff0c\u6c14\u5019\u53d8\u5316\u662f\u5404\u56fd\u653f\u5e9c\u90fd\u5173\u5fc3\u7684\u8bdd\u9898\")\n```\n<!-- ![](./fig/pos_albert.svg) -->\n![](./fig/pos_albert.png)\n\n#### 2.5 word2vec\u8bcd\u5411\u91cf\n\n\u5bfc\u5165\u6a21\u578b\n```python\nfrom envtext.models import load_word2vec\nmodel = load_word2vec()\n```\n\n\u83b7\u5f97\u5411\u91cf\uff1a\n```python\nmodel.get_vector('\u73af\u5883\u4fdd\u62a4')\n```\nresults:\n```bash\narray([-13.304651  ,  -3.1560812 ,   6.4074125 ,  -3.6906316 ,\n        -1.4232658 ,   4.7912726 ,  -0.8003967 ,   4.0756955 ,\n        -2.7932549 ,   4.029449  ,  -1.9410586 ,  -6.844793  ,\n        -8.859059  ,  -0.93295586,   6.1359916 ,   1.9588425 ,\n         2.625194  ,  -4.3848248 ,  -6.4393744 ,   6.0373173 ,\n        -6.155831  ,  -6.4436955 ,   5.107795  , -11.209849  ,\n         0.04123919,   1.286314  , -11.320914  ,  -6.475419  ,\n         0.8528328 ,  -6.1932034 ,   2.0541244 ,  -3.3850324 ,\n         4.284287  ,  -7.197888  ,  -2.6205683 ,   0.31572345,\n         5.227246  ,   3.903521  ,  -2.5171268 ,   2.4655945 ,\n        -5.5421305 ,   5.5044537 ,   6.984615  ,  -7.6862364 ,\n         0.87583727,   0.03240405,   2.3616972 ,  -0.9396556 ,\n         3.9617348 ,   0.6690969 , -10.708663  ,  -2.8534212 ,\n        -0.8638448 ,  12.048176  ,   5.5968127 ,  -6.834452  ,\n         6.9515004 ,   3.948555  ,  -4.527055  ,   4.389503  ,\n        -0.47533572,   6.79178   ,  -0.8689579 ,  -2.7712438 ],\n      dtype=float32)\n```\n\n\u8ba1\u7b97\u76f8\u4f3c\u5ea6\n```python\nmodel.most_similar('\u73af\u5883\u4fdd\u62a4')\n```\nresults:\n```bash\n[('\u73af\u4fdd', 0.8425659537315369),\n ('\u751f\u6001\u73af\u5883\u4fdd\u62a4', 0.7966809868812561),\n ('\u571f\u58e4\u73af\u5883\u4fdd\u62a4', 0.7429764270782471),\n ('\u73af\u5883\u6c61\u67d3\u9632\u6cbb', 0.7383896708488464),\n ('\u751f\u6001\u4fdd\u62a4', 0.6929160952568054),\n ('\u5927\u6c14\u73af\u5883\u4fdd\u62a4', 0.6914916634559631),\n ('\u5e94\u5bf9\u6c14\u5019\u53d8\u5316', 0.6642681956291199),\n ('\u6c34\u6c61\u67d3\u9632\u6cbb', 0.6642411947250366),\n ('\u5927\u6c14\u6c61\u67d3\u9632\u6cbb', 0.6606612801551819),\n ('\u73af\u5883\u7ba1\u7406', 0.6518533825874329)]\n```\n\n#### 2.6 env-bert\u8bcd\u5411\u91cf\n\n\u5bfc\u5165\u6a21\u578b\uff1a\n```python\nfrom envtext import Bert2Vec,Config\nmodel = Bert2Vec(Config.bert.bert_mlm)\n```\n\u83b7\u53d6\u5411\u91cf\uff1a\n```python\n#\u83b7\u5f97\u8bcd\u5411\u91cf\nmodel.get_vector('\u73af\u5883\u4fdd\u62a4')\n#\u83b7\u5f97\u53e5\u5411\u91cf\uff0c\u8f93\u5165\u5df2\u7ecf\u88ab\u5206\u597d\u8bcd\u7684\u53e5\u5b50\nmodel.get_vector([\"\u73af\u5883\u4fdd\u62a4\",\"\u4eba\u4eba\u6709\u8d23\"])\n```\n\u7ed3\u679c\uff1a\n```\narray([ 1.4521e+00, -3.4131e-01,  6.8420e-02, -6.1371e-02,  2.9004e-01,\n        1.8872e-01, -4.0405e-01,  4.1138e-01, -5.0000e-01,  5.2344e-01,\n        5.9814e-01, -3.1396e-01,  3.0029e-01,  3.2959e-02,  1.6553e+00,\n       -4.4800e-01,  1.0195e+00, -6.4697e-01,  3.0200e-01,  5.7080e-01,\n        7.6599e-02,  3.4155e-01,  1.2805e-01, -2.1863e-01, -3.3398e-01,\n        6.9092e-01,  4.2725e-01, -4.8364e-01,  7.8760e-01,  3.8940e-01,\n        4.9927e-02, -7.1106e-02, -5.3271e-01, -4.8486e-01,  3.1665e-01,\n        5.1367e-01,  8.8477e-01, -2.2302e-01,  1.9943e-02,  7.3047e-01,\n       -1.5417e-01, -1.4206e-02, -5.2881e-01,  4.0674e-01,  2.7466e-01,\n       -1.3940e-01,  5.2490e-01, -1.1514e+00, -4.2676e-01,  9.5508e-01,\n       ...])\n```\n\n\u8ba1\u7b97\u76f8\u4f3c\u5ea6\n```python\nmodel.add_words(\n    [\n        \"\u73af\u5883\u6c61\u67d3\",\n        \"\u6c34\u6c61\u67d3\",\n        \"\u5927\u6c14\u6c61\u67d3\",\n        \"\u5317\u4eac\u5e02\",\n        \"\u4e0a\u6d77\u5e02\",\n        \"\u5170\u5dde\u5e02\"\n    ])\nmodel.most_similar(\"\u90d1\u5dde\u5e02\")\n```\nresults:\n```bash\n[('\u5170\u5dde\u5e02', 0.8755860328674316),\n ('\u5317\u4eac\u5e02', 0.7335232496261597),\n ('\u4e0a\u6d77\u5e02', 0.7241109013557434),\n ('\u5927\u6c14\u6c61\u67d3', 0.471857488155365),\n ('\u6c34\u6c61\u67d3', 0.4557272493839264)]\n```\n\n#### 2.7 \u5b8c\u578b\u586b\u7a7a\n\n\u7528`[MASK]`\u6807\u8bb0\u9700\u8981\u586b\u7684\u90e8\u5206\n```\nfrom envtext import  BertMLM,Config\nmodel = BertMLM(Config.bert_mlm)\nmodel(\"\u5728\u5168\u7403\u6c14\u5019\u5927\u4f1a\u4e0a\uff0c[MASK][MASK][MASK][MASK]\u662f\u5404\u56fd\u653f\u5e9c\u90fd\u5173\u5fc3\u7684\u8bdd\u9898\")\n```\nresults\uff1a\n```bash\ntext:\u5728\u5168\u7403\u6c14\u5019\u5927\u4f1a\u4e0a\uff0c[MASK][MASK][MASK][MASK]\u662f\u5404\u56fd\u653f\u5e9c\u90fd\u5173\u5fc3\u7684\u8bdd\u9898 \n  predict: ['\u6c14', '\u4f53', '\u51cf', '\u5c11'] ; probability: 0.5166 \n  predict: ['\u6c14', '\u4f53', '\u51cf', '\u6392'] ; probability: 0.5166 \n  predict: ['\u6c14', '\u4f53', '\u51cf', '\u78b3'] ; probability: 0.5166 \n  predict: ['\u6c14', '\u4f53', '\u51cf', '\u7f13'] ; probability: 0.5166 \n  predict: ['\u6c14', '\u4f53', '\u51cf', '\u91cf'] ; probability: 0.5166 \n```\n\n\n#### 2.8 \u60c5\u611f\u5206\u6790\n\n\u9884\u6d4b\u60c5\u611f\u6fc0\u70c8\u7a0b\u5ea6\n```\nfrom envtext import  BertSA,Config\nmodel = BertMLM(Config.intensity_sa)\nmodel(\"\u4e2d\u56fd\u5230\u73b0\u5728\u90fd\u6ca1\u6709\u8fbe\u52303000\u5e74\u7684\u5e73\u5747\u6c14\u6e29\uff0c\u73b0\u5728\u5c31\u628a\u8fd1\u671f\u65f6\u95f4\u6c14\u6e29\u4e0a\u5347\u8ddf\u5de5\u4e1a\u9769\u547d\u8054\u7cfb\u8d77\u6765\u662f\u4e0d\u662f\u4e3a\u65f6\u5c1a\u65e9\uff1f\u5373\u4fbf\u6ca1\u6709\u5de5\u4e1a\u9769\u547d1743\u5e74\u4e2d\u56fd\u5317\u65b9\u7684\u7f55\u89c1\u9ad8\u6e29\uff0c1743\u5e747\u670820\u81f325\u65e5\uff0c\u534e\u5317\u5730\u533a\u4e0b\u5348\u7684\u6c14\u6e29\u5747\u9ad8\u4e8e40\u2103\u3002\u5176\u4e2d7\u670825\u65e5\u6700\u70ed\uff0c\u6c14\u6e29\u9ad8\u8fbe44.4\u2103\u3002\u8fd9\u6837\u7684\u6781\u7aef\u9ad8\u6e29\u7eaa\u5f55\uff0c\u8fc4\u4eca\u4ece\u672a\u88ab\u8d85\u8d8a\u3002\u6c11\u56fd\u4e09\u5341\u4e00\u5e74(\u516c\u51431942\u5e74)\u548c\u516c\u51431999\u5e74\u590f\u5b63\uff0c\u534e\u5317\u5730\u533a\u5148\u540e\u51fa\u73b0\u4e86\u4e24\u6b21\u6781\u7aef\u9ad8\u6e29\u7eaa\u5f55\uff0c\u5206\u522b\u4e3a42.6\u2103\u300142.2\u2103\uff0c\u5747\u4f4e\u4e8e\u4e7e\u9686\u516b\u5e74\u7684\u6e29\u5ea6\u3002\u53c8\u8981\u7b97\u5230\u4ec0\u4e48\u5934\u4e0a\u5462\uff1f\uff01\uff01\uff01\")\n```\nresults\uff1a\n![](./fig/sa_bert.png)\n\n\n#### 2.9 \u5b9e\u4f53\u62bd\u53d6\n\n\u4f7f\u7528cluener\u4e0a\u8bad\u7ec3\u7684\u6a21\u578b\n```\nfrom envtext import  BertNER,Config\nmodel = BertNER(Config.bert.clue_ner)\nmodel([\n\"\u751f\u751f\u4e0d\u606fCSOL\u751f\u5316\u72c2\u6f6e\u8ba9\u4f60\u586b\u5f39\u72c2\u626b\",\n\"\u90a3\u4e0d\u52d2\u65afvs\u9521\u8036\u7eb3\u4ee5\u53ca\u6851\u666evs\u70ed\u90a3\u4e9a\u4e4b\u4e0a\u5462\uff1f\",\n\"\u52a0\u52d2\u6bd4\u6d77\u76d73\uff1a\u4e16\u754c\u5c3d\u5934\u300b\u7684\u53bb\u5e74\u540c\u671f\u6210\u7ee9\u6b7b\u6b7b\u7529\u5728\u8eab\u540e\uff0c\u540e\u8005\u5219\u5373\u5c06\u8d76\u8d85\u300a\u53d8\u5f62\u91d1\u521a\u300b\uff0c\",\n\"\u5e03\u9c81\u4eac\u65af\u7814\u7a76\u6240\u6851\u987f\u4e2d\u56fd\u4e2d\u5fc3\u7814\u7a76\u90e8\u4e3b\u4efb\u674e\u6210\u8bf4\uff0c\u4e1c\u4e9a\u7684\u548c\u5e73\u4e0e\u5b89\u5168\uff0c\u662f\u7f8e\u56fd\u7684\u201c\u6838\u5fc3\u5229\u76ca\u201d\u4e4b\u4e00\u3002\",\n\"\u6b64\u6570\u636e\u6362\u7b97\u6210\u4e9a\u6d32\u76d8\u7f57\u9a6c\u5ba2\u573a\u53ef\u8ba9\u5e73\u534a\u4f4e\u6c34\u3002\",\n],print_result=True)\n```\nresults\uff1a\n![](fig/cluener_bert.png)\n\n\n## 3. \u8bad\u7ec3\u5e76\u63a8\u7406\n\n\u4f7f\u7528envtext\uff0c\u60a8\u53ef\u4ee5\u6807\u8bb0\u4e00\u4e9b\u793a\u4f8b\uff0c\u8bad\u7ec3\u60a8\u7684\u6a21\u578b\uff0c\u5e76\u8fdb\u4e00\u6b65\u4f7f\u7528\u6a21\u578b\u6765\u63a8\u65ad\u5176\u4f59\u7684\u6587\u672c\u3002\n\n\u76ee\u524d\u652f\u6301\u7684\u6a21\u578b\uff1a\n\n| Taskname | Bert models |Albert models | RNNs models | Others |\n| ------ | ------ | ------ | ------ | ----- | \n| \u5b8c\u578b\u586b\u7a7a | BertMLM  | ------  |  ------  | ----- |\n| \u5206\u7c7b   | BertCLS | AlbertCLS |   RNNCLS  |  CNNCLS,TFIDFCLS  |\n| \u60c5\u611f\u5206\u6790 | BertSA  |  ----    |  RNNSA  |  ------  |\n| \u591a\u9009   |BertMultiChoice | AlbertMultiChoice  | RNNMultiChoice | ----- |\n| \u547d\u540d\u5b9e\u4f53\u8bc6\u522b | BertNER  | AlbertNER   | RNNNER  | -----    |\n| \u5d4c\u5957\u547d\u540d\u5b9e\u4f53\u8bc6\u522b | BertGP  | -----   | -----  | -----    |\n| \u5173\u7cfb\u5206\u7c7b  | BertRelation  | ----   | ----  | -----    |\n| \u5b9e\u4f53\u5173\u7cfb\u8054\u5408\u62bd\u53d6 | BertTriple  | ----   | ----  | -----    |\n| \u8bcd\u5411\u91cf  |  Bert2vec  |  -----   |----- | Word2Vec |\n\n\u9664\u4e86\u6587\u672c\u751f\u6210\u4efb\u52a1\u5916\uff0c\u57fa\u672c\u652f\u6301\u5927\u90e8\u5206\u7684NLP\u4efb\u52a1\u3002\n\nBert and Albert\u652f\u6301\u73af\u5883\u6587\u672c\u4e2d\u7684\u5927\u89c4\u6a21\u9884\u8bad\u7ec3\u6a21\u578b' envBert '\u548c' envalbert '\uff0c\u4ee5\u53cahuggingface transformer\u4e2d\u7684\u5176\u4ed6Bert\u6a21\u578b\u3002\n\nRNN\u6a21\u578b\u7531\u201cLSTM\u201d\u3001\u201cGRU\u201d\u548c\u201cRNN\u201d\u7ec4\u6210\uff0c\u53ef\u4ee5\u7528\u73af\u5883\u57df\u9884\u8bad\u7ec3\u7684\u8bcd\u5411\u91cf\u8fdb\u884c\u521d\u59cb\u5316\uff0c\u4e5f\u53ef\u4ee5\u7528Onehot\u7f16\u7801\u8fdb\u884c\u521d\u59cb\u5316\u3002\n\n### 3.1 \u8bad\u7ec3\n\n##### 3.1 Bert/albert \u6a21\u578b\u8bad\u7ec3\n\n```python\n#\u5bfc\u5165bert\u6a21\u578b(eg. \u5206\u7c7b\u6a21\u578b)\nfrom envtext.models import BertCLS\nmodel = BertCLS('celtics1863/env-bert-chinese')\n\n# # \u5982\u679c\u4f7f\u7528\u81ea\u5b9a\u4e49\u7684\u6570\u636e\u96c6\n# model.load_dataset(file_path,task = 'cls',format = 'datasets-format')\n# # \u4f7f\u7528envtext\u4e2d\u9ed8\u8ba4\u7684\u6570\u636e\u96c6\nmodel.load_dataset('isclimate')\n\n#\u6a21\u578b\u8bad\u7ec3\nmodel.train()\n\n#\u6a21\u578b\u4fdd\u5b58\nmodel.save_model('classification') #input directory\n```\n\n##### 3.2 RNN training\n\n```python\n#\u5bfc\u5165 rnn model(eg. \u5206\u7c7b\u6a21\u578b)\nfrom envtext.models import RNNCLS\nmodel = RNNCLS()\n\n# # \u4f7f\u7528\u81ea\u5b9a\u4e49\u7684\u6a21\u578b\n# model.load_dataset(file_path,task = 'cls',format = 'datasets-format')\n# # \u4f7f\u7528EnvText\u81ea\u5e26\u7684\u6570\u636e\u96c6\nmodel.load_dataset('isclimate')\n\n#\u6a21\u578b\u8bad\u7ec3\nmodel.train()\n\n#\u4fdd\u5b58\u6a21\u578b\nmodel.save_model('classification') #\u8f93\u5165\u5f85\u4fdd\u5b58\u7684\u6587\u4ef6\u5939\n```\n\n\n### 3.2 \u4f7f\u7528\u81ea\u5b9a\u4e49\u7684\u6a21\u578b\u63a8\u7406\n\n#### 3.2.1 \u4f7f\u7528\u81ea\u5b9a\u4e49bert\u6a21\u578b\u63a8\u7406\n```python\n#\u4ece\u6587\u4ef6\u5939\u5bfc\u5165\u83ab\u5c0f\u4ed9\nfrom envtext.models import BertMLM\nmodel = BertMLM('celtics1863/env-bert-chinese')\n\n#\u9884\u6d4b\u7ed3\u679c\uff0c\u53ef\u4ee5\u8f93\u5165 str \u6216 List[str]\nmodel('[MASK][MASK][MASK][MASK]\u662f\u5404\u56fd\u653f\u5e9c\u90fd\u5173\u5fc3\u7684\u8bdd\u9898')\n\n#\u5bfc\u51fa\u7ed3\u679c\nmodel.save_result('result.csv')\n```\n#### 3.2.2 \u4f7f\u7528RNN\u6a21\u578b\u63a8\u7406\n\n\u4ece \u542b\u6709`pytorch_model.bin`\u7684`\u6587\u4ef6` \u63a8\u7406\n\n```python\nfrom envtext.models import RNNCLS\n\nmodel = RNNCLS('local directory')\n\n#predict\nmodel('\u6c14\u5019\u53d8\u5316\u662f\u5404\u56fd\u653f\u5e9c\u90fd\u5173\u5fc3\u7684\u8bdd\u9898')\n\n#save result\nmodel.save_result('result.csv')\n```\n\n\n### 4. \u81ea\u5b9a\u4e49\u6a21\u578b\n\n##### 4.1 \u81ea\u5b9a\u4e49bert\u6a21\u578b\n\n\u4ecebert\u6a21\u578b\u5b9a\u4e49\u4e00\u4e2a\u56de\u5f52\u5668\n\n```python\nfrom envtext.models.bert_base import BertBase\nimport torch\nfrom transformers import BertPreTrainedModel,BertModel\n\nclass MyBert(BertPreTrainedModel):\n    def __init__(self, config):\n        super(MyBert, self).__init__(config)\n        self.bert = BertModel(config) #bert model\n        self.regressor = torch.nn.Linear(config.hidden_size, 1) #regressor\n        self.loss = torch.nn.MSELoss() #loss function\n        \n    def forward(self, input_ids, token_type_ids=None, attention_mask=None, labels=None,\n              position_ids=None, inputs_embeds=None, head_mask=None):\n        outputs = self.bert(input_ids,\n                            attention_mask=attention_mask,\n                            token_type_ids=token_type_ids,\n                            position_ids=position_ids,\n                            head_mask=head_mask,\n                            inputs_embeds=inputs_embeds)\n        #use[CLS] token\n        cls_output = outputs[0][:,0,:] \n\n        #get logits \n        logits = self.regressor(cls_output)\n\n        outputs = (logits,)\n        \n        #\u8fd9\u91cc\u9700\u8981\u4e0ebert\u7684\u63a5\u53e3\u4fdd\u6301\u4e00\u81f4\n        if labels is not None: \n            loss = self.loss(logits.squeeze(),labels)\n            outputs = (loss,) + outputs\n        return outputs\n\n```\n\u5bf9\u9f50EnvText\u7684\u63a5\u53e3\uff1a\n\n```python\nclass MyBertModel(BertBase):\n    #Rewrite the initialization function\n    def initialize_bert(self,path = None,config = None,**kwargs):\n        super().initialize_bert(path,config,**kwargs)\n        self.model = MyBert.from_pretrained(self.model_path)\n\n    #[Optional] \u91cd\u5199\u9884\u5904\u7406\u51fd\u6570\n    def preprocess(self,text, logits, **kwargs):     \n        text = text.replace(\"\\n\", \"\")\n        return text\n\n    #[Optional] \u91cd\u5199\u540e\u5904\u7406\u51fd\u6570\n    def postprocess(self,text, logits, **kwargs):      \n        logits = logits.squeeze()\n        return logits.tolist()\n            \n     #[Optional] \u5728\u8bad\u7ec3\u65f6\u4f1a\u8c03\u7528\uff0c\u8ba1\u7b97\u9664loss\u4ee5\u5916\u7684metric\n     def compute_metrics(eval_pred)\n         from envtext.utils.metrics import metrics_for_reg\n         return metrics_for_reg(eval_pred)\n     \n     #[Optional] Optional to align parameters in config,\n     #\u5bf9\u9f50\u53c2\u6570\u914d\u7f6e\n     def align_config(self):\n         super().align_config()\n         ##\u53ef\u4ee5\u4f7f\u7528self.update_config() \u6216 self.set_attribute() \u63a5\u53e3\u91cd\u65b0\u8bbe\u7f6econfig\n         pass\n```\n\n##### 4.2 self-defined RNN model\n\nRNN\u6a21\u578b\u7684\u5b9a\u4e49\u4e0e\u6b64\u7c7b\u4f3c\u3002  \n\n\u9996\u5148\uff0c\u5b9e\u73b0LSTM\u5206\u7c7b\u6a21\u578b\uff0c\u5177\u4f53\u5982\u4e0b:\n\n```python\nfrom torch import nn\nimport torch\nclass MyRNN(nn.Module):\n    def __init__(self,config):\n        self.rnn = nn.LSTM(config.embed_size, config.hidden_size ,config.num_layers,batch_first = True)\n        self.classifier = nn.Linear(config.hidden_size,config.num_labels)\n    \n    def forward(self,X,labels = None):\n        X,_ = self.rnn(X)\n        logits = self.classifier(X)\n        \n        #Align interfaces, still need to output with labels present (loss,logits) and without labels (logits,)\n        if labels is not None:\n            loss = self.loss_fn(logits,labels)\n            return (loss,logits) \n        return (logits,)\n```\n\u5bf9\u9f50EnvText\u7684\u63a5\u53e3\n\n```python\nimport numpy as np\nclass MyRNNModel(BertBase):\n    #Rewrite the initialization function\n    def initialize_bert(self,path = None,config = None,**kwargs):\n        super().initialize_bert(path,config,**kwargs) #\u4fdd\u6301\u4e0d\u53d8\n        self.model = MyRNN.from_pretrained(self.model_path) \n\n    #[Optional] rewrite the function that postprocesses the prediction result\n    def postprocess(self,text, logits, print_result = True ,save_result = True):     \n        pred = np.argmax(logits,axis = -1)\n        return pred.tolist()\n            \n    #[Optional] rewrite metrics\uff0cadd metric besides loss, for training\n    def compute_metrics(eval_pred):\n        return {} #\u8fd4\u56de\u4e00\u4e2adict\n        \n    #[Optional] rewrite align_config\n    #Because there are times when you need to accept multiple inputs, such as the number of categories or a list of categories when classifying tasks, you can use this interface for alignment.\n    def align_config(self):\n        super().align_config()\n```\n\n\u5bf9\u4e8e\u66f4\u8be6\u7ec6\u7684\u6559\u7a0b\uff0c\u6848\u4f8b\u5c06\u6dfb\u52a0\u5728[jupyter notebooks]('notebooks')\n\n\n### 5. \u4f7f\u7528\u5efa\u8bae\n\n1. Bert\u6a21\u578b\u6bd4\u8f83\u5927\uff0c\u5982\u679c\u53ea\u6709cpu\u7684\u60c5\u51b5\u4e0b\uff0c\u5efa\u8bae\u5148\u7528RNN\u6a21\u578b\uff0c\u8dd1\u51fa\u4e00\u4e2a\u7ed3\u679c\uff0c\u89c2\u5bdf\u6570\u636e\u96c6\u7684\u6570\u91cf/\u8d28\u91cf\u662f\u5426\u8fbe\u6807\uff0c\u518d\u8003\u8651\u662f\u5426\u7528Bert\u6a21\u578b\u3002\u4e00\u822cenvbert\u6a21\u578b\u8981\u6bd4RNN\u6a21\u578b\u9886\u514810\u4e2a\u70b9\u5de6\u53f3\uff0c\u5c24\u5176\u5728\u6570\u636e\u96c6\u8d8a\u5c0f\u7684\u60c5\u51b5\u4e0b\uff0cenvbert\u7684\u4f18\u52bf\u8d8a\u660e\u663e\u3002\n2. \u795e\u7ecf\u7f51\u7edc\u6a21\u578b\u53d7\u5230\u521d\u59cb\u5316\u6743\u91cd\u5f71\u54cd\uff0c\u6bcf\u4e00\u6b21\u8bad\u7ec3\u7684\u60c5\u51b5\u4e0d\u4e00\u6837\uff0c\u5efa\u8bae\u591a\u8dd1\u51e0\u6b21\uff0c\u53d6\u6700\u597d\u7684\u7ed3\u679c\u3002\n3. Learning rate, Epoch, Batchsize\u662f\u4e09\u4e2a\u6700\u5173\u952e\u7684\u8d85\u53c2\u6570\uff0c\u9700\u8981\u5bf9\u4e0d\u540c\u6570\u636e\u96c6\u5c0f\u5fc3\u8c03\u6574\u3002\u9ed8\u8ba4\u7684\u53c2\u6570\u53ef\u4ee5\u5728\u5927\u591a\u6570\u60c5\u51b5\u4e0b\u8fbe\u5230\u8f83\u4f18\u7684\u503c\uff0c\u4f46\u662f\u4e00\u5b9a\u4e0d\u4f1a\u8fbe\u5230\u6700\u597d\u7684\u7ed3\u679c\u3002\n\n# LISENCE\nApache Lisence\n\n\n\n\n",
    "bugtrack_url": null,
    "license": "",
    "summary": "envtext for Chinese texts analysis in Environment domain",
    "version": "0.1.4",
    "split_keywords": [
        "nlp",
        "bert",
        "chinese",
        "lstm",
        "rnn",
        "domain text analysis"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "cde1b54b103ff882e916256a3902e7d2d4c69827925e1380e33e3d8d14e334a0",
                "md5": "65f197e6797056d9d556d4f07e8d44ce",
                "sha256": "2edae27a818a668e9c62be99a800786ef258cc2182fb5bec6a3bc719a69af68b"
            },
            "downloads": -1,
            "filename": "envtext-0.1.4-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "65f197e6797056d9d556d4f07e8d44ce",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.6",
            "size": 35796282,
            "upload_time": "2023-04-19T04:47:30",
            "upload_time_iso_8601": "2023-04-19T04:47:30.252490Z",
            "url": "https://files.pythonhosted.org/packages/cd/e1/b54b103ff882e916256a3902e7d2d4c69827925e1380e33e3d8d14e334a0/envtext-0.1.4-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-04-19 04:47:30",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "github_user": "celtics1863",
    "github_project": "envtext",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "envtext"
}

Bi Huaibin