bkit

Name	bkit JSON
Version	0.0.8 JSON
	download
home_page	None
Summary	A python tool for Bangla text processing
upload_time	2025-01-22 04:59:34
maintainer	None
docs_url	None
author	None
requires_python	>=3.7
license	None
keywords	bangla kit bangla text processing banpipeline ner pos shallow parsing dependency parsing
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            # bangla-text-processing-kit

A python tool kit for processing Bangla texts.

- [bangla-text-processing-kit](#bangla-text-processing-kit)
  - [How to use](#how-to-use)
    - [Installing](#installing)
    - [Checking text](#checking-text)
    - [Transforming text](#transforming-text)
      - [Normalizer](#normalizer)
      - [Character Normalization](#character-normalization)
      - [Punctuation space normalization](#punctuation-space-normalization)
      - [Zero width characters normalization](#zero-width-characters-normalization)
      - [Halant (হসন্ত) normalization](#halant-হসন্ত-normalization)
      - [Kar ambiguity](#kar-ambiguity)
      - [Clean text](#clean-text)
      - [Clean punctuations](#clean-punctuations)
      - [Clean digits](#clean-digits)
      - [Multiple spaces](#multiple-spaces)
      - [URLs](#urls)
      - [Emojis](#emojis)
      - [HTML tags](#html-tags)
      - [Multiple punctuations](#multiple-punctuations)
      - [Special characters](#special-characters)
      - [Non Bangla characters](#non-bangla-characters)
    - [Text Analysis](#text-analysis)
      - [Word count](#word-count)
      - [Sentence Count](#sentence-count)
    - [Lemmatization](#lemmatization)
      - [Lemmatize text](#lemmatize-text)
      - [Lemmatize word](#lemmatize-word)
    - [Tokenization](#tokenization)
      - [Word tokenization](#word-tokenization)
      - [Word and Punctuation tokenization](#word-and-punctuation-tokenization)
      - [Sentence tokenization](#sentence-tokenization)
    - [Named Entity Recognition (NER)](#named-entity-recognition-ner)
      - [Named Entity Recognition (NER) Visualization](#named-entity-recognition-ner-visualization)
    - [Parts of Speech (PoS) tagging](#parts-of-speech-pos-tagging)
      - [Parts of Speech (PoS) Visualization](#parts-of-speech-pos-visualization)
    - [Shallow Parsing (Constituency Parsing)](#shallow-parsing-constituency-parsing)
        - [Shallow Parsing Visualization](#shallow-parsing-visualization)
    - [Dependency Parsing](#dependency-parsing)
        - [Dependency Parsing Visualization](#dependency-parsing-visualization)
    - [Coreference Resolution](#coref-resolution)
      - [Coreference Resolution Visualization](#coreference-resolution-visualization)
    

## How to use<a id="how-to-use"></a>

### Installing<a id="installing"></a>

There are three installation options of the bkit package. These are:

1. `bkit`: The most basic version of bkit with the normalization, cleaning and tokenization capabilities.

```bash
pip install bkit
```
2. `bkit[lemma]`: Everything in the basic version plus lemmatization capability.

```bash
pip install bkit[lemma]
```
3. `bkit[all]`: Everything that are available in bkit including normalization, cleaning, tokenization, lemmatization, NER, POS and shallow parsing. 

```bash
pip install bkit[all]
```

### Checking text<a id="checking-text"></a>

- `bkit.utils.is_bangla(text) -> bool`: Checks if text contains only Bangla characters, digits, spaces, punctuations and some symbols. Returns true if so, else return false.
- `bkit.utils.is_digit(text) -> bool`: Checks if text contains only **Bangla digit** characters. Returns true if so, else return false.
- `bkit.utils.contains_digit(text, check_english_digits) -> bool`: Checks if text contains **any digits**. By default checks only Bangla digits. Returns true if so, else return false.
- `bkit.utils.contains_bangla(text) -> bool`: Checks if text contains **any Bangla character**. Returns true if so, else return false.

### Transforming text<a id="transforming-text"></a>

Text transformation includes the normalization and cleaning procedures. To transform text, use the `bkit.transform` module. Supported functionalities are:

#### Normalizer<a id="normalizer"></a>

This module normalize Bangla text using the following steps:

<!-- no toc -->
1. [Character normalization](#character-normalization)
2. [Zero width character normalization](#zero-width-characters-normalization)
3. [Halant normalization](#halant-হসন্ত-normalization)
4. [Vowel-kar normalization](#kar-ambiguity)
5. [Punctuation space normalization](#punctuation-space-normalization)

```python
import bkit

text = 'অাামাব় । '
print(list(text))
# >>> ['অ', 'া', 'া', 'ম', 'া', 'ব', '়', ' ', '।', ' ']

normalizer = bkit.transform.Normalizer(
    normalize_characters=True,
    normalize_zw_characters=True,
    normalize_halant=True,
    normalize_vowel_kar=True,
    normalize_punctuation_spaces=True
)

clean_text = normalizer(text)
print(clean_text, list(clean_text))
# >>> আমার। ['আ', 'ম', 'া', 'র', '।']
```

#### Character Normalization<a id="character-normalization"></a>

This module performs character normalization in Bangla text. It performs nukta normalization, Assamese normalization, Kar normalization, legacy character normalization and Punctuation normalization sequentially.

```python
import bkit

text = 'আমাব়'
print(list(text))
# >>> ['আ', 'ম', 'া', 'ব', '়']

text = bkit.transform.normalize_characters(text)

print(list(text))
# >>> ['আ', 'ম', 'া', 'র']
```

#### Punctuation space normalization<a id="punctuation-space-normalization"></a>

Normalizes punctuation spaces i.e. adds necessary spaces before or after specific punctuations, also removes if necessary.

```python
import bkit

text = 'রহিম(২৩)এ কথা বলেন   ।তিনি (    রহিম ) আরও জানান, ১,২৪,৩৫,৬৫৪.৩২৩ কোটি টাকা ব্যায়ে...'

clean_text = bkit.transform.normalize_punctuation_spaces(text)
print(clean_text)
# >>> রহিম (২৩) এ কথা বলেন। তিনি (রহিম) আরও জানান, ১,২৪,৩৫,৬৫৪.৩২৩ কোটি টাকা ব্যায়ে...
```

#### Zero width characters normalization<a id="zero-width-characters-normalization"></a>

There are two zero-width characters. These are Zero Width Joiner (ZWJ) and Zero Width Non Joiner (ZWNJ) characters. Generally ZWNJ is not used with Bangla texts and ZWJ joiner is used with `র` only. So, these characters are normalized based on these intuitions.

```python
import bkit

text = 'র‍্য‌াকেট'
print(f"text: {text} \t Characters: {list(text)}")
# >>> text: র‍্য‌াকেট     Characters: ['র', '\u200d', '্', 'য', '\u200c', 'া', 'ক', 'ে', 'ট']

clean_text = bkit.transform.normalize_zero_width_chars(text)
print(f"text: {clean_text} \t Characters: {list(clean_text)}")
# >>> text: র‍্যাকেট     Characters: ['র', '\u200d', '্', 'য', 'া', 'ক', 'ে', 'ট']
```

#### Halant (হসন্ত) normalization<a id="halant-হসন্ত-normalization"></a>

This function normalizes halant (হসন্ত) [`0x09CD`] in Bangla text. While using this function, it is recommended to normalize the zero width characters at first, e.g. using the `bkit.transform.normalize_zero_width_chars()` function.

During the normalization it also handles the `ত্ -> ৎ` conversion. For a valid conjunct letter (যুক্তবর্ণ) where 'ত' is the former character, can take one of 'ত', 'থ', 'ন', 'ব', 'ম', 'য', and 'র' as the next character. The conversion is perform based on this intuition.

During the halant normalization, the following cases are handled.

- Remove any leading and tailing halant of a word and/or text.
- Replace two or more consecutive occurrences of halant by a single halant.
- Remove halant between any characters that do not follow or precede a halant character. Like a halant that follows or precedes a vowel, kar, য়, etc will be removed.
- Remove multiple fola (multiple ref, ro-fola and jo-fola)

```python
import bkit

text = 'আসন্্্ন আসফাকুল্লাহ্‌ আলবত্‍ আলবত্ র‍্যাব ই্সি'
print(list(text))
# >>> ['আ', 'স', 'ন', '্', '্', '্', 'ন', ' ', 'আ', 'স', 'ফ', 'া', 'ক', 'ু', 'ল', '্', 'ল', 'া', 'হ', '্', '\u200c', ' ', 'আ', 'ল', 'ব', 'ত', '্', '\u200d', ' ', 'আ', 'ল', 'ব', 'ত', '্', ' ', 'র', '\u200d', '্', 'য', 'া', 'ব', ' ', 'ই', '্', 'স', 'ি']

clean_text = bkit.transform.normalize_zero_width_chars(text)
clean_text = bkit.transform.normalize_halant(clean_text)
print(clean_text, list(clean_text))
# >>> আসন্ন আসফাকুল্লাহ আলবৎ আলবৎ র‍্যাব ইসি ['আ', 'স', 'ন', '্', 'ন', ' ', 'আ', 'স', 'ফ', 'া', 'ক', 'ু', 'ল', '্', 'ল', 'া', 'হ', ' ', 'আ', 'ল', 'ব', 'ৎ', ' ', 'আ', 'ল', 'ব', 'ৎ', ' ', 'র', '\u200d', '্', 'য', 'া', 'ব', ' ', 'ই', 'স', 'ি']
```

#### Kar ambiguity<a id="kar-ambiguity"></a>

Normalizes kar ambiguity with vowels, ঁ, ং, and ঃ. It removes any kar that is preceded by a vowel or consonant diacritics like: `আা` will be normalized to `আ`. In case of consecutive occurrence of kars like: `কাাাী`, only the first kar will be kept like: `কা`.

```python
import bkit

text = 'অংশইে অংশগ্রহণইে আাারো এখনওো আলবার্তোে সাধুু কাাাী'
print(list(text))
# >>> ['অ', 'ং', 'শ', 'ই', 'ে', ' ', 'অ', 'ং', 'শ', 'গ', '্', 'র', 'হ', 'ণ', 'ই', 'ে', ' ', 'আ', 'া', 'া', 'র', 'ো', ' ', 'এ', 'খ', 'ন', 'ও', 'ো', ' ', 'আ', 'ল', 'ব', 'া', 'র', '্', 'ত', 'ো', 'ে', ' ', 'স', 'া', 'ধ', 'ু', 'ু', ' ', 'ক', 'া', 'া', 'া', 'ী']

clean_text = bkit.transform.normalize_kar_ambiguity(text)
print(clean_text, list(clean_text))
# >>> অংশই অংশগ্রহণই আরো এখনও আলবার্তো সাধু কা ['অ', 'ং', 'শ', 'ই', ' ', 'অ', 'ং', 'শ', 'গ', '্', 'র', 'হ', 'ণ', 'ই', ' ', 'আ', 'র', 'ো', ' ', 'এ', 'খ', 'ন', 'ও', ' ', 'আ', 'ল', 'ব', 'া', 'র', '্', 'ত', 'ো', ' ', 'স', 'া', 'ধ', 'ু', ' ', 'ক', 'া']
```

#### Clean text<a id="clean-text"></a>

Clean text using the following steps sequentially:

<!-- no toc -->
1. [Removes all HTML tags](#html-tags)
2. [Removes all URLs](#urls)
3. [Removes all emojis (optional)](#emojis)
4. [Removes all digits (optional)](#clean-digits)
5. [Removes all punctuations (optional)](#clean-punctuations)
6. [Removes all extra spaces](#multiple-spaces)
7. [Removes all non bangla characters](#non-bangla-characters)

```python
import bkit

text = '<a href=some_URL>বাংলাদেশ</a>\nবাংলাদেশের   আয়তন ১.৪৭ লক্ষ কিলোমিটার!!!'

clean_text = bkit.transform.clean_text(text)
print(clean_text)
# >>> বাংলাদেশ বাংলাদেশের আয়তন লক্ষ কিলোমিটার
```

#### Clean punctuations<a id="clean-punctuations"></a>

Remove punctuations with the given `replace_with` character/string.

```python
import bkit

text = 'আমরা মাঠে ফুটবল খেলতে পছন্দ করি!'

clean_text = bkit.transform.clean_punctuations(text)
print(clean_text)
# >>> আমরা মাঠে ফুটবল খেলতে পছন্দ করি

clean_text = bkit.transform.clean_punctuations(text, replace_with=' PUNC ')
print(clean_text)
# >>> আমরা মাঠে ফুটবল খেলতে পছন্দ করি PUNC
```

#### Clean digits<a id="clean-digits"></a>

Remove any bangla digit from text by replacing with the given `replace_with` character/string.

```python
import bkit

text = 'তার বাসা ৭৯ নাম্বার রোডে।'

clean_text = bkit.transform.clean_digits(text)
print(clean_text)
# >>> তার বাসা    নাম্বার রোডে।

clean_text = bkit.transform.clean_digits(text, replace_with='#')
print(clean_text)
# >>> তার বাসা ## নাম্বার রোডে।
```

#### Multiple spaces<a id="multiple-spaces"></a>

Clean multiple consecutive whitespace characters including space, newlines, tabs, vertical tabs, etc. It also removes leading and trailing whitespace characters.

```python
import bkit

text = 'তার বাসা ৭৯   \t\t নাম্বার   রোডে।\nসে খুব \v ভালো ছেলে।'

clean_text = bkit.transform.clean_multiple_spaces(text)
print(clean_text)
# >>> তার বাসা ৭৯ নাম্বার রোডে। সে খুব ভালো ছেলে।

clean_text = bkit.transform.clean_multiple_spaces(text, keep_new_line=True)
print(clean_text)
# >>> তার বাসা ৭৯ নাম্বার রোডে।\nসে খুব \n ভালো ছেলে।
```

#### URLs<a id="urls"></a>

Clean URLs from text and replace the URLs with any given string.

```python
import bkit

text = 'আমি https://xyz.abc সাইটে ব্লগ লিখি। এই ftp://10.17.5.23/books সার্ভার থেকে আমার বইগুলো পাবে। এই https://bn.wikipedia.org/wiki/%E0%A6%A7%E0%A6%BE%E0%A6%A4%E0%A7%81_(%E0%A6%AC%E0%A6%BE%E0%A6%82%E0%A6%B2%E0%A6%BE_%E0%A6%AC%E0%A7%8D%E0%A6%AF%E0%A6%BE%E0%A6%95%E0%A6%B0%E0%A6%A3) লিঙ্কটিতে ভালো তথ্য আছে।'

clean_text = bkit.transform.clean_urls(text)
print(clean_text)
# >>> আমি   সাইটে ব্লগ লিখি। এই   সার্ভার থেকে আমার বইগুলো পাবে। এই   লিঙ্কটিতে ভালো তথ্য আছে।

clean_text = bkit.transform.clean_urls(text, replace_with='URL')
print(clean_text)
# >>> আমি URL সাইটে ব্লগ লিখি। এই URL সার্ভার থেকে আমার বইগুলো পাবে। এই URL লিঙ্কটিতে ভালো তথ্য আছে।
```

#### Emojis<a id="emojis"></a>

Clean emoji and emoticons from text and replace those with any given string.

```python
import bkit

text = 'কিছু ইমোজি হল: 😀🫅🏾🫅🏿🫃🏼🫃🏽🫃🏾🫃🏿🫄🫄🏻🫄🏼🫄🏽🫄🏾🫄🏿🧌🪸🪷🪹🪺🫘🫗🫙🛝🛞🛟🪬🪩🪫🩼🩻🫧🪪🟰'

clean_text = bkit.transform.clean_emojis(text, replace_with='<EMOJI>')
print(clean_text)
# >>> কিছু ইমোজি হল: <EMOJI>
```

#### HTML tags<a id="html-tags"></a>

Clean HTML tags from text and replace those with any given string.

```python
import bkit

text = '<a href=some_URL>বাংলাদেশ</a>'

clean_text = bkit.transform.clean_html(text)
print(clean_text)
# >>> বাংলাদেশ
```

#### Multiple punctuations<a id="multiple-punctuations"></a>

Remove multiple consecutive punctuations and keep the first punctuation only.

```python
import bkit

text = 'কি আনন্দ!!!!!'

clean_text = bkit.transform.clean_multiple_punctuations(text)
print(clean_text)
# >>> কি আনন্দ!
```

#### Special characters<a id="special-characters"></a>

Remove special characters like `$`, `#`, `@`, etc and replace them with the given string. If no character list is passed, `[$, #,  &, %, @]` are removed by default.

```python
import bkit

text = '#বাংলাদেশ$'

clean_text = bkit.transform.clean_special_characters(text, characters=['#', '$'], replace_with='')
print(clean_text)
# >>> বাংলাদেশ
```

#### Non Bangla characters<a id="non-bangla-characters"></a>

Non Bangla characters include characters and punctuation not used in Bangla like english or other language's alphabets  and replace them with the given string.

```python
import bkit

text = 'এই শূককীট হাতিশুঁড় Heliotropium indicum, অতসী, আকন্দ Calotropis gigantea গাছের পাতার রসালো অংশ আহার করে।'

clean_text = bkit.transform.clean_non_bangla(text, replace_with='')
print(clean_text)
# >>> এই শূককীট হাতিশুঁড়  , অতসী, আকন্দ  গাছের পাতার রসালো অংশ আহার করে
```

### Text Analysis<a id="text-analysis"></a>

#### Word count<a id="word-count"></a>

The `bkit.analysis.count_words` function can be used to get the word counts. It has the following paramerts:

```python
"""
Args:
  text (Tuple[str, List[str]]): The text to count words from. If a string is provided,
    it will be split into words. If a list of strings is provided, each string will
    be split into words and counted separately.
  clean_punctuation (bool, optional): Whether to clean punctuation from the words count. Defaults to False.
  punct_replacement (str, optional): The replacement for the punctuation. Only applicable if
    clean_punctuation is True. Defaults to "".
  return_dict (bool, optional): Whether to return the word count as a dictionary.
    Defaults to False.
  ordered (bool, optional): Whether to return the word count in descending order. Only
    applicable if return_dict is True. Defaults to False.

Returns:
  Tuple[int, Dict[str, int]]: If return_dict is True, returns a tuple containing the
    total word count and a dictionary where the keys are the words and the values
    are their respective counts. If return_dict is False, returns only the total
    word count as an integer.
"""

# examples

import bkit

text='অভিষেকের আগের দিন গতকাল রোববার ওয়াশিংটনে বিশাল এক সমাবেশে হাজির হন ট্রাম্প। তিনি উচ্ছ্বসিত ভক্ত-সমর্থকদের আমেরিকার পতনের যবনিকা ঘটানোর অঙ্গীকার করেন।'
total_words=bkit.analysis.count_words(text)
print(total_words)
# >>> 21

```

#### Sentence Count<a id="sentence-count"></a>

The bkit.analysis.count_sentences function can be used to get the word counts. It has the following paramerts:

```python
"""
Counts the number of sentences in the given text or list of texts.

Args:
  text (Tuple[str, List[str]]): The text or list of texts to count sentences from.
  return_dict (bool, optional): Whether to return the result as a dictionary. Defaults to False.
  ordered (bool, optional): Whether to order the result in descending order.
    Only applicable if return_dict is True. Defaults to False.

Returns:
  int or dict: The count of sentences. If return_dict is True, returns a dictionary with sentences as keys
    and their counts as values. If return_dict is False, returns the total count of sentences.

Raises:
  AssertionError: If ordered is True but return_dict is False.
"""

# examples
import bkit

text = 'তুমি কোথায় থাক? ঢাকা বাংলাদেশের\n রাজধানী। কি অবস্থা তার! ১২/০৩/২০২২ তারিখে সে ৪/ক ঠিকানায় গিয়ে ১২,৩৪৫.২৩ টাকা দিয়েছিল।'

count = bkit.analysis.count_sentences(text)
print(count)
# >>> 5

count = bkit.analysis.count_sentences(text, return_dict=True, ordered=True)
print(count)
# >>> {'তুমি কোথায় থাক?': 1, 'ঢাকা বাংলাদেশের\n': 1, 'রাজধানী।': 1, 'কি অবস্থা তার!': 1, '১২/০৩/২০২২ তারিখে সে ৪/ক ঠিকানায় গিয়ে ১২,৩৪৫.২৩ টাকা দিয়েছিল।': 1}
```

### Lemmatization<a id="lemmatization"></a>
Lemmatization is implemented based on our this paper **BanLemma: A Word Formation Dependent Rule and Dictionary Based Bangla Lemmatizer**  

[For more details](https://arxiv.org/pdf/2311.03078)
#### Lemmatize text<a id="lemmatize-text"></a>

Lemmatize a given text. Generally expects the text to be a sentence.

```python
import bkit

text = 'পৃথিবীর জনসংখ্যা ৮ বিলিয়নের কিছু কম'

lemmatized = bkit.lemmatizer.lemmatize(text)

print(lemmatized)
# >>> পৃথিবী জনসংখ্যা ৮ বিলিয়ন কিছু কম
```

#### Lemmatize word<a id="lemmatize-word"></a>

Lemmatize a word given the PoS information.

```python
import bkit

text = 'পৃথিবীর'

lemmatized = bkit.lemmatizer.lemmatize_word(text, 'noun')

print(lemmatized)
# >>> পৃথিবী
```

### Tokenization<a id="tokenization"></a>

Tokenize a given text. The `bkit.tokenizer` module is used to tokenizer text into tokens. It supports three types of tokenization.

#### Word tokenization<a id="word-tokenization"></a>

Tokenize text into words. Also separates some punctuations including comma, danda (।), question mark, etc.

```python
import bkit

text = 'তুমি কোথায় থাক? ঢাকা বাংলাদেশের রাজধানী। কি অবস্থা তার! ১২/০৩/২০২২ তারিখে সে ৪/ক ঠিকানায় গিয়ে ১২,৩৪৫ টাকা দিয়েছিল।'

tokens = bkit.tokenizer.tokenize(text)

print(tokens)
# >>> ['তুমি', 'কোথায়', 'থাক', '?', 'ঢাকা', 'বাংলাদেশের', 'রাজধানী', '।', 'কি', 'অবস্থা', 'তার', '!', '১২/০৩/২০২২', 'তারিখে', 'সে', '৪/ক', 'ঠিকানায়', 'গিয়ে', '১২,৩৪৫', 'টাকা', 'দিয়েছিল', '।']
```

#### Word and Punctuation tokenization<a id="word-and-punctuation-tokenization"></a>

Tokenize text into words and any punctuation.

```python
import bkit

text = 'তুমি কোথায় থাক? ঢাকা বাংলাদেশের রাজধানী। কি অবস্থা তার! ১২/০৩/২০২২ তারিখে সে ৪/ক ঠিকানায় গিয়ে ১২,৩৪৫ টাকা দিয়েছিল।'

tokens = bkit.tokenizer.tokenize_word_punctuation(text)

print(tokens)
# >>> ['তুমি', 'কোথায়', 'থাক', '?', 'ঢাকা', 'বাংলাদেশের', 'রাজধানী', '।', 'কি', 'অবস্থা', 'তার', '!', '১২', '/', '০৩', '/', '২০২২', 'তারিখে', 'সে', '৪', '/', 'ক', 'ঠিকানায়', 'গিয়ে', '১২', ',', '৩৪৫', 'টাকা', 'দিয়েছিল', '।']
```

#### Sentence tokenization<a id="sentence-tokenization"></a>

Tokenize text into sentences.

```python
import bkit

text = 'তুমি কোথায় থাক? ঢাকা বাংলাদেশের রাজধানী। কি অবস্থা তার! ১২/০৩/২০২২ তারিখে সে ৪/ক ঠিকানায় গিয়ে ১২,৩৪৫ টাকা দিয়েছিল।'

tokens = bkit.tokenizer.tokenize_sentence(text)

print(tokens)
# >>> ['তুমি কোথায় থাক?', 'ঢাকা বাংলাদেশের রাজধানী।', 'কি অবস্থা তার!', '১২/০৩/২০২২ তারিখে সে ৪/ক ঠিকানায় গিয়ে ১২,৩৪৫ টাকা দিয়েছিল।']
```

### Named Entity Recognition (NER)<a id="named-entity-recognition-ner"></a>

Predicts the tags of the Named Entities of a given text.

```python
import bkit

text = 'তুমি কোথায় থাক? ঢাকা বাংলাদেশের রাজধানী। কি অবস্থা তার! ১২/০৩/২০২২ তারিখে সে ৪/ক ঠিকানায় গিয়ে ১২,৩৪৫.২৩ টাকা দিয়েছিল।'

ner = bkit.ner.Infer('ner-noisy-label')
predictions = ner(text)

print(predictions)
# >>> [('তুমি', 'O', 0.9998692), ('কোথায়', 'O', 0.99988306), ('থাক?', 'O', 0.99983954), ('ঢাকা', 'B-GPE', 0.99891424), ('বাংলাদেশের', 'B-GPE', 0.99710876), ('রাজধানী।', 'O', 0.9995414), ('কি', 'O', 0.99989176), ('অবস্থা', 'O', 0.99980336), ('তার!', 'O', 0.99983263), ('১২/০৩/২০২২', 'B-D&T', 0.97921854), ('তারিখে', 'O', 0.9271435), ('সে', 'O', 0.99934834), ('৪/ক', 'B-NUM', 0.8297553), ('ঠিকানায়', 'O', 0.99728775), ('গিয়ে', 'O', 0.9994825), ('১২,৩৪৫.২৩', 'B-NUM', 0.99740463), ('টাকা', 'B-UNIT', 0.99914896), ('দিয়েছিল।', 'O', 0.9998908)]
```

#### Named Entity Recognition (NER) Visualization<a id="named-entity-recognition-ner-visualization"></a>
It takes the model's output and visualizes the NER tag for every word in the text.

```python
import bkit

text = 'তুমি কোথায় থাক? ঢাকা বাংলাদেশের রাজধানী। কি অবস্থা তার! ১২/০৩/২০২২ তারিখে সে ৪/ক ঠিকানায় গিয়ে ১২,৩৪৫.২৩ টাকা দিয়েছিল।'
ner = bkit.ner.Infer('ner-noisy-label')
predictions = ner(text)
bkit.ner.visualize(predictions)
```
[![NER.png](https://i.postimg.cc/Zq2RrF0z/NER.png)](https://postimg.cc/SXLk49Wg)


### Parts of Speech (PoS) tagging<a id="parts-of-speech-pos-tagging"></a>

Predicts the tags of the parts of speech of a given text.

```python
import bkit

text = 'গত কিছুদিন ধরেই জ্বালানিহীন অবস্থায় একটি ছোট মাছ ধরার নৌকায় ১৫০ জন রোহিঙ্গা আন্দামান সাগরে ভাসমান অবস্থায় রয়েছে ।'
pos = bkit.pos.Infer('pos-noisy-label')
predictions = pos(text)

print(predictions)
# >>> [('গত', 'ADJ', 0.98674506), ('কিছুদিন', 'NNC', 0.97954935), ('ধরেই', 'PP', 0.96124), ('জ্বালানিহীন', 'ADJ', 0.93195957), ('অবস্থায়', 'NNC', 0.9960413), ('একটি', 'QF', 0.9912915), ('ছোট', 'ADJ', 0.9810739), ('মাছ', 'NNC', 0.97365385), ('ধরার', 'NNC', 0.96641904), ('নৌকায়', 'NNC', 0.99680626), ('১৫০', 'QF', 0.996005), ('জন', 'NNC', 0.99434316), ('রোহিঙ্গা', 'NNP', 0.9141038), ('আন্দামান', 'NNP', 0.9856694), ('সাগরে', 'NNP', 0.7122378), ('ভাসমান', 'ADJ', 0.93841994), ('অবস্থায়', 'NNC', 0.9965629), ('রয়েছে', 'VF', 0.99680847), ('।', 'PUNCT', 0.9963098)]

```
#### Parts of Speech (PoS) Visualization<a id="parts-of-speech-pos-visualization"></a>
"It takes the model's output and visualizes the Part-of-Speech tag for every word in the text.

```python
import bkit

text = 'গত কিছুদিন ধরেই জ্বালানিহীন অবস্থায় একটি ছোট মাছ ধরার নৌকায় ১৫০ জন রোহিঙ্গা আন্দামান সাগরে ভাসমান অবস্থায় রয়েছে ।'
pos = bkit.pos.Infer('pos-noisy-label')
predictions = pos(text)
bkit.pos.visualize(predictions)
```
[![pos.png](https://i.postimg.cc/8CjcJnpb/pos.png)](https://postimg.cc/3yQYz1Dy)


### Shallow Parsing (Constituency Parsing)<a id="shallow-parsing-constituency-parsing"></a>

Predicts the shallow parsing tags of a given text.

```python
import bkit

text = 'তুমি কোথায় থাক? ঢাকা বাংলাদেশের রাজধানী। কি অবস্থা তার! ১২/০৩/২০২২ তারিখে সে ৪/ক ঠিকানায় গিয়ে ১২,৩৪৫.২৩ টাকা দিয়েছিল।'
shallow = bkit.shallow.Infer(pos_model='pos-noisy-label')
predictions = shallow(text)
print(predictions)
# >>> (S (VP (NP (PRO তুমি)) (VP (ADVP (ADV কোথায়)) (VF থাক))) (NP (NNP ?) (NNP ঢাকা) (NNC বাংলাদেশের)) (ADVP (ADV রাজধানী)) (NP (NP (NP (NNC ।)) (NP (PRO কি))) (NP (QF অবস্থা) (NNC তার)) (NP (PRO !))) (NP (NP (QF ১২/০৩/২০২২) (NNC তারিখে)) (VNF সে) (NP (QF ৪/ক) (NNC ঠিকানায়))) (VF গিয়ে))
```

#### Shallow Parsing Visualization<a id="shallow-parsing-visualization"></a>
It converts model predictions into an interactive shallow parsing Tree for clear and intuitive analysis

```python
from bkit import shallow
text = "কাতার বিশ্বকাপে আর্জেন্টিনার বিশ্বকাপ জয়ে মার্তিনেজের অবদান অনেক।"
shallow = shallow.Infer(pos_model='pos-noisy-label')
predictions = shallow(text)
shallow.visualize(predictions)
```

[![shallow.png](https://i.postimg.cc/sgfvygbq/shallow.png)](https://postimg.cc/vcjQtbYt)


### Dependency Parsing<a id="dependency-parsing"></a>

Predicts the dependency parsing tags of a given text.

```python
from bkit import dependency

text = "কাতার বিশ্বকাপে আর্জেন্টিনার বিশ্বকাপ জয়ে মার্তিনেজের অবদান অনেক।"
dep =dependency.Infer('dependency-parsing')
predictions = dep(text)
print(predictions)
# >>>[{'text': 'কাতার বিশ্বকাপে আর্জেন্টিনার বিশ্বকাপ জয়ে মার্তিনেজের অবদান অনেক ।', 'predictions': [{'token_start': 1, 'token_end': 0, 'label': 'compound'}, {'token_start': 7, 'token_end': 1, 'label': 'obl'}, {'token_start': 4, 'token_end': 2, 'label': 'nmod'}, {'token_start': 4, 'token_end': 3, 'label': 'nmod'}, {'token_start': 7, 'token_end': 4, 'label': 'obl'}, {'token_start': 6, 'token_end': 5, 'label': 'nmod'}, {'token_start': 7, 'token_end': 6, 'label': 'nsubj'}, {'token_start': 7, 'token_end': 7, 'label': 'root'}, {'token_start': 7, 'token_end': 8, 'label': 'punct'}]}]
```

#### Dependency Parsing Visualization<a id="dependency-parsing-visualization"></a>
It converts model predictions into an interactive dependency graph for clear and intuitive analysis

```python
from bkit import dependency
text = "কাতার বিশ্বকাপে আর্জেন্টিনার বিশ্বকাপ জয়ে মার্তিনেজের অবদান অনেক।"
dep = dependency.Infer('dependency-parsing')
predictions = dep(text)
dependency.visualize(predictions)
```

[![dependency-visu.png](https://i.postimg.cc/MpsXG8ST/dependency-visu.png)](https://postimg.cc/K1Mm9w4S)

### Coreference Resolution<a id="coref-resolution"></a>

Predicts the coreferent clusters of a given text.

```python
import bkit

text = "তারাসুন্দরী ( ১৮৭৮ - ১৯৪৮ ) অভিনেত্রী । ১৮৮৪ সালে বিনোদিনীর সহায়তায় স্টার থিয়েটারে যোগদানের মাধ্যমে তিনি অভিনয় শুরু করেন । প্রথমে তিনি গিরিশচন্দ্র ঘোষের চৈতন্যলীলা নাটকে এক বালক ও সরলা নাটকে গোপাল চরিত্রে অভিনয় করেন ।"
coref = bkit.coref.Infer('coref')
predictions = coref(text)
print(predictions)
# >>> {'text': ['তারাসুন্দরী', '(', '১৮৭৮', '-', '১৯৪৮', ')', 'অভিনেত্রী', '।', '১৮৮৪', 'সালে', 'বিনোদিনীর', 'সহায়তায়', 'স্টার', 'থিয়েটারে', 'যোগদানের', 'মাধ্যমে', 'তিনি', 'অভিনয়', 'শুরু', 'করেন', '।', 'প্রথমে', 'তিনি', 'গিরিশচন্দ্র', 'ঘোষের', 'চৈতন্যলীলা', 'নাটকে', 'এক', 'বালক', 'ও', 'সরলা', 'নাটকে', 'গোপাল', 'চরিত্রে', 'অভিনয়', 'করেন', '।'], 'mention_indices': {0: [{'start_token': 0, 'end_token': 0}, {'start_token': 6, 'end_token': 6}, {'start_token': 10, 'end_token': 10}, {'start_token': 16, 'end_token': 16}, {'start_token': 22, 'end_token': 22}]}}
```

#### Coreference Resolution Visualization<a id="coreference-resolution-visualization"></a>
It takes the model's output and creates an interactive visualization to clearly depict coreference resolution, highlighting the relationships between entities in the text

```python
from bkit import coref

text = "তারাসুন্দরী ( ১৮৭৮ - ১৯৪৮ ) অভিনেত্রী । ১৮৮৪ সালে বিনোদিনীর সহায়তায় স্টার থিয়েটারে যোগদানের মাধ্যমে তিনি অভিনয় শুরু করেন । প্রথমে তিনি গিরিশচন্দ্র ঘোষের চৈতন্যলীলা নাটকে এক বালক ও সরলা নাটকে গোপাল চরিত্রে অভিনয় করেন ।"
coref = coref.Infer('coref')
predictions = coref(text)
coref.visualize(predictions)
```

[![coref.png](https://i.postimg.cc/26W5TbBb/coref.png)](https://postimg.cc/cgsZLJh0)

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "bkit",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.7",
    "maintainer_email": null,
    "keywords": "bangla kit, bangla text processing, banpipeline, ner, pos, shallow parsing, dependency parsing",
    "author": null,
    "author_email": "banglagov <manage.bangla@gmail.com>",
    "download_url": "https://files.pythonhosted.org/packages/00/c9/fcc1a9372b6aa3cb2ff348b0c0d15ed6864d75feac2899561e79eace7cfd/bkit-0.0.8.tar.gz",
    "platform": null,
    "description": "# bangla-text-processing-kit\n\nA python tool kit for processing Bangla texts.\n\n- [bangla-text-processing-kit](#bangla-text-processing-kit)\n  - [How to use](#how-to-use)\n    - [Installing](#installing)\n    - [Checking text](#checking-text)\n    - [Transforming text](#transforming-text)\n      - [Normalizer](#normalizer)\n      - [Character Normalization](#character-normalization)\n      - [Punctuation space normalization](#punctuation-space-normalization)\n      - [Zero width characters normalization](#zero-width-characters-normalization)\n      - [Halant (\u09b9\u09b8\u09a8\u09cd\u09a4) normalization](#halant-\u09b9\u09b8\u09a8\u09cd\u09a4-normalization)\n      - [Kar ambiguity](#kar-ambiguity)\n      - [Clean text](#clean-text)\n      - [Clean punctuations](#clean-punctuations)\n      - [Clean digits](#clean-digits)\n      - [Multiple spaces](#multiple-spaces)\n      - [URLs](#urls)\n      - [Emojis](#emojis)\n      - [HTML tags](#html-tags)\n      - [Multiple punctuations](#multiple-punctuations)\n      - [Special characters](#special-characters)\n      - [Non Bangla characters](#non-bangla-characters)\n    - [Text Analysis](#text-analysis)\n      - [Word count](#word-count)\n      - [Sentence Count](#sentence-count)\n    - [Lemmatization](#lemmatization)\n      - [Lemmatize text](#lemmatize-text)\n      - [Lemmatize word](#lemmatize-word)\n    - [Tokenization](#tokenization)\n      - [Word tokenization](#word-tokenization)\n      - [Word and Punctuation tokenization](#word-and-punctuation-tokenization)\n      - [Sentence tokenization](#sentence-tokenization)\n    - [Named Entity Recognition (NER)](#named-entity-recognition-ner)\n      - [Named Entity Recognition (NER) Visualization](#named-entity-recognition-ner-visualization)\n    - [Parts of Speech (PoS) tagging](#parts-of-speech-pos-tagging)\n      - [Parts of Speech (PoS) Visualization](#parts-of-speech-pos-visualization)\n    - [Shallow Parsing (Constituency Parsing)](#shallow-parsing-constituency-parsing)\n        - [Shallow Parsing Visualization](#shallow-parsing-visualization)\n    - [Dependency Parsing](#dependency-parsing)\n        - [Dependency Parsing Visualization](#dependency-parsing-visualization)\n    - [Coreference Resolution](#coref-resolution)\n      - [Coreference Resolution Visualization](#coreference-resolution-visualization)\n    \n\n## How to use<a id=\"how-to-use\"></a>\n\n### Installing<a id=\"installing\"></a>\n\nThere are three installation options of the bkit package. These are:\n\n1. `bkit`: The most basic version of bkit with the normalization, cleaning and tokenization capabilities.\n\n```bash\npip install bkit\n```\n2. `bkit[lemma]`: Everything in the basic version plus lemmatization capability.\n\n```bash\npip install bkit[lemma]\n```\n3. `bkit[all]`: Everything that are available in bkit including normalization, cleaning, tokenization, lemmatization, NER, POS and shallow parsing. \n\n```bash\npip install bkit[all]\n```\n\n### Checking text<a id=\"checking-text\"></a>\n\n- `bkit.utils.is_bangla(text) -> bool`: Checks if text contains only Bangla characters, digits, spaces, punctuations and some symbols. Returns true if so, else return false.\n- `bkit.utils.is_digit(text) -> bool`: Checks if text contains only **Bangla digit** characters. Returns true if so, else return false.\n- `bkit.utils.contains_digit(text, check_english_digits) -> bool`: Checks if text contains **any digits**. By default checks only Bangla digits. Returns true if so, else return false.\n- `bkit.utils.contains_bangla(text) -> bool`: Checks if text contains **any Bangla character**. Returns true if so, else return false.\n\n### Transforming text<a id=\"transforming-text\"></a>\n\nText transformation includes the normalization and cleaning procedures. To transform text, use the `bkit.transform` module. Supported functionalities are:\n\n#### Normalizer<a id=\"normalizer\"></a>\n\nThis module normalize Bangla text using the following steps:\n\n<!-- no toc -->\n1. [Character normalization](#character-normalization)\n2. [Zero width character normalization](#zero-width-characters-normalization)\n3. [Halant normalization](#halant-\u09b9\u09b8\u09a8\u09cd\u09a4-normalization)\n4. [Vowel-kar normalization](#kar-ambiguity)\n5. [Punctuation space normalization](#punctuation-space-normalization)\n\n```python\nimport bkit\n\ntext = '\u0985\u09be\u09be\u09ae\u09be\u09ac\u09bc \u0964 '\nprint(list(text))\n# >>> ['\u0985', '\u09be', '\u09be', '\u09ae', '\u09be', '\u09ac', '\u09bc', ' ', '\u0964', ' ']\n\nnormalizer = bkit.transform.Normalizer(\n    normalize_characters=True,\n    normalize_zw_characters=True,\n    normalize_halant=True,\n    normalize_vowel_kar=True,\n    normalize_punctuation_spaces=True\n)\n\nclean_text = normalizer(text)\nprint(clean_text, list(clean_text))\n# >>> \u0986\u09ae\u09be\u09b0\u0964 ['\u0986', '\u09ae', '\u09be', '\u09b0', '\u0964']\n```\n\n#### Character Normalization<a id=\"character-normalization\"></a>\n\nThis module performs character normalization in Bangla text. It performs nukta normalization, Assamese normalization, Kar normalization, legacy character normalization and Punctuation normalization sequentially.\n\n```python\nimport bkit\n\ntext = '\u0986\u09ae\u09be\u09ac\u09bc'\nprint(list(text))\n# >>> ['\u0986', '\u09ae', '\u09be', '\u09ac', '\u09bc']\n\ntext = bkit.transform.normalize_characters(text)\n\nprint(list(text))\n# >>> ['\u0986', '\u09ae', '\u09be', '\u09b0']\n```\n\n#### Punctuation space normalization<a id=\"punctuation-space-normalization\"></a>\n\nNormalizes punctuation spaces i.e. adds necessary spaces before or after specific punctuations, also removes if necessary.\n\n```python\nimport bkit\n\ntext = '\u09b0\u09b9\u09bf\u09ae(\u09e8\u09e9)\u098f \u0995\u09a5\u09be \u09ac\u09b2\u09c7\u09a8   \u0964\u09a4\u09bf\u09a8\u09bf (    \u09b0\u09b9\u09bf\u09ae ) \u0986\u09b0\u0993 \u099c\u09be\u09a8\u09be\u09a8, \u09e7,\u09e8\u09ea,\u09e9\u09eb,\u09ec\u09eb\u09ea.\u09e9\u09e8\u09e9 \u0995\u09cb\u099f\u09bf \u099f\u09be\u0995\u09be \u09ac\u09cd\u09af\u09be\u09df\u09c7...'\n\nclean_text = bkit.transform.normalize_punctuation_spaces(text)\nprint(clean_text)\n# >>> \u09b0\u09b9\u09bf\u09ae (\u09e8\u09e9) \u098f \u0995\u09a5\u09be \u09ac\u09b2\u09c7\u09a8\u0964 \u09a4\u09bf\u09a8\u09bf (\u09b0\u09b9\u09bf\u09ae) \u0986\u09b0\u0993 \u099c\u09be\u09a8\u09be\u09a8, \u09e7,\u09e8\u09ea,\u09e9\u09eb,\u09ec\u09eb\u09ea.\u09e9\u09e8\u09e9 \u0995\u09cb\u099f\u09bf \u099f\u09be\u0995\u09be \u09ac\u09cd\u09af\u09be\u09df\u09c7...\n```\n\n#### Zero width characters normalization<a id=\"zero-width-characters-normalization\"></a>\n\nThere are two zero-width characters. These are Zero Width Joiner (ZWJ) and Zero Width Non Joiner (ZWNJ) characters. Generally ZWNJ is not used with Bangla texts and ZWJ joiner is used with `\u09b0` only. So, these characters are normalized based on these intuitions.\n\n```python\nimport bkit\n\ntext = '\u09b0\u200d\u09cd\u09af\u200c\u09be\u0995\u09c7\u099f'\nprint(f\"text: {text} \\t Characters: {list(text)}\")\n# >>> text: \u09b0\u200d\u09cd\u09af\u200c\u09be\u0995\u09c7\u099f     Characters: ['\u09b0', '\\u200d', '\u09cd', '\u09af', '\\u200c', '\u09be', '\u0995', '\u09c7', '\u099f']\n\nclean_text = bkit.transform.normalize_zero_width_chars(text)\nprint(f\"text: {clean_text} \\t Characters: {list(clean_text)}\")\n# >>> text: \u09b0\u200d\u09cd\u09af\u09be\u0995\u09c7\u099f     Characters: ['\u09b0', '\\u200d', '\u09cd', '\u09af', '\u09be', '\u0995', '\u09c7', '\u099f']\n```\n\n#### Halant (\u09b9\u09b8\u09a8\u09cd\u09a4) normalization<a id=\"halant-\u09b9\u09b8\u09a8\u09cd\u09a4-normalization\"></a>\n\nThis function normalizes halant (\u09b9\u09b8\u09a8\u09cd\u09a4) [`0x09CD`] in Bangla text. While using this function, it is recommended to normalize the zero width characters at first, e.g. using the `bkit.transform.normalize_zero_width_chars()` function.\n\nDuring the normalization it also handles the `\u09a4\u09cd -> \u09ce` conversion. For a valid conjunct letter (\u09af\u09c1\u0995\u09cd\u09a4\u09ac\u09b0\u09cd\u09a3) where '\u09a4' is the former character, can take one of '\u09a4', '\u09a5', '\u09a8', '\u09ac', '\u09ae', '\u09af', and '\u09b0' as the next character. The conversion is perform based on this intuition.\n\nDuring the halant normalization, the following cases are handled.\n\n- Remove any leading and tailing halant of a word and/or text.\n- Replace two or more consecutive occurrences of halant by a single halant.\n- Remove halant between any characters that do not follow or precede a halant character. Like a halant that follows or precedes a vowel, kar, \u09df, etc will be removed.\n- Remove multiple fola (multiple ref, ro-fola and jo-fola)\n\n```python\nimport bkit\n\ntext = '\u0986\u09b8\u09a8\u09cd\u09cd\u09cd\u09a8 \u0986\u09b8\u09ab\u09be\u0995\u09c1\u09b2\u09cd\u09b2\u09be\u09b9\u09cd\u200c \u0986\u09b2\u09ac\u09a4\u09cd\u200d \u0986\u09b2\u09ac\u09a4\u09cd \u09b0\u200d\u09cd\u09af\u09be\u09ac \u0987\u09cd\u09b8\u09bf'\nprint(list(text))\n# >>> ['\u0986', '\u09b8', '\u09a8', '\u09cd', '\u09cd', '\u09cd', '\u09a8', ' ', '\u0986', '\u09b8', '\u09ab', '\u09be', '\u0995', '\u09c1', '\u09b2', '\u09cd', '\u09b2', '\u09be', '\u09b9', '\u09cd', '\\u200c', ' ', '\u0986', '\u09b2', '\u09ac', '\u09a4', '\u09cd', '\\u200d', ' ', '\u0986', '\u09b2', '\u09ac', '\u09a4', '\u09cd', ' ', '\u09b0', '\\u200d', '\u09cd', '\u09af', '\u09be', '\u09ac', ' ', '\u0987', '\u09cd', '\u09b8', '\u09bf']\n\nclean_text = bkit.transform.normalize_zero_width_chars(text)\nclean_text = bkit.transform.normalize_halant(clean_text)\nprint(clean_text, list(clean_text))\n# >>> \u0986\u09b8\u09a8\u09cd\u09a8 \u0986\u09b8\u09ab\u09be\u0995\u09c1\u09b2\u09cd\u09b2\u09be\u09b9 \u0986\u09b2\u09ac\u09ce \u0986\u09b2\u09ac\u09ce \u09b0\u200d\u09cd\u09af\u09be\u09ac \u0987\u09b8\u09bf ['\u0986', '\u09b8', '\u09a8', '\u09cd', '\u09a8', ' ', '\u0986', '\u09b8', '\u09ab', '\u09be', '\u0995', '\u09c1', '\u09b2', '\u09cd', '\u09b2', '\u09be', '\u09b9', ' ', '\u0986', '\u09b2', '\u09ac', '\u09ce', ' ', '\u0986', '\u09b2', '\u09ac', '\u09ce', ' ', '\u09b0', '\\u200d', '\u09cd', '\u09af', '\u09be', '\u09ac', ' ', '\u0987', '\u09b8', '\u09bf']\n```\n\n#### Kar ambiguity<a id=\"kar-ambiguity\"></a>\n\nNormalizes kar ambiguity with vowels, \u0981, \u0982, and \u0983. It removes any kar that is preceded by a vowel or consonant diacritics like: `\u0986\u09be` will be normalized to `\u0986`. In case of consecutive occurrence of kars like: `\u0995\u09be\u09be\u09be\u09c0`, only the first kar will be kept like: `\u0995\u09be`.\n\n```python\nimport bkit\n\ntext = '\u0985\u0982\u09b6\u0987\u09c7 \u0985\u0982\u09b6\u0997\u09cd\u09b0\u09b9\u09a3\u0987\u09c7 \u0986\u09be\u09be\u09b0\u09cb \u098f\u0996\u09a8\u0993\u09cb \u0986\u09b2\u09ac\u09be\u09b0\u09cd\u09a4\u09cb\u09c7 \u09b8\u09be\u09a7\u09c1\u09c1 \u0995\u09be\u09be\u09be\u09c0'\nprint(list(text))\n# >>> ['\u0985', '\u0982', '\u09b6', '\u0987', '\u09c7', ' ', '\u0985', '\u0982', '\u09b6', '\u0997', '\u09cd', '\u09b0', '\u09b9', '\u09a3', '\u0987', '\u09c7', ' ', '\u0986', '\u09be', '\u09be', '\u09b0', '\u09cb', ' ', '\u098f', '\u0996', '\u09a8', '\u0993', '\u09cb', ' ', '\u0986', '\u09b2', '\u09ac', '\u09be', '\u09b0', '\u09cd', '\u09a4', '\u09cb', '\u09c7', ' ', '\u09b8', '\u09be', '\u09a7', '\u09c1', '\u09c1', ' ', '\u0995', '\u09be', '\u09be', '\u09be', '\u09c0']\n\nclean_text = bkit.transform.normalize_kar_ambiguity(text)\nprint(clean_text, list(clean_text))\n# >>> \u0985\u0982\u09b6\u0987 \u0985\u0982\u09b6\u0997\u09cd\u09b0\u09b9\u09a3\u0987 \u0986\u09b0\u09cb \u098f\u0996\u09a8\u0993 \u0986\u09b2\u09ac\u09be\u09b0\u09cd\u09a4\u09cb \u09b8\u09be\u09a7\u09c1 \u0995\u09be ['\u0985', '\u0982', '\u09b6', '\u0987', ' ', '\u0985', '\u0982', '\u09b6', '\u0997', '\u09cd', '\u09b0', '\u09b9', '\u09a3', '\u0987', ' ', '\u0986', '\u09b0', '\u09cb', ' ', '\u098f', '\u0996', '\u09a8', '\u0993', ' ', '\u0986', '\u09b2', '\u09ac', '\u09be', '\u09b0', '\u09cd', '\u09a4', '\u09cb', ' ', '\u09b8', '\u09be', '\u09a7', '\u09c1', ' ', '\u0995', '\u09be']\n```\n\n#### Clean text<a id=\"clean-text\"></a>\n\nClean text using the following steps sequentially:\n\n<!-- no toc -->\n1. [Removes all HTML tags](#html-tags)\n2. [Removes all URLs](#urls)\n3. [Removes all emojis (optional)](#emojis)\n4. [Removes all digits (optional)](#clean-digits)\n5. [Removes all punctuations (optional)](#clean-punctuations)\n6. [Removes all extra spaces](#multiple-spaces)\n7. [Removes all non bangla characters](#non-bangla-characters)\n\n```python\nimport bkit\n\ntext = '<a href=some_URL>\u09ac\u09be\u0982\u09b2\u09be\u09a6\u09c7\u09b6</a>\\n\u09ac\u09be\u0982\u09b2\u09be\u09a6\u09c7\u09b6\u09c7\u09b0   \u0986\u09df\u09a4\u09a8 \u09e7.\u09ea\u09ed \u09b2\u0995\u09cd\u09b7 \u0995\u09bf\u09b2\u09cb\u09ae\u09bf\u099f\u09be\u09b0!!!'\n\nclean_text = bkit.transform.clean_text(text)\nprint(clean_text)\n# >>> \u09ac\u09be\u0982\u09b2\u09be\u09a6\u09c7\u09b6 \u09ac\u09be\u0982\u09b2\u09be\u09a6\u09c7\u09b6\u09c7\u09b0 \u0986\u09df\u09a4\u09a8 \u09b2\u0995\u09cd\u09b7 \u0995\u09bf\u09b2\u09cb\u09ae\u09bf\u099f\u09be\u09b0\n```\n\n#### Clean punctuations<a id=\"clean-punctuations\"></a>\n\nRemove punctuations with the given `replace_with` character/string.\n\n```python\nimport bkit\n\ntext = '\u0986\u09ae\u09b0\u09be \u09ae\u09be\u09a0\u09c7 \u09ab\u09c1\u099f\u09ac\u09b2 \u0996\u09c7\u09b2\u09a4\u09c7 \u09aa\u099b\u09a8\u09cd\u09a6 \u0995\u09b0\u09bf!'\n\nclean_text = bkit.transform.clean_punctuations(text)\nprint(clean_text)\n# >>> \u0986\u09ae\u09b0\u09be \u09ae\u09be\u09a0\u09c7 \u09ab\u09c1\u099f\u09ac\u09b2 \u0996\u09c7\u09b2\u09a4\u09c7 \u09aa\u099b\u09a8\u09cd\u09a6 \u0995\u09b0\u09bf\n\nclean_text = bkit.transform.clean_punctuations(text, replace_with=' PUNC ')\nprint(clean_text)\n# >>> \u0986\u09ae\u09b0\u09be \u09ae\u09be\u09a0\u09c7 \u09ab\u09c1\u099f\u09ac\u09b2 \u0996\u09c7\u09b2\u09a4\u09c7 \u09aa\u099b\u09a8\u09cd\u09a6 \u0995\u09b0\u09bf PUNC\n```\n\n#### Clean digits<a id=\"clean-digits\"></a>\n\nRemove any bangla digit from text by replacing with the given `replace_with` character/string.\n\n```python\nimport bkit\n\ntext = '\u09a4\u09be\u09b0 \u09ac\u09be\u09b8\u09be \u09ed\u09ef \u09a8\u09be\u09ae\u09cd\u09ac\u09be\u09b0 \u09b0\u09cb\u09a1\u09c7\u0964'\n\nclean_text = bkit.transform.clean_digits(text)\nprint(clean_text)\n# >>> \u09a4\u09be\u09b0 \u09ac\u09be\u09b8\u09be    \u09a8\u09be\u09ae\u09cd\u09ac\u09be\u09b0 \u09b0\u09cb\u09a1\u09c7\u0964\n\nclean_text = bkit.transform.clean_digits(text, replace_with='#')\nprint(clean_text)\n# >>> \u09a4\u09be\u09b0 \u09ac\u09be\u09b8\u09be ## \u09a8\u09be\u09ae\u09cd\u09ac\u09be\u09b0 \u09b0\u09cb\u09a1\u09c7\u0964\n```\n\n#### Multiple spaces<a id=\"multiple-spaces\"></a>\n\nClean multiple consecutive whitespace characters including space, newlines, tabs, vertical tabs, etc. It also removes leading and trailing whitespace characters.\n\n```python\nimport bkit\n\ntext = '\u09a4\u09be\u09b0 \u09ac\u09be\u09b8\u09be \u09ed\u09ef   \\t\\t \u09a8\u09be\u09ae\u09cd\u09ac\u09be\u09b0   \u09b0\u09cb\u09a1\u09c7\u0964\\n\u09b8\u09c7 \u0996\u09c1\u09ac \\v \u09ad\u09be\u09b2\u09cb \u099b\u09c7\u09b2\u09c7\u0964'\n\nclean_text = bkit.transform.clean_multiple_spaces(text)\nprint(clean_text)\n# >>> \u09a4\u09be\u09b0 \u09ac\u09be\u09b8\u09be \u09ed\u09ef \u09a8\u09be\u09ae\u09cd\u09ac\u09be\u09b0 \u09b0\u09cb\u09a1\u09c7\u0964 \u09b8\u09c7 \u0996\u09c1\u09ac \u09ad\u09be\u09b2\u09cb \u099b\u09c7\u09b2\u09c7\u0964\n\nclean_text = bkit.transform.clean_multiple_spaces(text, keep_new_line=True)\nprint(clean_text)\n# >>> \u09a4\u09be\u09b0 \u09ac\u09be\u09b8\u09be \u09ed\u09ef \u09a8\u09be\u09ae\u09cd\u09ac\u09be\u09b0 \u09b0\u09cb\u09a1\u09c7\u0964\\n\u09b8\u09c7 \u0996\u09c1\u09ac \\n \u09ad\u09be\u09b2\u09cb \u099b\u09c7\u09b2\u09c7\u0964\n```\n\n#### URLs<a id=\"urls\"></a>\n\nClean URLs from text and replace the URLs with any given string.\n\n```python\nimport bkit\n\ntext = '\u0986\u09ae\u09bf https://xyz.abc \u09b8\u09be\u0987\u099f\u09c7 \u09ac\u09cd\u09b2\u0997 \u09b2\u09bf\u0996\u09bf\u0964 \u098f\u0987 ftp://10.17.5.23/books \u09b8\u09be\u09b0\u09cd\u09ad\u09be\u09b0 \u09a5\u09c7\u0995\u09c7 \u0986\u09ae\u09be\u09b0 \u09ac\u0987\u0997\u09c1\u09b2\u09cb \u09aa\u09be\u09ac\u09c7\u0964 \u098f\u0987 https://bn.wikipedia.org/wiki/%E0%A6%A7%E0%A6%BE%E0%A6%A4%E0%A7%81_(%E0%A6%AC%E0%A6%BE%E0%A6%82%E0%A6%B2%E0%A6%BE_%E0%A6%AC%E0%A7%8D%E0%A6%AF%E0%A6%BE%E0%A6%95%E0%A6%B0%E0%A6%A3) \u09b2\u09bf\u0999\u09cd\u0995\u099f\u09bf\u09a4\u09c7 \u09ad\u09be\u09b2\u09cb \u09a4\u09a5\u09cd\u09af \u0986\u099b\u09c7\u0964'\n\nclean_text = bkit.transform.clean_urls(text)\nprint(clean_text)\n# >>> \u0986\u09ae\u09bf   \u09b8\u09be\u0987\u099f\u09c7 \u09ac\u09cd\u09b2\u0997 \u09b2\u09bf\u0996\u09bf\u0964 \u098f\u0987   \u09b8\u09be\u09b0\u09cd\u09ad\u09be\u09b0 \u09a5\u09c7\u0995\u09c7 \u0986\u09ae\u09be\u09b0 \u09ac\u0987\u0997\u09c1\u09b2\u09cb \u09aa\u09be\u09ac\u09c7\u0964 \u098f\u0987   \u09b2\u09bf\u0999\u09cd\u0995\u099f\u09bf\u09a4\u09c7 \u09ad\u09be\u09b2\u09cb \u09a4\u09a5\u09cd\u09af \u0986\u099b\u09c7\u0964\n\nclean_text = bkit.transform.clean_urls(text, replace_with='URL')\nprint(clean_text)\n# >>> \u0986\u09ae\u09bf URL \u09b8\u09be\u0987\u099f\u09c7 \u09ac\u09cd\u09b2\u0997 \u09b2\u09bf\u0996\u09bf\u0964 \u098f\u0987 URL \u09b8\u09be\u09b0\u09cd\u09ad\u09be\u09b0 \u09a5\u09c7\u0995\u09c7 \u0986\u09ae\u09be\u09b0 \u09ac\u0987\u0997\u09c1\u09b2\u09cb \u09aa\u09be\u09ac\u09c7\u0964 \u098f\u0987 URL \u09b2\u09bf\u0999\u09cd\u0995\u099f\u09bf\u09a4\u09c7 \u09ad\u09be\u09b2\u09cb \u09a4\u09a5\u09cd\u09af \u0986\u099b\u09c7\u0964\n```\n\n#### Emojis<a id=\"emojis\"></a>\n\nClean emoji and emoticons from text and replace those with any given string.\n\n```python\nimport bkit\n\ntext = '\u0995\u09bf\u099b\u09c1 \u0987\u09ae\u09cb\u099c\u09bf \u09b9\u09b2: \ud83d\ude00\ud83e\udec5\ud83c\udffe\ud83e\udec5\ud83c\udfff\ud83e\udec3\ud83c\udffc\ud83e\udec3\ud83c\udffd\ud83e\udec3\ud83c\udffe\ud83e\udec3\ud83c\udfff\ud83e\udec4\ud83e\udec4\ud83c\udffb\ud83e\udec4\ud83c\udffc\ud83e\udec4\ud83c\udffd\ud83e\udec4\ud83c\udffe\ud83e\udec4\ud83c\udfff\ud83e\uddcc\ud83e\udeb8\ud83e\udeb7\ud83e\udeb9\ud83e\udeba\ud83e\uded8\ud83e\uded7\ud83e\uded9\ud83d\udedd\ud83d\udede\ud83d\udedf\ud83e\udeac\ud83e\udea9\ud83e\udeab\ud83e\ude7c\ud83e\ude7b\ud83e\udee7\ud83e\udeaa\ud83d\udff0'\n\nclean_text = bkit.transform.clean_emojis(text, replace_with='<EMOJI>')\nprint(clean_text)\n# >>> \u0995\u09bf\u099b\u09c1 \u0987\u09ae\u09cb\u099c\u09bf \u09b9\u09b2: <EMOJI>\n```\n\n#### HTML tags<a id=\"html-tags\"></a>\n\nClean HTML tags from text and replace those with any given string.\n\n```python\nimport bkit\n\ntext = '<a href=some_URL>\u09ac\u09be\u0982\u09b2\u09be\u09a6\u09c7\u09b6</a>'\n\nclean_text = bkit.transform.clean_html(text)\nprint(clean_text)\n# >>> \u09ac\u09be\u0982\u09b2\u09be\u09a6\u09c7\u09b6\n```\n\n#### Multiple punctuations<a id=\"multiple-punctuations\"></a>\n\nRemove multiple consecutive punctuations and keep the first punctuation only.\n\n```python\nimport bkit\n\ntext = '\u0995\u09bf \u0986\u09a8\u09a8\u09cd\u09a6!!!!!'\n\nclean_text = bkit.transform.clean_multiple_punctuations(text)\nprint(clean_text)\n# >>> \u0995\u09bf \u0986\u09a8\u09a8\u09cd\u09a6!\n```\n\n#### Special characters<a id=\"special-characters\"></a>\n\nRemove special characters like `$`, `#`, `@`, etc and replace them with the given string. If no character list is passed, `[$, #,  &, %, @]` are removed by default.\n\n```python\nimport bkit\n\ntext = '#\u09ac\u09be\u0982\u09b2\u09be\u09a6\u09c7\u09b6$'\n\nclean_text = bkit.transform.clean_special_characters(text, characters=['#', '$'], replace_with='')\nprint(clean_text)\n# >>> \u09ac\u09be\u0982\u09b2\u09be\u09a6\u09c7\u09b6\n```\n\n#### Non Bangla characters<a id=\"non-bangla-characters\"></a>\n\nNon Bangla characters include characters and punctuation not used in Bangla like english or other language's alphabets  and replace them with the given string.\n\n```python\nimport bkit\n\ntext = '\u098f\u0987 \u09b6\u09c2\u0995\u0995\u09c0\u099f \u09b9\u09be\u09a4\u09bf\u09b6\u09c1\u0981\u09a1\u09bc Heliotropium indicum, \u0985\u09a4\u09b8\u09c0, \u0986\u0995\u09a8\u09cd\u09a6 Calotropis gigantea \u0997\u09be\u099b\u09c7\u09b0 \u09aa\u09be\u09a4\u09be\u09b0 \u09b0\u09b8\u09be\u09b2\u09cb \u0985\u0982\u09b6 \u0986\u09b9\u09be\u09b0 \u0995\u09b0\u09c7\u0964'\n\nclean_text = bkit.transform.clean_non_bangla(text, replace_with='')\nprint(clean_text)\n# >>> \u098f\u0987 \u09b6\u09c2\u0995\u0995\u09c0\u099f \u09b9\u09be\u09a4\u09bf\u09b6\u09c1\u0981\u09a1\u09bc  , \u0985\u09a4\u09b8\u09c0, \u0986\u0995\u09a8\u09cd\u09a6  \u0997\u09be\u099b\u09c7\u09b0 \u09aa\u09be\u09a4\u09be\u09b0 \u09b0\u09b8\u09be\u09b2\u09cb \u0985\u0982\u09b6 \u0986\u09b9\u09be\u09b0 \u0995\u09b0\u09c7\n```\n\n### Text Analysis<a id=\"text-analysis\"></a>\n\n#### Word count<a id=\"word-count\"></a>\n\nThe `bkit.analysis.count_words` function can be used to get the word counts. It has the following paramerts:\n\n```python\n\"\"\"\nArgs:\n  text (Tuple[str, List[str]]): The text to count words from. If a string is provided,\n    it will be split into words. If a list of strings is provided, each string will\n    be split into words and counted separately.\n  clean_punctuation (bool, optional): Whether to clean punctuation from the words count. Defaults to False.\n  punct_replacement (str, optional): The replacement for the punctuation. Only applicable if\n    clean_punctuation is True. Defaults to \"\".\n  return_dict (bool, optional): Whether to return the word count as a dictionary.\n    Defaults to False.\n  ordered (bool, optional): Whether to return the word count in descending order. Only\n    applicable if return_dict is True. Defaults to False.\n\nReturns:\n  Tuple[int, Dict[str, int]]: If return_dict is True, returns a tuple containing the\n    total word count and a dictionary where the keys are the words and the values\n    are their respective counts. If return_dict is False, returns only the total\n    word count as an integer.\n\"\"\"\n\n# examples\n\nimport bkit\n\ntext='\u0985\u09ad\u09bf\u09b7\u09c7\u0995\u09c7\u09b0 \u0986\u0997\u09c7\u09b0 \u09a6\u09bf\u09a8 \u0997\u09a4\u0995\u09be\u09b2 \u09b0\u09cb\u09ac\u09ac\u09be\u09b0 \u0993\u09df\u09be\u09b6\u09bf\u0982\u099f\u09a8\u09c7 \u09ac\u09bf\u09b6\u09be\u09b2 \u098f\u0995 \u09b8\u09ae\u09be\u09ac\u09c7\u09b6\u09c7 \u09b9\u09be\u099c\u09bf\u09b0 \u09b9\u09a8 \u099f\u09cd\u09b0\u09be\u09ae\u09cd\u09aa\u0964 \u09a4\u09bf\u09a8\u09bf \u0989\u099a\u09cd\u099b\u09cd\u09ac\u09b8\u09bf\u09a4 \u09ad\u0995\u09cd\u09a4-\u09b8\u09ae\u09b0\u09cd\u09a5\u0995\u09a6\u09c7\u09b0 \u0986\u09ae\u09c7\u09b0\u09bf\u0995\u09be\u09b0 \u09aa\u09a4\u09a8\u09c7\u09b0 \u09af\u09ac\u09a8\u09bf\u0995\u09be \u0998\u099f\u09be\u09a8\u09cb\u09b0 \u0985\u0999\u09cd\u0997\u09c0\u0995\u09be\u09b0 \u0995\u09b0\u09c7\u09a8\u0964'\ntotal_words=bkit.analysis.count_words(text)\nprint(total_words)\n# >>> 21\n\n```\n\n#### Sentence Count<a id=\"sentence-count\"></a>\n\nThe bkit.analysis.count_sentences function can be used to get the word counts. It has the following paramerts:\n\n```python\n\"\"\"\nCounts the number of sentences in the given text or list of texts.\n\nArgs:\n  text (Tuple[str, List[str]]): The text or list of texts to count sentences from.\n  return_dict (bool, optional): Whether to return the result as a dictionary. Defaults to False.\n  ordered (bool, optional): Whether to order the result in descending order.\n    Only applicable if return_dict is True. Defaults to False.\n\nReturns:\n  int or dict: The count of sentences. If return_dict is True, returns a dictionary with sentences as keys\n    and their counts as values. If return_dict is False, returns the total count of sentences.\n\nRaises:\n  AssertionError: If ordered is True but return_dict is False.\n\"\"\"\n\n# examples\nimport bkit\n\ntext = '\u09a4\u09c1\u09ae\u09bf \u0995\u09cb\u09a5\u09be\u09df \u09a5\u09be\u0995? \u09a2\u09be\u0995\u09be \u09ac\u09be\u0982\u09b2\u09be\u09a6\u09c7\u09b6\u09c7\u09b0\\n \u09b0\u09be\u099c\u09a7\u09be\u09a8\u09c0\u0964 \u0995\u09bf \u0985\u09ac\u09b8\u09cd\u09a5\u09be \u09a4\u09be\u09b0! \u09e7\u09e8/\u09e6\u09e9/\u09e8\u09e6\u09e8\u09e8 \u09a4\u09be\u09b0\u09bf\u0996\u09c7 \u09b8\u09c7 \u09ea/\u0995 \u09a0\u09bf\u0995\u09be\u09a8\u09be\u09df \u0997\u09bf\u09df\u09c7 \u09e7\u09e8,\u09e9\u09ea\u09eb.\u09e8\u09e9 \u099f\u09be\u0995\u09be \u09a6\u09bf\u09df\u09c7\u099b\u09bf\u09b2\u0964'\n\ncount = bkit.analysis.count_sentences(text)\nprint(count)\n# >>> 5\n\ncount = bkit.analysis.count_sentences(text, return_dict=True, ordered=True)\nprint(count)\n# >>> {'\u09a4\u09c1\u09ae\u09bf \u0995\u09cb\u09a5\u09be\u09df \u09a5\u09be\u0995?': 1, '\u09a2\u09be\u0995\u09be \u09ac\u09be\u0982\u09b2\u09be\u09a6\u09c7\u09b6\u09c7\u09b0\\n': 1, '\u09b0\u09be\u099c\u09a7\u09be\u09a8\u09c0\u0964': 1, '\u0995\u09bf \u0985\u09ac\u09b8\u09cd\u09a5\u09be \u09a4\u09be\u09b0!': 1, '\u09e7\u09e8/\u09e6\u09e9/\u09e8\u09e6\u09e8\u09e8 \u09a4\u09be\u09b0\u09bf\u0996\u09c7 \u09b8\u09c7 \u09ea/\u0995 \u09a0\u09bf\u0995\u09be\u09a8\u09be\u09df \u0997\u09bf\u09df\u09c7 \u09e7\u09e8,\u09e9\u09ea\u09eb.\u09e8\u09e9 \u099f\u09be\u0995\u09be \u09a6\u09bf\u09df\u09c7\u099b\u09bf\u09b2\u0964': 1}\n```\n\n### Lemmatization<a id=\"lemmatization\"></a>\nLemmatization is implemented based on our this paper **BanLemma: A Word Formation Dependent Rule and Dictionary Based Bangla Lemmatizer**  \n\n[For more details](https://arxiv.org/pdf/2311.03078)\n#### Lemmatize text<a id=\"lemmatize-text\"></a>\n\nLemmatize a given text. Generally expects the text to be a sentence.\n\n```python\nimport bkit\n\ntext = '\u09aa\u09c3\u09a5\u09bf\u09ac\u09c0\u09b0 \u099c\u09a8\u09b8\u0982\u0996\u09cd\u09af\u09be \u09ee \u09ac\u09bf\u09b2\u09bf\u09df\u09a8\u09c7\u09b0 \u0995\u09bf\u099b\u09c1 \u0995\u09ae'\n\nlemmatized = bkit.lemmatizer.lemmatize(text)\n\nprint(lemmatized)\n# >>> \u09aa\u09c3\u09a5\u09bf\u09ac\u09c0 \u099c\u09a8\u09b8\u0982\u0996\u09cd\u09af\u09be \u09ee \u09ac\u09bf\u09b2\u09bf\u09df\u09a8 \u0995\u09bf\u099b\u09c1 \u0995\u09ae\n```\n\n#### Lemmatize word<a id=\"lemmatize-word\"></a>\n\nLemmatize a word given the PoS information.\n\n```python\nimport bkit\n\ntext = '\u09aa\u09c3\u09a5\u09bf\u09ac\u09c0\u09b0'\n\nlemmatized = bkit.lemmatizer.lemmatize_word(text, 'noun')\n\nprint(lemmatized)\n# >>> \u09aa\u09c3\u09a5\u09bf\u09ac\u09c0\n```\n\n### Tokenization<a id=\"tokenization\"></a>\n\nTokenize a given text. The `bkit.tokenizer` module is used to tokenizer text into tokens. It supports three types of tokenization.\n\n#### Word tokenization<a id=\"word-tokenization\"></a>\n\nTokenize text into words. Also separates some punctuations including comma, danda (\u0964), question mark, etc.\n\n```python\nimport bkit\n\ntext = '\u09a4\u09c1\u09ae\u09bf \u0995\u09cb\u09a5\u09be\u09df \u09a5\u09be\u0995? \u09a2\u09be\u0995\u09be \u09ac\u09be\u0982\u09b2\u09be\u09a6\u09c7\u09b6\u09c7\u09b0 \u09b0\u09be\u099c\u09a7\u09be\u09a8\u09c0\u0964 \u0995\u09bf \u0985\u09ac\u09b8\u09cd\u09a5\u09be \u09a4\u09be\u09b0! \u09e7\u09e8/\u09e6\u09e9/\u09e8\u09e6\u09e8\u09e8 \u09a4\u09be\u09b0\u09bf\u0996\u09c7 \u09b8\u09c7 \u09ea/\u0995 \u09a0\u09bf\u0995\u09be\u09a8\u09be\u09df \u0997\u09bf\u09df\u09c7 \u09e7\u09e8,\u09e9\u09ea\u09eb \u099f\u09be\u0995\u09be \u09a6\u09bf\u09df\u09c7\u099b\u09bf\u09b2\u0964'\n\ntokens = bkit.tokenizer.tokenize(text)\n\nprint(tokens)\n# >>> ['\u09a4\u09c1\u09ae\u09bf', '\u0995\u09cb\u09a5\u09be\u09df', '\u09a5\u09be\u0995', '?', '\u09a2\u09be\u0995\u09be', '\u09ac\u09be\u0982\u09b2\u09be\u09a6\u09c7\u09b6\u09c7\u09b0', '\u09b0\u09be\u099c\u09a7\u09be\u09a8\u09c0', '\u0964', '\u0995\u09bf', '\u0985\u09ac\u09b8\u09cd\u09a5\u09be', '\u09a4\u09be\u09b0', '!', '\u09e7\u09e8/\u09e6\u09e9/\u09e8\u09e6\u09e8\u09e8', '\u09a4\u09be\u09b0\u09bf\u0996\u09c7', '\u09b8\u09c7', '\u09ea/\u0995', '\u09a0\u09bf\u0995\u09be\u09a8\u09be\u09df', '\u0997\u09bf\u09df\u09c7', '\u09e7\u09e8,\u09e9\u09ea\u09eb', '\u099f\u09be\u0995\u09be', '\u09a6\u09bf\u09df\u09c7\u099b\u09bf\u09b2', '\u0964']\n```\n\n#### Word and Punctuation tokenization<a id=\"word-and-punctuation-tokenization\"></a>\n\nTokenize text into words and any punctuation.\n\n```python\nimport bkit\n\ntext = '\u09a4\u09c1\u09ae\u09bf \u0995\u09cb\u09a5\u09be\u09df \u09a5\u09be\u0995? \u09a2\u09be\u0995\u09be \u09ac\u09be\u0982\u09b2\u09be\u09a6\u09c7\u09b6\u09c7\u09b0 \u09b0\u09be\u099c\u09a7\u09be\u09a8\u09c0\u0964 \u0995\u09bf \u0985\u09ac\u09b8\u09cd\u09a5\u09be \u09a4\u09be\u09b0! \u09e7\u09e8/\u09e6\u09e9/\u09e8\u09e6\u09e8\u09e8 \u09a4\u09be\u09b0\u09bf\u0996\u09c7 \u09b8\u09c7 \u09ea/\u0995 \u09a0\u09bf\u0995\u09be\u09a8\u09be\u09df \u0997\u09bf\u09df\u09c7 \u09e7\u09e8,\u09e9\u09ea\u09eb \u099f\u09be\u0995\u09be \u09a6\u09bf\u09df\u09c7\u099b\u09bf\u09b2\u0964'\n\ntokens = bkit.tokenizer.tokenize_word_punctuation(text)\n\nprint(tokens)\n# >>> ['\u09a4\u09c1\u09ae\u09bf', '\u0995\u09cb\u09a5\u09be\u09df', '\u09a5\u09be\u0995', '?', '\u09a2\u09be\u0995\u09be', '\u09ac\u09be\u0982\u09b2\u09be\u09a6\u09c7\u09b6\u09c7\u09b0', '\u09b0\u09be\u099c\u09a7\u09be\u09a8\u09c0', '\u0964', '\u0995\u09bf', '\u0985\u09ac\u09b8\u09cd\u09a5\u09be', '\u09a4\u09be\u09b0', '!', '\u09e7\u09e8', '/', '\u09e6\u09e9', '/', '\u09e8\u09e6\u09e8\u09e8', '\u09a4\u09be\u09b0\u09bf\u0996\u09c7', '\u09b8\u09c7', '\u09ea', '/', '\u0995', '\u09a0\u09bf\u0995\u09be\u09a8\u09be\u09df', '\u0997\u09bf\u09df\u09c7', '\u09e7\u09e8', ',', '\u09e9\u09ea\u09eb', '\u099f\u09be\u0995\u09be', '\u09a6\u09bf\u09df\u09c7\u099b\u09bf\u09b2', '\u0964']\n```\n\n#### Sentence tokenization<a id=\"sentence-tokenization\"></a>\n\nTokenize text into sentences.\n\n```python\nimport bkit\n\ntext = '\u09a4\u09c1\u09ae\u09bf \u0995\u09cb\u09a5\u09be\u09df \u09a5\u09be\u0995? \u09a2\u09be\u0995\u09be \u09ac\u09be\u0982\u09b2\u09be\u09a6\u09c7\u09b6\u09c7\u09b0 \u09b0\u09be\u099c\u09a7\u09be\u09a8\u09c0\u0964 \u0995\u09bf \u0985\u09ac\u09b8\u09cd\u09a5\u09be \u09a4\u09be\u09b0! \u09e7\u09e8/\u09e6\u09e9/\u09e8\u09e6\u09e8\u09e8 \u09a4\u09be\u09b0\u09bf\u0996\u09c7 \u09b8\u09c7 \u09ea/\u0995 \u09a0\u09bf\u0995\u09be\u09a8\u09be\u09df \u0997\u09bf\u09df\u09c7 \u09e7\u09e8,\u09e9\u09ea\u09eb \u099f\u09be\u0995\u09be \u09a6\u09bf\u09df\u09c7\u099b\u09bf\u09b2\u0964'\n\ntokens = bkit.tokenizer.tokenize_sentence(text)\n\nprint(tokens)\n# >>> ['\u09a4\u09c1\u09ae\u09bf \u0995\u09cb\u09a5\u09be\u09df \u09a5\u09be\u0995?', '\u09a2\u09be\u0995\u09be \u09ac\u09be\u0982\u09b2\u09be\u09a6\u09c7\u09b6\u09c7\u09b0 \u09b0\u09be\u099c\u09a7\u09be\u09a8\u09c0\u0964', '\u0995\u09bf \u0985\u09ac\u09b8\u09cd\u09a5\u09be \u09a4\u09be\u09b0!', '\u09e7\u09e8/\u09e6\u09e9/\u09e8\u09e6\u09e8\u09e8 \u09a4\u09be\u09b0\u09bf\u0996\u09c7 \u09b8\u09c7 \u09ea/\u0995 \u09a0\u09bf\u0995\u09be\u09a8\u09be\u09df \u0997\u09bf\u09df\u09c7 \u09e7\u09e8,\u09e9\u09ea\u09eb \u099f\u09be\u0995\u09be \u09a6\u09bf\u09df\u09c7\u099b\u09bf\u09b2\u0964']\n```\n\n### Named Entity Recognition (NER)<a id=\"named-entity-recognition-ner\"></a>\n\nPredicts the tags of the Named Entities of a given text.\n\n```python\nimport bkit\n\ntext = '\u09a4\u09c1\u09ae\u09bf \u0995\u09cb\u09a5\u09be\u09df \u09a5\u09be\u0995? \u09a2\u09be\u0995\u09be \u09ac\u09be\u0982\u09b2\u09be\u09a6\u09c7\u09b6\u09c7\u09b0 \u09b0\u09be\u099c\u09a7\u09be\u09a8\u09c0\u0964 \u0995\u09bf \u0985\u09ac\u09b8\u09cd\u09a5\u09be \u09a4\u09be\u09b0! \u09e7\u09e8/\u09e6\u09e9/\u09e8\u09e6\u09e8\u09e8 \u09a4\u09be\u09b0\u09bf\u0996\u09c7 \u09b8\u09c7 \u09ea/\u0995 \u09a0\u09bf\u0995\u09be\u09a8\u09be\u09df \u0997\u09bf\u09df\u09c7 \u09e7\u09e8,\u09e9\u09ea\u09eb.\u09e8\u09e9 \u099f\u09be\u0995\u09be \u09a6\u09bf\u09df\u09c7\u099b\u09bf\u09b2\u0964'\n\nner = bkit.ner.Infer('ner-noisy-label')\npredictions = ner(text)\n\nprint(predictions)\n# >>> [('\u09a4\u09c1\u09ae\u09bf', 'O', 0.9998692), ('\u0995\u09cb\u09a5\u09be\u09df', 'O', 0.99988306), ('\u09a5\u09be\u0995?', 'O', 0.99983954), ('\u09a2\u09be\u0995\u09be', 'B-GPE', 0.99891424), ('\u09ac\u09be\u0982\u09b2\u09be\u09a6\u09c7\u09b6\u09c7\u09b0', 'B-GPE', 0.99710876), ('\u09b0\u09be\u099c\u09a7\u09be\u09a8\u09c0\u0964', 'O', 0.9995414), ('\u0995\u09bf', 'O', 0.99989176), ('\u0985\u09ac\u09b8\u09cd\u09a5\u09be', 'O', 0.99980336), ('\u09a4\u09be\u09b0!', 'O', 0.99983263), ('\u09e7\u09e8/\u09e6\u09e9/\u09e8\u09e6\u09e8\u09e8', 'B-D&T', 0.97921854), ('\u09a4\u09be\u09b0\u09bf\u0996\u09c7', 'O', 0.9271435), ('\u09b8\u09c7', 'O', 0.99934834), ('\u09ea/\u0995', 'B-NUM', 0.8297553), ('\u09a0\u09bf\u0995\u09be\u09a8\u09be\u09df', 'O', 0.99728775), ('\u0997\u09bf\u09df\u09c7', 'O', 0.9994825), ('\u09e7\u09e8,\u09e9\u09ea\u09eb.\u09e8\u09e9', 'B-NUM', 0.99740463), ('\u099f\u09be\u0995\u09be', 'B-UNIT', 0.99914896), ('\u09a6\u09bf\u09df\u09c7\u099b\u09bf\u09b2\u0964', 'O', 0.9998908)]\n```\n\n#### Named Entity Recognition (NER) Visualization<a id=\"named-entity-recognition-ner-visualization\"></a>\nIt takes the model's output and visualizes the NER tag for every word in the text.\n\n```python\nimport bkit\n\ntext = '\u09a4\u09c1\u09ae\u09bf \u0995\u09cb\u09a5\u09be\u09df \u09a5\u09be\u0995? \u09a2\u09be\u0995\u09be \u09ac\u09be\u0982\u09b2\u09be\u09a6\u09c7\u09b6\u09c7\u09b0 \u09b0\u09be\u099c\u09a7\u09be\u09a8\u09c0\u0964 \u0995\u09bf \u0985\u09ac\u09b8\u09cd\u09a5\u09be \u09a4\u09be\u09b0! \u09e7\u09e8/\u09e6\u09e9/\u09e8\u09e6\u09e8\u09e8 \u09a4\u09be\u09b0\u09bf\u0996\u09c7 \u09b8\u09c7 \u09ea/\u0995 \u09a0\u09bf\u0995\u09be\u09a8\u09be\u09df \u0997\u09bf\u09df\u09c7 \u09e7\u09e8,\u09e9\u09ea\u09eb.\u09e8\u09e9 \u099f\u09be\u0995\u09be \u09a6\u09bf\u09df\u09c7\u099b\u09bf\u09b2\u0964'\nner = bkit.ner.Infer('ner-noisy-label')\npredictions = ner(text)\nbkit.ner.visualize(predictions)\n```\n[![NER.png](https://i.postimg.cc/Zq2RrF0z/NER.png)](https://postimg.cc/SXLk49Wg)\n\n\n### Parts of Speech (PoS) tagging<a id=\"parts-of-speech-pos-tagging\"></a>\n\nPredicts the tags of the parts of speech of a given text.\n\n```python\nimport bkit\n\ntext = '\u0997\u09a4 \u0995\u09bf\u099b\u09c1\u09a6\u09bf\u09a8 \u09a7\u09b0\u09c7\u0987 \u099c\u09cd\u09ac\u09be\u09b2\u09be\u09a8\u09bf\u09b9\u09c0\u09a8 \u0985\u09ac\u09b8\u09cd\u09a5\u09be\u09df \u098f\u0995\u099f\u09bf \u099b\u09cb\u099f \u09ae\u09be\u099b \u09a7\u09b0\u09be\u09b0 \u09a8\u09cc\u0995\u09be\u09df \u09e7\u09eb\u09e6 \u099c\u09a8 \u09b0\u09cb\u09b9\u09bf\u0999\u09cd\u0997\u09be \u0986\u09a8\u09cd\u09a6\u09be\u09ae\u09be\u09a8 \u09b8\u09be\u0997\u09b0\u09c7 \u09ad\u09be\u09b8\u09ae\u09be\u09a8 \u0985\u09ac\u09b8\u09cd\u09a5\u09be\u09df \u09b0\u09df\u09c7\u099b\u09c7 \u0964'\npos = bkit.pos.Infer('pos-noisy-label')\npredictions = pos(text)\n\nprint(predictions)\n# >>> [('\u0997\u09a4', 'ADJ', 0.98674506), ('\u0995\u09bf\u099b\u09c1\u09a6\u09bf\u09a8', 'NNC', 0.97954935), ('\u09a7\u09b0\u09c7\u0987', 'PP', 0.96124), ('\u099c\u09cd\u09ac\u09be\u09b2\u09be\u09a8\u09bf\u09b9\u09c0\u09a8', 'ADJ', 0.93195957), ('\u0985\u09ac\u09b8\u09cd\u09a5\u09be\u09df', 'NNC', 0.9960413), ('\u098f\u0995\u099f\u09bf', 'QF', 0.9912915), ('\u099b\u09cb\u099f', 'ADJ', 0.9810739), ('\u09ae\u09be\u099b', 'NNC', 0.97365385), ('\u09a7\u09b0\u09be\u09b0', 'NNC', 0.96641904), ('\u09a8\u09cc\u0995\u09be\u09df', 'NNC', 0.99680626), ('\u09e7\u09eb\u09e6', 'QF', 0.996005), ('\u099c\u09a8', 'NNC', 0.99434316), ('\u09b0\u09cb\u09b9\u09bf\u0999\u09cd\u0997\u09be', 'NNP', 0.9141038), ('\u0986\u09a8\u09cd\u09a6\u09be\u09ae\u09be\u09a8', 'NNP', 0.9856694), ('\u09b8\u09be\u0997\u09b0\u09c7', 'NNP', 0.7122378), ('\u09ad\u09be\u09b8\u09ae\u09be\u09a8', 'ADJ', 0.93841994), ('\u0985\u09ac\u09b8\u09cd\u09a5\u09be\u09df', 'NNC', 0.9965629), ('\u09b0\u09df\u09c7\u099b\u09c7', 'VF', 0.99680847), ('\u0964', 'PUNCT', 0.9963098)]\n\n```\n#### Parts of Speech (PoS) Visualization<a id=\"parts-of-speech-pos-visualization\"></a>\n\"It takes the model's output and visualizes the Part-of-Speech tag for every word in the text.\n\n```python\nimport bkit\n\ntext = '\u0997\u09a4 \u0995\u09bf\u099b\u09c1\u09a6\u09bf\u09a8 \u09a7\u09b0\u09c7\u0987 \u099c\u09cd\u09ac\u09be\u09b2\u09be\u09a8\u09bf\u09b9\u09c0\u09a8 \u0985\u09ac\u09b8\u09cd\u09a5\u09be\u09df \u098f\u0995\u099f\u09bf \u099b\u09cb\u099f \u09ae\u09be\u099b \u09a7\u09b0\u09be\u09b0 \u09a8\u09cc\u0995\u09be\u09df \u09e7\u09eb\u09e6 \u099c\u09a8 \u09b0\u09cb\u09b9\u09bf\u0999\u09cd\u0997\u09be \u0986\u09a8\u09cd\u09a6\u09be\u09ae\u09be\u09a8 \u09b8\u09be\u0997\u09b0\u09c7 \u09ad\u09be\u09b8\u09ae\u09be\u09a8 \u0985\u09ac\u09b8\u09cd\u09a5\u09be\u09df \u09b0\u09df\u09c7\u099b\u09c7 \u0964'\npos = bkit.pos.Infer('pos-noisy-label')\npredictions = pos(text)\nbkit.pos.visualize(predictions)\n```\n[![pos.png](https://i.postimg.cc/8CjcJnpb/pos.png)](https://postimg.cc/3yQYz1Dy)\n\n\n### Shallow Parsing (Constituency Parsing)<a id=\"shallow-parsing-constituency-parsing\"></a>\n\nPredicts the shallow parsing tags of a given text.\n\n```python\nimport bkit\n\ntext = '\u09a4\u09c1\u09ae\u09bf \u0995\u09cb\u09a5\u09be\u09df \u09a5\u09be\u0995? \u09a2\u09be\u0995\u09be \u09ac\u09be\u0982\u09b2\u09be\u09a6\u09c7\u09b6\u09c7\u09b0 \u09b0\u09be\u099c\u09a7\u09be\u09a8\u09c0\u0964 \u0995\u09bf \u0985\u09ac\u09b8\u09cd\u09a5\u09be \u09a4\u09be\u09b0! \u09e7\u09e8/\u09e6\u09e9/\u09e8\u09e6\u09e8\u09e8 \u09a4\u09be\u09b0\u09bf\u0996\u09c7 \u09b8\u09c7 \u09ea/\u0995 \u09a0\u09bf\u0995\u09be\u09a8\u09be\u09df \u0997\u09bf\u09df\u09c7 \u09e7\u09e8,\u09e9\u09ea\u09eb.\u09e8\u09e9 \u099f\u09be\u0995\u09be \u09a6\u09bf\u09df\u09c7\u099b\u09bf\u09b2\u0964'\nshallow = bkit.shallow.Infer(pos_model='pos-noisy-label')\npredictions = shallow(text)\nprint(predictions)\n# >>> (S (VP (NP (PRO \u09a4\u09c1\u09ae\u09bf)) (VP (ADVP (ADV \u0995\u09cb\u09a5\u09be\u09df)) (VF \u09a5\u09be\u0995))) (NP (NNP ?) (NNP \u09a2\u09be\u0995\u09be) (NNC \u09ac\u09be\u0982\u09b2\u09be\u09a6\u09c7\u09b6\u09c7\u09b0)) (ADVP (ADV \u09b0\u09be\u099c\u09a7\u09be\u09a8\u09c0)) (NP (NP (NP (NNC \u0964)) (NP (PRO \u0995\u09bf))) (NP (QF \u0985\u09ac\u09b8\u09cd\u09a5\u09be) (NNC \u09a4\u09be\u09b0)) (NP (PRO !))) (NP (NP (QF \u09e7\u09e8/\u09e6\u09e9/\u09e8\u09e6\u09e8\u09e8) (NNC \u09a4\u09be\u09b0\u09bf\u0996\u09c7)) (VNF \u09b8\u09c7) (NP (QF \u09ea/\u0995) (NNC \u09a0\u09bf\u0995\u09be\u09a8\u09be\u09df))) (VF \u0997\u09bf\u09df\u09c7))\n```\n\n#### Shallow Parsing Visualization<a id=\"shallow-parsing-visualization\"></a>\nIt converts model predictions into an interactive shallow parsing Tree for clear and intuitive analysis\n\n```python\nfrom bkit import shallow\ntext = \"\u0995\u09be\u09a4\u09be\u09b0 \u09ac\u09bf\u09b6\u09cd\u09ac\u0995\u09be\u09aa\u09c7 \u0986\u09b0\u09cd\u099c\u09c7\u09a8\u09cd\u099f\u09bf\u09a8\u09be\u09b0 \u09ac\u09bf\u09b6\u09cd\u09ac\u0995\u09be\u09aa \u099c\u09df\u09c7 \u09ae\u09be\u09b0\u09cd\u09a4\u09bf\u09a8\u09c7\u099c\u09c7\u09b0 \u0985\u09ac\u09a6\u09be\u09a8 \u0985\u09a8\u09c7\u0995\u0964\"\nshallow = shallow.Infer(pos_model='pos-noisy-label')\npredictions = shallow(text)\nshallow.visualize(predictions)\n```\n\n[![shallow.png](https://i.postimg.cc/sgfvygbq/shallow.png)](https://postimg.cc/vcjQtbYt)\n\n\n### Dependency Parsing<a id=\"dependency-parsing\"></a>\n\nPredicts the dependency parsing tags of a given text.\n\n```python\nfrom bkit import dependency\n\ntext = \"\u0995\u09be\u09a4\u09be\u09b0 \u09ac\u09bf\u09b6\u09cd\u09ac\u0995\u09be\u09aa\u09c7 \u0986\u09b0\u09cd\u099c\u09c7\u09a8\u09cd\u099f\u09bf\u09a8\u09be\u09b0 \u09ac\u09bf\u09b6\u09cd\u09ac\u0995\u09be\u09aa \u099c\u09df\u09c7 \u09ae\u09be\u09b0\u09cd\u09a4\u09bf\u09a8\u09c7\u099c\u09c7\u09b0 \u0985\u09ac\u09a6\u09be\u09a8 \u0985\u09a8\u09c7\u0995\u0964\"\ndep =dependency.Infer('dependency-parsing')\npredictions = dep(text)\nprint(predictions)\n# >>>[{'text': '\u0995\u09be\u09a4\u09be\u09b0 \u09ac\u09bf\u09b6\u09cd\u09ac\u0995\u09be\u09aa\u09c7 \u0986\u09b0\u09cd\u099c\u09c7\u09a8\u09cd\u099f\u09bf\u09a8\u09be\u09b0 \u09ac\u09bf\u09b6\u09cd\u09ac\u0995\u09be\u09aa \u099c\u09df\u09c7 \u09ae\u09be\u09b0\u09cd\u09a4\u09bf\u09a8\u09c7\u099c\u09c7\u09b0 \u0985\u09ac\u09a6\u09be\u09a8 \u0985\u09a8\u09c7\u0995 \u0964', 'predictions': [{'token_start': 1, 'token_end': 0, 'label': 'compound'}, {'token_start': 7, 'token_end': 1, 'label': 'obl'}, {'token_start': 4, 'token_end': 2, 'label': 'nmod'}, {'token_start': 4, 'token_end': 3, 'label': 'nmod'}, {'token_start': 7, 'token_end': 4, 'label': 'obl'}, {'token_start': 6, 'token_end': 5, 'label': 'nmod'}, {'token_start': 7, 'token_end': 6, 'label': 'nsubj'}, {'token_start': 7, 'token_end': 7, 'label': 'root'}, {'token_start': 7, 'token_end': 8, 'label': 'punct'}]}]\n```\n\n#### Dependency Parsing Visualization<a id=\"dependency-parsing-visualization\"></a>\nIt converts model predictions into an interactive dependency graph for clear and intuitive analysis\n\n```python\nfrom bkit import dependency\ntext = \"\u0995\u09be\u09a4\u09be\u09b0 \u09ac\u09bf\u09b6\u09cd\u09ac\u0995\u09be\u09aa\u09c7 \u0986\u09b0\u09cd\u099c\u09c7\u09a8\u09cd\u099f\u09bf\u09a8\u09be\u09b0 \u09ac\u09bf\u09b6\u09cd\u09ac\u0995\u09be\u09aa \u099c\u09df\u09c7 \u09ae\u09be\u09b0\u09cd\u09a4\u09bf\u09a8\u09c7\u099c\u09c7\u09b0 \u0985\u09ac\u09a6\u09be\u09a8 \u0985\u09a8\u09c7\u0995\u0964\"\ndep = dependency.Infer('dependency-parsing')\npredictions = dep(text)\ndependency.visualize(predictions)\n```\n\n[![dependency-visu.png](https://i.postimg.cc/MpsXG8ST/dependency-visu.png)](https://postimg.cc/K1Mm9w4S)\n\n### Coreference Resolution<a id=\"coref-resolution\"></a>\n\nPredicts the coreferent clusters of a given text.\n\n```python\nimport bkit\n\ntext = \"\u09a4\u09be\u09b0\u09be\u09b8\u09c1\u09a8\u09cd\u09a6\u09b0\u09c0 ( \u09e7\u09ee\u09ed\u09ee - \u09e7\u09ef\u09ea\u09ee ) \u0985\u09ad\u09bf\u09a8\u09c7\u09a4\u09cd\u09b0\u09c0 \u0964 \u09e7\u09ee\u09ee\u09ea \u09b8\u09be\u09b2\u09c7 \u09ac\u09bf\u09a8\u09cb\u09a6\u09bf\u09a8\u09c0\u09b0 \u09b8\u09b9\u09be\u09af\u09bc\u09a4\u09be\u09af\u09bc \u09b8\u09cd\u099f\u09be\u09b0 \u09a5\u09bf\u09af\u09bc\u09c7\u099f\u09be\u09b0\u09c7 \u09af\u09cb\u0997\u09a6\u09be\u09a8\u09c7\u09b0 \u09ae\u09be\u09a7\u09cd\u09af\u09ae\u09c7 \u09a4\u09bf\u09a8\u09bf \u0985\u09ad\u09bf\u09a8\u09af\u09bc \u09b6\u09c1\u09b0\u09c1 \u0995\u09b0\u09c7\u09a8 \u0964 \u09aa\u09cd\u09b0\u09a5\u09ae\u09c7 \u09a4\u09bf\u09a8\u09bf \u0997\u09bf\u09b0\u09bf\u09b6\u099a\u09a8\u09cd\u09a6\u09cd\u09b0 \u0998\u09cb\u09b7\u09c7\u09b0 \u099a\u09c8\u09a4\u09a8\u09cd\u09af\u09b2\u09c0\u09b2\u09be \u09a8\u09be\u099f\u0995\u09c7 \u098f\u0995 \u09ac\u09be\u09b2\u0995 \u0993 \u09b8\u09b0\u09b2\u09be \u09a8\u09be\u099f\u0995\u09c7 \u0997\u09cb\u09aa\u09be\u09b2 \u099a\u09b0\u09bf\u09a4\u09cd\u09b0\u09c7 \u0985\u09ad\u09bf\u09a8\u09af\u09bc \u0995\u09b0\u09c7\u09a8 \u0964\"\ncoref = bkit.coref.Infer('coref')\npredictions = coref(text)\nprint(predictions)\n# >>> {'text': ['\u09a4\u09be\u09b0\u09be\u09b8\u09c1\u09a8\u09cd\u09a6\u09b0\u09c0', '(', '\u09e7\u09ee\u09ed\u09ee', '-', '\u09e7\u09ef\u09ea\u09ee', ')', '\u0985\u09ad\u09bf\u09a8\u09c7\u09a4\u09cd\u09b0\u09c0', '\u0964', '\u09e7\u09ee\u09ee\u09ea', '\u09b8\u09be\u09b2\u09c7', '\u09ac\u09bf\u09a8\u09cb\u09a6\u09bf\u09a8\u09c0\u09b0', '\u09b8\u09b9\u09be\u09af\u09bc\u09a4\u09be\u09af\u09bc', '\u09b8\u09cd\u099f\u09be\u09b0', '\u09a5\u09bf\u09af\u09bc\u09c7\u099f\u09be\u09b0\u09c7', '\u09af\u09cb\u0997\u09a6\u09be\u09a8\u09c7\u09b0', '\u09ae\u09be\u09a7\u09cd\u09af\u09ae\u09c7', '\u09a4\u09bf\u09a8\u09bf', '\u0985\u09ad\u09bf\u09a8\u09af\u09bc', '\u09b6\u09c1\u09b0\u09c1', '\u0995\u09b0\u09c7\u09a8', '\u0964', '\u09aa\u09cd\u09b0\u09a5\u09ae\u09c7', '\u09a4\u09bf\u09a8\u09bf', '\u0997\u09bf\u09b0\u09bf\u09b6\u099a\u09a8\u09cd\u09a6\u09cd\u09b0', '\u0998\u09cb\u09b7\u09c7\u09b0', '\u099a\u09c8\u09a4\u09a8\u09cd\u09af\u09b2\u09c0\u09b2\u09be', '\u09a8\u09be\u099f\u0995\u09c7', '\u098f\u0995', '\u09ac\u09be\u09b2\u0995', '\u0993', '\u09b8\u09b0\u09b2\u09be', '\u09a8\u09be\u099f\u0995\u09c7', '\u0997\u09cb\u09aa\u09be\u09b2', '\u099a\u09b0\u09bf\u09a4\u09cd\u09b0\u09c7', '\u0985\u09ad\u09bf\u09a8\u09af\u09bc', '\u0995\u09b0\u09c7\u09a8', '\u0964'], 'mention_indices': {0: [{'start_token': 0, 'end_token': 0}, {'start_token': 6, 'end_token': 6}, {'start_token': 10, 'end_token': 10}, {'start_token': 16, 'end_token': 16}, {'start_token': 22, 'end_token': 22}]}}\n```\n\n#### Coreference Resolution Visualization<a id=\"coreference-resolution-visualization\"></a>\nIt takes the model's output and creates an interactive visualization to clearly depict coreference resolution, highlighting the relationships between entities in the text\n\n```python\nfrom bkit import coref\n\ntext = \"\u09a4\u09be\u09b0\u09be\u09b8\u09c1\u09a8\u09cd\u09a6\u09b0\u09c0 ( \u09e7\u09ee\u09ed\u09ee - \u09e7\u09ef\u09ea\u09ee ) \u0985\u09ad\u09bf\u09a8\u09c7\u09a4\u09cd\u09b0\u09c0 \u0964 \u09e7\u09ee\u09ee\u09ea \u09b8\u09be\u09b2\u09c7 \u09ac\u09bf\u09a8\u09cb\u09a6\u09bf\u09a8\u09c0\u09b0 \u09b8\u09b9\u09be\u09af\u09bc\u09a4\u09be\u09af\u09bc \u09b8\u09cd\u099f\u09be\u09b0 \u09a5\u09bf\u09af\u09bc\u09c7\u099f\u09be\u09b0\u09c7 \u09af\u09cb\u0997\u09a6\u09be\u09a8\u09c7\u09b0 \u09ae\u09be\u09a7\u09cd\u09af\u09ae\u09c7 \u09a4\u09bf\u09a8\u09bf \u0985\u09ad\u09bf\u09a8\u09af\u09bc \u09b6\u09c1\u09b0\u09c1 \u0995\u09b0\u09c7\u09a8 \u0964 \u09aa\u09cd\u09b0\u09a5\u09ae\u09c7 \u09a4\u09bf\u09a8\u09bf \u0997\u09bf\u09b0\u09bf\u09b6\u099a\u09a8\u09cd\u09a6\u09cd\u09b0 \u0998\u09cb\u09b7\u09c7\u09b0 \u099a\u09c8\u09a4\u09a8\u09cd\u09af\u09b2\u09c0\u09b2\u09be \u09a8\u09be\u099f\u0995\u09c7 \u098f\u0995 \u09ac\u09be\u09b2\u0995 \u0993 \u09b8\u09b0\u09b2\u09be \u09a8\u09be\u099f\u0995\u09c7 \u0997\u09cb\u09aa\u09be\u09b2 \u099a\u09b0\u09bf\u09a4\u09cd\u09b0\u09c7 \u0985\u09ad\u09bf\u09a8\u09af\u09bc \u0995\u09b0\u09c7\u09a8 \u0964\"\ncoref = coref.Infer('coref')\npredictions = coref(text)\ncoref.visualize(predictions)\n```\n\n[![coref.png](https://i.postimg.cc/26W5TbBb/coref.png)](https://postimg.cc/cgsZLJh0)\n\n\n\n\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "A python tool for Bangla text processing",
    "version": "0.0.8",
    "project_urls": {
        "Homepage": "https://github.com/giga-tech/bangla-text-processing-kit"
    },
    "split_keywords": [
        "bangla kit",
        " bangla text processing",
        " banpipeline",
        " ner",
        " pos",
        " shallow parsing",
        " dependency parsing"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "0e501b27f96158417e99d6656d9d9ee60749255dc2a4827aab982a5d973e3d2b",
                "md5": "13bf2f4ba57ecd95c87a23e986b2d152",
                "sha256": "9e4ab3811e596369d8a688020e46ebc1da6dcda34ae055cc8922633c493934a3"
            },
            "downloads": -1,
            "filename": "bkit-0.0.8-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "13bf2f4ba57ecd95c87a23e986b2d152",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.7",
            "size": 1933952,
            "upload_time": "2025-01-22T04:59:28",
            "upload_time_iso_8601": "2025-01-22T04:59:28.975491Z",
            "url": "https://files.pythonhosted.org/packages/0e/50/1b27f96158417e99d6656d9d9ee60749255dc2a4827aab982a5d973e3d2b/bkit-0.0.8-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "00c9fcc1a9372b6aa3cb2ff348b0c0d15ed6864d75feac2899561e79eace7cfd",
                "md5": "af8cd0b1bb458fe65d5a26585a01ac7d",
                "sha256": "113526f748858f7efddc4c969efc1392935dce749d062f9defcbaceb6ea517a1"
            },
            "downloads": -1,
            "filename": "bkit-0.0.8.tar.gz",
            "has_sig": false,
            "md5_digest": "af8cd0b1bb458fe65d5a26585a01ac7d",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.7",
            "size": 1936553,
            "upload_time": "2025-01-22T04:59:34",
            "upload_time_iso_8601": "2025-01-22T04:59:34.165142Z",
            "url": "https://files.pythonhosted.org/packages/00/c9/fcc1a9372b6aa3cb2ff348b0c0d15ed6864d75feac2899561e79eace7cfd/bkit-0.0.8.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-01-22 04:59:34",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "giga-tech",
    "github_project": "bangla-text-processing-kit",
    "github_not_found": true,
    "lcname": "bkit"
}

None