# SHIFTLAB OCR
SHIFT OCR is a library for handwriting text segmentation and character recognition.
# Get Started
```
pip install shiftlab_ocr
```
## Doc2Text
`Reader` from `doc2text` performs text detection and the following recognition.
![](https://github.com/constantin50/shiftlab_ocr/blob/main/demo_image.png)
```
import urllib
from shiftlab_ocr.doc2text.reader import Reader
urllib.request.urlretrieve(
'https://raw.githubusercontent.com/konverner/shiftlab_ocr/main/demo_image.png',
'test.png')
reader = Reader()
result = reader.doc2text("test.png")
```
Display recognized text:
```
print(result[0])
Действительно ли добро сильнее зла?
Именно над этим вопросом аставля заставляет
читателей задуматься В. Тендряков.
Автор рассматривает данную пробле-
му на конкретном примере, рассказывая
историю 00 заблудившемся немце русских
солдатах, которые пожалели врала и
позволи ему остаться землянке.
```
Display segmented crops:
```
import matplotlib.pyplot as plt
def show_img_grid(images, N):
n = int(N**(0.5))
k = 0
f, axarr = plt.subplots(n,n,figsize=(10,10))
for i in range(n):
for j in range(n):
axarr[i,j].imshow(images[k].img)
k += 1
f.show()
show_img_grid(result[1], 48)
```
![](https://github.com/konverner/shiftlab_ocr/blob/main/crops_image.png?raw=true)
## Generator of handwriting
It generates handwriting script with random backgrounds and handwriting fonts with a given string or a list of strings saved in `source.txt`.
Generating a random sample from a string:
```
from shiftlab_ocr.generator.generator import Generator
g = Generator(lang='ru')
s = g.generate_from_string('Москва',min_length=4,max_length=24) # get from a string
s
```
![](https://sun9-51.userapi.com/impg/CSeyZPb4rDmP4aCYIDoMDx5VQMXcWO6CwtpGUA/vH_cghX1JtA.jpg?size=344x88&quality=96&sign=c61344d4c7f5576ffe03e750ca31f94c&type=album)
Generating batch of random samples from `source.txt`:
```
import numpy as np
# upload source.txt with one word per line
g.upload_source('source.txt')
b = g.generate_batch(12,4,13) # get batch of random samples from source.txt
fig=plt.figure(figsize=(10, 10))
rows = int(len(b)/4) + 2
columns = int(len(b)/8) + 2
for i in range(len(b)):
fig.add_subplot(rows, columns, i+1)
plt.imshow(np.asarray(b[i][0]))
```
![](https://sun9-80.userapi.com/impg/ay9o11D8ItN65kDqYnZBahiZFk1zZ2wo5BYoMA/I_nNhdMQeLs.jpg?size=600x409&quality=96&sign=9d6a3ee935fcdc7112aec557eeed74f1&type=album)
Also, see [Google Colab Demo](https://colab.research.google.com/drive/1FPfQY9HvjEPEdzfFEZsgSCk5P1TBUAse?usp=sharing)
Raw data
{
"_id": null,
"home_page": "https://github.com/konverner/shiftlab_ocr",
"name": "shiftlab-ocr",
"maintainer": "",
"docs_url": null,
"requires_python": ">=3.6",
"maintainer_email": "",
"keywords": "data,computer vision,handwriting,doc2text",
"author": "Konstantin Verner",
"author_email": "konst.verner@gmail.com",
"download_url": "https://files.pythonhosted.org/packages/02/72/046035814cd471be9534bd7b0cb831b376cc9cac178edf58d7bca68cab00/shiftlab-ocr-0.3.2.tar.gz",
"platform": null,
"description": "# SHIFTLAB OCR\r\n\r\nSHIFT OCR is a library for handwriting text segmentation and character recognition.\r\n \r\n# Get Started\r\n\r\n``` \r\npip install shiftlab_ocr\r\n```\r\n## Doc2Text\r\n`Reader` from `doc2text` performs text detection and the following recognition.\r\n\r\n![](https://github.com/constantin50/shiftlab_ocr/blob/main/demo_image.png)\r\n\r\n```\r\nimport urllib\r\n\r\nfrom shiftlab_ocr.doc2text.reader import Reader\r\n\r\n\r\nurllib.request.urlretrieve(\r\n 'https://raw.githubusercontent.com/konverner/shiftlab_ocr/main/demo_image.png',\r\n 'test.png')\r\n \r\nreader = Reader()\r\nresult = reader.doc2text(\"test.png\")\r\n\r\n```\r\n\r\nDisplay recognized text:\r\n\r\n```\r\nprint(result[0])\r\n\r\n\u0414\u0435\u0439\u0441\u0442\u0432\u0438\u0442\u0435\u043b\u044c\u043d\u043e \u043b\u0438 \u0434\u043e\u0431\u0440\u043e \u0441\u0438\u043b\u044c\u043d\u0435\u0435 \u0437\u043b\u0430?\r\n\u0418\u043c\u0435\u043d\u043d\u043e \u043d\u0430\u0434 \u044d\u0442\u0438\u043c \u0432\u043e\u043f\u0440\u043e\u0441\u043e\u043c \u0430\u0441\u0442\u0430\u0432\u043b\u044f \u0437\u0430\u0441\u0442\u0430\u0432\u043b\u044f\u0435\u0442\r\n\u0447\u0438\u0442\u0430\u0442\u0435\u043b\u0435\u0439 \u0437\u0430\u0434\u0443\u043c\u0430\u0442\u044c\u0441\u044f \u0412. \u0422\u0435\u043d\u0434\u0440\u044f\u043a\u043e\u0432.\r\n\u0410\u0432\u0442\u043e\u0440 \u0440\u0430\u0441\u0441\u043c\u0430\u0442\u0440\u0438\u0432\u0430\u0435\u0442 \u0434\u0430\u043d\u043d\u0443\u044e \u043f\u0440\u043e\u0431\u043b\u0435-\r\n\u043c\u0443 \u043d\u0430 \u043a\u043e\u043d\u043a\u0440\u0435\u0442\u043d\u043e\u043c \u043f\u0440\u0438\u043c\u0435\u0440\u0435, \u0440\u0430\u0441\u0441\u043a\u0430\u0437\u044b\u0432\u0430\u044f\r\n\u0438\u0441\u0442\u043e\u0440\u0438\u044e 00 \u0437\u0430\u0431\u043b\u0443\u0434\u0438\u0432\u0448\u0435\u043c\u0441\u044f \u043d\u0435\u043c\u0446\u0435 \u0440\u0443\u0441\u0441\u043a\u0438\u0445\r\n\u0441\u043e\u043b\u0434\u0430\u0442\u0430\u0445, \u043a\u043e\u0442\u043e\u0440\u044b\u0435 \u043f\u043e\u0436\u0430\u043b\u0435\u043b\u0438 \u0432\u0440\u0430\u043b\u0430 \u0438\r\n\u043f\u043e\u0437\u0432\u043e\u043b\u0438 \u0435\u043c\u0443 \u043e\u0441\u0442\u0430\u0442\u044c\u0441\u044f \u0437\u0435\u043c\u043b\u044f\u043d\u043a\u0435. \r\n\r\n```\r\n\r\nDisplay segmented crops:\r\n\r\n```\r\nimport matplotlib.pyplot as plt\r\n\r\ndef show_img_grid(images, N):\r\n n = int(N**(0.5))\r\n k = 0\r\n f, axarr = plt.subplots(n,n,figsize=(10,10))\r\n for i in range(n):\r\n for j in range(n):\r\n axarr[i,j].imshow(images[k].img)\r\n k += 1\r\n f.show()\r\n\r\nshow_img_grid(result[1], 48)\r\n```\r\n\r\n![](https://github.com/konverner/shiftlab_ocr/blob/main/crops_image.png?raw=true)\r\n\r\n## Generator of handwriting\r\n\r\nIt generates handwriting script with random backgrounds and handwriting fonts with a given string or a list of strings saved in `source.txt`.\r\n\r\nGenerating a random sample from a string:\r\n\r\n```\r\nfrom shiftlab_ocr.generator.generator import Generator\r\n\r\ng = Generator(lang='ru')\r\ns = g.generate_from_string('\u041c\u043e\u0441\u043a\u0432\u0430',min_length=4,max_length=24) # get from a string\r\ns\r\n```\r\n\r\n![](https://sun9-51.userapi.com/impg/CSeyZPb4rDmP4aCYIDoMDx5VQMXcWO6CwtpGUA/vH_cghX1JtA.jpg?size=344x88&quality=96&sign=c61344d4c7f5576ffe03e750ca31f94c&type=album)\r\n\r\nGenerating batch of random samples from `source.txt`:\r\n\r\n```\r\nimport numpy as np\r\n\r\n# upload source.txt with one word per line\r\ng.upload_source('source.txt')\r\nb = g.generate_batch(12,4,13) # get batch of random samples from source.txt\r\nfig=plt.figure(figsize=(10, 10))\r\nrows = int(len(b)/4) + 2\r\ncolumns = int(len(b)/8) + 2\r\nfor i in range(len(b)):\r\n fig.add_subplot(rows, columns, i+1)\r\n plt.imshow(np.asarray(b[i][0])) \r\n\r\n```\r\n\r\n![](https://sun9-80.userapi.com/impg/ay9o11D8ItN65kDqYnZBahiZFk1zZ2wo5BYoMA/I_nNhdMQeLs.jpg?size=600x409&quality=96&sign=9d6a3ee935fcdc7112aec557eeed74f1&type=album)\r\n\r\nAlso, see [Google Colab Demo](https://colab.research.google.com/drive/1FPfQY9HvjEPEdzfFEZsgSCk5P1TBUAse?usp=sharing)\r\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "SHIFT OCR is a library for handwriting text segmentation and character recognition.",
"version": "0.3.2",
"project_urls": {
"Homepage": "https://github.com/konverner/shiftlab_ocr"
},
"split_keywords": [
"data",
"computer vision",
"handwriting",
"doc2text"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "cd30b3be68e7b970c7b3140607367dede17227e11209aa19ab9069ec08cbee52",
"md5": "fed67162fd9052e23eda0312a52190e9",
"sha256": "4f04a0a2292fda20d6d95ec652f24342a5a85a64984af63bac0c25e8656bc565"
},
"downloads": -1,
"filename": "shiftlab_ocr-0.3.2-py3-none-any.whl",
"has_sig": false,
"md5_digest": "fed67162fd9052e23eda0312a52190e9",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.6",
"size": 1656279,
"upload_time": "2023-07-04T13:49:21",
"upload_time_iso_8601": "2023-07-04T13:49:21.206909Z",
"url": "https://files.pythonhosted.org/packages/cd/30/b3be68e7b970c7b3140607367dede17227e11209aa19ab9069ec08cbee52/shiftlab_ocr-0.3.2-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "0272046035814cd471be9534bd7b0cb831b376cc9cac178edf58d7bca68cab00",
"md5": "3328b64d7446d8db3bbc36477922b1b9",
"sha256": "db95692c5af6c7a2a317ba36e6df5ce1fbbb15814a16743ade21098c5bdab162"
},
"downloads": -1,
"filename": "shiftlab-ocr-0.3.2.tar.gz",
"has_sig": false,
"md5_digest": "3328b64d7446d8db3bbc36477922b1b9",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.6",
"size": 1604455,
"upload_time": "2023-07-04T13:49:23",
"upload_time_iso_8601": "2023-07-04T13:49:23.207685Z",
"url": "https://files.pythonhosted.org/packages/02/72/046035814cd471be9534bd7b0cb831b376cc9cac178edf58d7bca68cab00/shiftlab-ocr-0.3.2.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2023-07-04 13:49:23",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "konverner",
"github_project": "shiftlab_ocr",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"requirements": [],
"lcname": "shiftlab-ocr"
}