tahweel


Nametahweel JSON
Version 0.0.13 PyPI version JSON
download
home_pagehttps://tahweel.ieasybooks.com
Summaryتحويل ملفات PDF إلى Word و TXT
upload_time2024-07-28 13:56:27
maintainerNone
docs_urlNone
authorEasyBooks
requires_python<4.0,>=3.10
licenseMIT
keywords tahweel ocr pdf word txt google-drive-api
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            <div align="center">
  <a href="https://pypi.org/project/tahweel" target="_blank"><img src="https://img.shields.io/pypi/v/tahweel?label=PyPI%20Version&color=limegreen" /></a>
  <a href="https://pypi.org/project/tahweel" target="_blank"><img src="https://img.shields.io/pypi/pyversions/tahweel?color=limegreen" /></a>
  <a href="https://github.com/ieasybooks/tahweel/blob/main/LICENSE" target="_blank"><img src="https://img.shields.io/pypi/l/tahweel?color=limegreen" /></a>
  <a href="https://pepy.tech/project/tahweel" target="_blank"><img src="https://static.pepy.tech/badge/tahweel" /></a>

  <a href="https://github.com/ieasybooks/tahweel/actions/workflows/pre-commit.yml" target="_blank"><img src="https://github.com/ieasybooks/tahweel/actions/workflows/pre-commit.yml/badge.svg" /></a>
  <a href="https://github.com/ieasybooks/tahweel/actions/workflows/tests.yml" target="_blank"><img src="https://github.com/ieasybooks/tahweel/actions/workflows/tests.yml/badge.svg" /></a>
  <a href="https://sonarcloud.io/summary/new_code?id=ieasybooks_tahweel" target="_blank"><img src="https://sonarcloud.io/api/project_badges/measure?project=ieasybooks_tahweel&metric=code_smells" /></a>
  <a href="https://tahweel.ieasybooks.com" target="_blank"><img src="https://colab.research.google.com/assets/colab-badge.svg" /></a>
</div>

<h1 dir="rtl">تحويل</h1>

<p dir="rtl">تحويل ملفات PDF إلى DOCX و TXT.</p>

<h2 dir="rtl">مميزات تحويل</h2>

<ul dir="rtl">
  <li>تحويل ملفات PDF إلى DOCX و TXT باستخدام تقنيات التعرّف على الحروف من Google</li>
  <li>إمكانية تحويل ملف واحد أو مجلد كامل من الملفات</li>
  <li>الحصول على مخرجات بنفس عدد صفحات ملف PDF</li>
</ul>

<h2 dir="rtl">متطلبات الاستخدام</h2>

<ul dir="rtl">
  <li>إتصال انترنت جيد السرعة لأن الملفات ستُرفع إلى خوادم Google لتُعالج</li>
  <li>إنشاء Service Account Credentials من Google Cloud Platform كما هو موضّح <a href="https://developers.google.com/workspace/guides/create-credentials">هنا</a></li>
  <li>تثبيت لغة Python بإصدار 3.10 أو أعلى على حاسبك</li>
  <li>تثبيت مكتبة <code>poppler-utils</code> على نظام تشغيلك</li>
  <li>في حال تحويل ملفات تحتوي على صفحات بحجم أكبر من <code dir=ltr>5MB</code> يجب تثبيت مكتبة <code>bc</code> ومكتبة <code>imagemagick</code> على نظام تشغيلك</li>
</ul>

<h2 dir="rtl">تثبيت تحويل</h2>

<h3 dir="rtl">من خلال <code>pip</code></h3>

<p dir="rtl">يمكنك تثبيت تحويل من خلال <code>pip</code> باستخدام الأمر: <code dir="ltr">pip install tahweel</code></p>

<h3 dir="rtl">من خلال الشيفرة المصدرية</h3>

<ul dir="rtl">
  <li>قم بتنزيل هذا المستودع من خلال الضغط على Code ثم Download ZIP أو من خلال تنفيذ الأمر التالي: <code>git clone git@github.com:ieasybooks/tahweel.git</code></li>
  <li>قم بفك ضغط الملف إذا قمت بتنزيله بصيغة ZIP وتوجّه إلى مجلد المشروع</li>
  <li>قم بتنفيذ الأمر التالي لتثبيت تحويل: <code dir="ltr">poetry install</code></li>
</ul>

<h2 dir="rtl">استخدام تحويل</h2>

<h3 dir="rtl">الخيارات المتوفرة</h3>

<ul dir="rtl">
  <li>مسارات ملفات PDF أو مجلدات تحتوي على أكثر من ملف PDF: يجب تمرير مسارات الملفات أو المجلدات بعد اسم أداة تحويل بشكل مباشر. على سبيل المثال: <code dir="ltr">tahweel "./pdfs"</code></li>
  <li>ملف Service Account Credentials: يجب تمرير مسار ملف <code>JSON</code> الخاص بك من Google Cloud Platform إلى الاختيار <code dir="ltr">--service-account-credentials</code></li>
  <li>عدد عمليات تحويل ملف PDF إلى صور: يمكن تحديد العدد من خلال الاختيار <code dir="ltr">--pdf2image-thread-count</code>. حسب قوة حاسبك يمكن تقليل أو زيادة هذه القيمة. القيمة الافتراضية هي <code dir="ltr">8</code></li>
  <li>عدد عمليات تحويل الصور إلى نص: يمكن تحديد العدد من خلال الاختيار <code dir="ltr">--processor-max-workers</code>. حسب جودة اتصال الانترنت لديك يمكن تقليل أو زيادة هذه القيمة. القيمة الافتراضية هي <code dir="ltr">8</code></li>
  <li>نوع المخرجات عند معالجة مجلد من الملفات: عند معالجة مجلد كامل من ملفات PDF يمكنك تحديد نوع المخرجات من خلال تمرير إما <code>tree_to_tree</code> أو <code>side_by_side</code> إلى الاختيار <code dir="ltr">--dir-output-type</code>. القيمة الأولى وهي <code>tree_to_tree</code> ستقوم بإنشاء مجلد جديد بنفس ترتيب المجلد الأصلي لكل نوع من أنواع المخرجات TXT و DOCX. القيمة الثانية وهي <code>side_by_side</code> ستقوم بإنشاء ملفات TXT و DOCX بجانب ملفات PDF داخل المجلد الأصلي. القيمة الافتراضية هي <code dir="ltr">tree_to_tree</code></li>
  <li>فاصل الصفحات في ملفات TXT: يمكن تحديد النص الذي يفصل الصفحات في ملفات TXT من خلال الاختيار <code dir="ltr">--txt-page-separator</code>. القيمة الافتراضية هي <code dir="ltr">PAGE_SEPARATOR</code></li>
  <li>إزالة الأسطر من ملفات DOCX: يمكن إزالة الأسطر من ملفات DOCX قبل كتابة المحتوى من خلال الاختيار <code dir="ltr">--docx-remove-newlines</code> وهذا الأمر مفيد في حال أردت أن تكون عدد صفحات ملف DOCX مساوياً لعدد صفحات ملف PDF. القيمة الافتراضية هي <code dir="ltr">False</code></li>
  <li>
    صيغة المخرجات: يمكنك تحديد صيغة المخرجات من خلال الاختيار <code dir="ltr">--output-formats</code>. الصيغ المتوفرة:
    <ul dir="rtl">
      <li><code dir="ltr">txt</code></li>
      <li><code dir="ltr">docx</code></li>
    </ul>
  </li>
  <li>مجلد المخرجات: يمكنك تحديد مجلد الإخراج من خلال الاختيار <code dir="ltr">--output-dir</code>. إذا لم تُحدّد مجلد الإخراج ستُكتب المخرجات بناء على مسارات الملفات والمجلدات التي أعطيتها لتحويل</li>
</ul>

```
➜ tahweel --help
usage: tahweel --service-account-credentials SERVICE_ACCOUNT_CREDENTIALS [--pdf2image-thread-count PDF2IMAGE_THREAD_COUNT] [--processor-max-workers PROCESSOR_MAX_WORKERS]
               [--dir-output-type {tree_to_tree,side_by_side}] [--txt-page-separator TXT_PAGE_SEPARATOR] [--docx-remove-newlines] [--output-dir OUTPUT_DIR] [--skip-output-check] [-h] [--version]
               files_or_dirs_paths [files_or_dirs_paths ...]

positional arguments:
  files_or_dirs_paths   Path to the file or directory to be processed.

options:
  --service-account-credentials SERVICE_ACCOUNT_CREDENTIALS
                        (Path, required) Path to the service account credentials JSON file.
  --pdf2image-thread-count PDF2IMAGE_THREAD_COUNT
                        (int, default=8) Number of threads to use for PDF to image conversion using `pdf2image` package.
  --processor-max-workers PROCESSOR_MAX_WORKERS
                        (int, default=8) Number of threads to use while performing OCR on PDF pages.
  --dir-output-type {tree_to_tree,side_by_side}
                        Use this argument when processing a directory. `tree_to_tree` means the output will be in a new directory beside the input directory with the same structure, while `side_by_side`
                        means the output will be in the same input directory beside each file.
  --txt-page-separator TXT_PAGE_SEPARATOR
                        (str, default=PAGE_SEPARATOR) Separator to use between pages in the output TXT file.
  --docx-remove-newlines
                        (bool, default=False) Remove newlines from the output DOCX file. Useful if you want DOCX and PDF to have the same page count.
  --output-dir OUTPUT_DIR
                        (pathlib.Path | None, default=None) Path to the output directory. This overrides the default output directory behavior.
  --skip-output-check   (bool, default=False) Use this flag in development only to skip the output check.
  -h, --help            show this help message and exit
  --version             show program's version number and exit
```

<h3 dir="rtl">التحويل من خلال سطر الأوامر</h3>

<h4 dir="rtl">تحويل ملف PDF واحد</h4>

```bash
tahweel "./pdfs/1.pdf" \
  --service-account-credentials "./service_account_credentials.json" \
  --pdf2image-thread-count 8 \
  --processor-max-workers 8 \
  --txt-page-separator PAGE_SEPARATOR
```

<h4 dir="rtl">تحويل أكثر من ملف PDF ومجلد</h4>

```bash
tahweel "./pdfs/1.pdf" "./pdfs/2.pdf" "./other_pdfs" \
  --service-account-credentials "./service_account_credentials.json" \
  --pdf2image-thread-count 8 \
  --processor-max-workers 8 \
  --txt-page-separator PAGE_SEPARATOR
```

<h4 dir="rtl">تحويل مجلد كامل من الملفات</h4>

```bash
tahweel "./pdfs" \
  --service-account-credentials "./service_account_credentials.json" \
  --pdf2image-thread-count 8 \
  --processor-max-workers 8 \
  --dir-output-type tree_to_tree \
  --txt-page-separator PAGE_SEPARATOR \
  --docx-remove-newlines
```

<h3 dir="rtl">التحويل من خلال الشيفرة البرمجية</h3>

<p dir="rtl">يمكنك استخدام تحويل من خلال الشيفرة البرمجية كالتالي:</p>

```python
from concurrent.futures import ThreadPoolExecutor
from pathlib import Path

from tahweel.enums import TahweelType
from tahweel.managers import PdfFileManager
from tahweel.processors import GoogleDriveOcrProcessor
from tahweel.writers import DocxWriter, TxtWriter
from tqdm import tqdm


def main():
  processor = GoogleDriveOcrProcessor('./service_account_credentials.json')
  pdf_file_manager = PdfFileManager(Path('./pdfs/1.pdf'), 8)
  pdf_file_manager.to_images()

  with ThreadPoolExecutor(max_workers=8) as executor:
    content = list(
      tqdm(executor.map(processor.process, pdf_file_manager.images_paths), total=pdf_file_manager.pages_count()),
    )

  TxtWriter(pdf_file_manager.txt_file_path(TahweelType.FILE)).write(content, 'PAGE_SEPARATOR')
  DocxWriter(pdf_file_manager.docx_file_path(TahweelType.FILE)).write(content, False)


if __name__ == '__main__':
  main()
```

<h3 dir="rtl">التحويل باستخدام Docker</h3>

<p dir="rtl">إذا كان لديك Docker على حاسبك، فالطريقة الأسهل لاستخدام تحويل هي من خلاله. الأمر التالي يقوم بتنزيل Docker image الخاصة بتحويل وتحويل ملف PDF باستخدام تقنيات Google Drive OCR وإخراج النتائج في المجلد الحالي:</p>

```bash
docker run -it --rm -v "$PWD:/tahweel" ghcr.io/ieasybooks/tahweel \
  "./pdfs/1.pdf" \
  --service-account-credentials "./service_account_credentials.json" \
  --pdf2image-thread-count 8 \
  --processor-max-workers 8 \
  --dir-output-type tree_to_tree \
  --txt-page-separator PAGE_SEPARATOR \
  --docx-remove-newlines
```

<p dir="rtl">يمكنك تمرير أي خيار من خيارات مكتبة تحويل المُوضّحة في الأعلى، ولكن يجب مُراعاة تنفيذ الأمر من داخل المجلد الذي يحتوي على ملفات PDF المراد تحويلها وملف Service Account Credentials الخاص بك.</p>

<hr>

<p dir="rtl">تم الاعتماد بشكل كبير على مستودع <a href="https://github.com/ocrarian/ocrarian.py">ocrarian.py</a> لإنجاز تحويل بشكل أسرع، فجزى الله من عمل عليه خير الجزاء.</p>

            

Raw data

            {
    "_id": null,
    "home_page": "https://tahweel.ieasybooks.com",
    "name": "tahweel",
    "maintainer": null,
    "docs_url": null,
    "requires_python": "<4.0,>=3.10",
    "maintainer_email": null,
    "keywords": "tahweel, ocr, pdf, word, txt, google-drive-api",
    "author": "EasyBooks",
    "author_email": "easybooksdev@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/5a/ca/db4e8f39e2a1fd489717336df5a7150244e7091589a48db5817c0a5ef61a/tahweel-0.0.13.tar.gz",
    "platform": null,
    "description": "<div align=\"center\">\n  <a href=\"https://pypi.org/project/tahweel\" target=\"_blank\"><img src=\"https://img.shields.io/pypi/v/tahweel?label=PyPI%20Version&color=limegreen\" /></a>\n  <a href=\"https://pypi.org/project/tahweel\" target=\"_blank\"><img src=\"https://img.shields.io/pypi/pyversions/tahweel?color=limegreen\" /></a>\n  <a href=\"https://github.com/ieasybooks/tahweel/blob/main/LICENSE\" target=\"_blank\"><img src=\"https://img.shields.io/pypi/l/tahweel?color=limegreen\" /></a>\n  <a href=\"https://pepy.tech/project/tahweel\" target=\"_blank\"><img src=\"https://static.pepy.tech/badge/tahweel\" /></a>\n\n  <a href=\"https://github.com/ieasybooks/tahweel/actions/workflows/pre-commit.yml\" target=\"_blank\"><img src=\"https://github.com/ieasybooks/tahweel/actions/workflows/pre-commit.yml/badge.svg\" /></a>\n  <a href=\"https://github.com/ieasybooks/tahweel/actions/workflows/tests.yml\" target=\"_blank\"><img src=\"https://github.com/ieasybooks/tahweel/actions/workflows/tests.yml/badge.svg\" /></a>\n  <a href=\"https://sonarcloud.io/summary/new_code?id=ieasybooks_tahweel\" target=\"_blank\"><img src=\"https://sonarcloud.io/api/project_badges/measure?project=ieasybooks_tahweel&metric=code_smells\" /></a>\n  <a href=\"https://tahweel.ieasybooks.com\" target=\"_blank\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" /></a>\n</div>\n\n<h1 dir=\"rtl\">\u062a\u062d\u0648\u064a\u0644</h1>\n\n<p dir=\"rtl\">\u062a\u062d\u0648\u064a\u0644 \u0645\u0644\u0641\u0627\u062a PDF \u0625\u0644\u0649 DOCX \u0648 TXT.</p>\n\n<h2 dir=\"rtl\">\u0645\u0645\u064a\u0632\u0627\u062a \u062a\u062d\u0648\u064a\u0644</h2>\n\n<ul dir=\"rtl\">\n  <li>\u062a\u062d\u0648\u064a\u0644 \u0645\u0644\u0641\u0627\u062a PDF \u0625\u0644\u0649 DOCX \u0648 TXT \u0628\u0627\u0633\u062a\u062e\u062f\u0627\u0645 \u062a\u0642\u0646\u064a\u0627\u062a \u0627\u0644\u062a\u0639\u0631\u0651\u0641 \u0639\u0644\u0649 \u0627\u0644\u062d\u0631\u0648\u0641 \u0645\u0646 Google</li>\n  <li>\u0625\u0645\u0643\u0627\u0646\u064a\u0629 \u062a\u062d\u0648\u064a\u0644 \u0645\u0644\u0641 \u0648\u0627\u062d\u062f \u0623\u0648 \u0645\u062c\u0644\u062f \u0643\u0627\u0645\u0644 \u0645\u0646 \u0627\u0644\u0645\u0644\u0641\u0627\u062a</li>\n  <li>\u0627\u0644\u062d\u0635\u0648\u0644 \u0639\u0644\u0649 \u0645\u062e\u0631\u062c\u0627\u062a \u0628\u0646\u0641\u0633 \u0639\u062f\u062f \u0635\u0641\u062d\u0627\u062a \u0645\u0644\u0641 PDF</li>\n</ul>\n\n<h2 dir=\"rtl\">\u0645\u062a\u0637\u0644\u0628\u0627\u062a \u0627\u0644\u0627\u0633\u062a\u062e\u062f\u0627\u0645</h2>\n\n<ul dir=\"rtl\">\n  <li>\u0625\u062a\u0635\u0627\u0644 \u0627\u0646\u062a\u0631\u0646\u062a \u062c\u064a\u062f \u0627\u0644\u0633\u0631\u0639\u0629 \u0644\u0623\u0646 \u0627\u0644\u0645\u0644\u0641\u0627\u062a \u0633\u062a\u064f\u0631\u0641\u0639 \u0625\u0644\u0649 \u062e\u0648\u0627\u062f\u0645 Google \u0644\u062a\u064f\u0639\u0627\u0644\u062c</li>\n  <li>\u0625\u0646\u0634\u0627\u0621 Service Account Credentials \u0645\u0646 Google Cloud Platform \u0643\u0645\u0627 \u0647\u0648 \u0645\u0648\u0636\u0651\u062d <a href=\"https://developers.google.com/workspace/guides/create-credentials\">\u0647\u0646\u0627</a></li>\n  <li>\u062a\u062b\u0628\u064a\u062a \u0644\u063a\u0629 Python \u0628\u0625\u0635\u062f\u0627\u0631 3.10 \u0623\u0648 \u0623\u0639\u0644\u0649 \u0639\u0644\u0649 \u062d\u0627\u0633\u0628\u0643</li>\n  <li>\u062a\u062b\u0628\u064a\u062a \u0645\u0643\u062a\u0628\u0629 <code>poppler-utils</code> \u0639\u0644\u0649 \u0646\u0638\u0627\u0645 \u062a\u0634\u063a\u064a\u0644\u0643</li>\n  <li>\u0641\u064a \u062d\u0627\u0644 \u062a\u062d\u0648\u064a\u0644 \u0645\u0644\u0641\u0627\u062a \u062a\u062d\u062a\u0648\u064a \u0639\u0644\u0649 \u0635\u0641\u062d\u0627\u062a \u0628\u062d\u062c\u0645 \u0623\u0643\u0628\u0631 \u0645\u0646 <code dir=ltr>5MB</code> \u064a\u062c\u0628 \u062a\u062b\u0628\u064a\u062a \u0645\u0643\u062a\u0628\u0629 <code>bc</code> \u0648\u0645\u0643\u062a\u0628\u0629 <code>imagemagick</code> \u0639\u0644\u0649 \u0646\u0638\u0627\u0645 \u062a\u0634\u063a\u064a\u0644\u0643</li>\n</ul>\n\n<h2 dir=\"rtl\">\u062a\u062b\u0628\u064a\u062a \u062a\u062d\u0648\u064a\u0644</h2>\n\n<h3 dir=\"rtl\">\u0645\u0646 \u062e\u0644\u0627\u0644 <code>pip</code></h3>\n\n<p dir=\"rtl\">\u064a\u0645\u0643\u0646\u0643 \u062a\u062b\u0628\u064a\u062a \u062a\u062d\u0648\u064a\u0644 \u0645\u0646 \u062e\u0644\u0627\u0644 <code>pip</code> \u0628\u0627\u0633\u062a\u062e\u062f\u0627\u0645 \u0627\u0644\u0623\u0645\u0631: <code dir=\"ltr\">pip install tahweel</code></p>\n\n<h3 dir=\"rtl\">\u0645\u0646 \u062e\u0644\u0627\u0644 \u0627\u0644\u0634\u064a\u0641\u0631\u0629 \u0627\u0644\u0645\u0635\u062f\u0631\u064a\u0629</h3>\n\n<ul dir=\"rtl\">\n  <li>\u0642\u0645 \u0628\u062a\u0646\u0632\u064a\u0644 \u0647\u0630\u0627 \u0627\u0644\u0645\u0633\u062a\u0648\u062f\u0639 \u0645\u0646 \u062e\u0644\u0627\u0644 \u0627\u0644\u0636\u063a\u0637 \u0639\u0644\u0649 Code \u062b\u0645 Download ZIP \u0623\u0648 \u0645\u0646 \u062e\u0644\u0627\u0644 \u062a\u0646\u0641\u064a\u0630 \u0627\u0644\u0623\u0645\u0631 \u0627\u0644\u062a\u0627\u0644\u064a: <code>git clone git@github.com:ieasybooks/tahweel.git</code></li>\n  <li>\u0642\u0645 \u0628\u0641\u0643 \u0636\u063a\u0637 \u0627\u0644\u0645\u0644\u0641 \u0625\u0630\u0627 \u0642\u0645\u062a \u0628\u062a\u0646\u0632\u064a\u0644\u0647 \u0628\u0635\u064a\u063a\u0629 ZIP \u0648\u062a\u0648\u062c\u0651\u0647 \u0625\u0644\u0649 \u0645\u062c\u0644\u062f \u0627\u0644\u0645\u0634\u0631\u0648\u0639</li>\n  <li>\u0642\u0645 \u0628\u062a\u0646\u0641\u064a\u0630 \u0627\u0644\u0623\u0645\u0631 \u0627\u0644\u062a\u0627\u0644\u064a \u0644\u062a\u062b\u0628\u064a\u062a \u062a\u062d\u0648\u064a\u0644: <code dir=\"ltr\">poetry install</code></li>\n</ul>\n\n<h2 dir=\"rtl\">\u0627\u0633\u062a\u062e\u062f\u0627\u0645 \u062a\u062d\u0648\u064a\u0644</h2>\n\n<h3 dir=\"rtl\">\u0627\u0644\u062e\u064a\u0627\u0631\u0627\u062a \u0627\u0644\u0645\u062a\u0648\u0641\u0631\u0629</h3>\n\n<ul dir=\"rtl\">\n  <li>\u0645\u0633\u0627\u0631\u0627\u062a \u0645\u0644\u0641\u0627\u062a PDF \u0623\u0648 \u0645\u062c\u0644\u062f\u0627\u062a \u062a\u062d\u062a\u0648\u064a \u0639\u0644\u0649 \u0623\u0643\u062b\u0631 \u0645\u0646 \u0645\u0644\u0641 PDF: \u064a\u062c\u0628 \u062a\u0645\u0631\u064a\u0631 \u0645\u0633\u0627\u0631\u0627\u062a \u0627\u0644\u0645\u0644\u0641\u0627\u062a \u0623\u0648 \u0627\u0644\u0645\u062c\u0644\u062f\u0627\u062a \u0628\u0639\u062f \u0627\u0633\u0645 \u0623\u062f\u0627\u0629 \u062a\u062d\u0648\u064a\u0644 \u0628\u0634\u0643\u0644 \u0645\u0628\u0627\u0634\u0631. \u0639\u0644\u0649 \u0633\u0628\u064a\u0644 \u0627\u0644\u0645\u062b\u0627\u0644: <code dir=\"ltr\">tahweel \"./pdfs\"</code></li>\n  <li>\u0645\u0644\u0641 Service Account Credentials: \u064a\u062c\u0628 \u062a\u0645\u0631\u064a\u0631 \u0645\u0633\u0627\u0631 \u0645\u0644\u0641 <code>JSON</code> \u0627\u0644\u062e\u0627\u0635 \u0628\u0643 \u0645\u0646 Google Cloud Platform \u0625\u0644\u0649 \u0627\u0644\u0627\u062e\u062a\u064a\u0627\u0631 <code dir=\"ltr\">--service-account-credentials</code></li>\n  <li>\u0639\u062f\u062f \u0639\u0645\u0644\u064a\u0627\u062a \u062a\u062d\u0648\u064a\u0644 \u0645\u0644\u0641 PDF \u0625\u0644\u0649 \u0635\u0648\u0631: \u064a\u0645\u0643\u0646 \u062a\u062d\u062f\u064a\u062f \u0627\u0644\u0639\u062f\u062f \u0645\u0646 \u062e\u0644\u0627\u0644 \u0627\u0644\u0627\u062e\u062a\u064a\u0627\u0631 <code dir=\"ltr\">--pdf2image-thread-count</code>. \u062d\u0633\u0628 \u0642\u0648\u0629 \u062d\u0627\u0633\u0628\u0643 \u064a\u0645\u0643\u0646 \u062a\u0642\u0644\u064a\u0644 \u0623\u0648 \u0632\u064a\u0627\u062f\u0629 \u0647\u0630\u0647 \u0627\u0644\u0642\u064a\u0645\u0629. \u0627\u0644\u0642\u064a\u0645\u0629 \u0627\u0644\u0627\u0641\u062a\u0631\u0627\u0636\u064a\u0629 \u0647\u064a <code dir=\"ltr\">8</code></li>\n  <li>\u0639\u062f\u062f \u0639\u0645\u0644\u064a\u0627\u062a \u062a\u062d\u0648\u064a\u0644 \u0627\u0644\u0635\u0648\u0631 \u0625\u0644\u0649 \u0646\u0635: \u064a\u0645\u0643\u0646 \u062a\u062d\u062f\u064a\u062f \u0627\u0644\u0639\u062f\u062f \u0645\u0646 \u062e\u0644\u0627\u0644 \u0627\u0644\u0627\u062e\u062a\u064a\u0627\u0631 <code dir=\"ltr\">--processor-max-workers</code>. \u062d\u0633\u0628 \u062c\u0648\u062f\u0629 \u0627\u062a\u0635\u0627\u0644 \u0627\u0644\u0627\u0646\u062a\u0631\u0646\u062a \u0644\u062f\u064a\u0643 \u064a\u0645\u0643\u0646 \u062a\u0642\u0644\u064a\u0644 \u0623\u0648 \u0632\u064a\u0627\u062f\u0629 \u0647\u0630\u0647 \u0627\u0644\u0642\u064a\u0645\u0629. \u0627\u0644\u0642\u064a\u0645\u0629 \u0627\u0644\u0627\u0641\u062a\u0631\u0627\u0636\u064a\u0629 \u0647\u064a <code dir=\"ltr\">8</code></li>\n  <li>\u0646\u0648\u0639 \u0627\u0644\u0645\u062e\u0631\u062c\u0627\u062a \u0639\u0646\u062f \u0645\u0639\u0627\u0644\u062c\u0629 \u0645\u062c\u0644\u062f \u0645\u0646 \u0627\u0644\u0645\u0644\u0641\u0627\u062a: \u0639\u0646\u062f \u0645\u0639\u0627\u0644\u062c\u0629 \u0645\u062c\u0644\u062f \u0643\u0627\u0645\u0644 \u0645\u0646 \u0645\u0644\u0641\u0627\u062a PDF \u064a\u0645\u0643\u0646\u0643 \u062a\u062d\u062f\u064a\u062f \u0646\u0648\u0639 \u0627\u0644\u0645\u062e\u0631\u062c\u0627\u062a \u0645\u0646 \u062e\u0644\u0627\u0644 \u062a\u0645\u0631\u064a\u0631 \u0625\u0645\u0627 <code>tree_to_tree</code> \u0623\u0648 <code>side_by_side</code> \u0625\u0644\u0649 \u0627\u0644\u0627\u062e\u062a\u064a\u0627\u0631 <code dir=\"ltr\">--dir-output-type</code>. \u0627\u0644\u0642\u064a\u0645\u0629 \u0627\u0644\u0623\u0648\u0644\u0649 \u0648\u0647\u064a <code>tree_to_tree</code> \u0633\u062a\u0642\u0648\u0645 \u0628\u0625\u0646\u0634\u0627\u0621 \u0645\u062c\u0644\u062f \u062c\u062f\u064a\u062f \u0628\u0646\u0641\u0633 \u062a\u0631\u062a\u064a\u0628 \u0627\u0644\u0645\u062c\u0644\u062f \u0627\u0644\u0623\u0635\u0644\u064a \u0644\u0643\u0644 \u0646\u0648\u0639 \u0645\u0646 \u0623\u0646\u0648\u0627\u0639 \u0627\u0644\u0645\u062e\u0631\u062c\u0627\u062a TXT \u0648 DOCX. \u0627\u0644\u0642\u064a\u0645\u0629 \u0627\u0644\u062b\u0627\u0646\u064a\u0629 \u0648\u0647\u064a <code>side_by_side</code> \u0633\u062a\u0642\u0648\u0645 \u0628\u0625\u0646\u0634\u0627\u0621 \u0645\u0644\u0641\u0627\u062a TXT \u0648 DOCX \u0628\u062c\u0627\u0646\u0628 \u0645\u0644\u0641\u0627\u062a PDF \u062f\u0627\u062e\u0644 \u0627\u0644\u0645\u062c\u0644\u062f \u0627\u0644\u0623\u0635\u0644\u064a. \u0627\u0644\u0642\u064a\u0645\u0629 \u0627\u0644\u0627\u0641\u062a\u0631\u0627\u0636\u064a\u0629 \u0647\u064a <code dir=\"ltr\">tree_to_tree</code></li>\n  <li>\u0641\u0627\u0635\u0644 \u0627\u0644\u0635\u0641\u062d\u0627\u062a \u0641\u064a \u0645\u0644\u0641\u0627\u062a TXT: \u064a\u0645\u0643\u0646 \u062a\u062d\u062f\u064a\u062f \u0627\u0644\u0646\u0635 \u0627\u0644\u0630\u064a \u064a\u0641\u0635\u0644 \u0627\u0644\u0635\u0641\u062d\u0627\u062a \u0641\u064a \u0645\u0644\u0641\u0627\u062a TXT \u0645\u0646 \u062e\u0644\u0627\u0644 \u0627\u0644\u0627\u062e\u062a\u064a\u0627\u0631 <code dir=\"ltr\">--txt-page-separator</code>. \u0627\u0644\u0642\u064a\u0645\u0629 \u0627\u0644\u0627\u0641\u062a\u0631\u0627\u0636\u064a\u0629 \u0647\u064a <code dir=\"ltr\">PAGE_SEPARATOR</code></li>\n  <li>\u0625\u0632\u0627\u0644\u0629 \u0627\u0644\u0623\u0633\u0637\u0631 \u0645\u0646 \u0645\u0644\u0641\u0627\u062a DOCX: \u064a\u0645\u0643\u0646 \u0625\u0632\u0627\u0644\u0629 \u0627\u0644\u0623\u0633\u0637\u0631 \u0645\u0646 \u0645\u0644\u0641\u0627\u062a DOCX \u0642\u0628\u0644 \u0643\u062a\u0627\u0628\u0629 \u0627\u0644\u0645\u062d\u062a\u0648\u0649 \u0645\u0646 \u062e\u0644\u0627\u0644 \u0627\u0644\u0627\u062e\u062a\u064a\u0627\u0631 <code dir=\"ltr\">--docx-remove-newlines</code> \u0648\u0647\u0630\u0627 \u0627\u0644\u0623\u0645\u0631 \u0645\u0641\u064a\u062f \u0641\u064a \u062d\u0627\u0644 \u0623\u0631\u062f\u062a \u0623\u0646 \u062a\u0643\u0648\u0646 \u0639\u062f\u062f \u0635\u0641\u062d\u0627\u062a \u0645\u0644\u0641 DOCX \u0645\u0633\u0627\u0648\u064a\u0627\u064b \u0644\u0639\u062f\u062f \u0635\u0641\u062d\u0627\u062a \u0645\u0644\u0641 PDF. \u0627\u0644\u0642\u064a\u0645\u0629 \u0627\u0644\u0627\u0641\u062a\u0631\u0627\u0636\u064a\u0629 \u0647\u064a <code dir=\"ltr\">False</code></li>\n  <li>\n    \u0635\u064a\u063a\u0629 \u0627\u0644\u0645\u062e\u0631\u062c\u0627\u062a: \u064a\u0645\u0643\u0646\u0643 \u062a\u062d\u062f\u064a\u062f \u0635\u064a\u063a\u0629 \u0627\u0644\u0645\u062e\u0631\u062c\u0627\u062a \u0645\u0646 \u062e\u0644\u0627\u0644 \u0627\u0644\u0627\u062e\u062a\u064a\u0627\u0631 <code dir=\"ltr\">--output-formats</code>. \u0627\u0644\u0635\u064a\u063a \u0627\u0644\u0645\u062a\u0648\u0641\u0631\u0629:\n    <ul dir=\"rtl\">\n      <li><code dir=\"ltr\">txt</code></li>\n      <li><code dir=\"ltr\">docx</code></li>\n    </ul>\n  </li>\n  <li>\u0645\u062c\u0644\u062f \u0627\u0644\u0645\u062e\u0631\u062c\u0627\u062a: \u064a\u0645\u0643\u0646\u0643 \u062a\u062d\u062f\u064a\u062f \u0645\u062c\u0644\u062f \u0627\u0644\u0625\u062e\u0631\u0627\u062c \u0645\u0646 \u062e\u0644\u0627\u0644 \u0627\u0644\u0627\u062e\u062a\u064a\u0627\u0631 <code dir=\"ltr\">--output-dir</code>. \u0625\u0630\u0627 \u0644\u0645 \u062a\u064f\u062d\u062f\u0651\u062f \u0645\u062c\u0644\u062f \u0627\u0644\u0625\u062e\u0631\u0627\u062c \u0633\u062a\u064f\u0643\u062a\u0628 \u0627\u0644\u0645\u062e\u0631\u062c\u0627\u062a \u0628\u0646\u0627\u0621 \u0639\u0644\u0649 \u0645\u0633\u0627\u0631\u0627\u062a \u0627\u0644\u0645\u0644\u0641\u0627\u062a \u0648\u0627\u0644\u0645\u062c\u0644\u062f\u0627\u062a \u0627\u0644\u062a\u064a \u0623\u0639\u0637\u064a\u062a\u0647\u0627 \u0644\u062a\u062d\u0648\u064a\u0644</li>\n</ul>\n\n```\n\u279c tahweel --help\nusage: tahweel --service-account-credentials SERVICE_ACCOUNT_CREDENTIALS [--pdf2image-thread-count PDF2IMAGE_THREAD_COUNT] [--processor-max-workers PROCESSOR_MAX_WORKERS]\n               [--dir-output-type {tree_to_tree,side_by_side}] [--txt-page-separator TXT_PAGE_SEPARATOR] [--docx-remove-newlines] [--output-dir OUTPUT_DIR] [--skip-output-check] [-h] [--version]\n               files_or_dirs_paths [files_or_dirs_paths ...]\n\npositional arguments:\n  files_or_dirs_paths   Path to the file or directory to be processed.\n\noptions:\n  --service-account-credentials SERVICE_ACCOUNT_CREDENTIALS\n                        (Path, required) Path to the service account credentials JSON file.\n  --pdf2image-thread-count PDF2IMAGE_THREAD_COUNT\n                        (int, default=8) Number of threads to use for PDF to image conversion using `pdf2image` package.\n  --processor-max-workers PROCESSOR_MAX_WORKERS\n                        (int, default=8) Number of threads to use while performing OCR on PDF pages.\n  --dir-output-type {tree_to_tree,side_by_side}\n                        Use this argument when processing a directory. `tree_to_tree` means the output will be in a new directory beside the input directory with the same structure, while `side_by_side`\n                        means the output will be in the same input directory beside each file.\n  --txt-page-separator TXT_PAGE_SEPARATOR\n                        (str, default=PAGE_SEPARATOR) Separator to use between pages in the output TXT file.\n  --docx-remove-newlines\n                        (bool, default=False) Remove newlines from the output DOCX file. Useful if you want DOCX and PDF to have the same page count.\n  --output-dir OUTPUT_DIR\n                        (pathlib.Path | None, default=None) Path to the output directory. This overrides the default output directory behavior.\n  --skip-output-check   (bool, default=False) Use this flag in development only to skip the output check.\n  -h, --help            show this help message and exit\n  --version             show program's version number and exit\n```\n\n<h3 dir=\"rtl\">\u0627\u0644\u062a\u062d\u0648\u064a\u0644 \u0645\u0646 \u062e\u0644\u0627\u0644 \u0633\u0637\u0631 \u0627\u0644\u0623\u0648\u0627\u0645\u0631</h3>\n\n<h4 dir=\"rtl\">\u062a\u062d\u0648\u064a\u0644 \u0645\u0644\u0641 PDF \u0648\u0627\u062d\u062f</h4>\n\n```bash\ntahweel \"./pdfs/1.pdf\" \\\n  --service-account-credentials \"./service_account_credentials.json\" \\\n  --pdf2image-thread-count 8 \\\n  --processor-max-workers 8 \\\n  --txt-page-separator PAGE_SEPARATOR\n```\n\n<h4 dir=\"rtl\">\u062a\u062d\u0648\u064a\u0644 \u0623\u0643\u062b\u0631 \u0645\u0646 \u0645\u0644\u0641 PDF \u0648\u0645\u062c\u0644\u062f</h4>\n\n```bash\ntahweel \"./pdfs/1.pdf\" \"./pdfs/2.pdf\" \"./other_pdfs\" \\\n  --service-account-credentials \"./service_account_credentials.json\" \\\n  --pdf2image-thread-count 8 \\\n  --processor-max-workers 8 \\\n  --txt-page-separator PAGE_SEPARATOR\n```\n\n<h4 dir=\"rtl\">\u062a\u062d\u0648\u064a\u0644 \u0645\u062c\u0644\u062f \u0643\u0627\u0645\u0644 \u0645\u0646 \u0627\u0644\u0645\u0644\u0641\u0627\u062a</h4>\n\n```bash\ntahweel \"./pdfs\" \\\n  --service-account-credentials \"./service_account_credentials.json\" \\\n  --pdf2image-thread-count 8 \\\n  --processor-max-workers 8 \\\n  --dir-output-type tree_to_tree \\\n  --txt-page-separator PAGE_SEPARATOR \\\n  --docx-remove-newlines\n```\n\n<h3 dir=\"rtl\">\u0627\u0644\u062a\u062d\u0648\u064a\u0644 \u0645\u0646 \u062e\u0644\u0627\u0644 \u0627\u0644\u0634\u064a\u0641\u0631\u0629 \u0627\u0644\u0628\u0631\u0645\u062c\u064a\u0629</h3>\n\n<p dir=\"rtl\">\u064a\u0645\u0643\u0646\u0643 \u0627\u0633\u062a\u062e\u062f\u0627\u0645 \u062a\u062d\u0648\u064a\u0644 \u0645\u0646 \u062e\u0644\u0627\u0644 \u0627\u0644\u0634\u064a\u0641\u0631\u0629 \u0627\u0644\u0628\u0631\u0645\u062c\u064a\u0629 \u0643\u0627\u0644\u062a\u0627\u0644\u064a:</p>\n\n```python\nfrom concurrent.futures import ThreadPoolExecutor\nfrom pathlib import Path\n\nfrom tahweel.enums import TahweelType\nfrom tahweel.managers import PdfFileManager\nfrom tahweel.processors import GoogleDriveOcrProcessor\nfrom tahweel.writers import DocxWriter, TxtWriter\nfrom tqdm import tqdm\n\n\ndef main():\n  processor = GoogleDriveOcrProcessor('./service_account_credentials.json')\n  pdf_file_manager = PdfFileManager(Path('./pdfs/1.pdf'), 8)\n  pdf_file_manager.to_images()\n\n  with ThreadPoolExecutor(max_workers=8) as executor:\n    content = list(\n      tqdm(executor.map(processor.process, pdf_file_manager.images_paths), total=pdf_file_manager.pages_count()),\n    )\n\n  TxtWriter(pdf_file_manager.txt_file_path(TahweelType.FILE)).write(content, 'PAGE_SEPARATOR')\n  DocxWriter(pdf_file_manager.docx_file_path(TahweelType.FILE)).write(content, False)\n\n\nif __name__ == '__main__':\n  main()\n```\n\n<h3 dir=\"rtl\">\u0627\u0644\u062a\u062d\u0648\u064a\u0644 \u0628\u0627\u0633\u062a\u062e\u062f\u0627\u0645 Docker</h3>\n\n<p dir=\"rtl\">\u0625\u0630\u0627 \u0643\u0627\u0646 \u0644\u062f\u064a\u0643 Docker \u0639\u0644\u0649 \u062d\u0627\u0633\u0628\u0643\u060c \u0641\u0627\u0644\u0637\u0631\u064a\u0642\u0629 \u0627\u0644\u0623\u0633\u0647\u0644 \u0644\u0627\u0633\u062a\u062e\u062f\u0627\u0645 \u062a\u062d\u0648\u064a\u0644 \u0647\u064a \u0645\u0646 \u062e\u0644\u0627\u0644\u0647. \u0627\u0644\u0623\u0645\u0631 \u0627\u0644\u062a\u0627\u0644\u064a \u064a\u0642\u0648\u0645 \u0628\u062a\u0646\u0632\u064a\u0644 Docker image \u0627\u0644\u062e\u0627\u0635\u0629 \u0628\u062a\u062d\u0648\u064a\u0644 \u0648\u062a\u062d\u0648\u064a\u0644 \u0645\u0644\u0641 PDF \u0628\u0627\u0633\u062a\u062e\u062f\u0627\u0645 \u062a\u0642\u0646\u064a\u0627\u062a Google Drive OCR \u0648\u0625\u062e\u0631\u0627\u062c \u0627\u0644\u0646\u062a\u0627\u0626\u062c \u0641\u064a \u0627\u0644\u0645\u062c\u0644\u062f \u0627\u0644\u062d\u0627\u0644\u064a:</p>\n\n```bash\ndocker run -it --rm -v \"$PWD:/tahweel\" ghcr.io/ieasybooks/tahweel \\\n  \"./pdfs/1.pdf\" \\\n  --service-account-credentials \"./service_account_credentials.json\" \\\n  --pdf2image-thread-count 8 \\\n  --processor-max-workers 8 \\\n  --dir-output-type tree_to_tree \\\n  --txt-page-separator PAGE_SEPARATOR \\\n  --docx-remove-newlines\n```\n\n<p dir=\"rtl\">\u064a\u0645\u0643\u0646\u0643 \u062a\u0645\u0631\u064a\u0631 \u0623\u064a \u062e\u064a\u0627\u0631 \u0645\u0646 \u062e\u064a\u0627\u0631\u0627\u062a \u0645\u0643\u062a\u0628\u0629 \u062a\u062d\u0648\u064a\u0644 \u0627\u0644\u0645\u064f\u0648\u0636\u0651\u062d\u0629 \u0641\u064a \u0627\u0644\u0623\u0639\u0644\u0649\u060c \u0648\u0644\u0643\u0646 \u064a\u062c\u0628 \u0645\u064f\u0631\u0627\u0639\u0627\u0629 \u062a\u0646\u0641\u064a\u0630 \u0627\u0644\u0623\u0645\u0631 \u0645\u0646 \u062f\u0627\u062e\u0644 \u0627\u0644\u0645\u062c\u0644\u062f \u0627\u0644\u0630\u064a \u064a\u062d\u062a\u0648\u064a \u0639\u0644\u0649 \u0645\u0644\u0641\u0627\u062a PDF \u0627\u0644\u0645\u0631\u0627\u062f \u062a\u062d\u0648\u064a\u0644\u0647\u0627 \u0648\u0645\u0644\u0641 Service Account Credentials \u0627\u0644\u062e\u0627\u0635 \u0628\u0643.</p>\n\n<hr>\n\n<p dir=\"rtl\">\u062a\u0645 \u0627\u0644\u0627\u0639\u062a\u0645\u0627\u062f \u0628\u0634\u0643\u0644 \u0643\u0628\u064a\u0631 \u0639\u0644\u0649 \u0645\u0633\u062a\u0648\u062f\u0639 <a href=\"https://github.com/ocrarian/ocrarian.py\">ocrarian.py</a> \u0644\u0625\u0646\u062c\u0627\u0632 \u062a\u062d\u0648\u064a\u0644 \u0628\u0634\u0643\u0644 \u0623\u0633\u0631\u0639\u060c \u0641\u062c\u0632\u0649 \u0627\u0644\u0644\u0647 \u0645\u0646 \u0639\u0645\u0644 \u0639\u0644\u064a\u0647 \u062e\u064a\u0631 \u0627\u0644\u062c\u0632\u0627\u0621.</p>\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "\u062a\u062d\u0648\u064a\u0644 \u0645\u0644\u0641\u0627\u062a PDF \u0625\u0644\u0649 Word \u0648 TXT",
    "version": "0.0.13",
    "project_urls": {
        "Homepage": "https://tahweel.ieasybooks.com",
        "Repository": "https://github.com/ieasybooks/tahweel"
    },
    "split_keywords": [
        "tahweel",
        " ocr",
        " pdf",
        " word",
        " txt",
        " google-drive-api"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "ec8293a91c5cb706147dcb4d5436c782123181907b6edd913d935165ba9cea3f",
                "md5": "f886e53d7eef8eb29eb9908638e7729f",
                "sha256": "1565b51ec041c0db75f90706706dd043e4ceaf402984a449de0397a30ad26d21"
            },
            "downloads": -1,
            "filename": "tahweel-0.0.13-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "f886e53d7eef8eb29eb9908638e7729f",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": "<4.0,>=3.10",
            "size": 19933,
            "upload_time": "2024-07-28T13:56:25",
            "upload_time_iso_8601": "2024-07-28T13:56:25.721928Z",
            "url": "https://files.pythonhosted.org/packages/ec/82/93a91c5cb706147dcb4d5436c782123181907b6edd913d935165ba9cea3f/tahweel-0.0.13-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "5acadb4e8f39e2a1fd489717336df5a7150244e7091589a48db5817c0a5ef61a",
                "md5": "475e2aed0aa56dfa13220262c4d8cb5d",
                "sha256": "5e5e424f6ee6ad00efcee7dcc324229ecb85f4cc9c4b34f39fbba2265b106721"
            },
            "downloads": -1,
            "filename": "tahweel-0.0.13.tar.gz",
            "has_sig": false,
            "md5_digest": "475e2aed0aa56dfa13220262c4d8cb5d",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": "<4.0,>=3.10",
            "size": 15804,
            "upload_time": "2024-07-28T13:56:27",
            "upload_time_iso_8601": "2024-07-28T13:56:27.201396Z",
            "url": "https://files.pythonhosted.org/packages/5a/ca/db4e8f39e2a1fd489717336df5a7150244e7091589a48db5817c0a5ef61a/tahweel-0.0.13.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-07-28 13:56:27",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "ieasybooks",
    "github_project": "tahweel",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "tahweel"
}
        
Elapsed time: 0.27900s