pdftranscript

Name	pdftranscript JSON
Version	1.0.5 JSON
	download
home_page	None
Summary	PDF to semantic HTML conversion.
upload_time	2024-11-18 17:35:17
maintainer	None
docs_url	None
author	None
requires_python	>=3.9
license	None
keywords	html pdf conversion
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

# PDF to semantic HTML conversion

Transcript contains Python programs whose job is to transcribe PDF into
sematic HTML.

[pdftranscript](https://github.com/fmalina/PDFtranscript/blob/main/pdftranscript/transcript.py) -
Get semantic HTML from PDFs converted by pdf2htmlEX.

[pdfttf](https://github.com/fmalina/PDFtranscript/blob/main/pdftranscript/ttf.py) -
Recover lost text from PDFs where true type font characters are nothing more than
images of themselves.

[pdf2html](https://github.com/fmalina/PDFtranscript/blob/main/pdftranscript/pdf2html.py) -
Batch process a folder full of PDFs ready for pdftranscript

Read the docstrings for more information.

## Example

[PDF before](https://fmalina.github.io/PDFtranscript/tests/PDF/report-1967329.pdf)
and [semantic HTML after](https://fmalina.github.io/PDFtranscript/tests/HTM/report-1967329.htm)

## Installation

pip install pdftranscript

Get Python installed along with latest pdf2htmlEX.

On OS X with Homebrew:

brew install python3 pdf2htmlEX

or on Ubuntu/Debian

sudo apt update && sudo apt install -y libfontconfig1 libcairo2 libjpeg-turbo8 ttfautohint
wget -o pdf2htmlEX.deb https://github.com/pdf2htmlEX/pdf2htmlEX/releases/download/v0.18.8.rc1/pdf2htmlEX-0.18.8.rc1-master-20200630-Ubuntu-bionic-x86_64.deb

Check `sha256sum pdf2htmlEX.deb` matches `4ef2698cbeb6995189ac...`

sudo apt install ./pdf2htmlEX.deb
pdf2htmlEX -v

Docker install of pdf2htmlEX is also supported (brew one started failing
as of late). This particular image is tested and used in the default
config via `DOCKER_IMG_TAG`.

docker pull
pdf2htmlex/pdf2htmlex:0.18.8.rc2-master-20200820-ubuntu-20.04-x86_64

`pip install pdftranscript` should install `lxml` and `freetype-py` too.

## Configure

Configure your project path in your `.env` file and `config.py` **most
importantly the DATA_DIR**. This can be any folder let\'s say
`DATA_DIR=/path/to/pdf-transcript/tests`. If you use a docker install
of pdf2htmlEX, you\'ll need to set `DOCKER_INSTALL=1` This will mount
your data dir to Docker path. `DOCKER_IMG_TAG` is also
[configurable](pdftranscript/config.py). Go ahead create your `.env` file and add
`DATA_DIR=...`

Your DATA_DIR should end up containing 3 folders: PDF, HTML and HTM if
you otherwise stick with default configuration. Create a 'PDF' folder
inside and drop your PDFs there.

- PDF is a folder where your PDFs are.
- HTML is where pdf2htmlEX output (non-semantic HTML) ends up after
running `./pdf2html.py`, which just runs pdf2htmlEX with suitable
options.
- HTM is the final destination where semantic HTML gets born after
running `./transcript.py`.

## Run

`pdf2html` or `./pdftranscript/pdf2html.py` in a cloned repo.

`pdftranscript` or `./pdftranscript/transcript.py`

When you change configuration within `transcript.py` or tweak some
code. You only need to run `./pdftranscript/transcript.py`

## Development process

Set expected (hand-adjusted) output to aim for and improve codebase to
get transcript output closer to the ideal semantic output. Make sure
your changes don't make output worse for other tests. Use
`ruff check`.

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "pdftranscript",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.9",
    "maintainer_email": null,
    "keywords": "HTML, PDF, conversion",
    "author": null,
    "author_email": "Franti\u0161ek Malina <fmalina@pm.me>",
    "download_url": "https://files.pythonhosted.org/packages/61/c9/e80841a18d89ec3a594b702d05f0709d5f85a3e623b6c1f632a19ab6e4aa/pdftranscript-1.0.5.tar.gz",
    "platform": null,
    "description": "# PDF to semantic HTML conversion\n\nTranscript contains Python programs whose job is to transcribe PDF into\nsematic HTML.\n\n[pdftranscript](https://github.com/fmalina/PDFtranscript/blob/main/pdftranscript/transcript.py) - \nGet semantic HTML from PDFs converted by pdf2htmlEX.\n\n[pdfttf](https://github.com/fmalina/PDFtranscript/blob/main/pdftranscript/ttf.py) - \nRecover lost text from PDFs where true type font characters are nothing more than\nimages of themselves.\n\n[pdf2html](https://github.com/fmalina/PDFtranscript/blob/main/pdftranscript/pdf2html.py) - \nBatch process a folder full of PDFs ready for pdftranscript\n\nRead the docstrings for more information.\n\n## Example\n\n[PDF before](https://fmalina.github.io/PDFtranscript/tests/PDF/report-1967329.pdf)\nand [semantic HTML after](https://fmalina.github.io/PDFtranscript/tests/HTM/report-1967329.htm)\n\n## Installation\n    \n    pip install pdftranscript\n\nGet Python installed along with latest pdf2htmlEX. \n\nOn OS X with Homebrew:\n\n    brew install python3 pdf2htmlEX\n\nor on Ubuntu/Debian\n\n    sudo apt update && sudo apt install -y libfontconfig1 libcairo2 libjpeg-turbo8 ttfautohint\n    wget -o pdf2htmlEX.deb https://github.com/pdf2htmlEX/pdf2htmlEX/releases/download/v0.18.8.rc1/pdf2htmlEX-0.18.8.rc1-master-20200630-Ubuntu-bionic-x86_64.deb\n\nCheck `sha256sum pdf2htmlEX.deb` matches `4ef2698cbeb6995189ac...`\n\n    sudo apt install ./pdf2htmlEX.deb\n    pdf2htmlEX -v\n\nDocker install of pdf2htmlEX is also supported (brew one started failing\nas of late). This particular image is tested and used in the default\nconfig via `DOCKER_IMG_TAG`.\n\n    docker pull\n    pdf2htmlex/pdf2htmlex:0.18.8.rc2-master-20200820-ubuntu-20.04-x86_64\n\n`pip install pdftranscript` should install `lxml` and `freetype-py` too.\n\n## Configure\n\nConfigure your project path in your `.env` file and `config.py` **most\nimportantly the DATA_DIR**. This can be any folder let\\'s say\n`DATA_DIR=/path/to/pdf-transcript/tests`. If you use a docker install\nof pdf2htmlEX, you\\'ll need to set `DOCKER_INSTALL=1` This will mount\nyour data dir to Docker path. `DOCKER_IMG_TAG` is also\n[configurable](pdftranscript/config.py). Go ahead create your `.env` file and add\n`DATA_DIR=...`\n\nYour DATA_DIR should end up containing 3 folders: PDF, HTML and HTM if\nyou otherwise stick with default configuration. Create a 'PDF' folder\ninside and drop your PDFs there.\n\n-   PDF is a folder where your PDFs are.\n-   HTML is where pdf2htmlEX output (non-semantic HTML) ends up after\n    running `./pdf2html.py`, which just runs pdf2htmlEX with suitable\n    options.\n-   HTM is the final destination where semantic HTML gets born after\n    running `./transcript.py`.\n\n## Run\n\n`pdf2html` or `./pdftranscript/pdf2html.py` in a cloned repo.\n\n`pdftranscript` or `./pdftranscript/transcript.py`\n\nWhen you change configuration within `transcript.py` or tweak some\ncode. You only need to run `./pdftranscript/transcript.py`\n\n## Development process\n\nSet expected (hand-adjusted) output to aim for and improve codebase to\nget transcript output closer to the ideal semantic output. Make sure\nyour changes don't make output worse for other tests. Use\n`ruff check`.\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "PDF to semantic HTML conversion.",
    "version": "1.0.5",
    "project_urls": null,
    "split_keywords": [
        "html",
        " pdf",
        " conversion"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "2ddfd8f3be631ea444576de9399319fa13baf83be002e63dce00b383de583607",
                "md5": "f4782b5d0b1e66d4ec37ba9448d99320",
                "sha256": "02466cf87bf06885e34c1259c869c18bdd177e13f8918c1680d369e9e582f4f0"
            },
            "downloads": -1,
            "filename": "pdftranscript-1.0.5-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "f4782b5d0b1e66d4ec37ba9448d99320",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.9",
            "size": 16730,
            "upload_time": "2024-11-18T17:35:16",
            "upload_time_iso_8601": "2024-11-18T17:35:16.224105Z",
            "url": "https://files.pythonhosted.org/packages/2d/df/d8f3be631ea444576de9399319fa13baf83be002e63dce00b383de583607/pdftranscript-1.0.5-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "61c9e80841a18d89ec3a594b702d05f0709d5f85a3e623b6c1f632a19ab6e4aa",
                "md5": "278410f802b881cfd694e18e11c6e756",
                "sha256": "8b16f0edf8af69acfd73f9d5954b03003e469d3dfb6a438f60047d091df462a8"
            },
            "downloads": -1,
            "filename": "pdftranscript-1.0.5.tar.gz",
            "has_sig": false,
            "md5_digest": "278410f802b881cfd694e18e11c6e756",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.9",
            "size": 16729,
            "upload_time": "2024-11-18T17:35:17",
            "upload_time_iso_8601": "2024-11-18T17:35:17.964125Z",
            "url": "https://files.pythonhosted.org/packages/61/c9/e80841a18d89ec3a594b702d05f0709d5f85a3e623b6c1f632a19ab6e4aa/pdftranscript-1.0.5.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-11-18 17:35:17",
    "github": false,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "lcname": "pdftranscript"
}

None