Name | pdftranscript JSON |
Version |
1.0.5
JSON |
| download |
home_page | None |
Summary | PDF to semantic HTML conversion. |
upload_time | 2024-11-18 17:35:17 |
maintainer | None |
docs_url | None |
author | None |
requires_python | >=3.9 |
license | None |
keywords |
html
pdf
conversion
|
VCS |
|
bugtrack_url |
|
requirements |
No requirements were recorded.
|
Travis-CI |
No Travis.
|
coveralls test coverage |
No coveralls.
|
# PDF to semantic HTML conversion
Transcript contains Python programs whose job is to transcribe PDF into
sematic HTML.
[pdftranscript](https://github.com/fmalina/PDFtranscript/blob/main/pdftranscript/transcript.py) -
Get semantic HTML from PDFs converted by pdf2htmlEX.
[pdfttf](https://github.com/fmalina/PDFtranscript/blob/main/pdftranscript/ttf.py) -
Recover lost text from PDFs where true type font characters are nothing more than
images of themselves.
[pdf2html](https://github.com/fmalina/PDFtranscript/blob/main/pdftranscript/pdf2html.py) -
Batch process a folder full of PDFs ready for pdftranscript
Read the docstrings for more information.
## Example
[PDF before](https://fmalina.github.io/PDFtranscript/tests/PDF/report-1967329.pdf)
and [semantic HTML after](https://fmalina.github.io/PDFtranscript/tests/HTM/report-1967329.htm)
## Installation
pip install pdftranscript
Get Python installed along with latest pdf2htmlEX.
On OS X with Homebrew:
brew install python3 pdf2htmlEX
or on Ubuntu/Debian
sudo apt update && sudo apt install -y libfontconfig1 libcairo2 libjpeg-turbo8 ttfautohint
wget -o pdf2htmlEX.deb https://github.com/pdf2htmlEX/pdf2htmlEX/releases/download/v0.18.8.rc1/pdf2htmlEX-0.18.8.rc1-master-20200630-Ubuntu-bionic-x86_64.deb
Check `sha256sum pdf2htmlEX.deb` matches `4ef2698cbeb6995189ac...`
sudo apt install ./pdf2htmlEX.deb
pdf2htmlEX -v
Docker install of pdf2htmlEX is also supported (brew one started failing
as of late). This particular image is tested and used in the default
config via `DOCKER_IMG_TAG`.
docker pull
pdf2htmlex/pdf2htmlex:0.18.8.rc2-master-20200820-ubuntu-20.04-x86_64
`pip install pdftranscript` should install `lxml` and `freetype-py` too.
## Configure
Configure your project path in your `.env` file and `config.py` **most
importantly the DATA_DIR**. This can be any folder let\'s say
`DATA_DIR=/path/to/pdf-transcript/tests`. If you use a docker install
of pdf2htmlEX, you\'ll need to set `DOCKER_INSTALL=1` This will mount
your data dir to Docker path. `DOCKER_IMG_TAG` is also
[configurable](pdftranscript/config.py). Go ahead create your `.env` file and add
`DATA_DIR=...`
Your DATA_DIR should end up containing 3 folders: PDF, HTML and HTM if
you otherwise stick with default configuration. Create a 'PDF' folder
inside and drop your PDFs there.
- PDF is a folder where your PDFs are.
- HTML is where pdf2htmlEX output (non-semantic HTML) ends up after
running `./pdf2html.py`, which just runs pdf2htmlEX with suitable
options.
- HTM is the final destination where semantic HTML gets born after
running `./transcript.py`.
## Run
`pdf2html` or `./pdftranscript/pdf2html.py` in a cloned repo.
`pdftranscript` or `./pdftranscript/transcript.py`
When you change configuration within `transcript.py` or tweak some
code. You only need to run `./pdftranscript/transcript.py`
## Development process
Set expected (hand-adjusted) output to aim for and improve codebase to
get transcript output closer to the ideal semantic output. Make sure
your changes don't make output worse for other tests. Use
`ruff check`.
Raw data
{
"_id": null,
"home_page": null,
"name": "pdftranscript",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.9",
"maintainer_email": null,
"keywords": "HTML, PDF, conversion",
"author": null,
"author_email": "Franti\u0161ek Malina <fmalina@pm.me>",
"download_url": "https://files.pythonhosted.org/packages/61/c9/e80841a18d89ec3a594b702d05f0709d5f85a3e623b6c1f632a19ab6e4aa/pdftranscript-1.0.5.tar.gz",
"platform": null,
"description": "# PDF to semantic HTML conversion\n\nTranscript contains Python programs whose job is to transcribe PDF into\nsematic HTML.\n\n[pdftranscript](https://github.com/fmalina/PDFtranscript/blob/main/pdftranscript/transcript.py) - \nGet semantic HTML from PDFs converted by pdf2htmlEX.\n\n[pdfttf](https://github.com/fmalina/PDFtranscript/blob/main/pdftranscript/ttf.py) - \nRecover lost text from PDFs where true type font characters are nothing more than\nimages of themselves.\n\n[pdf2html](https://github.com/fmalina/PDFtranscript/blob/main/pdftranscript/pdf2html.py) - \nBatch process a folder full of PDFs ready for pdftranscript\n\nRead the docstrings for more information.\n\n## Example\n\n[PDF before](https://fmalina.github.io/PDFtranscript/tests/PDF/report-1967329.pdf)\nand [semantic HTML after](https://fmalina.github.io/PDFtranscript/tests/HTM/report-1967329.htm)\n\n## Installation\n \n pip install pdftranscript\n\nGet Python installed along with latest pdf2htmlEX. \n\nOn OS X with Homebrew:\n\n brew install python3 pdf2htmlEX\n\nor on Ubuntu/Debian\n\n sudo apt update && sudo apt install -y libfontconfig1 libcairo2 libjpeg-turbo8 ttfautohint\n wget -o pdf2htmlEX.deb https://github.com/pdf2htmlEX/pdf2htmlEX/releases/download/v0.18.8.rc1/pdf2htmlEX-0.18.8.rc1-master-20200630-Ubuntu-bionic-x86_64.deb\n\nCheck `sha256sum pdf2htmlEX.deb` matches `4ef2698cbeb6995189ac...`\n\n sudo apt install ./pdf2htmlEX.deb\n pdf2htmlEX -v\n\nDocker install of pdf2htmlEX is also supported (brew one started failing\nas of late). This particular image is tested and used in the default\nconfig via `DOCKER_IMG_TAG`.\n\n docker pull\n pdf2htmlex/pdf2htmlex:0.18.8.rc2-master-20200820-ubuntu-20.04-x86_64\n\n`pip install pdftranscript` should install `lxml` and `freetype-py` too.\n\n## Configure\n\nConfigure your project path in your `.env` file and `config.py` **most\nimportantly the DATA_DIR**. This can be any folder let\\'s say\n`DATA_DIR=/path/to/pdf-transcript/tests`. If you use a docker install\nof pdf2htmlEX, you\\'ll need to set `DOCKER_INSTALL=1` This will mount\nyour data dir to Docker path. `DOCKER_IMG_TAG` is also\n[configurable](pdftranscript/config.py). Go ahead create your `.env` file and add\n`DATA_DIR=...`\n\nYour DATA_DIR should end up containing 3 folders: PDF, HTML and HTM if\nyou otherwise stick with default configuration. Create a 'PDF' folder\ninside and drop your PDFs there.\n\n- PDF is a folder where your PDFs are.\n- HTML is where pdf2htmlEX output (non-semantic HTML) ends up after\n running `./pdf2html.py`, which just runs pdf2htmlEX with suitable\n options.\n- HTM is the final destination where semantic HTML gets born after\n running `./transcript.py`.\n\n## Run\n\n`pdf2html` or `./pdftranscript/pdf2html.py` in a cloned repo.\n\n`pdftranscript` or `./pdftranscript/transcript.py`\n\nWhen you change configuration within `transcript.py` or tweak some\ncode. You only need to run `./pdftranscript/transcript.py`\n\n## Development process\n\nSet expected (hand-adjusted) output to aim for and improve codebase to\nget transcript output closer to the ideal semantic output. Make sure\nyour changes don't make output worse for other tests. Use\n`ruff check`.\n",
"bugtrack_url": null,
"license": null,
"summary": "PDF to semantic HTML conversion.",
"version": "1.0.5",
"project_urls": null,
"split_keywords": [
"html",
" pdf",
" conversion"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "2ddfd8f3be631ea444576de9399319fa13baf83be002e63dce00b383de583607",
"md5": "f4782b5d0b1e66d4ec37ba9448d99320",
"sha256": "02466cf87bf06885e34c1259c869c18bdd177e13f8918c1680d369e9e582f4f0"
},
"downloads": -1,
"filename": "pdftranscript-1.0.5-py3-none-any.whl",
"has_sig": false,
"md5_digest": "f4782b5d0b1e66d4ec37ba9448d99320",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.9",
"size": 16730,
"upload_time": "2024-11-18T17:35:16",
"upload_time_iso_8601": "2024-11-18T17:35:16.224105Z",
"url": "https://files.pythonhosted.org/packages/2d/df/d8f3be631ea444576de9399319fa13baf83be002e63dce00b383de583607/pdftranscript-1.0.5-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "61c9e80841a18d89ec3a594b702d05f0709d5f85a3e623b6c1f632a19ab6e4aa",
"md5": "278410f802b881cfd694e18e11c6e756",
"sha256": "8b16f0edf8af69acfd73f9d5954b03003e469d3dfb6a438f60047d091df462a8"
},
"downloads": -1,
"filename": "pdftranscript-1.0.5.tar.gz",
"has_sig": false,
"md5_digest": "278410f802b881cfd694e18e11c6e756",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.9",
"size": 16729,
"upload_time": "2024-11-18T17:35:17",
"upload_time_iso_8601": "2024-11-18T17:35:17.964125Z",
"url": "https://files.pythonhosted.org/packages/61/c9/e80841a18d89ec3a594b702d05f0709d5f85a3e623b6c1f632a19ab6e4aa/pdftranscript-1.0.5.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-11-18 17:35:17",
"github": false,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"lcname": "pdftranscript"
}