pdferli


Namepdferli JSON
Version 0.11 PyPI version JSON
download
home_pagehttps://github.com/hansalemaos/pdferli
SummaryConvert PDFs into pandas DataFrames, remove restrictions, put/crack PDF passwords
upload_time2023-08-24 02:41:14
maintainer
docs_urlNone
authorJohannes Fischer
requires_python
licenseMIT
keywords pdf parsing passwords
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            
# Convert PDFs into pandas DataFrames, remove restrictions, put/crack PDF passwords

## pip install pdferli 

#### Tested against Windows 10 / Python 3.10 / Anaconda 

```python

crack_password(file, chars, processes=4, minlen=None, maxlen=None, verbose=True)
	Attempt to crack a PDF password using a brute-force approach.
	
	Args:
		file (str): Path to the encrypted PDF file.
		chars (iterable): List of characters to generate passwords from.
		processes (int, optional): Number of parallel processes for password cracking. Defaults to 4.
		minlen (int, optional): Minimum length of generated passwords. Defaults to 1.
		maxlen (int, optional): Maximum length of generated passwords. Defaults to length of chars + 1.
		verbose (bool, optional): Whether to display progress information. Defaults to True.
	
	Returns:
		str: Cracked password if successful, None if not successful


get_pdfdf(path, normalize_content=False, **kwargs)
	Extract structured data from a PDF document and return it as a pandas DataFrame.
	
	Args:
		path (str): Path to the PDF file.
		normalize_content (bool, optional): Whether to normalize content extraction. Defaults to False.
		**kwargs: Additional keyword arguments for pikepdf.open and extract_pages methods.
	
	Returns:
		pandas.DataFrame: DataFrame containing extracted structured data from the PDF.

put_password_encryption(inputfile, outputfile, password)
	Encrypt a PDF file using a specified password.
	
	Args:
		inputfile (str): Path to the input PDF file.
		outputfile (str): Path to the output encrypted PDF file.
		password (str): Password for encryption.


remove_restrictions(inputfile, outputfile, **kwargs)
	Remove encryption and restrictions from a PDF file.
	
	Args:
		inputfile (str): Path to the input encrypted PDF file.
		outputfile (str): Path to the output decrypted PDF file.
		**kwargs: Additional keyword arguments for pikepdf.save method.


Examples:

from time import perf_counter

from pdferli import (
    crack_password,
    put_password_encryption,
    remove_restrictions,
    get_pdfdf,
)


put_password_encryption(
    r"C:\sample.pdf",
    r"C:\sample4.pdf",
    password="1234",
)
path = r"C:\Arquivo.pdf"
remove_restrictions(path, "c:\\norestrictions.pdf")
df = get_pdfdf(path, normalize_content=False)




if __name__ == "__main__":  # necessary for crack_password since it uses multiprocessing
    start = perf_counter()
    x = crack_password(
        file=r"C:\sample4.pdf",
        chars=list("0123456789"),
        processes=4,
        minlen=0,
        maxlen=None,
        verbose=True,
    )
    print(perf_counter() - start)
    print(x)
    start = perf_counter()



# output df
   aa_adv  aa_bits aa_colorspace  aa_element_index aa_element_type  aa_evenodd  aa_fill aa_fontname  aa_height aa_imagemask  aa_linewidth aa_name    aa_size aa_srcsize aa_stream  aa_stroke aa_text      aa_text_element aa_text_line  aa_upright   aa_width       aa_x0       aa_x1       aa_y0       aa_y1 bb_hierachy_element bb_hierachy_page
0  31.968     <NA>          <NA>                 0          LTChar        <NA>     <NA>     ArialMT  56.546172         <NA>          <NA>    <NA>  56.546172       <NA>      <NA>       <NA>       A  APENAS VISUALIZAÇÃO            A        True  11.336388  126.431281  137.767669  242.012331  298.558504           (0, 0, 0)           (0, 0)
1    <NA>     <NA>          <NA>                 1          LTAnno        <NA>     <NA>        <NA>       <NA>         <NA>          <NA>    <NA>       <NA>       <NA>      <NA>       <NA>                           \n         <NA>       False       <NA>        <NA>        <NA>        <NA>        <NA>           (0, 0, 0)           (0, 0)
2  31.968     <NA>          <NA>                 2          LTChar        <NA>     <NA>     ArialMT  56.546172         <NA>          <NA>    <NA>  56.546172       <NA>      <NA>       <NA>       P  APENAS VISUALIZAÇÃO            P        True  11.336388  149.036174  160.372561  264.617224  321.163396           (0, 0, 0)           (0, 0)
3    <NA>     <NA>          <NA>                 3          LTAnno        <NA>     <NA>        <NA>       <NA>         <NA>          <NA>    <NA>       <NA>       <NA>      <NA>       <NA>                           \n         <NA>       False       <NA>        <NA>        <NA>        <NA>        <NA>           (0, 0, 0)           (0, 0)
4  31.968     <NA>          <NA>                 4          LTChar        <NA>     <NA>     ArialMT  56.546172         <NA>          <NA>    <NA>  56.546172       <NA>      <NA>       <NA>       E  APENAS VISUALIZAÇÃO            E        True  11.336388  171.641066  182.977454  287.222116  343.768289           (0, 0, 0)           (0, 0)
```

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/hansalemaos/pdferli",
    "name": "pdferli",
    "maintainer": "",
    "docs_url": null,
    "requires_python": "",
    "maintainer_email": "",
    "keywords": "pdf,parsing,passwords",
    "author": "Johannes Fischer",
    "author_email": "aulasparticularesdealemaosp@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/47/65/19a603bfd5a7e44abd380da3b3df8f2a002f15f5417fdd996241ed82311e/pdferli-0.11.tar.gz",
    "platform": null,
    "description": "\r\n# Convert PDFs into pandas DataFrames, remove restrictions, put/crack PDF passwords\r\n\r\n## pip install pdferli \r\n\r\n#### Tested against Windows 10 / Python 3.10 / Anaconda \r\n\r\n```python\r\n\r\ncrack_password(file, chars, processes=4, minlen=None, maxlen=None, verbose=True)\r\n\tAttempt to crack a PDF password using a brute-force approach.\r\n\t\r\n\tArgs:\r\n\t\tfile (str): Path to the encrypted PDF file.\r\n\t\tchars (iterable): List of characters to generate passwords from.\r\n\t\tprocesses (int, optional): Number of parallel processes for password cracking. Defaults to 4.\r\n\t\tminlen (int, optional): Minimum length of generated passwords. Defaults to 1.\r\n\t\tmaxlen (int, optional): Maximum length of generated passwords. Defaults to length of chars + 1.\r\n\t\tverbose (bool, optional): Whether to display progress information. Defaults to True.\r\n\t\r\n\tReturns:\r\n\t\tstr: Cracked password if successful, None if not successful\r\n\r\n\r\nget_pdfdf(path, normalize_content=False, **kwargs)\r\n\tExtract structured data from a PDF document and return it as a pandas DataFrame.\r\n\t\r\n\tArgs:\r\n\t\tpath (str): Path to the PDF file.\r\n\t\tnormalize_content (bool, optional): Whether to normalize content extraction. Defaults to False.\r\n\t\t**kwargs: Additional keyword arguments for pikepdf.open and extract_pages methods.\r\n\t\r\n\tReturns:\r\n\t\tpandas.DataFrame: DataFrame containing extracted structured data from the PDF.\r\n\r\nput_password_encryption(inputfile, outputfile, password)\r\n\tEncrypt a PDF file using a specified password.\r\n\t\r\n\tArgs:\r\n\t\tinputfile (str): Path to the input PDF file.\r\n\t\toutputfile (str): Path to the output encrypted PDF file.\r\n\t\tpassword (str): Password for encryption.\r\n\r\n\r\nremove_restrictions(inputfile, outputfile, **kwargs)\r\n\tRemove encryption and restrictions from a PDF file.\r\n\t\r\n\tArgs:\r\n\t\tinputfile (str): Path to the input encrypted PDF file.\r\n\t\toutputfile (str): Path to the output decrypted PDF file.\r\n\t\t**kwargs: Additional keyword arguments for pikepdf.save method.\r\n\r\n\r\nExamples:\r\n\r\nfrom time import perf_counter\r\n\r\nfrom pdferli import (\r\n    crack_password,\r\n    put_password_encryption,\r\n    remove_restrictions,\r\n    get_pdfdf,\r\n)\r\n\r\n\r\nput_password_encryption(\r\n    r\"C:\\sample.pdf\",\r\n    r\"C:\\sample4.pdf\",\r\n    password=\"1234\",\r\n)\r\npath = r\"C:\\Arquivo.pdf\"\r\nremove_restrictions(path, \"c:\\\\norestrictions.pdf\")\r\ndf = get_pdfdf(path, normalize_content=False)\r\n\r\n\r\n\r\n\r\nif __name__ == \"__main__\":  # necessary for crack_password since it uses multiprocessing\r\n    start = perf_counter()\r\n    x = crack_password(\r\n        file=r\"C:\\sample4.pdf\",\r\n        chars=list(\"0123456789\"),\r\n        processes=4,\r\n        minlen=0,\r\n        maxlen=None,\r\n        verbose=True,\r\n    )\r\n    print(perf_counter() - start)\r\n    print(x)\r\n    start = perf_counter()\r\n\r\n\r\n\r\n# output df\r\n   aa_adv  aa_bits aa_colorspace  aa_element_index aa_element_type  aa_evenodd  aa_fill aa_fontname  aa_height aa_imagemask  aa_linewidth aa_name    aa_size aa_srcsize aa_stream  aa_stroke aa_text      aa_text_element aa_text_line  aa_upright   aa_width       aa_x0       aa_x1       aa_y0       aa_y1 bb_hierachy_element bb_hierachy_page\r\n0  31.968     <NA>          <NA>                 0          LTChar        <NA>     <NA>     ArialMT  56.546172         <NA>          <NA>    <NA>  56.546172       <NA>      <NA>       <NA>       A  APENAS VISUALIZA\u00c7\u00c3O            A        True  11.336388  126.431281  137.767669  242.012331  298.558504           (0, 0, 0)           (0, 0)\r\n1    <NA>     <NA>          <NA>                 1          LTAnno        <NA>     <NA>        <NA>       <NA>         <NA>          <NA>    <NA>       <NA>       <NA>      <NA>       <NA>                           \\n         <NA>       False       <NA>        <NA>        <NA>        <NA>        <NA>           (0, 0, 0)           (0, 0)\r\n2  31.968     <NA>          <NA>                 2          LTChar        <NA>     <NA>     ArialMT  56.546172         <NA>          <NA>    <NA>  56.546172       <NA>      <NA>       <NA>       P  APENAS VISUALIZA\u00c7\u00c3O            P        True  11.336388  149.036174  160.372561  264.617224  321.163396           (0, 0, 0)           (0, 0)\r\n3    <NA>     <NA>          <NA>                 3          LTAnno        <NA>     <NA>        <NA>       <NA>         <NA>          <NA>    <NA>       <NA>       <NA>      <NA>       <NA>                           \\n         <NA>       False       <NA>        <NA>        <NA>        <NA>        <NA>           (0, 0, 0)           (0, 0)\r\n4  31.968     <NA>          <NA>                 4          LTChar        <NA>     <NA>     ArialMT  56.546172         <NA>          <NA>    <NA>  56.546172       <NA>      <NA>       <NA>       E  APENAS VISUALIZA\u00c7\u00c3O            E        True  11.336388  171.641066  182.977454  287.222116  343.768289           (0, 0, 0)           (0, 0)\r\n```\r\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "Convert PDFs into pandas DataFrames, remove restrictions, put/crack PDF passwords",
    "version": "0.11",
    "project_urls": {
        "Homepage": "https://github.com/hansalemaos/pdferli"
    },
    "split_keywords": [
        "pdf",
        "parsing",
        "passwords"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "2dc4aef642801ea4103f343a054a17ad9114e993b9bf5ef1585385690d94d02b",
                "md5": "3b9bbfd9f4c54711271f86cd4fc34680",
                "sha256": "ef12ce3e7b1d1288f7f5382e41abc20936c9f30d49f324aab6b3055e0b039bf2"
            },
            "downloads": -1,
            "filename": "pdferli-0.11-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "3b9bbfd9f4c54711271f86cd4fc34680",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": null,
            "size": 15041,
            "upload_time": "2023-08-24T02:41:12",
            "upload_time_iso_8601": "2023-08-24T02:41:12.729373Z",
            "url": "https://files.pythonhosted.org/packages/2d/c4/aef642801ea4103f343a054a17ad9114e993b9bf5ef1585385690d94d02b/pdferli-0.11-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "476519a603bfd5a7e44abd380da3b3df8f2a002f15f5417fdd996241ed82311e",
                "md5": "9fd1d1fb264eaac6d962b042fbac48a4",
                "sha256": "929dd3c8ed8d8c7083f448f4193b94274b49d8c252fc6578c90e8a3b55638f4e"
            },
            "downloads": -1,
            "filename": "pdferli-0.11.tar.gz",
            "has_sig": false,
            "md5_digest": "9fd1d1fb264eaac6d962b042fbac48a4",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": null,
            "size": 14568,
            "upload_time": "2023-08-24T02:41:14",
            "upload_time_iso_8601": "2023-08-24T02:41:14.710681Z",
            "url": "https://files.pythonhosted.org/packages/47/65/19a603bfd5a7e44abd380da3b3df8f2a002f15f5417fdd996241ed82311e/pdferli-0.11.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-08-24 02:41:14",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "hansalemaos",
    "github_project": "pdferli",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "requirements": [],
    "lcname": "pdferli"
}
        
Elapsed time: 0.10824s