pyamihtml


Namepyamihtml JSON
Version 0.0.6 PyPI version JSON
download
home_pagehttps://github.com/petermr/pyamihtml
Summarypdf2html converter
upload_time2023-07-07 09:26:11
maintainer
docs_urlNone
authorPeter Murray-Rust
requires_python>=3.7
licenseApache2
keywords text and data mining
VCS
bugtrack_url
requirements beautifulsoup4 braceexpand lxml matplotlib nltk pdfminer3 Pillow psutil PyPDF2 python-rake setuptools SPARQLWrapper tkinterhtml tkinterweb future pdfplumber configparser zlib wheel
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # pyami
Semantic Reader of the Scientific Literature.

A scientific article is not a single dumb file; it can be transformed into many semantic, useful sub-documents. `pyami` is a Python framework for doing this; it reads the scientific literature in bulk, transforms, searches and and analyses the contents.

### status
This is in very active alpha development and early documentation will appear on the Wiki.

# overview
* `pyami` is a personal (client-side) tool that users can customise for their own needs and visions. It does not reply on remote service providers, though it can make its own use of Open sites. 

# scope
* `pyami` can collect documents (of any sort) into a corpus or `CProject`. 
* `pyami` has tools for filtering, cleaning, normalizing a corpus
* `pyami` can search this corpus by filenames, filetypes, document structure and content 
* content can include metadata, text, tables, diagrams, math, chemistry, references and (in some cases) citations
* searching uses words, phrases, dictionaries, and image content (vectors and pixels)
* search results can be transformed, filtered, aggregated, and used for iterative enhancement ("snowballing")

# architecture
* downloaded documents (a corpus) are stored on your disk in a folder/directory (a CProject`)
* each document (`CTree`) in a `CProject` is a living subtree of many labelled text subsections (text, paragraphs, sentences, phrases, words) and images (`png`) . This is flexible (new types and subdirectories) can be added by users).
* the CTree is held on disk and can be further processed by other programs (e.g. pandas, tesseract (image2text), matplotlib
* a commandline supports many operations for searching, and transforming.
* a GUI (`ami-gui`) is layered on the commandline to help navigation, query building and visualisation.
* the commandline can be used by workflow tools such as Jupyter Notebooks
* The `pyami` code is packaged as a Python library for use by other tools

# components
There are several independent components. Many of these are customised for beginners. They can be referenced by symbols to avoid having to remember filenames. Users customise this with environment variables (often preset).
* project. The CProject holding the corpus. Users can have as many projects as they like.
* dictionary. Many searches use dictionaries and often several are used at once. There are currently over 50 dictionaries in a network but it's easy to create your own.
* code. (in Python3)
 
# commands
This is a subset of current commands (NYI=not yet implemented):
````
optional arguments:
  -h, --help            show this help message and exit
  --apply {pdf2txt,txt2sent,xml2txt} [{pdf2txt,txt2sent,xml2txt} ...]
                        list of sequential transformations (1:1 map) to apply to pipeline ({self.TXT2SENT} NYI)
  --assert ASSERT [ASSERT ...]
                        assertions; failure gives error message (prototype)
  --combine COMBINE     operation to combine files into final object (e.g. concat text or CSV file
  --config [CONFIG [CONFIG ...]], -c [CONFIG [CONFIG ...]]
                        file (e.g. ~/pyami/config.ini) with list of config file(s) or config vars
  --debug DEBUG [DEBUG ...]
                        debugging commands , numbers, (not formalised)
  --demo [DEMO [DEMO ...]]
                        simple demos (NYI). empty gives list. May need downloading corpora
  --dict DICT [DICT ...], -d DICT [DICT ...]
                        dictionaries to ami-search with, _help gives list
  --filter FILTER [FILTER ...]
                        expr to filter with
  --glob GLOB [GLOB ...], -g GLOB [GLOB ...]
                        glob files; python syntax (* and ** wildcards supported); include alternatives in {...,...}.
  --languages LANGUAGES [LANGUAGES ...]
                        languages (NYI)
  --loglevel LOGLEVEL, -l LOGLEVEL
                        log level (NYI)
  --maxbars [MAXBARS]   max bars on plot (NYI)
  --nosearch            search (NYI)
  --outfile OUTFILE     output file, normally 1. but (NYI) may track multiple input dirs (NYI)
  --patt PATT [PATT ...]
                        patterns to search with (NYI); regex may need quoting
  --plot                plot params (NYI)
  --proj PROJ [PROJ ...], -p PROJ [PROJ ...]
                        projects to search; _help will give list
  --sect SECT [SECT ...], -s SECT [SECT ...]
                        sections to search; _help gives all(?)
  --split {txt2para,xml2sect} [{txt2para,xml2sect} ...]
                        split fulltext.* into paras, sections
  --test [{file_lib,pdf_lib,text_lib} [{file_lib,pdf_lib,text_lib} ...]]
                        run tests for modules; no selection runs all
````

# getting started
There are an alpha series of commandline examples which show the operation of the system. Currently:
````
 examples args: ['examples.py']
choose from:
gl => globbing files
pd => convert pdf to text
pa => split pdf text into paragraphs
sc => split xml into sections
sl => split oil26 project into sections
se => split text to sentences
fi => simple filter (not complete)
sp => extract species with italics and regex (not finalised)

all => all examples
````

## config file
**everyone needs**

### a 'pyami' directory. 

This can be naywhere but normally where you put program files and their
setting. It will be easiest if it's a direct subdirectory of your HOME directory. 
It MUST include a `pyami.ini` file. By default you can use the one in the `py4ami` distribution

### an environmental variable `PYAMI_HOME` 

This will point to the `pyami.ini` file. 
See [./CONFIG.md](CONFIG.md)


`
````
(base) pm286macbook:pyami pm286$ more ~/pyami/config.ini
; NOTE. All files use forward slash even on Windows
; use slash (/) to separate filename components, we will convert to file-separator automatically
; variables can be substituted using {}

[DIRS]
home           = ~
dictionary_url = https://github.com/petermr/dictionary
project_dir =    ${home}/projects
cev_open =       ${DIRS:project_dir}/CEVOpen
code_dir =       ${DIRS:project_dir}/openDiagram/physchem/python
; # wikidata taxon name property
; taxon_name.w = P225777
; # italic content
; all_italics.x = xpath(//p//italic/text())
; # species, e.g. Zea mays, T. rex, An. gambiae
; species.r = [A-Z][a-z]?(\.|[a-z]{2,})\s+[a-z]{3,})

[URLS]
petermr_url = https://github.com/petermr
petermr_raw_url = https://raw.githubusercontent.com/petermr
tigr2ess.u =        https://github.com/petermr/tigr2ess/tree/master

[AMISEARCH]
oil3.p = ${DIRS:code_dir}/tst/proj
# wikidata taxon name property
taxon_name = P225
# italic content
all_italics.x = //p//italic/text()
# species, e.g. Zea mays, T. rex, An. gambiae
species.r = [A-Z][a-z]?(\.|[a-z]{2,})\s+[a-z]{3,}

[DICTIONARIES]
dict_dir     = ${DIRS:home}/dictionary

ov_ini       = ${dict_dir}/openvirus20210120/amidict.ini
cev_ini      = ${DIRS:cev_open}/dictionary/amidict.ini

# docanal_ini  = ${dict_dir}/docanal/docanal.ini # not yet added


[PROJECTS]
open_battery =      ${DIRS:project_dir}/open-battery
pr_liion =          ${open_battery}/liion
tigr2ess =          ${DIRS:project_dir}/tigr2ess
open_diagram =      ${DIRS:project_dir}/openDiagram
open_virus =        ${DIRS:project_dir}/openVirus

minicorpora_ini =   ${DIRS:cev_open}/minicorpora/config.ini
cev_searches_ini =  ${DIRS:cev_open}/searches/config.ini
open_diag_ini =     ${DIRS:project_dir}/openDiagram/physchem/resources/config.ini


(base) pm286macbook:pyami pm286$ 
````

##UPDATE `py4ami.ami_dict` and `py4ami.ami_pdf`

These now support `argparse` in their own right (2022-07-09)
They will each give an argparse of commands 

### `ami_pdf`

(2022-07-09)

```
python -m py4ami.ami_pdf --help
running PDFArgs main
usage: ami_pdf.py [-h] [--maxpage MAXPAGE] [--indir INDIR] [--inpath INPATH] [--outdir OUTDIR]
                  [--outform OUTFORM] [--flow FLOW] [--imagedir IMAGEDIR] [--resolution RESOLUTION]
                  [--template TEMPLATE] [--debug {words,lines,rects,curves,images,tables,hyperlinks,annots}]

PDF parsing

optional arguments:
  -h, --help            show this help message and exit
  --maxpage MAXPAGE     maximum number of pages
  --indir INDIR         input directory
  --inpath INPATH       input file
  --outdir OUTDIR       output directory
  --outform OUTFORM     output format
  --flow FLOW           create flowing HTML (heuristics)
  --imagedir IMAGEDIR   output images to imagedir
  --resolution RESOLUTION
                        resolution of output images (if imagedir)
  --template TEMPLATE   file to parse specific type of document (NYI)
  --debug {words,lines,rects,curves,images,tables,hyperlinks,annots}
                        debug these during parsing (NYI)

```


## Notes for PMR?
project organization: https://dev.to/codemouse92/dead-simple-python-project-structure-and-imports-38c6
use: pkg_resources

# WARNING
...
AmiUtil exists in both py4ami and pyamiimage. There should be a separate library



            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/petermr/pyamihtml",
    "name": "pyamihtml",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.7",
    "maintainer_email": "",
    "keywords": "text and data mining",
    "author": "Peter Murray-Rust",
    "author_email": "petermurrayrust@googlemail.com",
    "download_url": "https://files.pythonhosted.org/packages/70/20/287ce0d16fb45c09c846adad353121cd91489c3bac5155ae374ffbcd73b2/pyamihtml-0.0.6.tar.gz",
    "platform": null,
    "description": "# pyami\nSemantic Reader of the Scientific Literature.\n\nA scientific article is not a single dumb file; it can be transformed into many semantic, useful sub-documents. `pyami` is a Python framework for doing this; it reads the scientific literature in bulk, transforms, searches and and analyses the contents.\n\n### status\nThis is in very active alpha development and early documentation will appear on the Wiki.\n\n# overview\n* `pyami` is a personal (client-side) tool that users can customise for their own needs and visions. It does not reply on remote service providers, though it can make its own use of Open sites. \n\n# scope\n* `pyami` can collect documents (of any sort) into a corpus or `CProject`. \n* `pyami` has tools for filtering, cleaning, normalizing a corpus\n* `pyami` can search this corpus by filenames, filetypes, document structure and content \n* content can include metadata, text, tables, diagrams, math, chemistry, references and (in some cases) citations\n* searching uses words, phrases, dictionaries, and image content (vectors and pixels)\n* search results can be transformed, filtered, aggregated, and used for iterative enhancement (\"snowballing\")\n\n# architecture\n* downloaded documents (a corpus) are stored on your disk in a folder/directory (a CProject`)\n* each document (`CTree`) in a `CProject` is a living subtree of many labelled text subsections (text, paragraphs, sentences, phrases, words) and images (`png`) . This is flexible (new types and subdirectories) can be added by users).\n* the CTree is held on disk and can be further processed by other programs (e.g. pandas, tesseract (image2text), matplotlib\n* a commandline supports many operations for searching, and transforming.\n* a GUI (`ami-gui`) is layered on the commandline to help navigation, query building and visualisation.\n* the commandline can be used by workflow tools such as Jupyter Notebooks\n* The `pyami` code is packaged as a Python library for use by other tools\n\n# components\nThere are several independent components. Many of these are customised for beginners. They can be referenced by symbols to avoid having to remember filenames. Users customise this with environment variables (often preset).\n* project. The CProject holding the corpus. Users can have as many projects as they like.\n* dictionary. Many searches use dictionaries and often several are used at once. There are currently over 50 dictionaries in a network but it's easy to create your own.\n* code. (in Python3)\n \n# commands\nThis is a subset of current commands (NYI=not yet implemented):\n````\noptional arguments:\n  -h, --help            show this help message and exit\n  --apply {pdf2txt,txt2sent,xml2txt} [{pdf2txt,txt2sent,xml2txt} ...]\n                        list of sequential transformations (1:1 map) to apply to pipeline ({self.TXT2SENT} NYI)\n  --assert ASSERT [ASSERT ...]\n                        assertions; failure gives error message (prototype)\n  --combine COMBINE     operation to combine files into final object (e.g. concat text or CSV file\n  --config [CONFIG [CONFIG ...]], -c [CONFIG [CONFIG ...]]\n                        file (e.g. ~/pyami/config.ini) with list of config file(s) or config vars\n  --debug DEBUG [DEBUG ...]\n                        debugging commands , numbers, (not formalised)\n  --demo [DEMO [DEMO ...]]\n                        simple demos (NYI). empty gives list. May need downloading corpora\n  --dict DICT [DICT ...], -d DICT [DICT ...]\n                        dictionaries to ami-search with, _help gives list\n  --filter FILTER [FILTER ...]\n                        expr to filter with\n  --glob GLOB [GLOB ...], -g GLOB [GLOB ...]\n                        glob files; python syntax (* and ** wildcards supported); include alternatives in {...,...}.\n  --languages LANGUAGES [LANGUAGES ...]\n                        languages (NYI)\n  --loglevel LOGLEVEL, -l LOGLEVEL\n                        log level (NYI)\n  --maxbars [MAXBARS]   max bars on plot (NYI)\n  --nosearch            search (NYI)\n  --outfile OUTFILE     output file, normally 1. but (NYI) may track multiple input dirs (NYI)\n  --patt PATT [PATT ...]\n                        patterns to search with (NYI); regex may need quoting\n  --plot                plot params (NYI)\n  --proj PROJ [PROJ ...], -p PROJ [PROJ ...]\n                        projects to search; _help will give list\n  --sect SECT [SECT ...], -s SECT [SECT ...]\n                        sections to search; _help gives all(?)\n  --split {txt2para,xml2sect} [{txt2para,xml2sect} ...]\n                        split fulltext.* into paras, sections\n  --test [{file_lib,pdf_lib,text_lib} [{file_lib,pdf_lib,text_lib} ...]]\n                        run tests for modules; no selection runs all\n````\n\n# getting started\nThere are an alpha series of commandline examples which show the operation of the system. Currently:\n````\n examples args: ['examples.py']\nchoose from:\ngl => globbing files\npd => convert pdf to text\npa => split pdf text into paragraphs\nsc => split xml into sections\nsl => split oil26 project into sections\nse => split text to sentences\nfi => simple filter (not complete)\nsp => extract species with italics and regex (not finalised)\n\nall => all examples\n````\n\n## config file\n**everyone needs**\n\n### a 'pyami' directory. \n\nThis can be naywhere but normally where you put program files and their\nsetting. It will be easiest if it's a direct subdirectory of your HOME directory. \nIt MUST include a `pyami.ini` file. By default you can use the one in the `py4ami` distribution\n\n### an environmental variable `PYAMI_HOME` \n\nThis will point to the `pyami.ini` file. \nSee [./CONFIG.md](CONFIG.md)\n\n\n`\n````\n(base) pm286macbook:pyami pm286$ more ~/pyami/config.ini\n; NOTE. All files use forward slash even on Windows\n; use slash (/) to separate filename components, we will convert to file-separator automatically\n; variables can be substituted using {}\n\n[DIRS]\nhome           = ~\ndictionary_url = https://github.com/petermr/dictionary\nproject_dir =    ${home}/projects\ncev_open =       ${DIRS:project_dir}/CEVOpen\ncode_dir =       ${DIRS:project_dir}/openDiagram/physchem/python\n; # wikidata taxon name property\n; taxon_name.w = P225777\n; # italic content\n; all_italics.x = xpath(//p//italic/text())\n; # species, e.g. Zea mays, T. rex, An. gambiae\n; species.r = [A-Z][a-z]?(\\.|[a-z]{2,})\\s+[a-z]{3,})\n\n[URLS]\npetermr_url = https://github.com/petermr\npetermr_raw_url = https://raw.githubusercontent.com/petermr\ntigr2ess.u =        https://github.com/petermr/tigr2ess/tree/master\n\n[AMISEARCH]\noil3.p = ${DIRS:code_dir}/tst/proj\n# wikidata taxon name property\ntaxon_name = P225\n# italic content\nall_italics.x = //p//italic/text()\n# species, e.g. Zea mays, T. rex, An. gambiae\nspecies.r = [A-Z][a-z]?(\\.|[a-z]{2,})\\s+[a-z]{3,}\n\n[DICTIONARIES]\ndict_dir     = ${DIRS:home}/dictionary\n\nov_ini       = ${dict_dir}/openvirus20210120/amidict.ini\ncev_ini      = ${DIRS:cev_open}/dictionary/amidict.ini\n\n#\u00a0docanal_ini  = ${dict_dir}/docanal/docanal.ini # not yet added\n\n\n[PROJECTS]\nopen_battery =      ${DIRS:project_dir}/open-battery\npr_liion =          ${open_battery}/liion\ntigr2ess =          ${DIRS:project_dir}/tigr2ess\nopen_diagram =      ${DIRS:project_dir}/openDiagram\nopen_virus =        ${DIRS:project_dir}/openVirus\n\nminicorpora_ini =   ${DIRS:cev_open}/minicorpora/config.ini\ncev_searches_ini =  ${DIRS:cev_open}/searches/config.ini\nopen_diag_ini =     ${DIRS:project_dir}/openDiagram/physchem/resources/config.ini\n\n\n(base) pm286macbook:pyami pm286$ \n````\n\n##UPDATE `py4ami.ami_dict` and `py4ami.ami_pdf`\n\nThese now support `argparse` in their own right (2022-07-09)\nThey will each give an argparse of commands \n\n### `ami_pdf`\n\n(2022-07-09)\n\n```\npython -m py4ami.ami_pdf --help\nrunning PDFArgs main\nusage: ami_pdf.py [-h] [--maxpage MAXPAGE] [--indir INDIR] [--inpath INPATH] [--outdir OUTDIR]\n                  [--outform OUTFORM] [--flow FLOW] [--imagedir IMAGEDIR] [--resolution RESOLUTION]\n                  [--template TEMPLATE] [--debug {words,lines,rects,curves,images,tables,hyperlinks,annots}]\n\nPDF parsing\n\noptional arguments:\n  -h, --help            show this help message and exit\n  --maxpage MAXPAGE     maximum number of pages\n  --indir INDIR         input directory\n  --inpath INPATH       input file\n  --outdir OUTDIR       output directory\n  --outform OUTFORM     output format\n  --flow FLOW           create flowing HTML (heuristics)\n  --imagedir IMAGEDIR   output images to imagedir\n  --resolution RESOLUTION\n                        resolution of output images (if imagedir)\n  --template TEMPLATE   file to parse specific type of document (NYI)\n  --debug {words,lines,rects,curves,images,tables,hyperlinks,annots}\n                        debug these during parsing (NYI)\n\n```\n\n\n## Notes for PMR?\nproject organization: https://dev.to/codemouse92/dead-simple-python-project-structure-and-imports-38c6\nuse: pkg_resources\n\n# WARNING\n...\nAmiUtil exists in both py4ami and pyamiimage. There should be a separate library\n\n\n",
    "bugtrack_url": null,
    "license": "Apache2",
    "summary": "pdf2html converter",
    "version": "0.0.6",
    "project_urls": {
        "Homepage": "https://github.com/petermr/pyamihtml"
    },
    "split_keywords": [
        "text",
        "and",
        "data",
        "mining"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "7020287ce0d16fb45c09c846adad353121cd91489c3bac5155ae374ffbcd73b2",
                "md5": "e40b2cbd5e87f7982afb8b03885c9a46",
                "sha256": "ec94b047b7dab88e40770b5676201ef49a4eb79002e8191be4051d6587ca732d"
            },
            "downloads": -1,
            "filename": "pyamihtml-0.0.6.tar.gz",
            "has_sig": false,
            "md5_digest": "e40b2cbd5e87f7982afb8b03885c9a46",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.7",
            "size": 210327,
            "upload_time": "2023-07-07T09:26:11",
            "upload_time_iso_8601": "2023-07-07T09:26:11.422108Z",
            "url": "https://files.pythonhosted.org/packages/70/20/287ce0d16fb45c09c846adad353121cd91489c3bac5155ae374ffbcd73b2/pyamihtml-0.0.6.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-07-07 09:26:11",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "petermr",
    "github_project": "pyamihtml",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "requirements": [
        {
            "name": "beautifulsoup4",
            "specs": [
                [
                    "~=",
                    "4.10.0"
                ]
            ]
        },
        {
            "name": "braceexpand",
            "specs": [
                [
                    "==",
                    "0.1.7"
                ]
            ]
        },
        {
            "name": "lxml",
            "specs": [
                [
                    "~=",
                    "4.7.1"
                ]
            ]
        },
        {
            "name": "matplotlib",
            "specs": [
                [
                    "~=",
                    "3.5.1"
                ]
            ]
        },
        {
            "name": "nltk",
            "specs": [
                [
                    "~=",
                    "3.6.7"
                ]
            ]
        },
        {
            "name": "pdfminer3",
            "specs": [
                [
                    "==",
                    "2018.12.3.0"
                ]
            ]
        },
        {
            "name": "Pillow",
            "specs": [
                [
                    "~=",
                    "9.1.1"
                ]
            ]
        },
        {
            "name": "psutil",
            "specs": [
                [
                    "~=",
                    "5.9.0"
                ]
            ]
        },
        {
            "name": "PyPDF2",
            "specs": [
                [
                    "==",
                    "1.26.0"
                ]
            ]
        },
        {
            "name": "python-rake",
            "specs": []
        },
        {
            "name": "setuptools",
            "specs": [
                [
                    "~=",
                    "60.3.1"
                ]
            ]
        },
        {
            "name": "SPARQLWrapper",
            "specs": [
                [
                    "==",
                    "1.8.5"
                ]
            ]
        },
        {
            "name": "tkinterhtml",
            "specs": [
                [
                    "==",
                    "0.7"
                ]
            ]
        },
        {
            "name": "tkinterweb",
            "specs": [
                [
                    "==",
                    "3.10.7"
                ]
            ]
        },
        {
            "name": "future",
            "specs": [
                [
                    "~=",
                    "0.18.2"
                ]
            ]
        },
        {
            "name": "pdfplumber",
            "specs": [
                [
                    "~=",
                    "0.7.1"
                ]
            ]
        },
        {
            "name": "configparser",
            "specs": [
                [
                    "~=",
                    "5.0.2"
                ]
            ]
        },
        {
            "name": "zlib",
            "specs": [
                [
                    "~=",
                    "1.2.11"
                ]
            ]
        },
        {
            "name": "wheel",
            "specs": [
                [
                    "~=",
                    "0.35.1"
                ]
            ]
        }
    ],
    "lcname": "pyamihtml"
}
        
Elapsed time: 0.14137s