# pyami
Semantic Reader of the Scientific Literature.
A scientific article is not a single dumb file; it can be transformed into many semantic, useful sub-documents. `pyami` is a Python framework for doing this; it reads the scientific literature in bulk, transforms, searches and and analyses the contents.
### status
This is in very active alpha development and early documentation will appear on the Wiki.
# overview
* `pyami` is a personal (client-side) tool that users can customise for their own needs and visions. It does not reply on remote service providers, though it can make its own use of Open sites.
# scope
* `pyami` can collect documents (of any sort) into a corpus or `CProject`.
* `pyami` has tools for filtering, cleaning, normalizing a corpus
* `pyami` can search this corpus by filenames, filetypes, document structure and content
* content can include metadata, text, tables, diagrams, math, chemistry, references and (in some cases) citations
* searching uses words, phrases, dictionaries, and image content (vectors and pixels)
* search results can be transformed, filtered, aggregated, and used for iterative enhancement ("snowballing")
# architecture
* downloaded documents (a corpus) are stored on your disk in a folder/directory (a CProject`)
* each document (`CTree`) in a `CProject` is a living subtree of many labelled text subsections (text, paragraphs, sentences, phrases, words) and images (`png`) . This is flexible (new types and subdirectories) can be added by users).
* the CTree is held on disk and can be further processed by other programs (e.g. pandas, tesseract (image2text), matplotlib
* a commandline supports many operations for searching, and transforming.
* a GUI (`ami-gui`) is layered on the commandline to help navigation, query building and visualisation.
* the commandline can be used by workflow tools such as Jupyter Notebooks
* The `pyami` code is packaged as a Python library for use by other tools
# components
There are several independent components. Many of these are customised for beginners. They can be referenced by symbols to avoid having to remember filenames. Users customise this with environment variables (often preset).
* project. The CProject holding the corpus. Users can have as many projects as they like.
* dictionary. Many searches use dictionaries and often several are used at once. There are currently over 50 dictionaries in a network but it's easy to create your own.
* code. (in Python3)
# commands
This is a subset of current commands (NYI=not yet implemented):
````
optional arguments:
-h, --help show this help message and exit
--apply {pdf2txt,txt2sent,xml2txt} [{pdf2txt,txt2sent,xml2txt} ...]
list of sequential transformations (1:1 map) to apply to pipeline ({self.TXT2SENT} NYI)
--assert ASSERT [ASSERT ...]
assertions; failure gives error message (prototype)
--combine COMBINE operation to combine files into final object (e.g. concat text or CSV file
--config [CONFIG [CONFIG ...]], -c [CONFIG [CONFIG ...]]
file (e.g. ~/pyami/config.ini) with list of config file(s) or config vars
--debug DEBUG [DEBUG ...]
debugging commands , numbers, (not formalised)
--demo [DEMO [DEMO ...]]
simple demos (NYI). empty gives list. May need downloading corpora
--dict DICT [DICT ...], -d DICT [DICT ...]
dictionaries to ami-search with, _help gives list
--filter FILTER [FILTER ...]
expr to filter with
--glob GLOB [GLOB ...], -g GLOB [GLOB ...]
glob files; python syntax (* and ** wildcards supported); include alternatives in {...,...}.
--languages LANGUAGES [LANGUAGES ...]
languages (NYI)
--loglevel LOGLEVEL, -l LOGLEVEL
log level (NYI)
--maxbars [MAXBARS] max bars on plot (NYI)
--nosearch search (NYI)
--outfile OUTFILE output file, normally 1. but (NYI) may track multiple input dirs (NYI)
--patt PATT [PATT ...]
patterns to search with (NYI); regex may need quoting
--plot plot params (NYI)
--proj PROJ [PROJ ...], -p PROJ [PROJ ...]
projects to search; _help will give list
--sect SECT [SECT ...], -s SECT [SECT ...]
sections to search; _help gives all(?)
--split {txt2para,xml2sect} [{txt2para,xml2sect} ...]
split fulltext.* into paras, sections
--test [{file_lib,pdf_lib,text_lib} [{file_lib,pdf_lib,text_lib} ...]]
run tests for modules; no selection runs all
````
# getting started
There are an alpha series of commandline examples which show the operation of the system. Currently:
````
examples args: ['examples.py']
choose from:
gl => globbing files
pd => convert pdf to text
pa => split pdf text into paragraphs
sc => split xml into sections
sl => split oil26 project into sections
se => split text to sentences
fi => simple filter (not complete)
sp => extract species with italics and regex (not finalised)
all => all examples
````
## config file
**everyone needs**
### a 'pyami' directory.
This can be naywhere but normally where you put program files and their
setting. It will be easiest if it's a direct subdirectory of your HOME directory.
It MUST include a `pyami.ini` file. By default you can use the one in the `py4ami` distribution
### an environmental variable `PYAMI_HOME`
This will point to the `pyami.ini` file.
See [./CONFIG.md](CONFIG.md)
`
````
(base) pm286macbook:pyami pm286$ more ~/pyami/config.ini
; NOTE. All files use forward slash even on Windows
; use slash (/) to separate filename components, we will convert to file-separator automatically
; variables can be substituted using {}
[DIRS]
home = ~
dictionary_url = https://github.com/petermr/dictionary
project_dir = ${home}/projects
cev_open = ${DIRS:project_dir}/CEVOpen
code_dir = ${DIRS:project_dir}/openDiagram/physchem/python
; # wikidata taxon name property
; taxon_name.w = P225777
; # italic content
; all_italics.x = xpath(//p//italic/text())
; # species, e.g. Zea mays, T. rex, An. gambiae
; species.r = [A-Z][a-z]?(\.|[a-z]{2,})\s+[a-z]{3,})
[URLS]
petermr_url = https://github.com/petermr
petermr_raw_url = https://raw.githubusercontent.com/petermr
tigr2ess.u = https://github.com/petermr/tigr2ess/tree/master
[AMISEARCH]
oil3.p = ${DIRS:code_dir}/tst/proj
# wikidata taxon name property
taxon_name = P225
# italic content
all_italics.x = //p//italic/text()
# species, e.g. Zea mays, T. rex, An. gambiae
species.r = [A-Z][a-z]?(\.|[a-z]{2,})\s+[a-z]{3,}
[DICTIONARIES]
dict_dir = ${DIRS:home}/dictionary
ov_ini = ${dict_dir}/openvirus20210120/amidict.ini
cev_ini = ${DIRS:cev_open}/dictionary/amidict.ini
# docanal_ini = ${dict_dir}/docanal/docanal.ini # not yet added
[PROJECTS]
open_battery = ${DIRS:project_dir}/open-battery
pr_liion = ${open_battery}/liion
tigr2ess = ${DIRS:project_dir}/tigr2ess
open_diagram = ${DIRS:project_dir}/openDiagram
open_virus = ${DIRS:project_dir}/openVirus
minicorpora_ini = ${DIRS:cev_open}/minicorpora/config.ini
cev_searches_ini = ${DIRS:cev_open}/searches/config.ini
open_diag_ini = ${DIRS:project_dir}/openDiagram/physchem/resources/config.ini
(base) pm286macbook:pyami pm286$
````
##UPDATE `py4ami.ami_dict` and `py4ami.ami_pdf`
These now support `argparse` in their own right (2022-07-09)
They will each give an argparse of commands
### `ami_pdf`
(2022-07-09)
```
python -m py4ami.ami_pdf --help
running PDFArgs main
usage: ami_pdf.py [-h] [--maxpage MAXPAGE] [--indir INDIR] [--inpath INPATH] [--outdir OUTDIR]
[--outform OUTFORM] [--flow FLOW] [--imagedir IMAGEDIR] [--resolution RESOLUTION]
[--template TEMPLATE] [--debug {words,lines,rects,curves,images,tables,hyperlinks,annots}]
PDF parsing
optional arguments:
-h, --help show this help message and exit
--maxpage MAXPAGE maximum number of pages
--indir INDIR input directory
--inpath INPATH input file
--outdir OUTDIR output directory
--outform OUTFORM output format
--flow FLOW create flowing HTML (heuristics)
--imagedir IMAGEDIR output images to imagedir
--resolution RESOLUTION
resolution of output images (if imagedir)
--template TEMPLATE file to parse specific type of document (NYI)
--debug {words,lines,rects,curves,images,tables,hyperlinks,annots}
debug these during parsing (NYI)
```
## Notes for PMR?
project organization: https://dev.to/codemouse92/dead-simple-python-project-structure-and-imports-38c6
use: pkg_resources
# WARNING
...
AmiUtil exists in both py4ami and pyamiimage. There should be a separate library
Raw data
{
"_id": null,
"home_page": "https://github.com/petermr/pyamihtml",
"name": "pyamihtmlx",
"maintainer": "",
"docs_url": null,
"requires_python": ">=3.7",
"maintainer_email": "",
"keywords": "text and data mining",
"author": "Peter Murray-Rust",
"author_email": "petermurrayrust@googlemail.com",
"download_url": "https://files.pythonhosted.org/packages/6a/31/da24d40a01cb101081ccb4b17fa56eb7b1e20fda5c1154fd88d826b32d71/pyamihtmlx-0.1.5.tar.gz",
"platform": null,
"description": "# pyami\nSemantic Reader of the Scientific Literature.\n\nA scientific article is not a single dumb file; it can be transformed into many semantic, useful sub-documents. `pyami` is a Python framework for doing this; it reads the scientific literature in bulk, transforms, searches and and analyses the contents.\n\n### status\nThis is in very active alpha development and early documentation will appear on the Wiki.\n\n# overview\n* `pyami` is a personal (client-side) tool that users can customise for their own needs and visions. It does not reply on remote service providers, though it can make its own use of Open sites. \n\n# scope\n* `pyami` can collect documents (of any sort) into a corpus or `CProject`. \n* `pyami` has tools for filtering, cleaning, normalizing a corpus\n* `pyami` can search this corpus by filenames, filetypes, document structure and content \n* content can include metadata, text, tables, diagrams, math, chemistry, references and (in some cases) citations\n* searching uses words, phrases, dictionaries, and image content (vectors and pixels)\n* search results can be transformed, filtered, aggregated, and used for iterative enhancement (\"snowballing\")\n\n# architecture\n* downloaded documents (a corpus) are stored on your disk in a folder/directory (a CProject`)\n* each document (`CTree`) in a `CProject` is a living subtree of many labelled text subsections (text, paragraphs, sentences, phrases, words) and images (`png`) . This is flexible (new types and subdirectories) can be added by users).\n* the CTree is held on disk and can be further processed by other programs (e.g. pandas, tesseract (image2text), matplotlib\n* a commandline supports many operations for searching, and transforming.\n* a GUI (`ami-gui`) is layered on the commandline to help navigation, query building and visualisation.\n* the commandline can be used by workflow tools such as Jupyter Notebooks\n* The `pyami` code is packaged as a Python library for use by other tools\n\n# components\nThere are several independent components. Many of these are customised for beginners. They can be referenced by symbols to avoid having to remember filenames. Users customise this with environment variables (often preset).\n* project. The CProject holding the corpus. Users can have as many projects as they like.\n* dictionary. Many searches use dictionaries and often several are used at once. There are currently over 50 dictionaries in a network but it's easy to create your own.\n* code. (in Python3)\n \n# commands\nThis is a subset of current commands (NYI=not yet implemented):\n````\noptional arguments:\n -h, --help show this help message and exit\n --apply {pdf2txt,txt2sent,xml2txt} [{pdf2txt,txt2sent,xml2txt} ...]\n list of sequential transformations (1:1 map) to apply to pipeline ({self.TXT2SENT} NYI)\n --assert ASSERT [ASSERT ...]\n assertions; failure gives error message (prototype)\n --combine COMBINE operation to combine files into final object (e.g. concat text or CSV file\n --config [CONFIG [CONFIG ...]], -c [CONFIG [CONFIG ...]]\n file (e.g. ~/pyami/config.ini) with list of config file(s) or config vars\n --debug DEBUG [DEBUG ...]\n debugging commands , numbers, (not formalised)\n --demo [DEMO [DEMO ...]]\n simple demos (NYI). empty gives list. May need downloading corpora\n --dict DICT [DICT ...], -d DICT [DICT ...]\n dictionaries to ami-search with, _help gives list\n --filter FILTER [FILTER ...]\n expr to filter with\n --glob GLOB [GLOB ...], -g GLOB [GLOB ...]\n glob files; python syntax (* and ** wildcards supported); include alternatives in {...,...}.\n --languages LANGUAGES [LANGUAGES ...]\n languages (NYI)\n --loglevel LOGLEVEL, -l LOGLEVEL\n log level (NYI)\n --maxbars [MAXBARS] max bars on plot (NYI)\n --nosearch search (NYI)\n --outfile OUTFILE output file, normally 1. but (NYI) may track multiple input dirs (NYI)\n --patt PATT [PATT ...]\n patterns to search with (NYI); regex may need quoting\n --plot plot params (NYI)\n --proj PROJ [PROJ ...], -p PROJ [PROJ ...]\n projects to search; _help will give list\n --sect SECT [SECT ...], -s SECT [SECT ...]\n sections to search; _help gives all(?)\n --split {txt2para,xml2sect} [{txt2para,xml2sect} ...]\n split fulltext.* into paras, sections\n --test [{file_lib,pdf_lib,text_lib} [{file_lib,pdf_lib,text_lib} ...]]\n run tests for modules; no selection runs all\n````\n\n# getting started\nThere are an alpha series of commandline examples which show the operation of the system. Currently:\n````\n examples args: ['examples.py']\nchoose from:\ngl => globbing files\npd => convert pdf to text\npa => split pdf text into paragraphs\nsc => split xml into sections\nsl => split oil26 project into sections\nse => split text to sentences\nfi => simple filter (not complete)\nsp => extract species with italics and regex (not finalised)\n\nall => all examples\n````\n\n## config file\n**everyone needs**\n\n### a 'pyami' directory. \n\nThis can be naywhere but normally where you put program files and their\nsetting. It will be easiest if it's a direct subdirectory of your HOME directory. \nIt MUST include a `pyami.ini` file. By default you can use the one in the `py4ami` distribution\n\n### an environmental variable `PYAMI_HOME` \n\nThis will point to the `pyami.ini` file. \nSee [./CONFIG.md](CONFIG.md)\n\n\n`\n````\n(base) pm286macbook:pyami pm286$ more ~/pyami/config.ini\n; NOTE. All files use forward slash even on Windows\n; use slash (/) to separate filename components, we will convert to file-separator automatically\n; variables can be substituted using {}\n\n[DIRS]\nhome = ~\ndictionary_url = https://github.com/petermr/dictionary\nproject_dir = ${home}/projects\ncev_open = ${DIRS:project_dir}/CEVOpen\ncode_dir = ${DIRS:project_dir}/openDiagram/physchem/python\n; # wikidata taxon name property\n; taxon_name.w = P225777\n; # italic content\n; all_italics.x = xpath(//p//italic/text())\n; # species, e.g. Zea mays, T. rex, An. gambiae\n; species.r = [A-Z][a-z]?(\\.|[a-z]{2,})\\s+[a-z]{3,})\n\n[URLS]\npetermr_url = https://github.com/petermr\npetermr_raw_url = https://raw.githubusercontent.com/petermr\ntigr2ess.u = https://github.com/petermr/tigr2ess/tree/master\n\n[AMISEARCH]\noil3.p = ${DIRS:code_dir}/tst/proj\n# wikidata taxon name property\ntaxon_name = P225\n# italic content\nall_italics.x = //p//italic/text()\n# species, e.g. Zea mays, T. rex, An. gambiae\nspecies.r = [A-Z][a-z]?(\\.|[a-z]{2,})\\s+[a-z]{3,}\n\n[DICTIONARIES]\ndict_dir = ${DIRS:home}/dictionary\n\nov_ini = ${dict_dir}/openvirus20210120/amidict.ini\ncev_ini = ${DIRS:cev_open}/dictionary/amidict.ini\n\n#\u00a0docanal_ini = ${dict_dir}/docanal/docanal.ini # not yet added\n\n\n[PROJECTS]\nopen_battery = ${DIRS:project_dir}/open-battery\npr_liion = ${open_battery}/liion\ntigr2ess = ${DIRS:project_dir}/tigr2ess\nopen_diagram = ${DIRS:project_dir}/openDiagram\nopen_virus = ${DIRS:project_dir}/openVirus\n\nminicorpora_ini = ${DIRS:cev_open}/minicorpora/config.ini\ncev_searches_ini = ${DIRS:cev_open}/searches/config.ini\nopen_diag_ini = ${DIRS:project_dir}/openDiagram/physchem/resources/config.ini\n\n\n(base) pm286macbook:pyami pm286$ \n````\n\n##UPDATE `py4ami.ami_dict` and `py4ami.ami_pdf`\n\nThese now support `argparse` in their own right (2022-07-09)\nThey will each give an argparse of commands \n\n### `ami_pdf`\n\n(2022-07-09)\n\n```\npython -m py4ami.ami_pdf --help\nrunning PDFArgs main\nusage: ami_pdf.py [-h] [--maxpage MAXPAGE] [--indir INDIR] [--inpath INPATH] [--outdir OUTDIR]\n [--outform OUTFORM] [--flow FLOW] [--imagedir IMAGEDIR] [--resolution RESOLUTION]\n [--template TEMPLATE] [--debug {words,lines,rects,curves,images,tables,hyperlinks,annots}]\n\nPDF parsing\n\noptional arguments:\n -h, --help show this help message and exit\n --maxpage MAXPAGE maximum number of pages\n --indir INDIR input directory\n --inpath INPATH input file\n --outdir OUTDIR output directory\n --outform OUTFORM output format\n --flow FLOW create flowing HTML (heuristics)\n --imagedir IMAGEDIR output images to imagedir\n --resolution RESOLUTION\n resolution of output images (if imagedir)\n --template TEMPLATE file to parse specific type of document (NYI)\n --debug {words,lines,rects,curves,images,tables,hyperlinks,annots}\n debug these during parsing (NYI)\n\n```\n\n\n## Notes for PMR?\nproject organization: https://dev.to/codemouse92/dead-simple-python-project-structure-and-imports-38c6\nuse: pkg_resources\n\n# WARNING\n...\nAmiUtil exists in both py4ami and pyamiimage. There should be a separate library\n",
"bugtrack_url": null,
"license": "Apache2",
"summary": "pdf2html converter and enhancer",
"version": "0.1.5",
"project_urls": {
"Homepage": "https://github.com/petermr/pyamihtml"
},
"split_keywords": [
"text",
"and",
"data",
"mining"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "6a31da24d40a01cb101081ccb4b17fa56eb7b1e20fda5c1154fd88d826b32d71",
"md5": "6bcee874f9872054f4403b195dc9069b",
"sha256": "3f2ea9e2c9586c21dbcfc7a4e859777ab11d656b78b5a82245d0c8a677f7d292"
},
"downloads": -1,
"filename": "pyamihtmlx-0.1.5.tar.gz",
"has_sig": false,
"md5_digest": "6bcee874f9872054f4403b195dc9069b",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.7",
"size": 298366,
"upload_time": "2024-03-06T23:19:47",
"upload_time_iso_8601": "2024-03-06T23:19:47.991319Z",
"url": "https://files.pythonhosted.org/packages/6a/31/da24d40a01cb101081ccb4b17fa56eb7b1e20fda5c1154fd88d826b32d71/pyamihtmlx-0.1.5.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-03-06 23:19:47",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "petermr",
"github_project": "pyamihtml",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"requirements": [],
"lcname": "pyamihtmlx"
}