# USFM-Grammar
The python library that facilitates
* Parsing and validation of USFM files using `tree-sitter-usfm3`
* Conversion of USFM files to other formats (USX, dict, list etc)
* Extraction of specific contents from USFM files like scripture alone(clean verses), notes (footnotes, cross-refs) etc
Built on python 3.10
## Installation
`pip install usfm-grammar`
This requires a C compiler. On Windows, Microsoft Visual C++ 14.0 or above is required.
It is recommended that you update `pip`, `setuptools` and `wheel`.
## Usage
### By importing library in Python code
```python
from usfm_grammar import USFMParser, Filter
# input_usfm_str = open("sample.usfm","r", encoding='utf8').read()
input_usfm_str = '''
\\id GEN
\\c 1
\\p
\\v 1 test verse
'''
my_parser = USFMParser(input_usfm_str)
errors = my_parser.errors
print(errors)
```
##### To convert to USX
```python
from lxml import etree
usx_elem = my_parser.to_usx() # default filter=ALL
print(etree.tostring(usx_elem, encoding="unicode", pretty_print=True))
```
##### To convert to Dict/USJ
```python
output = my_parser.to_usj() # default all markers
# filters out specified markers from output
# output = my_parser.to_usj(exclude_markers=['s1','h', 'toc1','toc2','mt'])
# retains only specified contents from output
# output = my_parser.to_usj(include_markers=['id', 'c', 'v'])
# use predefined marker groups instead of listing them one by one
# output = my_parser.to_usj(include_markers=Filter.BCV+Filter.TEXT)
# for a flattened JSON removing nesting brought in by paragraphs, lists, quotes, tables and character level markups
# output = my_parser.to_usj(exclude_markers=Filter.PARAGRAPHS+Filter.CHARACTERS)
# To NOT concatinate text extracted from different markers
# output = my_parser.to_usj(exclude_markers=Filter.PARAGRAPHS+Filter.CHARACTERS, combine_texts=False)
print(output)
```
To understand more about how `exclude_markers`, `include_markers`, `combine_texts` and `Filter` works refer the section on [filtering on USJ](#filtering-on-usj)
##### To save as json
```python
import json
dict_output = my_parser.to_usj()
with open("file_path.json", "w", encoding='utf-8') as fp:
json.dump(dict_output, fp)
```
##### To convert to List or table like format
```python
list_output = my_parser.to_list()
#list_output = my_parser.to_list([Filter.SCRIPTURE_TEXT])
table_output = "\n".join(["\t".join(row) for row in list_output])
print(table_output)
```
##### To convert to BibleNLP format
Bible NLP format consists of two `txt` files: the first, with verse texts, one per line and the second, with corresponding references.
```python
dict_output = my_parser.to_biblenlp_format()
#dict_output = my_parser.to_biblenlp_format(ignore_errors=True)
with open("bibleNLP.txt", "w", encoding='utf-8') as out_file1:
out_file1.writelines(f"{verse}\n" for verse in dict_output['text'])
with open("vref.txt", "w", encoding='utf-8') as out_file2:
out_file2.writelines(f"{ref}\n" for ref in dict_output['vref'])
```
##### To round trip with USJ
```python
from usfm_grammar import USFMParser, Filter
my_parser = USFMParser(input_usfm_str)
usj_obj = my_parser.to_usj()
my_parser2 = USFMParser(from_usj=usj_obj)
print(my_parser2.usfm)
```
:warning: There will be differences between first USFM and the generated one in 1. Spaces and lines 2. Default attributes will be given their names 3. Closing markers may be newly added
##### To remove unwanted markers from USFM
```python
from usfm_grammar import USFMParser, Filter
my_parser = USFMParser(input_usfm_str)
usj_obj = my_parser.to_usj(include_markers=Filter.BCV+Filter.TEXT)
my_parser2 = USFMParser(from_usj=usj_obj)
print(my_parser2.usfm)
```
##### USJ to USX or Table
```python
from usfm_grammar import USFMParser, Filter
my_parser = USFMParser(input_usfm_str)
usj_obj = my_parser.to_usj()
my_parser2 = USFMParser(from_usj=usj_obj)
print(my_parser2.to_usx())
# print(my_parser2.to_list())
```
##### USX to USFM, USJ or Table
```python
from usfm_grammar import USFMParser, Filter
from lxml import etree
test_xml_file = "sample_usx.xml"
with open(test_xml_file, 'r', encoding='utf-8') as usx_file:
usx_str = usx_file.read()
usx_obj = etree.fromstring(usx_str)
my_parser = USFMParser(from_usx=usx_obj)
print(my_parser.usfm)
# print(my_parser.to_usj())
# print(my_parser.to_list())
```
#### Experimental Validation and Autofix
For USJ:
```python
from usfm_grammar import Validator
wrong_USFM="\\id GEN\n\\c 1\n\\v 1 test verse"
checker = Validator();
resp = checker.is_valid_usfm(wrong_USFM); # true or false
print(checker.message) # List of errors if present
edited_USFM = checker.auto_fix_usfm(wrong_USFM);
print(checker.message); # Report on autofix attempt
```
### From CLI
```
usage: usfm-grammar [-h] [--in_format {usfm,usj,usx}] [--out_format {usj,table,syntax-tree,usx,markdown,usfm,bible-nlp}]
[--include_markers {book_headers,titles,comments,paragraphs,characters,notes,study_bible,bcv,text,ide,usfm,h,toc,toca,imt,is,ip,ipi,im,imi,ipq,imq,ipr,iq,ib,ili,iot,io,iex,imte,ie,mt,mte,cl,cd,ms,mr,s,sr,r,d,sp,sd,sts,rem,lit,restore,p,m,po,pr,cls,pmo,pm,pmc,pmr,pi,mi,nb,pc,ph,q,qr,qc,qa,qm,qd,lh,li,lf,lim,litl,tr,tc,th,tcr,thr,table,b,add,bk,dc,ior,iqt,k,litl,nd,ord,pn,png,qac,qs,qt,rq,sig,sls,tl,wj,em,bd,bdit,it,no,sc,sup,rb,pro,w,wh,wa,wg,lik,liv,jmp,f,fe,ef,efe,x,ex,fr,ft,fk,fq,fqa,fl,fw,fp,fv,fdc,xo,xop,xt,xta,xk,xq,xot,xnt,xdc,esb,cat,id,c,v,text-in-excluded-parent}]
[--exclude_markers {book_headers,titles,comments,paragraphs,characters,notes,study_bible,bcv,text,ide,usfm,h,toc,toca,imt,is,ip,ipi,im,imi,ipq,imq,ipr,iq,ib,ili,iot,io,iex,imte,ie,mt,mte,cl,cd,ms,mr,s,sr,r,d,sp,sd,sts,rem,lit,restore,p,m,po,pr,cls,pmo,pm,pmc,pmr,pi,mi,nb,pc,ph,q,qr,qc,qa,qm,qd,lh,li,lf,lim,litl,tr,tc,th,tcr,thr,table,b,add,bk,dc,ior,iqt,k,litl,nd,ord,pn,png,qac,qs,qt,rq,sig,sls,tl,wj,em,bd,bdit,it,no,sc,sup,rb,pro,w,wh,wa,wg,lik,liv,jmp,f,fe,ef,efe,x,ex,fr,ft,fk,fq,fqa,fl,fw,fp,fv,fdc,xo,xop,xt,xta,xk,xq,xot,xnt,xdc,esb,cat,id,c,v,text-in-excluded-parent}]
[--csv_col_sep CSV_COL_SEP] [--csv_row_sep CSV_ROW_SEP] [--ignore_errors] [--combine_text]
infile
Uses the tree-sitter-usfm grammar to parse and convert USFM to Syntax-tree, JSON, CSV, USX etc.
positional arguments:
infile input usfm or usj file
options:
-h, --help show this help message and exit
--in_format {usfm,usj,usx}
input file format
--out_format {usj,table,syntax-tree,usx,markdown,usfm,bible-nlp}
output format
--include_markers {book_headers,titles,comments,paragraphs,characters,notes,study_bible,bcv,text,ide,usfm,h,toc,toca,imt,is,ip,ipi,im,imi,ipq,imq,ipr,iq,ib,ili,iot,io,iex,imte,ie,mt,mte,cl,cd,ms,mr,s,sr,r,d,sp,sd,sts,rem,lit,restore,p,m,po,pr,cls,pmo,pm,pmc,pmr,pi,mi,nb,pc,ph,q,qr,qc,qa,qm,qd,lh,li,lf,lim,litl,tr,tc,th,tcr,thr,table,b,add,bk,dc,ior,iqt,k,litl,nd,ord,pn,png,qac,qs,qt,rq,sig,sls,tl,wj,em,bd,bdit,it,no,sc,sup,rb,pro,w,wh,wa,wg,lik,liv,jmp,f,fe,ef,efe,x,ex,fr,ft,fk,fq,fqa,fl,fw,fp,fv,fdc,xo,xop,xt,xta,xk,xq,xot,xnt,xdc,esb,cat,id,c,v,text-in-excluded-parent}
the list of of contents to be included
--exclude_markers {book_headers,titles,comments,paragraphs,characters,notes,study_bible,bcv,text,ide,usfm,h,toc,toca,imt,is,ip,ipi,im,imi,ipq,imq,ipr,iq,ib,ili,iot,io,iex,imte,ie,mt,mte,cl,cd,ms,mr,s,sr,r,d,sp,sd,sts,rem,lit,restore,p,m,po,pr,cls,pmo,pm,pmc,pmr,pi,mi,nb,pc,ph,q,qr,qc,qa,qm,qd,lh,li,lf,lim,litl,tr,tc,th,tcr,thr,table,b,add,bk,dc,ior,iqt,k,litl,nd,ord,pn,png,qac,qs,qt,rq,sig,sls,tl,wj,em,bd,bdit,it,no,sc,sup,rb,pro,w,wh,wa,wg,lik,liv,jmp,f,fe,ef,efe,x,ex,fr,ft,fk,fq,fqa,fl,fw,fp,fv,fdc,xo,xop,xt,xta,xk,xq,xot,xnt,xdc,esb,cat,id,c,v,text-in-excluded-parent}
the list of of contents to be included
--csv_col_sep CSV_COL_SEP
column separator or delimiter. Only useful with format=table.
--csv_row_sep CSV_ROW_SEP
row separator or delimiter. Only useful with format=table.
--ignore_errors to get some output from successfully parsed portions
--combine_text to be used along with exclude_markers or include_markers, to concatinate the consecutive text snippets, from different components, or not
```
Example
```bash
>>> python3 -m usfm_grammar sample.usfm --out_format usx
>>> usfm-grammar sample.usfm
>>> usfm-grammar sample.usfm --out_format usx
>>> usfm-grammar sample.usfm --include_markers bcv --include_markers text --include_markers s
>>> usfm-grammar sample-usj.json --out_format usfm
```
For the `biblenlp` option, two files will be generated: `<name>_biblenlp.txt` and `<name>_biblenlp_vref.txt`. For all other `out_format` options, the output is displayed directly in the console (standard output). If needed, it can be redirected to a file using the following approach:
```bash
>>> usfm-grammar sample.usfm --out_format usx > converted_usx.xml
```
### Filtering on USJ
The filtering on USJ, the JSON output, is a feature incorporated to allow data extraction, markup cleaning etc. The arguments `exclude_markers` and `include_markers` in the methods `USFMParser.to_usj()` makes this possible. Also the `USFMParser.to_list()`, can accept these inputs and perform similar operations. There is CLI versions also for these arguments to replicate the filtering feature there.
- *include_markers*
Optional input parameter to `to_usj()` and `to_list` in python library and also in CLI when `format=json` or `format=table`. Defaults to `None`.When proivded, only those markers listed will be included in the output. `include_markers` is applied before applying `exclude_markers`.
- *exclude_markers*
Optional input parameter to `to_usj()` and `to_list` in python library and also in CLI when `format=json` or `format=table`. Defaults to `None`. When proivded, all markers except those listed will be included in the output.
- *combine_texts*
Optional input parameter to `to_usj()` and `to_list` in python library and also in CLI when `format=json` or `format=table`. Defaults to `True`. After filtering out makers like paragraphs and characters, we are left with texts from within them, if 'text-in-excluded-parent' is also not excluded. These text snippets may come as separate components in the contents list. When this option is `True`, the consequetive text snippets will be concatinated together. The text concatination is done in a puctuation and space aware manner. If users need more control over the space handling or for any other reason, would prefer the texts snippets as different components in the output, this can be set to `False`.
- *usfm_grammar.Filter*
This Class provides a set of enums that would be useful in providing in the `exclude_markers` and `include_markers` inputs rather than users listing out individual markers. The class has following options
```
BOOK_HEADERS : identification and introduction markers
TITLES : section headings and associated markers
COMMENTS : comment markers like \rem
PARAGRAPHS : paragraph markers like \p, poetry markers, list table markers
CHARACTERS : all character level markups like \em, \w, \wj etc and their nested versions with +
NOTES : foot note, cross-reference and their content markers
STUDY_BIBLE : \esb and \cat
BCV : \id, \c and \v
TEXT : 'text-in-excluded-parent'
```
To inspect which are the markers in each of these options, it could be just printed out, `print(Filter.TITLES)`. These could be used individually or concatinated to get the desired filtering of markers and data:
```python
output = my_parser.to_usj(include_markers=Filter.BCV)
output = my_parser.to_usj(include_markers=Filter.BCV+Filter.TEXT)
output = my_parser.to_usj(exclude_markers=Filter.PARAGRAPHS+Filter.CHARACTERS)
```
- Inner contents of excluded markers
For markers like `\p` `\q` etc, by excluding them, we only remove them from the heirachy and retain the inner contents like `\v`, text etc that would be coming inside it. But for certain other markers like `\f`, `\x`, `\esb` etc, if they are excluded their inner contents are also excluded. Following is the set of all markers, who inner contents are discarded if they are mentioned in `exclude_markers` or not included in `include_markers`.
```
BOOK_HEADERS, TITLES, COMMENTS, NOTES, STUDY_BIBLE
```
:warning: Generally, it is recommended to NOT use both `exclude_markers` and `include_markers` together as it could lead to unexpected behavours and data loss. For instance if `include_makers` has `\fk` and `exclude_markers` has `\f`, the output will not contain `\fk` as all inner contents of `\f` will be discarded.
Raw data
{
"_id": null,
"home_page": null,
"name": "usfm-grammar",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.10",
"maintainer_email": "Kavitha Raju <kavitha.raju@bridgeconn.com>, Joel Mathew <joel@bridgeconn.com>",
"keywords": "usfm, parser, grammar, tree-sitter",
"author": null,
"author_email": "BCS Team <joel@bridgeconn.com>",
"download_url": "https://files.pythonhosted.org/packages/85/f2/23892928015c33998c7c7b166ac32e75319b32ee9afaff65773a56953e37/usfm_grammar-3.0.0.tar.gz",
"platform": null,
"description": "# USFM-Grammar\n\nThe python library that facilitates\n* Parsing and validation of USFM files using `tree-sitter-usfm3`\n* Conversion of USFM files to other formats (USX, dict, list etc)\n* Extraction of specific contents from USFM files like scripture alone(clean verses), notes (footnotes, cross-refs) etc\n\nBuilt on python 3.10\n\n## Installation\n\n`pip install usfm-grammar`\n\nThis requires a C compiler. On Windows, Microsoft Visual C++ 14.0 or above is required. \nIt is recommended that you update `pip`, `setuptools` and `wheel`.\n\n\n## Usage\n\n### By importing library in Python code\n\n```python\nfrom usfm_grammar import USFMParser, Filter\n\n# input_usfm_str = open(\"sample.usfm\",\"r\", encoding='utf8').read()\ninput_usfm_str = '''\n\\\\id GEN\n\\\\c 1\n\\\\p\n\\\\v 1 test verse\n'''\n\nmy_parser = USFMParser(input_usfm_str)\n\nerrors = my_parser.errors\nprint(errors)\n```\n\n##### To convert to USX\n```python\nfrom lxml import etree\n\nusx_elem = my_parser.to_usx() # default filter=ALL\nprint(etree.tostring(usx_elem, encoding=\"unicode\", pretty_print=True))\n```\n\n##### To convert to Dict/USJ\n\n```python\noutput = my_parser.to_usj() # default all markers\n\n# filters out specified markers from output\n# output = my_parser.to_usj(exclude_markers=['s1','h', 'toc1','toc2','mt'])\n\n# retains only specified contents from output\n# output = my_parser.to_usj(include_markers=['id', 'c', 'v']) \n\n# use predefined marker groups instead of listing them one by one\n# output = my_parser.to_usj(include_markers=Filter.BCV+Filter.TEXT)\n\n# for a flattened JSON removing nesting brought in by paragraphs, lists, quotes, tables and character level markups\n# output = my_parser.to_usj(exclude_markers=Filter.PARAGRAPHS+Filter.CHARACTERS)\n\n# To NOT concatinate text extracted from different markers\n# output = my_parser.to_usj(exclude_markers=Filter.PARAGRAPHS+Filter.CHARACTERS, combine_texts=False) \n\nprint(output)\n```\nTo understand more about how `exclude_markers`, `include_markers`, `combine_texts` and `Filter` works refer the section on [filtering on USJ](#filtering-on-usj)\n\n##### To save as json\n```python\nimport json\ndict_output = my_parser.to_usj()\nwith open(\"file_path.json\", \"w\", encoding='utf-8') as fp:\n\tjson.dump(dict_output, fp)\n```\n\n##### To convert to List or table like format\n```python\nlist_output = my_parser.to_list() \n#list_output = my_parser.to_list([Filter.SCRIPTURE_TEXT])\n\ntable_output = \"\\n\".join([\"\\t\".join(row) for row in list_output])\nprint(table_output)\n\n```\n\n##### To convert to BibleNLP format\nBible NLP format consists of two `txt` files: the first, with verse texts, one per line and the second, with corresponding references.\n\n```python\ndict_output = my_parser.to_biblenlp_format() \n#dict_output = my_parser.to_biblenlp_format(ignore_errors=True)\n\nwith open(\"bibleNLP.txt\", \"w\", encoding='utf-8') as out_file1:\n out_file1.writelines(f\"{verse}\\n\" for verse in dict_output['text'])\n\nwith open(\"vref.txt\", \"w\", encoding='utf-8') as out_file2:\n out_file2.writelines(f\"{ref}\\n\" for ref in dict_output['vref'])\n\n```\n\n##### To round trip with USJ\n```python\nfrom usfm_grammar import USFMParser, Filter\n\nmy_parser = USFMParser(input_usfm_str)\nusj_obj = my_parser.to_usj()\n\nmy_parser2 = USFMParser(from_usj=usj_obj)\nprint(my_parser2.usfm)\n```\n:warning: There will be differences between first USFM and the generated one in 1. Spaces and lines 2. Default attributes will be given their names 3. Closing markers may be newly added\n\n##### To remove unwanted markers from USFM\n```python\nfrom usfm_grammar import USFMParser, Filter\n\nmy_parser = USFMParser(input_usfm_str)\nusj_obj = my_parser.to_usj(include_markers=Filter.BCV+Filter.TEXT)\n\nmy_parser2 = USFMParser(from_usj=usj_obj)\nprint(my_parser2.usfm)\n```\n##### USJ to USX or Table\n```python\nfrom usfm_grammar import USFMParser, Filter\n\nmy_parser = USFMParser(input_usfm_str)\nusj_obj = my_parser.to_usj()\n\nmy_parser2 = USFMParser(from_usj=usj_obj)\nprint(my_parser2.to_usx())\n# print(my_parser2.to_list())\n```\n\n##### USX to USFM, USJ or Table\n```python\nfrom usfm_grammar import USFMParser, Filter\nfrom lxml import etree\n\ntest_xml_file = \"sample_usx.xml\"\nwith open(test_xml_file, 'r', encoding='utf-8') as usx_file:\n usx_str = usx_file.read()\n usx_obj = etree.fromstring(usx_str)\n\n my_parser = USFMParser(from_usx=usx_obj)\n print(my_parser.usfm)\n # print(my_parser.to_usj())\n # print(my_parser.to_list())\n```\n\n#### Experimental Validation and Autofix\n\nFor USJ:\n```python\nfrom usfm_grammar import Validator\n\nwrong_USFM=\"\\\\id GEN\\n\\\\c 1\\n\\\\v 1 test verse\"\nchecker = Validator();\nresp = checker.is_valid_usfm(wrong_USFM); # true or false\nprint(checker.message) # List of errors if present\n\nedited_USFM = checker.auto_fix_usfm(wrong_USFM);\nprint(checker.message); # Report on autofix attempt \n```\n\n\n### From CLI\n\n```\nusage: usfm-grammar [-h] [--in_format {usfm,usj,usx}] [--out_format {usj,table,syntax-tree,usx,markdown,usfm,bible-nlp}]\n [--include_markers {book_headers,titles,comments,paragraphs,characters,notes,study_bible,bcv,text,ide,usfm,h,toc,toca,imt,is,ip,ipi,im,imi,ipq,imq,ipr,iq,ib,ili,iot,io,iex,imte,ie,mt,mte,cl,cd,ms,mr,s,sr,r,d,sp,sd,sts,rem,lit,restore,p,m,po,pr,cls,pmo,pm,pmc,pmr,pi,mi,nb,pc,ph,q,qr,qc,qa,qm,qd,lh,li,lf,lim,litl,tr,tc,th,tcr,thr,table,b,add,bk,dc,ior,iqt,k,litl,nd,ord,pn,png,qac,qs,qt,rq,sig,sls,tl,wj,em,bd,bdit,it,no,sc,sup,rb,pro,w,wh,wa,wg,lik,liv,jmp,f,fe,ef,efe,x,ex,fr,ft,fk,fq,fqa,fl,fw,fp,fv,fdc,xo,xop,xt,xta,xk,xq,xot,xnt,xdc,esb,cat,id,c,v,text-in-excluded-parent}]\n [--exclude_markers {book_headers,titles,comments,paragraphs,characters,notes,study_bible,bcv,text,ide,usfm,h,toc,toca,imt,is,ip,ipi,im,imi,ipq,imq,ipr,iq,ib,ili,iot,io,iex,imte,ie,mt,mte,cl,cd,ms,mr,s,sr,r,d,sp,sd,sts,rem,lit,restore,p,m,po,pr,cls,pmo,pm,pmc,pmr,pi,mi,nb,pc,ph,q,qr,qc,qa,qm,qd,lh,li,lf,lim,litl,tr,tc,th,tcr,thr,table,b,add,bk,dc,ior,iqt,k,litl,nd,ord,pn,png,qac,qs,qt,rq,sig,sls,tl,wj,em,bd,bdit,it,no,sc,sup,rb,pro,w,wh,wa,wg,lik,liv,jmp,f,fe,ef,efe,x,ex,fr,ft,fk,fq,fqa,fl,fw,fp,fv,fdc,xo,xop,xt,xta,xk,xq,xot,xnt,xdc,esb,cat,id,c,v,text-in-excluded-parent}]\n [--csv_col_sep CSV_COL_SEP] [--csv_row_sep CSV_ROW_SEP] [--ignore_errors] [--combine_text]\n infile\n\nUses the tree-sitter-usfm grammar to parse and convert USFM to Syntax-tree, JSON, CSV, USX etc.\n\npositional arguments:\n infile input usfm or usj file\n\noptions:\n -h, --help show this help message and exit\n --in_format {usfm,usj,usx}\n input file format\n --out_format {usj,table,syntax-tree,usx,markdown,usfm,bible-nlp}\n output format\n --include_markers {book_headers,titles,comments,paragraphs,characters,notes,study_bible,bcv,text,ide,usfm,h,toc,toca,imt,is,ip,ipi,im,imi,ipq,imq,ipr,iq,ib,ili,iot,io,iex,imte,ie,mt,mte,cl,cd,ms,mr,s,sr,r,d,sp,sd,sts,rem,lit,restore,p,m,po,pr,cls,pmo,pm,pmc,pmr,pi,mi,nb,pc,ph,q,qr,qc,qa,qm,qd,lh,li,lf,lim,litl,tr,tc,th,tcr,thr,table,b,add,bk,dc,ior,iqt,k,litl,nd,ord,pn,png,qac,qs,qt,rq,sig,sls,tl,wj,em,bd,bdit,it,no,sc,sup,rb,pro,w,wh,wa,wg,lik,liv,jmp,f,fe,ef,efe,x,ex,fr,ft,fk,fq,fqa,fl,fw,fp,fv,fdc,xo,xop,xt,xta,xk,xq,xot,xnt,xdc,esb,cat,id,c,v,text-in-excluded-parent}\n the list of of contents to be included\n --exclude_markers {book_headers,titles,comments,paragraphs,characters,notes,study_bible,bcv,text,ide,usfm,h,toc,toca,imt,is,ip,ipi,im,imi,ipq,imq,ipr,iq,ib,ili,iot,io,iex,imte,ie,mt,mte,cl,cd,ms,mr,s,sr,r,d,sp,sd,sts,rem,lit,restore,p,m,po,pr,cls,pmo,pm,pmc,pmr,pi,mi,nb,pc,ph,q,qr,qc,qa,qm,qd,lh,li,lf,lim,litl,tr,tc,th,tcr,thr,table,b,add,bk,dc,ior,iqt,k,litl,nd,ord,pn,png,qac,qs,qt,rq,sig,sls,tl,wj,em,bd,bdit,it,no,sc,sup,rb,pro,w,wh,wa,wg,lik,liv,jmp,f,fe,ef,efe,x,ex,fr,ft,fk,fq,fqa,fl,fw,fp,fv,fdc,xo,xop,xt,xta,xk,xq,xot,xnt,xdc,esb,cat,id,c,v,text-in-excluded-parent}\n the list of of contents to be included\n --csv_col_sep CSV_COL_SEP\n column separator or delimiter. Only useful with format=table.\n --csv_row_sep CSV_ROW_SEP\n row separator or delimiter. Only useful with format=table.\n --ignore_errors to get some output from successfully parsed portions\n --combine_text to be used along with exclude_markers or include_markers, to concatinate the consecutive text snippets, from different components, or not\n\n```\nExample\n```bash\n>>> python3 -m usfm_grammar sample.usfm --out_format usx\n\n>>> usfm-grammar sample.usfm\n\n>>> usfm-grammar sample.usfm --out_format usx\n\n>>> usfm-grammar sample.usfm --include_markers bcv --include_markers text --include_markers s\n\n>>> usfm-grammar sample-usj.json --out_format usfm\n```\n\nFor the `biblenlp` option, two files will be generated: `<name>_biblenlp.txt` and `<name>_biblenlp_vref.txt`. For all other `out_format` options, the output is displayed directly in the console (standard output). If needed, it can be redirected to a file using the following approach:\n```bash\n>>> usfm-grammar sample.usfm --out_format usx > converted_usx.xml\n```\n\n### Filtering on USJ\n\nThe filtering on USJ, the JSON output, is a feature incorporated to allow data extraction, markup cleaning etc. The arguments `exclude_markers` and `include_markers` in the methods `USFMParser.to_usj()` makes this possible. Also the `USFMParser.to_list()`, can accept these inputs and perform similar operations. There is CLI versions also for these arguments to replicate the filtering feature there.\n\n- *include_markers*\n\n Optional input parameter to `to_usj()` and `to_list` in python library and also in CLI when `format=json` or `format=table`. Defaults to `None`.When proivded, only those markers listed will be included in the output. `include_markers` is applied before applying `exclude_markers`. \n\n- *exclude_markers*\n\n Optional input parameter to `to_usj()` and `to_list` in python library and also in CLI when `format=json` or `format=table`. Defaults to `None`. When proivded, all markers except those listed will be included in the output.\n\n- *combine_texts*\n\n Optional input parameter to `to_usj()` and `to_list` in python library and also in CLI when `format=json` or `format=table`. Defaults to `True`. After filtering out makers like paragraphs and characters, we are left with texts from within them, if 'text-in-excluded-parent' is also not excluded. These text snippets may come as separate components in the contents list. When this option is `True`, the consequetive text snippets will be concatinated together. The text concatination is done in a puctuation and space aware manner. If users need more control over the space handling or for any other reason, would prefer the texts snippets as different components in the output, this can be set to `False`.\n\n- *usfm_grammar.Filter*\n\n This Class provides a set of enums that would be useful in providing in the `exclude_markers` and `include_markers` inputs rather than users listing out individual markers. The class has following options\n ```\n BOOK_HEADERS : identification and introduction markers\n TITLES : section headings and associated markers\n COMMENTS : comment markers like \\rem\n PARAGRAPHS : paragraph markers like \\p, poetry markers, list table markers\n CHARACTERS : all character level markups like \\em, \\w, \\wj etc and their nested versions with +\n NOTES : foot note, cross-reference and their content markers\n STUDY_BIBLE : \\esb and \\cat\n BCV : \\id, \\c and \\v\n TEXT : 'text-in-excluded-parent'\n ```\n To inspect which are the markers in each of these options, it could be just printed out, `print(Filter.TITLES)`. These could be used individually or concatinated to get the desired filtering of markers and data:\n ```python\n output = my_parser.to_usj(include_markers=Filter.BCV)\n output = my_parser.to_usj(include_markers=Filter.BCV+Filter.TEXT)\n output = my_parser.to_usj(exclude_markers=Filter.PARAGRAPHS+Filter.CHARACTERS)\n ``` \n- Inner contents of excluded markers\n\n For markers like `\\p` `\\q` etc, by excluding them, we only remove them from the heirachy and retain the inner contents like `\\v`, text etc that would be coming inside it. But for certain other markers like `\\f`, `\\x`, `\\esb` etc, if they are excluded their inner contents are also excluded. Following is the set of all markers, who inner contents are discarded if they are mentioned in `exclude_markers` or not included in `include_markers`.\n ```\n BOOK_HEADERS, TITLES, COMMENTS, NOTES, STUDY_BIBLE\n ```\n :warning: Generally, it is recommended to NOT use both `exclude_markers` and `include_markers` together as it could lead to unexpected behavours and data loss. For instance if `include_makers` has `\\fk` and `exclude_markers` has `\\f`, the output will not contain `\\fk` as all inner contents of `\\f` will be discarded.\n",
"bugtrack_url": null,
"license": "MIT License",
"summary": "Python parser for USFM files, based on tree-sitter-usfm3",
"version": "3.0.0",
"project_urls": {
"Homepage": "https://github.com/Bridgeconn/usfm-grammar/py-usfm-grammar#readme"
},
"split_keywords": [
"usfm",
" parser",
" grammar",
" tree-sitter"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "85f223892928015c33998c7c7b166ac32e75319b32ee9afaff65773a56953e37",
"md5": "b38d4a0457aea2839ba8e4eecb8c5387",
"sha256": "9248ffb288d8248d0700d7846c8c5650c187f48b45805902b4209bad17123752"
},
"downloads": -1,
"filename": "usfm_grammar-3.0.0.tar.gz",
"has_sig": false,
"md5_digest": "b38d4a0457aea2839ba8e4eecb8c5387",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.10",
"size": 31544,
"upload_time": "2024-12-13T09:22:17",
"upload_time_iso_8601": "2024-12-13T09:22:17.330047Z",
"url": "https://files.pythonhosted.org/packages/85/f2/23892928015c33998c7c7b166ac32e75319b32ee9afaff65773a56953e37/usfm_grammar-3.0.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-12-13 09:22:17",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "Bridgeconn",
"github_project": "usfm-grammar",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"lcname": "usfm-grammar"
}