Pygmars
========
https://github.com/aboutcode-org/pygmars
pygmars is a simple lexing and parsing library designed to craft lightweight
lexers and parsers using regular expressions.
pygmars allows you to craft simple lexers that recognizes words based on
regular expressions and identify sequences of words using lightweight grammars
to obtain a parse tree.
The lexing task transforms a sequence of words or strings (e.g. already split
in words) in a sequence of Token objects, assigning a label to each word and
tracking their position and line number.
In particular, the lexing output is designed to be compatible with the output
of Pygments lexers. It becomes possible to build simple grammars on top of
existing Pygments lexers to perform lightweight parsing of the many (130+)
programming languages supported by Pygments.
The parsing task transforms a sequence of Tokens in a parse Tree where each node
in the tree is recognized and assigned a label. Parsing is using regular
expression-based grammar rules applied to recognize Token sequences.
These rules are evaluated sequentially and not recursively: this keeps things
simple and works very well in practice. This approach and the rules syntax has
been battle-tested with NLTK from which pygmars is derived.
What about the name?
-----------------------
"pygmars" is a portmanteau of Pyg-ments and Gram-mars.
Origins
--------
This library is based on heavily modified, simplified and remixed original code
from NLTK regex POS tagger (renamed lexer) and regex chunker (renamed parser).
The original usage of NLTK was designed by @savinosto parse copyrights statements
in ScanCode Toolkit.
Users
-------
pygmars is used by ScanCode Toolkit for copyright detection and for
lightweight programming language parsing.
Why pygmars?
--------------
Why create this seemingly redundant library? Why not use NLTK directly?
- NLTK has a specific focus on NLP and lexing/tagging and parsing using regexes
is a tiny part of its overall feature set. These are part of rich set of
taggers and parsers and implement a common API. We do not have the need for
these richer APIs and they make evolving the API and refactoring the code
difficult.
- In particular NLTK POS tagging and chunking has been the engine used in
ScanCode toolkit copyright and author detection and there are some
improvements, simplifications and optimizations that would be difficult to
implement in NLTK directly and unlikely to be accepted upstream. For instance,
simplification of the code subset used for copyright detection enabled a big
boost in performance. Improvements to track the Token lines and positions may
not have been possible within the NLTK API.
- Newer versions of NLTK have several extra required dependencies that we do
not need. This is turn makes every tool heavier and complex when they only use
this limited NLTK subset. By stripping unused NLTK code, we get a small and
focused library with no dependencies.
- ScanCode toolkit also needs lightweight parsing of several programming
languages to extract metadata (such as dependencies) from package manifests.
Some parsers have been built by hand (such as gemfileparser), or use the
Python ast module (for Python setup.py), or they use existing Pygments lexers
as a base. A goal of this library is to be enable building lightweight parsers
reusing a Pygments lexer output as an input for a grammar. This is fairly
different from NLP in terms of goals.
Theory of operations
---------------------
A ``pygmars.lex.Lexer`` creates a sequence of ``pygmars.Token`` objects
such as::
Token(value="for" label="KEYWORD", start_line=12, pos=4)
where the label is a symbol name assigned to this token.
A Token is a terminal symbol and the grammar is composed of rules where the left
hand side is a label aka. a non-terminal symbol and the right hand side is a
regular expression-like pattern over labels.
See https://en.wikipedia.org/wiki/Terminal_and_nonterminal_symbols
A ``pygmars.parse.Parser`` is built from a ``pygmars.parse.Grammmar`` and
calling its ``parse`` function transforms a sequence of Tokens in a
``pygmars.tree.Tree`` parse tree.
The grammar is composed of Rules and loaded from a text with one rule per line
such as::
ASSIGNMENT: {<VARNAME> <EQUAL> <STRING|INT|FLOAT>} # variable assignment
Here the left hand side "ASSIGNMENT" label is produced when the right hand side
sequence of Token labels "<VARNAME> <EQUAL> <STRING|INT|FLOAT>" is matched.
"# variable assignment" is kept as a description for this rule.
License
--------
- SPDX-License-Identifier: Apache-2.0
Based on a substantially modified subset of the Natural Language Toolkit (NLTK)
http://nltk.org/
Copyright (c) nexB Inc. and others.
Copyright (C) NLTK Project
Raw data
{
"_id": null,
"home_page": "https://github.com/aboutcode-org/pygmars",
"name": "pygmars",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.8",
"maintainer_email": null,
"keywords": "utilities",
"author": "nexB. Inc. and others",
"author_email": "info@aboutcode.org",
"download_url": "https://files.pythonhosted.org/packages/e5/88/e2c36beae2bb7fc42d511be49c6d360054c6a08d401c4b5fb8177c2bb77f/pygmars-0.9.0.tar.gz",
"platform": null,
"description": "Pygmars\n========\n\n\nhttps://github.com/aboutcode-org/pygmars\n\npygmars is a simple lexing and parsing library designed to craft lightweight\nlexers and parsers using regular expressions.\n\npygmars allows you to craft simple lexers that recognizes words based on\nregular expressions and identify sequences of words using lightweight grammars\nto obtain a parse tree.\n\nThe lexing task transforms a sequence of words or strings (e.g. already split\nin words) in a sequence of Token objects, assigning a label to each word and\ntracking their position and line number.\n\nIn particular, the lexing output is designed to be compatible with the output\nof Pygments lexers. It becomes possible to build simple grammars on top of\nexisting Pygments lexers to perform lightweight parsing of the many (130+)\nprogramming languages supported by Pygments.\n\nThe parsing task transforms a sequence of Tokens in a parse Tree where each node\nin the tree is recognized and assigned a label. Parsing is using regular\nexpression-based grammar rules applied to recognize Token sequences.\n\nThese rules are evaluated sequentially and not recursively: this keeps things\nsimple and works very well in practice. This approach and the rules syntax has\nbeen battle-tested with NLTK from which pygmars is derived.\n\n\nWhat about the name?\n-----------------------\n\n\"pygmars\" is a portmanteau of Pyg-ments and Gram-mars.\n\n\nOrigins\n--------\n\nThis library is based on heavily modified, simplified and remixed original code\nfrom NLTK regex POS tagger (renamed lexer) and regex chunker (renamed parser).\nThe original usage of NLTK was designed by @savinosto parse copyrights statements\nin ScanCode Toolkit.\n\n\nUsers\n-------\n\npygmars is used by ScanCode Toolkit for copyright detection and for\nlightweight programming language parsing.\n\n\nWhy pygmars?\n--------------\n\nWhy create this seemingly redundant library? Why not use NLTK directly?\n\n- NLTK has a specific focus on NLP and lexing/tagging and parsing using regexes\n is a tiny part of its overall feature set. These are part of rich set of\n taggers and parsers and implement a common API. We do not have the need for\n these richer APIs and they make evolving the API and refactoring the code\n difficult.\n\n- In particular NLTK POS tagging and chunking has been the engine used in\n ScanCode toolkit copyright and author detection and there are some\n improvements, simplifications\u00a0and optimizations that would be difficult to\n implement in NLTK directly and unlikely to be accepted upstream. For instance,\n simplification of the code subset used for copyright detection enabled a big\n boost in performance. Improvements to track the Token lines and positions may\n not have been possible within the NLTK API.\n\n- Newer versions of NLTK have several extra required dependencies that we do\n not need. This is turn makes every tool heavier and complex when they\u00a0only use\n this limited NLTK subset. By stripping unused\u00a0NLTK code, we get a small and\n focused library\u00a0with no dependencies.\n\n- ScanCode toolkit also needs lightweight parsing of several programming\n languages to extract metadata (such as dependencies) from package manifests.\n Some parsers have been built by hand (such as gemfileparser), or use the\n Python ast module (for Python setup.py), or they use existing Pygments lexers\n as a base. A goal of this library is to be enable building lightweight parsers\n reusing a Pygments lexer output as an input for a grammar. This is fairly\n different from NLP in terms of goals.\n\n\nTheory of operations\n---------------------\n\nA ``pygmars.lex.Lexer`` creates a sequence of ``pygmars.Token`` objects\nsuch as::\n\n Token(value=\"for\" label=\"KEYWORD\", start_line=12, pos=4)\n\nwhere the label is a symbol name assigned to this token.\n\nA Token is a terminal symbol and the grammar is composed of rules where the left\nhand side is a label aka. a non-terminal symbol and the right hand side is a\nregular expression-like pattern over labels.\n\nSee https://en.wikipedia.org/wiki/Terminal_and_nonterminal_symbols\n\nA ``pygmars.parse.Parser`` is built from a ``pygmars.parse.Grammmar`` and\ncalling its ``parse`` function transforms a sequence of Tokens in a\n``pygmars.tree.Tree`` parse tree.\n\nThe grammar is composed of Rules and loaded from a text with one rule per line\nsuch as::\n\n ASSIGNMENT: {<VARNAME> <EQUAL> <STRING|INT|FLOAT>} # variable assignment\n\n\nHere the left hand side \"ASSIGNMENT\" label is produced when the right hand side\nsequence of Token labels \"<VARNAME> <EQUAL> <STRING|INT|FLOAT>\" is matched.\n\"# variable assignment\" is kept as a description for this rule.\n\n\nLicense\n--------\n\n- SPDX-License-Identifier: Apache-2.0\n\nBased on a substantially modified subset of the Natural Language Toolkit (NLTK)\nhttp://nltk.org/\n\nCopyright (c) nexB Inc. and others.\nCopyright (C) NLTK Project\n",
"bugtrack_url": null,
"license": "Apache-2.0",
"summary": "Craft simple regex-based small language lexers and parsers. Build parsers from grammars and accept Pygments lexers as an input. Derived from NLTK.",
"version": "0.9.0",
"project_urls": {
"Homepage": "https://github.com/aboutcode-org/pygmars"
},
"split_keywords": [
"utilities"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "28477586a687f206a302bc388b4e5424a3fa4618cfc99fc59434d1b384620269",
"md5": "5f65cee7da63b019100352415c77a12e",
"sha256": "0a6919e86bd193fd2a8322957a29d2cf06e8802c7215d74cc2004b6f914d0c56"
},
"downloads": -1,
"filename": "pygmars-0.9.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "5f65cee7da63b019100352415c77a12e",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.8",
"size": 28671,
"upload_time": "2024-09-04T16:17:41",
"upload_time_iso_8601": "2024-09-04T16:17:41.445325Z",
"url": "https://files.pythonhosted.org/packages/28/47/7586a687f206a302bc388b4e5424a3fa4618cfc99fc59434d1b384620269/pygmars-0.9.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "e588e2c36beae2bb7fc42d511be49c6d360054c6a08d401c4b5fb8177c2bb77f",
"md5": "b7aa2954a06206498c1eb1c47b50f5ce",
"sha256": "bc486cb3c4c7a22cc3d86077c51a741d1e2631c4cd409f33484974006063ee09"
},
"downloads": -1,
"filename": "pygmars-0.9.0.tar.gz",
"has_sig": false,
"md5_digest": "b7aa2954a06206498c1eb1c47b50f5ce",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.8",
"size": 88311,
"upload_time": "2024-09-04T16:17:42",
"upload_time_iso_8601": "2024-09-04T16:17:42.501284Z",
"url": "https://files.pythonhosted.org/packages/e5/88/e2c36beae2bb7fc42d511be49c6d360054c6a08d401c4b5fb8177c2bb77f/pygmars-0.9.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-09-04 16:17:42",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "aboutcode-org",
"github_project": "pygmars",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"requirements": [],
"lcname": "pygmars"
}