============
Parsimonious
============
Parsimonious aims to be the fastest arbitrary-lookahead parser written in pure
Python—and the most usable. It's based on parsing expression grammars (PEGs),
which means you feed it a simplified sort of EBNF notation. Parsimonious was
designed to undergird a MediaWiki parser that wouldn't take 5 seconds or a GB
of RAM to do one page, but it's applicable to all sorts of languages.
:Code: https://github.com/erikrose/parsimonious/
:Issues: https://github.com/erikrose/parsimonious/issues
:License: MIT License (MIT)
:Package: https://pypi.org/project/parsimonious/
Goals
=====
* Speed
* Frugal RAM use
* Minimalistic, understandable, idiomatic Python code
* Readable grammars
* Extensible grammars
* Complete test coverage
* Separation of concerns. Some Python parsing kits mix recognition with
instructions about how to turn the resulting tree into some kind of other
representation. This is limiting when you want to do several different things
with a tree: for example, render wiki markup to HTML *or* to text.
* Good error reporting. I want the parser to work *with* me as I develop a
grammar.
Install
=======
To install Parsimonious, run::
$ pip install parsimonious
Example Usage
=============
Here's how to build a simple grammar:
.. code:: python
>>> from parsimonious.grammar import Grammar
>>> grammar = Grammar(
... """
... bold_text = bold_open text bold_close
... text = ~"[A-Z 0-9]*"i
... bold_open = "(("
... bold_close = "))"
... """)
You can have forward references and even right recursion; it's all taken care
of by the grammar compiler. The first rule is taken to be the default start
symbol, but you can override that.
Next, let's parse something and get an abstract syntax tree:
.. code:: python
>>> print(grammar.parse('((bold stuff))'))
<Node called "bold_text" matching "((bold stuff))">
<Node called "bold_open" matching "((">
<RegexNode called "text" matching "bold stuff">
<Node called "bold_close" matching "))">
You'd typically then use a ``nodes.NodeVisitor`` subclass (see below) to walk
the tree and do something useful with it.
Another example would be to implement a parser for ``.ini``-files. Consider the following:
.. code:: python
grammar = Grammar(
r"""
expr = (entry / emptyline)*
entry = section pair*
section = lpar word rpar ws
pair = key equal value ws?
key = word+
value = (word / quoted)+
word = ~r"[-\w]+"
quoted = ~'"[^\"]+"'
equal = ws? "=" ws?
lpar = "["
rpar = "]"
ws = ~"\s*"
emptyline = ws+
"""
)
We could now implement a subclass of ``NodeVisitor`` like so:
.. code:: python
class IniVisitor(NodeVisitor):
def visit_expr(self, node, visited_children):
""" Returns the overall output. """
output = {}
for child in visited_children:
output.update(child[0])
return output
def visit_entry(self, node, visited_children):
""" Makes a dict of the section (as key) and the key/value pairs. """
key, values = visited_children
return {key: dict(values)}
def visit_section(self, node, visited_children):
""" Gets the section name. """
_, section, *_ = visited_children
return section.text
def visit_pair(self, node, visited_children):
""" Gets each key/value pair, returns a tuple. """
key, _, value, *_ = node.children
return key.text, value.text
def generic_visit(self, node, visited_children):
""" The generic visit method. """
return visited_children or node
And call it like that:
.. code:: python
from parsimonious.grammar import Grammar
from parsimonious.nodes import NodeVisitor
data = """[section]
somekey = somevalue
someotherkey=someothervalue
[anothersection]
key123 = "what the heck?"
key456="yet another one here"
"""
tree = grammar.parse(data)
iv = IniVisitor()
output = iv.visit(tree)
print(output)
This would yield
.. code:: python
{'section': {'somekey': 'somevalue', 'someotherkey': 'someothervalue'}, 'anothersection': {'key123': '"what the heck?"', 'key456': '"yet another one here"'}}
Status
======
* Everything that exists works. Test coverage is good.
* I don't plan on making any backward-incompatible changes to the rule syntax
in the future, so you can write grammars with confidence.
* It may be slow and use a lot of RAM; I haven't measured either yet. However,
I have yet to begin optimizing in earnest.
* Error reporting is now in place. ``repr`` methods of expressions, grammars,
and nodes are clear and helpful as well. The ``Grammar`` ones are
even round-trippable!
* The grammar extensibility story is underdeveloped at the moment. You should
be able to extend a grammar by simply concatenating more rules onto the
existing ones; later rules of the same name should override previous ones.
However, this is untested and may not be the final story.
* Sphinx docs are coming, but the docstrings are quite useful now.
* Note that there may be API changes until we get to 1.0, so be sure to pin to
the version you're using.
Coming Soon
-----------
* Optimizations to make Parsimonious worthy of its name
* Tighter RAM use
* Better-thought-out grammar extensibility story
* Amazing grammar debugging
A Little About PEG Parsers
==========================
PEG parsers don't draw a distinction between lexing and parsing; everything is
done at once. As a result, there is no lookahead limit, as there is with, for
instance, Yacc. And, due to both of these properties, PEG grammars are easier
to write: they're basically just a more practical dialect of EBNF. With
caching, they take O(grammar size * text length) memory (though I plan to do
better), but they run in O(text length) time.
More Technically
----------------
PEGs can describe a superset of *LL(k)* languages, any deterministic *LR(k)*
language, and many others—including some that aren't context-free
(http://www.brynosaurus.com/pub/lang/peg.pdf). They can also deal with what
would be ambiguous languages if described in canonical EBNF. They do this by
trading the ``|`` alternation operator for the ``/`` operator, which works the
same except that it makes priority explicit: ``a / b / c`` first tries matching
``a``. If that fails, it tries ``b``, and, failing that, moves on to ``c``.
Thus, ambiguity is resolved by always yielding the first successful recognition.
Writing Grammars
================
Grammars are defined by a series of rules. The syntax should be familiar to
anyone who uses regexes or reads programming language manuals. An example will
serve best:
.. code:: python
my_grammar = Grammar(r"""
styled_text = bold_text / italic_text
bold_text = "((" text "))"
italic_text = "''" text "''"
text = ~"[A-Z 0-9]*"i
""")
You can wrap a rule across multiple lines if you like; the syntax is very
forgiving.
Syntax Reference
----------------
==================== ========================================================
``"some literal"`` Used to quote literals. Backslash escaping and Python
conventions for "raw" and Unicode strings help support
fiddly characters.
``b"some literal"`` A bytes literal. Using bytes literals and regular
expressions allows your grammar to parse binary files.
Note that all literals and regular expressions must be
of the same type within a grammar. In grammars that
process bytestrings, you should make the grammar string
an ``r"""string"""`` so that byte literals like ``\xff``
work correctly.
[space] Sequences are made out of space- or tab-delimited
things. ``a b c`` matches spots where those 3
terms appear in that order.
``a / b / c`` Alternatives. The first to succeed of ``a / b / c``
wins.
``thing?`` An optional expression. This is greedy, always consuming
``thing`` if it exists.
``&thing`` A lookahead assertion. Ensures ``thing`` matches at the
current position but does not consume it.
``!thing`` A negative lookahead assertion. Matches if ``thing``
isn't found here. Doesn't consume any text.
``things*`` Zero or more things. This is greedy, always consuming as
many repetitions as it can.
``things+`` One or more things. This is greedy, always consuming as
many repetitions as it can.
``~r"regex"ilmsuxa`` Regexes have ``~`` in front and are quoted like
literals. Any flags_ (``asilmx``) follow the end quotes
as single chars. Regexes are good for representing
character classes (``[a-z0-9]``) and optimizing for
speed. The downside is that they won't be able to take
advantage of our fancy debugging, once we get that
working. Ultimately, I'd like to deprecate explicit
regexes and instead have Parsimonious dynamically build
them out of simpler primitives. Parsimonious uses the
regex_ library instead of the built-in re module.
``~br"regex"`` A bytes regex; required if your grammar parses
bytestrings.
``(things)`` Parentheses are used for grouping, like in every other
language.
``thing{n}`` Exactly ``n`` repetitions of ``thing``.
``thing{n,m}`` Between ``n`` and ``m`` repititions (inclusive.)
``thing{,m}`` At most ``m`` repetitions of ``thing``.
``thing{n,}`` At least ``n`` repetitions of ``thing``.
==================== ========================================================
.. _flags: https://docs.python.org/3/howto/regex.html#compilation
.. _regex: https://github.com/mrabarnett/mrab-regex
Optimizing Grammars
===================
Don't Repeat Expressions
------------------------
If you need a ``~"[a-z0-9]"i`` at two points in your grammar, don't type it
twice. Make it a rule of its own, and reference it from wherever you need it.
You'll get the most out of the caching this way, since cache lookups are by
expression object identity (for speed).
Even if you have an expression that's very simple, not repeating it will
save RAM, as there can, at worst, be a cached int for every char in the text
you're parsing. In the future, we may identify repeated subexpressions
automatically and factor them up while building the grammar.
How much should you shove into one regex, versus how much should you break them
up to not repeat yourself? That's a fine balance and worthy of benchmarking.
More stuff jammed into a regex will execute faster, because it doesn't have to
run any Python between pieces, but a broken-up one will give better cache
performance if the individual pieces are re-used elsewhere. If the pieces of a
regex aren't used anywhere else, by all means keep the whole thing together.
Quantifiers
-----------
Bring your ``?`` and ``*`` quantifiers up to the highest level you
can. Otherwise, lower-level patterns could succeed but be empty and put a bunch
of useless nodes in your tree that didn't really match anything.
Processing Parse Trees
======================
A parse tree has a node for each expression matched, even if it matched a
zero-length string, like ``"thing"?`` might.
The ``NodeVisitor`` class provides an inversion-of-control framework for
walking a tree and returning a new construct (tree, string, or whatever) based
on it. For now, have a look at its docstrings for more detail. There's also a
good example in ``grammar.RuleVisitor``. Notice how we take advantage of nodes'
iterability by using tuple unpacks in the formal parameter lists:
.. code:: python
def visit_or_term(self, or_term, (slash, _, term)):
...
For reference, here is the production the above unpacks::
or_term = "/" _ term
When something goes wrong in your visitor, you get a nice error like this::
[normal traceback here...]
VisitationException: 'Node' object has no attribute 'foo'
Parse tree:
<Node called "rules" matching "number = ~"[0-9]+""> <-- *** We were here. ***
<Node matching "number = ~"[0-9]+"">
<Node called "rule" matching "number = ~"[0-9]+"">
<Node matching "">
<Node called "label" matching "number">
<Node matching " ">
<Node called "_" matching " ">
<Node matching "=">
<Node matching " ">
<Node called "_" matching " ">
<Node called "rhs" matching "~"[0-9]+"">
<Node called "term" matching "~"[0-9]+"">
<Node called "atom" matching "~"[0-9]+"">
<Node called "regex" matching "~"[0-9]+"">
<Node matching "~">
<Node called "literal" matching ""[0-9]+"">
<Node matching "">
<Node matching "">
<Node called "eol" matching "
">
<Node matching "">
The parse tree is tacked onto the exception, and the node whose visitor method
raised the error is pointed out.
Why No Streaming Tree Processing?
---------------------------------
Some have asked why we don't process the tree as we go, SAX-style. There are
two main reasons:
1. It wouldn't work. With a PEG parser, no parsing decision is final until the
whole text is parsed. If we had to change a decision, we'd have to backtrack
and redo the SAX-style interpretation as well, which would involve
reconstituting part of the AST and quite possibly scuttling whatever you
were doing with the streaming output. (Note that some bursty SAX-style
processing may be possible in the future if we use cuts.)
2. It interferes with the ability to derive multiple representations from the
AST: for example, turning wiki markup into first HTML and then text.
Future Directions
=================
Rule Syntax Changes
-------------------
* Maybe support left-recursive rules like PyMeta, if anybody cares.
* Ultimately, I'd like to get rid of explicit regexes and break them into more
atomic things like character classes. Then we can dynamically compile bits
of the grammar into regexes as necessary to boost speed.
Optimizations
-------------
* Make RAM use almost constant by automatically inserting "cuts", as described
in
http://ialab.cs.tsukuba.ac.jp/~mizusima/publications/paste513-mizushima.pdf.
This would also improve error reporting, as we wouldn't backtrack out of
everything informative before finally failing.
* Find all the distinct subexpressions, and unify duplicates for a better cache
hit ratio.
* Think about having the user (optionally) provide some representative input
along with a grammar. We can then profile against it, see which expressions
are worth caching, and annotate the grammar. Perhaps there will even be
positions at which a given expression is more worth caching. Or we could keep
a count of how many times each cache entry has been used and evict the most
useless ones as RAM use grows.
* We could possibly compile the grammar into VM instructions, like in "A
parsing machine for PEGs" by Medeiros.
* If the recursion gets too deep in practice, use trampolining to dodge it.
Niceties
--------
* Pijnu has a raft of tree manipulators. I don't think I want all of them, but
a judicious subset might be nice. Don't get into mixing formatting with tree
manipulation.
https://github.com/erikrose/pijnu/blob/master/library/node.py#L333. PyPy's
parsing lib exposes a sane subset:
http://doc.pypy.org/en/latest/rlib.html#tree-transformations.
Version History
===============
(Next release)
* ...
0.10.0
* Fix infinite recursion in __eq__ in some cases. (FelisNivalis)
* Improve error message in left-recursive rules. (lucaswiman)
* Add support for range ``{min,max}`` repetition expressions (righthandabacus)
* Fix bug in ``*`` and ``+`` for token grammars (lucaswiman)
* Add support for grammars on bytestrings (lucaswiman)
* Fix LazyReference resolution bug #134 (righthandabacus)
* ~15% speedup on benchmarks with a faster node cache (ethframe)
.. warning::
This release makes backward-incompatible changes:
* Fix precedence of string literal modifiers ``u/r/b``.
This will break grammars with no spaces between a
reference and a string literal. (lucaswiman)
0.9.0
* Add support for Python 3.7, 3.8, 3.9, 3.10 (righthandabacus, Lonnen)
* Drop support for Python 2.x, 3.3, 3.4 (righthandabacus, Lonnen)
* Remove six and go all in on Python 3 idioms (Lonnen)
* Replace re with regex for improved handling of unicode characters
in regexes (Oderjunkie)
* Dropped nose for unittest (swayson)
* `Grammar.__repr__()` now correctly escapes backslashes (ingolemo)
* Custom rules can now be class methods in addition to
functions (James Addison)
* Make the ascii flag available in the regex syntax (Roman Inflianskas)
0.8.1
* Switch to a function-style ``print`` in the benchmark tests so we work
cleanly as a dependency on Python 3. (Edward Betts)
0.8.0
* Make Grammar iteration ordered, making the ``__repr__`` more like the
original input. (Lucas Wiman)
* Improve text representation and error messages for anonymous
subexpressions. (Lucas Wiman)
* Expose BadGrammar and VisitationError as top-level imports.
* No longer crash when you try to compare a Node to an instance of a
different class. (Esben Sonne)
* Pin ``six`` at 1.9.0 to ensure we have ``python_2_unicode_compatible``.
(Sam Raker)
* Drop Python 2.6 support.
0.7.0
* Add experimental token-based parsing, via TokenGrammar class, for those
operating on pre-lexed streams of tokens. This can, for example, help parse
indentation-sensitive languages that use the "off-side rule", like Python.
(Erik Rose)
* Common codebase for Python 2 and 3: no more 2to3 translation step (Mattias
Urlichs, Lucas Wiman)
* Drop Python 3.1 and 3.2 support.
* Fix a bug in ``Grammar.__repr__`` which fails to work on Python 3 since the
string_escape codec is gone in Python 3. (Lucas Wiman)
* Don't lose parentheses when printing representations of expressions.
(Michael Kelly)
* Make Grammar an immutable mapping (until we add automatic recompilation).
(Michael Kelly)
0.6.2
* Make grammar compilation 100x faster. Thanks to dmoisset for the initial
patch.
0.6.1
* Fix bug which made the default rule of a grammar invalid when it
contained a forward reference.
0.6
.. warning::
This release makes backward-incompatible changes:
* The ``default_rule`` arg to Grammar's constructor has been replaced
with a method, ``some_grammar.default('rule_name')``, which returns a
new grammar just like the old except with its default rule changed.
This is to free up the constructor kwargs for custom rules.
* ``UndefinedLabel`` is no longer a subclass of ``VisitationError``. This
matters only in the unlikely case that you were catching
``VisitationError`` exceptions and expecting to thus also catch
``UndefinedLabel``.
* Add support for "custom rules" in Grammars. These provide a hook for simple
custom parsing hooks spelled as Python lambdas. For heavy-duty needs,
you can put in Compound Expressions with LazyReferences as subexpressions,
and the Grammar will hook them up for optimal efficiency--no calling
``__getitem__`` on Grammar at parse time.
* Allow grammars without a default rule (in cases where there are no string
rules), which leads to also allowing empty grammars. Perhaps someone
building up grammars dynamically will find that useful.
* Add ``@rule`` decorator, allowing grammars to be constructed out of
notations on ``NodeVisitor`` methods. This saves looking back and forth
between the visitor and the grammar when there is only one visitor per
grammar.
* Add ``parse()`` and ``match()`` convenience methods to ``NodeVisitor``.
This makes the common case of parsing a string and applying exactly one
visitor to the AST shorter and simpler.
* Improve exception message when you forget to declare a visitor method.
* Add ``unwrapped_exceptions`` attribute to ``NodeVisitor``, letting you
name certain exceptions which propagate out of visitors without being
wrapped by ``VisitationError`` exceptions.
* Expose much more of the library in ``__init__``, making your imports
shorter.
* Drastically simplify reference resolution machinery. (Vladimir Keleshev)
0.5
.. warning::
This release makes some backward-incompatible changes. See below.
* Add alpha-quality error reporting. Now, rather than returning ``None``,
``parse()`` and ``match()`` raise ``ParseError`` if they don't succeed.
This makes more sense, since you'd rarely attempt to parse something and
not care if it succeeds. It was too easy before to forget to check for a
``None`` result. ``ParseError`` gives you a human-readable unicode
representation as well as some attributes that let you construct your own
custom presentation.
* Grammar construction now raises ``ParseError`` rather than ``BadGrammar``
if it can't parse your rules.
* ``parse()`` now takes an optional ``pos`` argument, like ``match()``.
* Make the ``_str__()`` method of ``UndefinedLabel`` return the right type.
* Support splitting rules across multiple lines, interleaving comments,
putting multiple rules on one line (but don't do that) and all sorts of
other horrific behavior.
* Tolerate whitespace after opening parens.
* Add support for single-quoted literals.
0.4
* Support Python 3.
* Fix ``import *`` for ``parsimonious.expressions``.
* Rewrite grammar compiler so right-recursive rules can be compiled and
parsing no longer fails in some cases with forward rule references.
0.3
* Support comments, the ``!`` ("not") operator, and parentheses in grammar
definition syntax.
* Change the ``&`` operator to a prefix operator to conform to the original
PEG syntax. The version in Parsing Techniques was infix, and that's what I
used as a reference. However, the unary version is more convenient, as it
lets you spell ``AB & A`` as simply ``A &B``.
* Take the ``print`` statements out of the benchmark tests.
* Give Node an evaluate-able ``__repr__``.
0.2
* Support matching of prefixes and other not-to-the-end slices of strings by
making ``match()`` public and able to initialize a new cache. Add
``match()`` callthrough method to ``Grammar``.
* Report a ``BadGrammar`` exception (rather than crashing) when there are
mistakes in a grammar definition.
* Simplify grammar compilation internals: get rid of superfluous visitor
methods and factor up repetitive ones. Simplify rule grammar as well.
* Add ``NodeVisitor.lift_child`` convenience method.
* Rename ``VisitationException`` to ``VisitationError`` for consistency with
the standard Python exception hierarchy.
* Rework ``repr`` and ``str`` values for grammars and expressions. Now they
both look like rule syntax. Grammars are even round-trippable! This fixes a
unicode encoding error when printing nodes that had parsed unicode text.
* Add tox for testing. Stop advertising Python 2.5 support, which never
worked (and won't unless somebody cares a lot, since it makes Python 3
support harder).
* Settle (hopefully) on the term "rule" to mean "the string representation of
a production". Get rid of the vague, mysterious "DSL".
0.1
* A rough but useable preview release
Thanks to Wiki Loves Monuments Panama for showing their support with a generous
gift.
Raw data
{
"_id": null,
"home_page": "https://github.com/erikrose/parsimonious",
"name": "parsimonious",
"maintainer": "",
"docs_url": null,
"requires_python": "",
"maintainer_email": "",
"keywords": "parse,parser,parsing,peg,packrat,grammar,language",
"author": "Erik Rose",
"author_email": "erikrose@grinchcentral.com",
"download_url": "https://files.pythonhosted.org/packages/7b/91/abdc50c4ef06fdf8d047f60ee777ca9b2a7885e1a9cea81343fbecda52d7/parsimonious-0.10.0.tar.gz",
"platform": null,
"description": "============\nParsimonious\n============\n\nParsimonious aims to be the fastest arbitrary-lookahead parser written in pure\nPython\u2014and the most usable. It's based on parsing expression grammars (PEGs),\nwhich means you feed it a simplified sort of EBNF notation. Parsimonious was\ndesigned to undergird a MediaWiki parser that wouldn't take 5 seconds or a GB\nof RAM to do one page, but it's applicable to all sorts of languages.\n\n:Code: https://github.com/erikrose/parsimonious/\n:Issues: https://github.com/erikrose/parsimonious/issues\n:License: MIT License (MIT)\n:Package: https://pypi.org/project/parsimonious/\n\n\nGoals\n=====\n\n* Speed\n* Frugal RAM use\n* Minimalistic, understandable, idiomatic Python code\n* Readable grammars\n* Extensible grammars\n* Complete test coverage\n* Separation of concerns. Some Python parsing kits mix recognition with\n instructions about how to turn the resulting tree into some kind of other\n representation. This is limiting when you want to do several different things\n with a tree: for example, render wiki markup to HTML *or* to text.\n* Good error reporting. I want the parser to work *with* me as I develop a\n grammar.\n\n\nInstall\n=======\n\nTo install Parsimonious, run::\n\n $ pip install parsimonious\n\n\nExample Usage\n=============\n\nHere's how to build a simple grammar:\n\n.. code:: python\n\n >>> from parsimonious.grammar import Grammar\n >>> grammar = Grammar(\n ... \"\"\"\n ... bold_text = bold_open text bold_close\n ... text = ~\"[A-Z 0-9]*\"i\n ... bold_open = \"((\"\n ... bold_close = \"))\"\n ... \"\"\")\n\nYou can have forward references and even right recursion; it's all taken care\nof by the grammar compiler. The first rule is taken to be the default start\nsymbol, but you can override that.\n\nNext, let's parse something and get an abstract syntax tree:\n\n.. code:: python\n\n >>> print(grammar.parse('((bold stuff))'))\n <Node called \"bold_text\" matching \"((bold stuff))\">\n <Node called \"bold_open\" matching \"((\">\n <RegexNode called \"text\" matching \"bold stuff\">\n <Node called \"bold_close\" matching \"))\">\n\nYou'd typically then use a ``nodes.NodeVisitor`` subclass (see below) to walk\nthe tree and do something useful with it.\n\nAnother example would be to implement a parser for ``.ini``-files. Consider the following:\n\n.. code:: python\n\n grammar = Grammar(\n r\"\"\"\n expr = (entry / emptyline)*\n entry = section pair*\n\n section = lpar word rpar ws\n pair = key equal value ws?\n\n key = word+\n value = (word / quoted)+\n word = ~r\"[-\\w]+\"\n quoted = ~'\"[^\\\"]+\"'\n equal = ws? \"=\" ws?\n lpar = \"[\"\n rpar = \"]\"\n ws = ~\"\\s*\"\n emptyline = ws+\n \"\"\"\n )\n\n\nWe could now implement a subclass of ``NodeVisitor`` like so:\n\n.. code:: python\n\n class IniVisitor(NodeVisitor):\n def visit_expr(self, node, visited_children):\n \"\"\" Returns the overall output. \"\"\"\n output = {}\n for child in visited_children:\n output.update(child[0])\n return output\n\n def visit_entry(self, node, visited_children):\n \"\"\" Makes a dict of the section (as key) and the key/value pairs. \"\"\"\n key, values = visited_children\n return {key: dict(values)}\n\n def visit_section(self, node, visited_children):\n \"\"\" Gets the section name. \"\"\"\n _, section, *_ = visited_children\n return section.text\n\n def visit_pair(self, node, visited_children):\n \"\"\" Gets each key/value pair, returns a tuple. \"\"\"\n key, _, value, *_ = node.children\n return key.text, value.text\n\n def generic_visit(self, node, visited_children):\n \"\"\" The generic visit method. \"\"\"\n return visited_children or node\n\nAnd call it like that:\n\n.. code:: python\n\n from parsimonious.grammar import Grammar\n from parsimonious.nodes import NodeVisitor\n\n data = \"\"\"[section]\n somekey = somevalue\n someotherkey=someothervalue\n\n [anothersection]\n key123 = \"what the heck?\"\n key456=\"yet another one here\"\n\n \"\"\"\n\n tree = grammar.parse(data)\n\n iv = IniVisitor()\n output = iv.visit(tree)\n print(output)\n\nThis would yield\n\n.. code:: python\n\n {'section': {'somekey': 'somevalue', 'someotherkey': 'someothervalue'}, 'anothersection': {'key123': '\"what the heck?\"', 'key456': '\"yet another one here\"'}}\n\nStatus\n======\n\n* Everything that exists works. Test coverage is good.\n* I don't plan on making any backward-incompatible changes to the rule syntax\n in the future, so you can write grammars with confidence.\n* It may be slow and use a lot of RAM; I haven't measured either yet. However,\n I have yet to begin optimizing in earnest.\n* Error reporting is now in place. ``repr`` methods of expressions, grammars,\n and nodes are clear and helpful as well. The ``Grammar`` ones are\n even round-trippable!\n* The grammar extensibility story is underdeveloped at the moment. You should\n be able to extend a grammar by simply concatenating more rules onto the\n existing ones; later rules of the same name should override previous ones.\n However, this is untested and may not be the final story.\n* Sphinx docs are coming, but the docstrings are quite useful now.\n* Note that there may be API changes until we get to 1.0, so be sure to pin to\n the version you're using.\n\nComing Soon\n-----------\n\n* Optimizations to make Parsimonious worthy of its name\n* Tighter RAM use\n* Better-thought-out grammar extensibility story\n* Amazing grammar debugging\n\n\nA Little About PEG Parsers\n==========================\n\nPEG parsers don't draw a distinction between lexing and parsing; everything is\ndone at once. As a result, there is no lookahead limit, as there is with, for\ninstance, Yacc. And, due to both of these properties, PEG grammars are easier\nto write: they're basically just a more practical dialect of EBNF. With\ncaching, they take O(grammar size * text length) memory (though I plan to do\nbetter), but they run in O(text length) time.\n\nMore Technically\n----------------\n\nPEGs can describe a superset of *LL(k)* languages, any deterministic *LR(k)*\nlanguage, and many others\u2014including some that aren't context-free\n(http://www.brynosaurus.com/pub/lang/peg.pdf). They can also deal with what\nwould be ambiguous languages if described in canonical EBNF. They do this by\ntrading the ``|`` alternation operator for the ``/`` operator, which works the\nsame except that it makes priority explicit: ``a / b / c`` first tries matching\n``a``. If that fails, it tries ``b``, and, failing that, moves on to ``c``.\nThus, ambiguity is resolved by always yielding the first successful recognition.\n\n\nWriting Grammars\n================\n\nGrammars are defined by a series of rules. The syntax should be familiar to\nanyone who uses regexes or reads programming language manuals. An example will\nserve best:\n\n.. code:: python\n\n my_grammar = Grammar(r\"\"\"\n styled_text = bold_text / italic_text\n bold_text = \"((\" text \"))\"\n italic_text = \"''\" text \"''\"\n text = ~\"[A-Z 0-9]*\"i\n \"\"\")\n\nYou can wrap a rule across multiple lines if you like; the syntax is very\nforgiving.\n\n\nSyntax Reference\n----------------\n\n==================== ========================================================\n``\"some literal\"`` Used to quote literals. Backslash escaping and Python\n conventions for \"raw\" and Unicode strings help support\n fiddly characters.\n\n``b\"some literal\"`` A bytes literal. Using bytes literals and regular\n expressions allows your grammar to parse binary files.\n Note that all literals and regular expressions must be\n of the same type within a grammar. In grammars that\n process bytestrings, you should make the grammar string\n an ``r\"\"\"string\"\"\"`` so that byte literals like ``\\xff``\n work correctly.\n\n[space] Sequences are made out of space- or tab-delimited\n things. ``a b c`` matches spots where those 3\n terms appear in that order.\n\n``a / b / c`` Alternatives. The first to succeed of ``a / b / c``\n wins.\n\n``thing?`` An optional expression. This is greedy, always consuming\n ``thing`` if it exists.\n\n``&thing`` A lookahead assertion. Ensures ``thing`` matches at the\n current position but does not consume it.\n\n``!thing`` A negative lookahead assertion. Matches if ``thing``\n isn't found here. Doesn't consume any text.\n\n``things*`` Zero or more things. This is greedy, always consuming as\n many repetitions as it can.\n\n``things+`` One or more things. This is greedy, always consuming as\n many repetitions as it can.\n\n``~r\"regex\"ilmsuxa`` Regexes have ``~`` in front and are quoted like\n literals. Any flags_ (``asilmx``) follow the end quotes\n as single chars. Regexes are good for representing\n character classes (``[a-z0-9]``) and optimizing for\n speed. The downside is that they won't be able to take\n advantage of our fancy debugging, once we get that\n working. Ultimately, I'd like to deprecate explicit\n regexes and instead have Parsimonious dynamically build\n them out of simpler primitives. Parsimonious uses the\n regex_ library instead of the built-in re module.\n\n``~br\"regex\"`` A bytes regex; required if your grammar parses\n bytestrings.\n\n``(things)`` Parentheses are used for grouping, like in every other\n language.\n\n``thing{n}`` Exactly ``n`` repetitions of ``thing``.\n\n``thing{n,m}`` Between ``n`` and ``m`` repititions (inclusive.)\n\n``thing{,m}`` At most ``m`` repetitions of ``thing``.\n\n``thing{n,}`` At least ``n`` repetitions of ``thing``.\n\n==================== ========================================================\n\n.. _flags: https://docs.python.org/3/howto/regex.html#compilation\n.. _regex: https://github.com/mrabarnett/mrab-regex\n\nOptimizing Grammars\n===================\n\nDon't Repeat Expressions\n------------------------\n\nIf you need a ``~\"[a-z0-9]\"i`` at two points in your grammar, don't type it\ntwice. Make it a rule of its own, and reference it from wherever you need it.\nYou'll get the most out of the caching this way, since cache lookups are by\nexpression object identity (for speed).\n\nEven if you have an expression that's very simple, not repeating it will\nsave RAM, as there can, at worst, be a cached int for every char in the text\nyou're parsing. In the future, we may identify repeated subexpressions\nautomatically and factor them up while building the grammar.\n\nHow much should you shove into one regex, versus how much should you break them\nup to not repeat yourself? That's a fine balance and worthy of benchmarking.\nMore stuff jammed into a regex will execute faster, because it doesn't have to\nrun any Python between pieces, but a broken-up one will give better cache\nperformance if the individual pieces are re-used elsewhere. If the pieces of a\nregex aren't used anywhere else, by all means keep the whole thing together.\n\n\nQuantifiers\n-----------\n\nBring your ``?`` and ``*`` quantifiers up to the highest level you\ncan. Otherwise, lower-level patterns could succeed but be empty and put a bunch\nof useless nodes in your tree that didn't really match anything.\n\n\nProcessing Parse Trees\n======================\n\nA parse tree has a node for each expression matched, even if it matched a\nzero-length string, like ``\"thing\"?`` might.\n\nThe ``NodeVisitor`` class provides an inversion-of-control framework for\nwalking a tree and returning a new construct (tree, string, or whatever) based\non it. For now, have a look at its docstrings for more detail. There's also a\ngood example in ``grammar.RuleVisitor``. Notice how we take advantage of nodes'\niterability by using tuple unpacks in the formal parameter lists:\n\n.. code:: python\n\n def visit_or_term(self, or_term, (slash, _, term)):\n ...\n\nFor reference, here is the production the above unpacks::\n\n or_term = \"/\" _ term\n\nWhen something goes wrong in your visitor, you get a nice error like this::\n\n [normal traceback here...]\n VisitationException: 'Node' object has no attribute 'foo'\n\n Parse tree:\n <Node called \"rules\" matching \"number = ~\"[0-9]+\"\"> <-- *** We were here. ***\n <Node matching \"number = ~\"[0-9]+\"\">\n <Node called \"rule\" matching \"number = ~\"[0-9]+\"\">\n <Node matching \"\">\n <Node called \"label\" matching \"number\">\n <Node matching \" \">\n <Node called \"_\" matching \" \">\n <Node matching \"=\">\n <Node matching \" \">\n <Node called \"_\" matching \" \">\n <Node called \"rhs\" matching \"~\"[0-9]+\"\">\n <Node called \"term\" matching \"~\"[0-9]+\"\">\n <Node called \"atom\" matching \"~\"[0-9]+\"\">\n <Node called \"regex\" matching \"~\"[0-9]+\"\">\n <Node matching \"~\">\n <Node called \"literal\" matching \"\"[0-9]+\"\">\n <Node matching \"\">\n <Node matching \"\">\n <Node called \"eol\" matching \"\n \">\n <Node matching \"\">\n\nThe parse tree is tacked onto the exception, and the node whose visitor method\nraised the error is pointed out.\n\nWhy No Streaming Tree Processing?\n---------------------------------\n\nSome have asked why we don't process the tree as we go, SAX-style. There are\ntwo main reasons:\n\n1. It wouldn't work. With a PEG parser, no parsing decision is final until the\n whole text is parsed. If we had to change a decision, we'd have to backtrack\n and redo the SAX-style interpretation as well, which would involve\n reconstituting part of the AST and quite possibly scuttling whatever you\n were doing with the streaming output. (Note that some bursty SAX-style\n processing may be possible in the future if we use cuts.)\n\n2. It interferes with the ability to derive multiple representations from the\n AST: for example, turning wiki markup into first HTML and then text.\n\n\nFuture Directions\n=================\n\nRule Syntax Changes\n-------------------\n\n* Maybe support left-recursive rules like PyMeta, if anybody cares.\n* Ultimately, I'd like to get rid of explicit regexes and break them into more\n atomic things like character classes. Then we can dynamically compile bits\n of the grammar into regexes as necessary to boost speed.\n\nOptimizations\n-------------\n\n* Make RAM use almost constant by automatically inserting \"cuts\", as described\n in\n http://ialab.cs.tsukuba.ac.jp/~mizusima/publications/paste513-mizushima.pdf.\n This would also improve error reporting, as we wouldn't backtrack out of\n everything informative before finally failing.\n* Find all the distinct subexpressions, and unify duplicates for a better cache\n hit ratio.\n* Think about having the user (optionally) provide some representative input\n along with a grammar. We can then profile against it, see which expressions\n are worth caching, and annotate the grammar. Perhaps there will even be\n positions at which a given expression is more worth caching. Or we could keep\n a count of how many times each cache entry has been used and evict the most\n useless ones as RAM use grows.\n* We could possibly compile the grammar into VM instructions, like in \"A\n parsing machine for PEGs\" by Medeiros.\n* If the recursion gets too deep in practice, use trampolining to dodge it.\n\nNiceties\n--------\n\n* Pijnu has a raft of tree manipulators. I don't think I want all of them, but\n a judicious subset might be nice. Don't get into mixing formatting with tree\n manipulation.\n https://github.com/erikrose/pijnu/blob/master/library/node.py#L333. PyPy's\n parsing lib exposes a sane subset:\n http://doc.pypy.org/en/latest/rlib.html#tree-transformations.\n\n\nVersion History\n===============\n(Next release)\n * ...\n\n0.10.0\n * Fix infinite recursion in __eq__ in some cases. (FelisNivalis)\n * Improve error message in left-recursive rules. (lucaswiman)\n * Add support for range ``{min,max}`` repetition expressions (righthandabacus)\n * Fix bug in ``*`` and ``+`` for token grammars (lucaswiman)\n * Add support for grammars on bytestrings (lucaswiman)\n * Fix LazyReference resolution bug #134 (righthandabacus)\n * ~15% speedup on benchmarks with a faster node cache (ethframe)\n\n .. warning::\n\n This release makes backward-incompatible changes:\n\n * Fix precedence of string literal modifiers ``u/r/b``.\n This will break grammars with no spaces between a\n reference and a string literal. (lucaswiman)\n\n\n0.9.0\n * Add support for Python 3.7, 3.8, 3.9, 3.10 (righthandabacus, Lonnen)\n * Drop support for Python 2.x, 3.3, 3.4 (righthandabacus, Lonnen)\n * Remove six and go all in on Python 3 idioms (Lonnen)\n * Replace re with regex for improved handling of unicode characters\n in regexes (Oderjunkie)\n * Dropped nose for unittest (swayson)\n * `Grammar.__repr__()` now correctly escapes backslashes (ingolemo)\n * Custom rules can now be class methods in addition to\n functions (James Addison)\n * Make the ascii flag available in the regex syntax (Roman Inflianskas)\n\n0.8.1\n * Switch to a function-style ``print`` in the benchmark tests so we work\n cleanly as a dependency on Python 3. (Edward Betts)\n\n0.8.0\n * Make Grammar iteration ordered, making the ``__repr__`` more like the\n original input. (Lucas Wiman)\n * Improve text representation and error messages for anonymous\n subexpressions. (Lucas Wiman)\n * Expose BadGrammar and VisitationError as top-level imports.\n * No longer crash when you try to compare a Node to an instance of a\n different class. (Esben Sonne)\n * Pin ``six`` at 1.9.0 to ensure we have ``python_2_unicode_compatible``.\n (Sam Raker)\n * Drop Python 2.6 support.\n\n0.7.0\n * Add experimental token-based parsing, via TokenGrammar class, for those\n operating on pre-lexed streams of tokens. This can, for example, help parse\n indentation-sensitive languages that use the \"off-side rule\", like Python.\n (Erik Rose)\n * Common codebase for Python 2 and 3: no more 2to3 translation step (Mattias\n Urlichs, Lucas Wiman)\n * Drop Python 3.1 and 3.2 support.\n * Fix a bug in ``Grammar.__repr__`` which fails to work on Python 3 since the\n string_escape codec is gone in Python 3. (Lucas Wiman)\n * Don't lose parentheses when printing representations of expressions.\n (Michael Kelly)\n * Make Grammar an immutable mapping (until we add automatic recompilation).\n (Michael Kelly)\n\n0.6.2\n * Make grammar compilation 100x faster. Thanks to dmoisset for the initial\n patch.\n\n0.6.1\n * Fix bug which made the default rule of a grammar invalid when it\n contained a forward reference.\n\n0.6\n .. warning::\n\n This release makes backward-incompatible changes:\n\n * The ``default_rule`` arg to Grammar's constructor has been replaced\n with a method, ``some_grammar.default('rule_name')``, which returns a\n new grammar just like the old except with its default rule changed.\n This is to free up the constructor kwargs for custom rules.\n * ``UndefinedLabel`` is no longer a subclass of ``VisitationError``. This\n matters only in the unlikely case that you were catching\n ``VisitationError`` exceptions and expecting to thus also catch\n ``UndefinedLabel``.\n\n * Add support for \"custom rules\" in Grammars. These provide a hook for simple\n custom parsing hooks spelled as Python lambdas. For heavy-duty needs,\n you can put in Compound Expressions with LazyReferences as subexpressions,\n and the Grammar will hook them up for optimal efficiency--no calling\n ``__getitem__`` on Grammar at parse time.\n * Allow grammars without a default rule (in cases where there are no string\n rules), which leads to also allowing empty grammars. Perhaps someone\n building up grammars dynamically will find that useful.\n * Add ``@rule`` decorator, allowing grammars to be constructed out of\n notations on ``NodeVisitor`` methods. This saves looking back and forth\n between the visitor and the grammar when there is only one visitor per\n grammar.\n * Add ``parse()`` and ``match()`` convenience methods to ``NodeVisitor``.\n This makes the common case of parsing a string and applying exactly one\n visitor to the AST shorter and simpler.\n * Improve exception message when you forget to declare a visitor method.\n * Add ``unwrapped_exceptions`` attribute to ``NodeVisitor``, letting you\n name certain exceptions which propagate out of visitors without being\n wrapped by ``VisitationError`` exceptions.\n * Expose much more of the library in ``__init__``, making your imports\n shorter.\n * Drastically simplify reference resolution machinery. (Vladimir Keleshev)\n\n0.5\n .. warning::\n\n This release makes some backward-incompatible changes. See below.\n\n * Add alpha-quality error reporting. Now, rather than returning ``None``,\n ``parse()`` and ``match()`` raise ``ParseError`` if they don't succeed.\n This makes more sense, since you'd rarely attempt to parse something and\n not care if it succeeds. It was too easy before to forget to check for a\n ``None`` result. ``ParseError`` gives you a human-readable unicode\n representation as well as some attributes that let you construct your own\n custom presentation.\n * Grammar construction now raises ``ParseError`` rather than ``BadGrammar``\n if it can't parse your rules.\n * ``parse()`` now takes an optional ``pos`` argument, like ``match()``.\n * Make the ``_str__()`` method of ``UndefinedLabel`` return the right type.\n * Support splitting rules across multiple lines, interleaving comments,\n putting multiple rules on one line (but don't do that) and all sorts of\n other horrific behavior.\n * Tolerate whitespace after opening parens.\n * Add support for single-quoted literals.\n\n0.4\n * Support Python 3.\n * Fix ``import *`` for ``parsimonious.expressions``.\n * Rewrite grammar compiler so right-recursive rules can be compiled and\n parsing no longer fails in some cases with forward rule references.\n\n0.3\n * Support comments, the ``!`` (\"not\") operator, and parentheses in grammar\n definition syntax.\n * Change the ``&`` operator to a prefix operator to conform to the original\n PEG syntax. The version in Parsing Techniques was infix, and that's what I\n used as a reference. However, the unary version is more convenient, as it\n lets you spell ``AB & A`` as simply ``A &B``.\n * Take the ``print`` statements out of the benchmark tests.\n * Give Node an evaluate-able ``__repr__``.\n\n0.2\n * Support matching of prefixes and other not-to-the-end slices of strings by\n making ``match()`` public and able to initialize a new cache. Add\n ``match()`` callthrough method to ``Grammar``.\n * Report a ``BadGrammar`` exception (rather than crashing) when there are\n mistakes in a grammar definition.\n * Simplify grammar compilation internals: get rid of superfluous visitor\n methods and factor up repetitive ones. Simplify rule grammar as well.\n * Add ``NodeVisitor.lift_child`` convenience method.\n * Rename ``VisitationException`` to ``VisitationError`` for consistency with\n the standard Python exception hierarchy.\n * Rework ``repr`` and ``str`` values for grammars and expressions. Now they\n both look like rule syntax. Grammars are even round-trippable! This fixes a\n unicode encoding error when printing nodes that had parsed unicode text.\n * Add tox for testing. Stop advertising Python 2.5 support, which never\n worked (and won't unless somebody cares a lot, since it makes Python 3\n support harder).\n * Settle (hopefully) on the term \"rule\" to mean \"the string representation of\n a production\". Get rid of the vague, mysterious \"DSL\".\n\n0.1\n * A rough but useable preview release\n\nThanks to Wiki Loves Monuments Panama for showing their support with a generous\ngift.\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "(Soon to be) the fastest pure-Python PEG parser I could muster",
"version": "0.10.0",
"split_keywords": [
"parse",
"parser",
"parsing",
"peg",
"packrat",
"grammar",
"language"
],
"urls": [
{
"comment_text": "",
"digests": {
"md5": "b7fbb37c95a16bb5093463fe87534d66",
"sha256": "982ab435fabe86519b57f6b35610aa4e4e977e9f02a14353edf4bbc75369fc0f"
},
"downloads": -1,
"filename": "parsimonious-0.10.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "b7fbb37c95a16bb5093463fe87534d66",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": null,
"size": 48427,
"upload_time": "2022-09-03T17:01:13",
"upload_time_iso_8601": "2022-09-03T17:01:13.814548Z",
"url": "https://files.pythonhosted.org/packages/aa/0f/c8b64d9b54ea631fcad4e9e3c8dbe8c11bb32a623be94f22974c88e71eaf/parsimonious-0.10.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"md5": "60fc04ad2d643b27ddfad7bde0412f2c",
"sha256": "8281600da180ec8ae35427a4ab4f7b82bfec1e3d1e52f80cb60ea82b9512501c"
},
"downloads": -1,
"filename": "parsimonious-0.10.0.tar.gz",
"has_sig": false,
"md5_digest": "60fc04ad2d643b27ddfad7bde0412f2c",
"packagetype": "sdist",
"python_version": "source",
"requires_python": null,
"size": 52172,
"upload_time": "2022-09-03T17:01:17",
"upload_time_iso_8601": "2022-09-03T17:01:17.004923Z",
"url": "https://files.pythonhosted.org/packages/7b/91/abdc50c4ef06fdf8d047f60ee777ca9b2a7885e1a9cea81343fbecda52d7/parsimonious-0.10.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2022-09-03 17:01:17",
"github": true,
"gitlab": false,
"bitbucket": false,
"github_user": "erikrose",
"github_project": "parsimonious",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"tox": true,
"lcname": "parsimonious"
}