lug

A C++ embedded domain specific language for expressing parsers as extended parsing expression grammars (PEGs)

Features

Natural syntax more akin to external parser generator languages
Separation of syntatic and lexical rules, with customizable implicit whitespace skipping
Direct and indirect left recursion with precedence levels to disambiguate subexpressions with mixed left/right recursion
Traditional PEG syntax has been extended to support attribute grammars and semantic actions
Support for most context-sensitive grammars with symbol tables, conditions and syntactic predicates
Cut operator to commit to currently matched parse prefix and prune all backtrack entries
Deferred evaluation of semantic actions, ensuring actions do not execute on failed branches or invalid input
Generated parsers are compiled to special-purpose bytecode and executed in a virtual parsing machine
UTF-8 text parsing with complete Level 1 and partial Level 2 support of the UTS #18 Unicode Regular Expressions technical standard
Automatic line and column tracking with customizable tab width and alignment
Uses expression template functors to implement the rules of the domain specific language
Header only library using C++17 language and library features
Relatively small with the intent of parser core to remain under 2000 lines of terse code

It is based on research introduced in the following papers:

Bryan Ford, Parsing expression grammars: a recognition-based syntactic foundation, Proceedings of the 31st ACM SIGPLAN-SIGACT symposium on Principles of Programming Languages, p.111-122, January 2004

Sérgio Medeiros et. al, A parsing machine for PEGs, Proceedings of the 2008 symposium on Dynamic Languages, p.1-12, July 2008

Kota Mizushima et. al, Packrat parsers can handle practical grammars in mostly constant space, Proceedings of the 9th ACM SIGPLAN-SIGSOFT workshop on Program analysis for software tools and engineering, p.29-36, June 2010

Sérgio Medeiros et. al, Left recursion in Parsing Expression Grammars, Science of Computer Programming, v.96 n.P2, p.177-190, December 2014

Leonardo Reis et. al, The formalization and implementation of Adaptable Parsing Expression Grammars, Science of Computer Programming, v.96 n.P2, p.191-210, December 2014

Tetsuro Matsumura, Kimio Kuramitsu, A Declarative Extension of Parsing Expression Grammars for Recognizing Most Programming Languages, Journal of Information Processing, v.24 i.2 p.256-264, November 2015

Sérgio Medeiros et. al, A parsing machine for parsing expression grammars with labeled failures, Proceedings of the 31st Annual ACM symposium on Applied Computing, p.1960-1967, April 2016

Building

As a header only library, lug itself does not need to be built. Simply ensure the lug header directory is in your include path and you're good to go.

As a baseline, the following compiler versions are known to work with lug.

Compiler	Language Mode
Clang 14.0.0 (March 2022)	-std=c++17 or -std=gnu++17
GCC 7.1.0 (May 2017)	-std=c++17 or -std=gnu++17
Microsoft Visual C++ 2017 15.5 (December 2017)	Platform Toolset: Visual Studio 2017 Toolset (v141), Language Standard: ISO C++17 Standard (/std:c++17)

To build the sample programs and unit tests, a makefile is provided for Linux and BSD platforms and a Visual Studio solution is available for use on Windows.

Syntax Reference

Operator	Syntax	Description
Sequence	`e1 > e2`	Matches both expressions e1 followed by e2 in sequence
Ordered Choice	`e1 \| e2`	Attempts to first match expression e1, and if that fails backtracks then attempts to match e2
Zero-or-More	`*e`	Reptition matching of expression e zero, one or more times
One-or-More	`+e`	Reptition matching of expression e one or more times
Optional	`~e`	Matching of expression e zero or one times
Positive Lookahead	`&e`	Matches without consuming input if expression e succeeds to match the input
Negative Lookahead	`!e`	Matches without consuming input if expression e fails to match the input
Cut Before	`--e`	Issues a cut instruction before the expression e
Cut After	`e--`	Issues a cut instruction after the expression e
Action Scheduling	`e < a`	Schedules a semantic action a to be evaluated if expression e successfully matches the input
Attribute Binding	`v % e`	Binds the return value of the last evaluated semantic action within the expression e to the attribute v

Control	Description
`cased[e]`	Case sensitive matching for the subexpression e (the default)
`caseless[e]`	Case insensitive matching for subexpression e
`skip[e]`	Turns on all whitespace skipping for subexpression e (the default)
`noskip[e]`	Turns off all whitespace skipping for subexpression e, including preceeding whitespace
`lexeme[e]`	Treats subexpression e as a lexical token with no internal whitespace skipping
`capture(V)[e]`	Syntactic capture of the matching subexpression e into variable V
`on(C)[e]`	Sets the condition C to true for the scope of subexpression e
`off(C)[e]`	Sets the condition C to false for the scope of subexpression e (the default)
`symbol(S)[e]`	Pushes a symbol definition for symbol S with value equal to the captured input matching subexpression e
`block[e]`	Creates a scope block for subexpression e where all new symbols defined in e are local to it and all external symbols defined outside of the block are also available for reference within e
`local[e]`	Creates a local scope block for subexpression e where all new symbols defined in e are local to it and there are no external symbol definitions available for reference
`local(S)[e]`	Creates a local scope block for subexpression e where all new symbols defined in e are local to it and all external symbols defined outside of the block are also available for reference within e, except for the symbol named S

Terminal	Description
`nop`	No operation, does not emit any instructions
`eps`	Matches the empty string
`eoi`	Matches the end of the input sequence
`eol`	Matches a Unicode line-ending
`cut`	Emits a cut operation into the stream of semantic actions
`chr(c)`	Matches the UTF-8, UTF-16, or UTF-32 character c
`chr(c1, c2)`	Matches characters in the UTF-8, UTF-16, or UTF-32 interval [c1-c2]
`str(s)`	Matches the sequence of characters in the string s
`bre(s)`	POSIX Basic Regular Expression (BRE)
`any`	Matches any single character
`any(flags)`	Matches a character exhibiting any of the character properties
`all(flags)`	Matches a character with all of the character properties
`none(flags)`	Matches a character with none of the character properties
`alpha`	Matches any alphabetical character
`alnum`	Matches any alphabetical character or numerical digit
`blank`	Matches any space or tab character
`digit`	Matches any decimal digit
`graph`	Matches any graphical character
`lower`	Matches any lowercase alphabetical character
`print`	Matches any printable character
`punct`	Matches any punctuation character
`space`	Matches any whitespace character
`upper`	Matches any uppercase alphabetical character
`xdigit`	Matches any hexadecimal digit
`when(C)`	Matches if the condition named C is true, without consuming input
`unless(C)`	Matches if the condition named C is false, without consuming input
`exists(S)`	Matches if there is a definition for symbol S in the current scope
`missing(S)`	Matches if there is no definition for symbol S in the current scope
`match(S)`	Matches the last definition for symbol named S
`match_any(S)`	Matches against any prior definition for symbol named S
`match_all(S)`	Matches against all prior definitions for symbol named S, in sequence from least to most recent
`match_front(S,N=0)`	Matches against the N-th least recent definition for symbol named S
`match_back(S,N=0)`	Matches against the N-th most recent definition for symbol named S

Literal	Name	Description
`_cx`	Character Expression	Matches the UTF-8, UTF-16, or UTF-32 character literal
`_sx`	String Expression	Matches the sequence of characters in a string literal
`_rx`	Regular Expression	POSIX Basic Regular Expression (BRE)
`_icx`	Case Insensitive Character Expression	Same as `_cx` but case insensitive
`_isx`	Case Insensitive String Expression	Same as `_sx` but case insensitive
`_irx`	Case Insensitive Regular Expression	Same as `_rx` but case insensitive
`_scx`	Case Sensitive Character Expression	Same as `_cx` but case sensitive
`_ssx`	Case Sensitive String Expression	Same as `_sx` but case sensitive
`_srx`	Case Sensitive Regular Expression	Same as `_rx` but case sensitive

TODO

parser error recovery
add an interactive processing mode flag to input sources?
handle exceptions thrown from semantic actions in semantics::accept?
feature: Adams-Nestra grammars and whitespace alignment
feature: syntax to specify number range of allowed iteration
optimization: tail recursion
optimization: reduce number of false-positive left-recursive calls even further by lazily evaluating rule mandate
optimization: additional instructions (test_char, test_any, test_range, test_class)
more samples, testing, and bug fixing
increase compiler warning level and fix any issues
documentation

Name		Name	Last commit message	Last commit date
Latest commit History 270 Commits
.github/workflows		.github/workflows
doc		doc
lug		lug
msvs		msvs
samples		samples
tests		tests
tools		tools
.clang-tidy		.clang-tidy
.editorconfig		.editorconfig
.gitattributes		.gitattributes
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
LICENSE.md		LICENSE.md
Makefile		Makefile
README.md		README.md
lug.sln		lug.sln
runtests.sh		runtests.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

lug

Features

Building

Syntax Reference

TODO

About

Releases 2

Contributors 2

Languages

License

jwtowner/lug

Folders and files

Latest commit

History

Repository files navigation

lug

Features

Building

Syntax Reference

TODO

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 2

Contributors 2

Languages