wheel (GitLab)
wheel (GHA via
nightly.link
)
We have moved to https://codeberg.org/UniGrammar/UniGrammar.py, grab new versions there.
Under the disguise of "better security" Micro$oft-owned GitHub has discriminated users of 1FA passwords while having commercial interest in success and wide adoption of FIDO 1FA specifications and Windows Hello implementation which it promotes as a replacement for passwords. It will result in dire consequencies and is competely inacceptable, read why.
If you don't want to participate in harming yourself, it is recommended to follow the lead and migrate somewhere away of GitHub and Micro$oft. Here is the list of alternatives and rationales to do it. If they delete the discussion, there are certain well-known places where you can get a copy of it. Read why you should also leave GitHub.
UniGrammar is a tool providing a unified DSL for writing grammars for transpilation into grammar DSLs specific to other tools.
When you create a grammar you want to make it compatible to different parser generators because:
-
it allows it to be reused;
-
it allows you utilize debugging tools available only to some of them.
And it is possible since most of grammar DSLs implement EBNF.
The general workflow is as follows (but feel fre to do as you feel convenient):
- Collect or craft samples of texts in the laguage you wanna parse. They should be convenient for testing. You usually need texts tsting each language feature separately, and then interactions between them. You either need a dir of them, if each text occupies multiple lines, or a file of them, if each text occupies a single line.
- Choose a parser generator CONVENIENT for you for implementing that grammar. The parser generator must have debugging tools sufficient for your task. It usually should be the most generic class, I mean GLR. You can downgrade the class later. For now your goal is to just develop the grammar, get familiar to it and make it work. I used
parglare
. - Make sure the needed tools are installed:
UniGrammar
itselfUniGrammarRuntime
- parser generator you want to support.
git
- GUI diff and merge tool supporting
git
repos, such asTortoiseGitMerge
,WinMerge
(for Windows only) ormeld
.
- Setup your working dir:
- Clone
https://codeberg.org/UniGrammar/grammars
and read itsReadMe
. - Find a dir in the repo matching the purpose of the language you want to parse. Create a subdir there for your language.
cd
into it.
- Clone
- Develop and debug a grammar for the selected parser generator. Make it work. Use debug tools, such as tracers and AST visualizers to make sure it works as intended. Commit it.
- Make an initial port of your grammar to
UniGrammar
:- Translate it to
grammar.yug
. For now just copy ad then manually translate. In future automatic assistance can be developed. - Use
UniGrammar transpile <yug file> <backend name>
to transpile it into a grammar for the backend of your choice. - Compare the generated spec to the one you have originally crafted. Make minor insignificant changes to the both specs to make them byte-by-byte identical, keeping the original spec working.
- Translate it to
- Set up testing:
- register the tests in your
yug
file - run
UniGrammar test <yug file> <backend name>
and make sure all the tests pass. This tests mean only that a source is pased without an issue. If they don't pass, fix the grammar.
- register the tests in your
- Make compatibility to the rest of backends, downgrading grammar class step-by-step. Modify the
yug
file and test untill it works for a backend. Bring compatibility to all the backends. - You get an universal grammar suitable for more than 1 backends. Now it's time for deployment and behavioral tests.
- generate a bundle using
UniGrammar gen-bundle <yug file> <backend name>
- Import runtime
from UniGrammarRuntime.ParserBundle import ParserBundle
b = ParserBundle(Path("path/to/bundle"))
w = b["your_grammar_name"].getWrapper()
parseTree = w("text to parse")
- generate a bundle using
-
An
*.*ug
file is a machine readable and writeable universal grammar file. It is a tree of serialized objects like the ones that can be serialized into JSON.ug
stands for UniGrammar. It is prepended by a letter:y
stands for YAMLj
stands for JSONp
stands for PON - "Pyhon Object Notation" that can be parsed securely usingast.literal_eval
*b
stands forbinary
. Prepended by a letter identifying a binary format.c
- cborm
- msgpacko
- own format
-
An
*.*ug
file consists of 4 sections, each of them is alist
of records:characters
for definition of character classes. Needed because of CoCo/R.keywords
- put there whole words that are reserved. Anything that identical to these words will be recognized as these words tokens.tokens
- consist of groups ofcharacters
. Cannot group other tokens and productions.fragmented
andproductions
- are productions resolved via a state machine. They are mostly the same, but they have big semantic difference, related to wrapper generated from them:fragmented
are considered to be simple text strings. They should nevercap
. It is an artificial class to support scannerful LL parsers. Scannerful LL parsers work from character classes. They split text into tokens and assign a type to each token based on character classes used in it, then do productions, and they never backtrace and the tokenizer doesn't know the context. This means token character classes in general should never overlap, otherwise the tokens may be wrong. So to support "tokens" with overlapping char ranges one splits them into tokens of non-overlapping char ranges, and these "tokens" are not tokens anymore, but productions. But they still have meaning of tokens. This section is for such "tokens". The postprocessor (can be automatically generated) should join them back into strings. Also their internal structure may be optimized out, or the backends it makes sense.productions
- usual productions, that must always contain at least 1cap
(otherwise they belong tofragmened
, if you get invalid python code, you probably have put something that must be infragmented
toproductions
), defining named refs to parse tree children subnodes.
-
use
id: <id>
to assign an id to each rule. It must he done for rules in sections. -
use
ref: <assigned id>
to refer an already created rule. -
use
alt: […]
to specify alternatives. Works for all the sections. Forchars
allows to enumerate characters. -
use
range: ['<start>', '<stop>']
to create a character range.[<start>-<stop>]
in regexp syntax. -
use
wellknown: <name>
to specify a group of characters with a well-known name. -
use
neg: true
if the chars are to be excluded. -
use
lit: ...
to add a literal or a single character. -
use
min
to mark iteration.min: 0
is transpiled to…*
({…}
),min: 1
is transpiled to…+
(… {…}
) in parglare (EBNF) syntaxes. -
use
opt
to mark optionality. It is transpiled to…?
([…]
). -
use
seq: […]
to create a sequence. -
use
cap: <name>
to put the contents of this rule into the parse tree, if it is constructed. -
use
prefer: shift | reduce
to set a preferrence forparglare
. -
use
spacer: <n>
to addn
empty lines. -
use
name
in the root to specify a grammar name.
Here is an example: https://codeberg.org/KOLANICH-libs/AptSourcesList.py/blob/master/grammar.yug
In the order of decreasing performance:
- parsimonious
- waxeye (PEG)
- ANTLR 4 (LL(*))
- parglare (LR, GLR)
- TatSu
rangeslicetools
- for computations with chars rangesplumbum
- for CLI