Skip to content

UniGrammar/UniGrammar.py

Repository files navigation

UniGrammar.py Unlicensed work

wheel (GitLab) wheel (GHA via nightly.link) GitLab Build Status GitLab Coverage GitHub Actions Libraries.io Status Code style: antiflash

We have moved to https://codeberg.org/UniGrammar/UniGrammar.py, grab new versions there.

Under the disguise of "better security" Micro$oft-owned GitHub has discriminated users of 1FA passwords while having commercial interest in success and wide adoption of FIDO 1FA specifications and Windows Hello implementation which it promotes as a replacement for passwords. It will result in dire consequencies and is competely inacceptable, read why.

If you don't want to participate in harming yourself, it is recommended to follow the lead and migrate somewhere away of GitHub and Micro$oft. Here is the list of alternatives and rationales to do it. If they delete the discussion, there are certain well-known places where you can get a copy of it. Read why you should also leave GitHub.


UniGrammar is a tool providing a unified DSL for writing grammars for transpilation into grammar DSLs specific to other tools.

Why?

When you create a grammar you want to make it compatible to different parser generators because:

  • it allows it to be reused;

  • it allows you utilize debugging tools available only to some of them.

And it is possible since most of grammar DSLs implement EBNF.

How?

The general workflow is as follows (but feel fre to do as you feel convenient):

  • Collect or craft samples of texts in the laguage you wanna parse. They should be convenient for testing. You usually need texts tsting each language feature separately, and then interactions between them. You either need a dir of them, if each text occupies multiple lines, or a file of them, if each text occupies a single line.
  • Choose a parser generator CONVENIENT for you for implementing that grammar. The parser generator must have debugging tools sufficient for your task. It usually should be the most generic class, I mean GLR. You can downgrade the class later. For now your goal is to just develop the grammar, get familiar to it and make it work. I used parglare.
  • Make sure the needed tools are installed:
    • UniGrammar itself
    • UniGrammarRuntime
    • parser generator you want to support.
    • git
    • GUI diff and merge tool supporting git repos, such as TortoiseGitMerge, WinMerge (for Windows only) or meld.
  • Setup your working dir:
    • Clone https://codeberg.org/UniGrammar/grammars and read its ReadMe.
    • Find a dir in the repo matching the purpose of the language you want to parse. Create a subdir there for your language. cd into it.
  • Develop and debug a grammar for the selected parser generator. Make it work. Use debug tools, such as tracers and AST visualizers to make sure it works as intended. Commit it.
  • Make an initial port of your grammar to UniGrammar:
    • Translate it to grammar.yug. For now just copy ad then manually translate. In future automatic assistance can be developed.
    • Use UniGrammar transpile <yug file> <backend name> to transpile it into a grammar for the backend of your choice.
    • Compare the generated spec to the one you have originally crafted. Make minor insignificant changes to the both specs to make them byte-by-byte identical, keeping the original spec working.
  • Set up testing:
    • register the tests in your yug file
    • run UniGrammar test <yug file> <backend name> and make sure all the tests pass. This tests mean only that a source is pased without an issue. If they don't pass, fix the grammar.
  • Make compatibility to the rest of backends, downgrading grammar class step-by-step. Modify the yug file and test untill it works for a backend. Bring compatibility to all the backends.
  • You get an universal grammar suitable for more than 1 backends. Now it's time for deployment and behavioral tests.
    • generate a bundle using UniGrammar gen-bundle <yug file> <backend name>
    • Import runtime from UniGrammarRuntime.ParserBundle import ParserBundle
    • b = ParserBundle(Path("path/to/bundle"))
    • w = b["your_grammar_name"].getWrapper()
    • parseTree = w("text to parse")

Guidelines

  • An *.*ug file is a machine readable and writeable universal grammar file. It is a tree of serialized objects like the ones that can be serialized into JSON. ug stands for UniGrammar. It is prepended by a letter:

    • y stands for YAML
    • j stands for JSON
    • p stands for PON - "Pyhon Object Notation" that can be parsed securely using ast.literal_eval
    • *b stands for binary. Prepended by a letter identifying a binary format.
      • c - cbor
      • m - msgpack
      • o - own format
  • An *.*ug file consists of 4 sections, each of them is a list of records:

    • characters for definition of character classes. Needed because of CoCo/R.
    • keywords - put there whole words that are reserved. Anything that identical to these words will be recognized as these words tokens.
    • tokens - consist of groups of characters. Cannot group other tokens and productions.
    • fragmented and productions - are productions resolved via a state machine. They are mostly the same, but they have big semantic difference, related to wrapper generated from them:
      • fragmented are considered to be simple text strings. They should never cap. It is an artificial class to support scannerful LL parsers. Scannerful LL parsers work from character classes. They split text into tokens and assign a type to each token based on character classes used in it, then do productions, and they never backtrace and the tokenizer doesn't know the context. This means token character classes in general should never overlap, otherwise the tokens may be wrong. So to support "tokens" with overlapping char ranges one splits them into tokens of non-overlapping char ranges, and these "tokens" are not tokens anymore, but productions. But they still have meaning of tokens. This section is for such "tokens". The postprocessor (can be automatically generated) should join them back into strings. Also their internal structure may be optimized out, or the backends it makes sense.
      • productions - usual productions, that must always contain at least 1 cap (otherwise they belong to fragmened, if you get invalid python code, you probably have put something that must be in fragmented to productions), defining named refs to parse tree children subnodes.
  • use id: <id> to assign an id to each rule. It must he done for rules in sections.

  • use ref: <assigned id> to refer an already created rule.

  • use alt: […] to specify alternatives. Works for all the sections. For chars allows to enumerate characters.

  • use range: ['<start>', '<stop>'] to create a character range. [<start>-<stop>] in regexp syntax.

  • use wellknown: <name> to specify a group of characters with a well-known name.

  • use neg: true if the chars are to be excluded.

  • use lit: ... to add a literal or a single character.

  • use min to mark iteration. min: 0 is transpiled to …* ({…}), min: 1 is transpiled to …+ (… {…}) in parglare (EBNF) syntaxes.

  • use opt to mark optionality. It is transpiled to …? ([…]).

  • use seq: […] to create a sequence.

  • use cap: <name> to put the contents of this rule into the parse tree, if it is constructed.

  • use prefer: shift | reduce to set a preferrence for parglare.

  • use spacer: <n> to add n empty lines.

  • use name in the root to specify a grammar name.

Here is an example: https://codeberg.org/KOLANICH-libs/AptSourcesList.py/blob/master/grammar.yug

Implemented backends

In the order of decreasing performance:

Not fully implemented backends

Dependencies