Statistics and info in README are tricking users #1385

sylee957 · 2024-01-26T11:14:51Z

sylee957
Jan 26, 2024

When I see the readme about the capability, it says something like

What can it do?
Parse all context-free grammars, and handle any ambiguity gracefully
Build an annotated parse-tree automagically, no construction code required.
Provide first-rate performance in terms of both Big-O complexity and measured run-time (considering that this is Python ;)
Run on every Python interpreter (it's pure-python)
Generate a stand-alone parser (for LALR(1) grammars)

However, I would like to ask a few questions

Is it possible that lalr can parse all context-free grammars, and handle ambiguity gracefully, or does it only apply for earley
Is earley performant? lalr is evidently the fastest, but in the graph, it seems like shame that earley is the slowest.

If my concernes are true, I get deceived by reading the README because it sounded like:

I have both the advantages of 'earley' and 'lark', but I don't have disadvantages from each one. So we are the best

Especially, for the performance, earley is not reasonably slow,
but also extreme slowest, which is evidently a problem. (that you could be more transparent)

https://github.com/lark-parser/lark?tab=readme-ov-file#performance-comparison

I think that you should be more transparent about the evident problems in summary,
because it affects decisions from users (evidently everyone are very technical users),
and I apparently know, and have faced such problems like performance.

Lark is a parsing toolkit for Python, built with a focus on ergonomics, performance and modularity.

If ‘earley’ one of your flagship, I would like to remove ‘performance’ there because it has anti evidence, and confusing users.

I am watching some issues in SymPy community, that some users come up with performance of earley
sympy/sympy#26098
and I also wanted to share my discovery because people are getting a lot of misunderstanding from the documentation of lark,
and if lark community should correct such info.

(I'm open to other suggestions or close this question if you can provide evidence that lark had solved such problems, or if the statistics are outdated)

MegaIng · 2024-01-26T14:02:39Z

MegaIng
Jan 26, 2024
Collaborator

If you read the rest of the readme, you can find quite a few places where there is a clear distinction made between lalr and early:

Experts: Lark implements both Earley(SPPF) and LALR(1), and several different lexers, so you can trade-off power and speed, according to your requirements. It also provides a variety of sophisticated features and utilities.

(this quote is directly above the section you quoted)

Earley parser

Can parse all context-free grammars
Full support for ambiguous grammars

LALR(1) parser

Fast and light, competitive with PLY
Can generate a stand-alone parser (read more)

While some wording could be improved, people only reading subsections of the README and getting confused is not something we are going to be able to prevent in general. If you have a concrete suggestion that doesn't involve removing the word "performance" (which is correct for most grammars people want to use it for), we can probably add it.

To answer the questions:

No, lalr can't parse every CFG, and this isn't claim anywhere, and this is clear based in it being a LALR parser
No, early isn't performant, it's O(n^3), and this is clear based on it being an early parser.

Neither of them are the flagship, they are different parts, both of which are useful for different situations.

0 replies

erezsh · 2024-01-26T16:51:27Z

erezsh
Jan 26, 2024
Maintainer

I also don't understand why you find our documentation confusing. The parser parameter for lark accepts either "lalr" or "earley", which makes it clear they are different algorithms. That fact is also mentioned several times in the documentation. It is further emphasized in the graph you linked yourself, showing the two algorithms separately, even as if they are different libraries. It also shows there clearly, on the main page, that Earley is "the slowest".

Earley is the default algorithm because it is the easiest for beginners, and it is capable of parsing ambiguous and complicated languages that most parsers cannot. Anyone who cares about performance, and doesn't need Earley's features, can choose to use LALR, as perhaps you should try to do.

If we are the best, it is because our LALR implementation is better than all the others, and because we are the only ones who provides Earley as an option; the only ones who can parse every CFG, although it may come at the cost of performance. And perhaps because of many other features that only Lark has, and no other library does. And yet I don't recall the adjective "best" being used anywhere in our pages, did we ever claim it out loud?

Yet you say you "get deceived", that we're "tricking users", and accuse us of a lack of transparency. So please, explain yourself better, so that I may understand why you would say that.

edit: And just one unrelated comment. I've looked at the Latex grammar they are using, and it's written in a very inefficient way, which is probably contributing to the slowness. They are also running it with the debug flag set to True, which also incurs a runtime cost.

3 replies

sylee957 Jan 26, 2024
Author

They are also running it with the debug flag set to True, which also incurs a runtime cost.

I would have question specifically for this. Does debug really has runtime cost?
I thought that debug does only does static analysis on grammar (finding unreachable token, find problems of LALR parsing table) which could only affect the performance only in construction or initial run.
It is obviously the most easiest way to tweak the performance (other than solving the puzzles for the grammar)
I'd like to know what kind of additional work debug does in runtime (parse) as well. (if I don't catch up the implementation)

MegaIng Jan 26, 2024
Collaborator

It's biggest runtime cost is that the sppf.png image gets generated, which is always a file access and a call to an external program, or a warning when pydot is not installed. I don't think it does anything extra for earley, but for lalr it always makes it use a slower and less memory efficient parse table.

erezsh Jan 27, 2024
Maintainer

finding unreachable token, find problems of LALR parsing table

This always happens anyway. The debug flag is for intensive debugging, not everyday use.

sylee957 · 2024-01-26T18:02:06Z

sylee957
Jan 26, 2024
Author

Please describe library in just precise about what it is just precisely, and humble like
- Lark is a flexible, easy to use parsing library
- We allow users to pick either LALR or Earley algorithm

I do know that lalr is a lot difficult to use but fast, and earley is a lot easy but slow.
Please just list the capability of LALR and Earley individually, one by one.
Please describe like 'we just let you pick one', that's all.
I also know that there is no way to combine best parts of these two, (lalr and earley) algorithms, so don't mix in two

In https://github.com/lark-parser/lark?tab=readme-ov-file#performance-comparison,
- Please write brief summary about the graph
- Please say 'we are sorry about performance of Earley but improving', or 'we are sorry about it but there is some other way around' there
Please invest some effort to write some guides like: it will be very helpful
- How to avoid some very inefficient grammar for 'Earley', how to tune its performance
- How to start with 'Earley', but transite to 'LALR' for performance (if that is easily possible)

1 reply

MegaIng Jan 26, 2024
Collaborator

We are not sorry about performance of earley. If it isn't good enough for you, try to use lalr or you can invest time and effort into improving the earley implementation (I don't know if there is a lot that can be done there). Or you can use a different library, although there is AFAIK no other pure python parsing library with similar abilities with regards to ambiguities.

I am not sure what you want us to add to "interpret the graph" that isn't just describing it. What would you interpret into it?

Better guides are something we should potentially have. But I think both of the suggestions you have are highly complex research question, and I don't think you can give an easy to understand explanation for people who haven't read up on parser theory.

sylee957 · 2024-01-30T13:04:04Z

sylee957
Jan 30, 2024
Author

Thanks, I'm closing this issue.

2 replies

erezsh Jan 30, 2024
Maintainer

So, you come to our project page to accuse us with strong words like deception and tricking, but when we ask you to explain yourself, you just back out and say nothing. I know I can't expect everyone to live up to my own standards, but if I were you, I would feel the need to either justify myself or apologize.

Anyway, I hope sympy manages to improve the performance of their grammar. Here's a few hints: Turn of the debug flag, refactor the grammar to use less rules (I saw a lot of needless repetition), and use left-recursion instead of right-recursion whenever possible, and definitely don't have a rule with both.

sylee957 Jan 30, 2024
Author

I'm sorry about bringing up naughty title, and open up a discussion like 'accusing' you.
However, I appreciate your time and effort, and I just decided to back off because we won't get any understanding on,
and I also have poor knowledge about select right or wrong words or phrases to talk.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Statistics and info in README are tricking users #1385

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 4 comments 6 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

Statistics and info in README are tricking users #1385

sylee957 Jan 26, 2024

Replies: 4 comments · 6 replies

MegaIng Jan 26, 2024 Collaborator

erezsh Jan 26, 2024 Maintainer

sylee957 Jan 26, 2024 Author

MegaIng Jan 26, 2024 Collaborator

erezsh Jan 27, 2024 Maintainer

sylee957 Jan 26, 2024 Author

MegaIng Jan 26, 2024 Collaborator

sylee957 Jan 30, 2024 Author

erezsh Jan 30, 2024 Maintainer

sylee957 Jan 30, 2024 Author

sylee957
Jan 26, 2024

Replies: 4 comments 6 replies

MegaIng
Jan 26, 2024
Collaborator

erezsh
Jan 26, 2024
Maintainer

sylee957 Jan 26, 2024
Author

MegaIng Jan 26, 2024
Collaborator

erezsh Jan 27, 2024
Maintainer

sylee957
Jan 26, 2024
Author

MegaIng Jan 26, 2024
Collaborator

sylee957
Jan 30, 2024
Author

erezsh Jan 30, 2024
Maintainer

sylee957 Jan 30, 2024
Author