Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request]: Improving search #325

Open
3 tasks done
mm12 opened this issue Jul 20, 2024 · 12 comments
Open
3 tasks done

[Feature Request]: Improving search #325

mm12 opened this issue Jul 20, 2024 · 12 comments
Labels
TagStudio: Search The TagStudio search engine Type: Duplicate This issue or pull request already exists Type: Enhancement New feature or request

Comments

@mm12
Copy link

mm12 commented Jul 20, 2024

Checklist

  • I am using an up-to-date version.
  • I have read the documentation.
  • I have searched existing issues.

Description

I know this is somewhat on the roadmap, but I thought I would share some specifics of how search should be improved. It is very important that work on search functionality starts early, to make sure a system is developed in a way that supports the future implementation of features (ie, we dont want to be in a situation where implementing a standard search feature would require major changes)

Solution

End goal: A fully featured search system. This could make use of Elastic and/or Opensearch. Desirable qualities:

  • A Standard search syntax
    • should be similar to Boorus and search engines - no one wants to learn a new syntax
  • Grouping of items
    • Many implementations of grouping treat each group as its own search, combining results are it goes up.
  • Boolean operators (AND, OR, NOT) of items and groups
  • Reserved characters and names:
    • there should be reserved characters, such as quotation marks and colons, that mean something in search syntax. To search the character itself, it will need to be escaped.
    • Any assignable field should be searchable. For example with filename:<query> - this should able to be used with any given type of field. Of course, by default, it is assumed to be a tag.
  • Searching text fields:
    • RegEx seems like the easiest way to do this. Not everyone knows it, but it could be a mode.
  • Handling entries with spaces ([BUG] Tags with Spaces in Searchbar #112) - there are 2 ways this is usually dealt with:
    • Make all spaces actually underscores: if your tag is a happy new year it becomes a_happy_new_year, but the underscores do not get shown to the user. This means that spaces and underscores are effectively the same.
    • grouping using quotations: "a happy new year" - however, this means that items with a quote in them will need to be escaped: a tag named "wow thats cool" would need to be searched with something like "\"wow thats cool\"" instead.

Alternatives

No response

@mm12 mm12 added the Type: Enhancement New feature or request label Jul 20, 2024
@CyanVoxel
Copy link
Member

I agree with your suggestions on this, however there's already several open issues and PRs pertaining to these features.
Boolean search is currently being tracked via #225 + #314 and is being implemented in #284/#310 with metadata search being tracked via #272 and being implemented in #284. I believe that grouping, escaping characters, and addressing tags with spaces are all coming with one or more of these PRs as well.

@CyanVoxel CyanVoxel added Type: Duplicate This issue or pull request already exists TagStudio: Search The TagStudio search engine labels Jul 20, 2024
@KillyMXI
Copy link

Better written issue comes late though, as it often happens...

There were some discussions in Discord as well. Maybe some of that needs to be summarized in GitHub somehow.

I don't like the accidental nature of PRs for such a fundamental feature, without settling on the design first. (And I'm not speaking about UI.)
Both PRs went with very different considerations in mind.

I'd probably go as far as suggesting to make a (E)BNF description of the search syntax before attempting to implement it. And there are other considerations as well, such as familiarity to random user, syntax extensibility and compatibility with possible UI helpers.

@mm12
Copy link
Author

mm12 commented Jul 20, 2024

I don't like the accidental nature of PRs for such a fundamental feature, without settling on the design first. (And I'm not speaking about UI.) Both PRs went with very different considerations in mind.

yes

I'd probably go as far as suggesting to make a (E)BNF description of the search syntax before attempting to implement it. And there are other considerations as well, such as familiarity to random user, syntax extensibility and compatibility with possible UI helpers.

I was actually thinking about this - in all honestly, I might make a PR of a markdown file detailing search syntax. Though, I still have not used TagStudio enough to be confident in it. If we want to do this, some things I would like is a general consensus on reserved characters and such, though we could just update it before merging it.

@mm12
Copy link
Author

mm12 commented Jul 20, 2024

perhaps something like this :

Search syntax

This section describes the (planned) search syntax used in TagStudio

General structure

Searches are parsed from the inner-most group outward, then left to right. Eg, a (b (c) d), c is parsed first, then b, then d then a. Terms are space-seperated, using " to escape spaces where needed. Note that the entire query can be treated as a large group.

Boolean operators

By default, search terms are in the AND mode. For example, cat fox would find entries with both cat and fox.

  • NOTing terms: This is almost always done with the - symbol, though occassionally done with NOT or !. cat -fox.
  • ORing terms:
    Common implementations:
    • or (english/python style, cat OR fox) - advantage or being readable, at the cost of visibility (it looks like normal text). This also means that you need to parse to change which category a term is in when you find OR, and building searches linearly.
    • | or || (programing style, cat | fox) - advantage of being visible and easy, but you still need to do more parsing.
    • ~ (Booru style, ~cat ~fox) - this has the advantage of making it obvious which items are included in the OR and which are not, which is good for users who make not be familiar with orders of operations (dog ~cat ~fox vs dog cat | fox). Disadvantage of not being able to start tag/text with ~ without escaping it.
  • Grouping - This is almost always implented with () - eg, ~(cat -fox) ~(fox -cat) would act as an exclusive OR - it finds items that have only one of cat and fox. Without grouping, this would take multiple searches.
  • Special operators: [] is reserved for possible future functionality with nested tags.
    • An entry with big fox and little cat would match any combination of (~big ~little) (~fox ~cat). This term might be used for something like fox[big] cat[little] to specify that a nested tag must belong to a specific parent.

Searching fields

By default, search terms apply to tags. There are a few special exceptions:

  • empty and no fields find entries that have not had any information attached.
  • untagged and no tags find entries that may or may not have had information attached, but none of them are tags.

Common field attributes:

Fields have attributes too - searching them is commonly done in these ways. Note that <field> and [options] exist to demonstate the syntax, and are not literals.

  • Boolean:
    • has_<field>:[True|False] - search whether an entry has a field on it. Eg, has_date:False finds entries that do not have a data attached.
    • <field>_count:[value] - search based on how many items are in a field (eg, how many tags there are)
    • in_<field>:[True|False] - the field is a collation. in_<field>:False matches all entries that are not in ANY collation
  • Matches: <field>:[value] - find entries where the field has a given value.
    • Default: <field>:"[value]" finds entries where the field CONTAINS the value for text entries. Using * matches all here (eg, a_*_cat matches anything where * is)
    • RegEx: Using RegEx syntax as the value (/regex/) searches with regex. Limited to text entries.
  • Comparisons and Ranges: (numerical values):
    • BETWEEN: <field>:[min]..[max] searches for entries where field's numerical value is between min and max, inclusive.
    • Inequalities: field: followed by >,<,>=, or <= search for their respective inequalities. Eg, date:<2024 finds where the date is before 2024.
  • order:[term|<supported field>] - This special opperator specifies how the search should be sorted. In the case where is is specified multiple times, the first takes precedent (ie, subsequent items are sub-sorts, applied where the value returns the same after the first pass). order is processed last, and does not support grouping (do to negations and ORs being applied to groups.)
    • terms might include random, tagCount, aspect_ratio, duration, etc. which can be use to order results as desired.
    • supported fields can also impact the order: dateAdded, title are examples.
    • If multiple order terms are present (eg, order:tagCount order:title), they are processed in the order of the operations. The items are sorted using the first order in the group (by tagCount) - then any items that have the same tagCount are sorted by title.

Tag/field rules

Generally, tags with characters that are used for search (spaces, ", :, ~, -, (), /, etc.) should be avoided if possible. Specifically, in the following situations, they will need to be escaped:

  • " - Will always need to be escaped.
  • Tags starting with boolean operators (~, ! and/or -, etc.) will always need to be escaped
  • In tags that contain :, the : must be followed by either _, a space, the colon must be escaped, or the tag must be escaped in quotes.
    • This excludes the first character of the tag.
  • Any tag containing unmatched grouping symbols ( should be escaped.
  • Text fields starting/ending in / will need to be escaped.

@samuellieberman
Copy link
Contributor

samuellieberman commented Jul 20, 2024

Very nicely written @mm12. I have just one big problem with your suggestion.

Tag rules

Generally, tags with characters that are used for search (spaces, ", :, ~, -, (), /, etc.) should be avoided if possible. When searching items with these characters, most will need to be escaped with \. Spaces can also be escaped by putting the tag name in quotes ("tag name here")

A lot of the suggested forbidden characters are super useful and popular in tag names and shouldn't have to be escaped. Some notable tag examples from Danbooru: :3, \(^o^)/, fate_(series), fate/grand_order, girls'_frontline, two-tone_hair...

The only restrictions I support are the following:

  • All whitespace in tags must be escaped.
  • Tags that start with ~, -, ! or an unmatched ( must be escaped.
  • Tags that are exactly equal to an existing operator must be escaped.
  • Tags that exactly overlap with a field search must be escaped.
    • This should only apply for currently existing fields, so honkai:_star_rail should not need to be escaped unless there is an entry with a honkai field.

And if we want to allow wildcards in tag searches:

  • All asterisk * characters must be escaped.

Also, can you please clarify what you meant when you said this?

  • Special operators: [] is reserved for functionality with nested tags. Currently, all functionality can be achieved with grouping and basic boolean operators.

And I don't understand what you meant when you said this:

  • order:[value|<supported field>] - This special opperator specifies how the search should be sorted. In the case where is is specified multiple times, the first takes precedent (ie, subsequent items are sub-sorts, applied where the value returns the same after the first pass).

@mm12
Copy link
Author

mm12 commented Jul 20, 2024

A lot of the suggested forbidden characters are super useful and popular in tag names and shouldn't have to be escaped. Some notable tag examples from Danbooru: :3

Good point. To clarify, I didn't mean they should always be escaped, just that they often will, in some contexts. Emote tags (:3, \(^o^)/, etc.) will be fine. Needing to escape them is contextual. It matches the restrictions you suppose, with the exception that in the case of things like abc:efg, it does not matter if abc is a valid field or not. Unless it is followed by an _ or an escaped space, it must be escaped.
o\:3 o:_3 "o: 3" "o:3" - OK
o:3 - not ok

Also, can you please clarify what you meant when you said this?

All the functionality that the current tag system supports can be described by the the spec (from my understanding). In the case where the system is upgraded, [] are there to aid in tag parent/child relationships.

And I don't understand what you meant when you said this
the order metatag allows you to specify the order of results. Potential values might be terms such as random, tagCount, etc., or might be fields that support sorting, such as dateAdded or title

I will edit the comment to clarify these.

samuellieberman added a commit to samuellieberman/TagStudio that referenced this issue Jul 22, 2024
…StudioDev#272 and TagStudioDev#325)

Adds ability to check the existence of fields of any type using the following syntax:
```has_<field>```
```has_<field>:<True|False>```
Adds the ability to search the content of text_line and text_box fields using the following syntax:
```<field>:<text>```
(Replace whitespace with underscore _ in `text`)
Updated test_search.py for new behavior.
@samuellieberman
Copy link
Contributor

Hey @mm12, I just got done implementing some field search syntax in my PR #310, and I was hoping to get your input on it.
Specifically, I implemented the following:
<field>:[value]
has_<field>
has_<field>:[True|False]

I also have boolean operators implemented pretty much identically to your suggestions from previous commits, except all tags and parentheses need trailing whitespace in my syntax. I haven't implemented quotation marks, wildcards, or regular expressions in my syntax.

I notice that you still say that unmatched parentheses and colons without spaces should be escaped in your current spec. In order to try to understand your reasoning, my current PR ignores these restrictions. If you clone my repository, try searching for >:) or NieR:Automata to see what I mean. I am very interested to hear if you would point out any issues with my syntax due to allowing these sorts of tags without escaping. Thanks!

@mm12
Copy link
Author

mm12 commented Jul 23, 2024

Hey @mm12, I just got done implementing some field search syntax in my PR #310, and I was hoping to get your input on it.
It looks like it is what we need right now - adding functionality that is needed but lacked.

Though, what I am looking to do is make this application scalable in terms of entry count and search complexity. To address this, I have 2 suggestions that can be a starting point here:

  1. Storing stuff in a giant JSON file (and parsing it) costs a lot. We should really be using SQL or something. The DB file can be stores in the same place.
  • This takes some effort. If we do this, I would hope we would have better import utilities than we have now.
  1. Writing our own search sucks. We should use something like Elastic or OpenSearch. Though, my experience with them is in Rails, not Python, so take that for what you will.

@KillyMXI
Copy link

Storing stuff in a giant JSON file (and parsing it) costs a lot. We should really be using SQL or something. The DB file can be stores in the same place.

Planned Features > Database Migration

use something like Elastic or OpenSearch

Good point. I've no experience with them. If they come with own query language or the query language needs to be adapted somehow - it will be good to think about it now.
But how they work? Are they applicable to diverse set of small metadata (tags, fields) compared to data consisting mostly of long text?
I'd imagine full text search might be only a sub-feature...
Excuse my ignorance if I'm saying something dumb here.

@KillyMXI
Copy link

Preserving relevant discussion from Discord.
Happened between July 9th and 11th.

Start of the discussion: https://discord.com/channels/1229183630228848661/1229309667528806420/1260037269297827931

THEHWIZ — 07/09/2024 4:58 AM

I’m super excited about giving tagstudio a go, and probably contributing/modifying my own version down the road. I was looking through the repo and couldnt quite tell so I figured I would ask, does the search currently support regex or are there plans to add regex search in the future? Or any sort of broader fuzzy search I guess?

Sam L. — 07/09/2024 5:17 AM (reply to THEHWIZ)

The search is really limited right now. It doesn't even support excluding tags, just "And" and "Or" mode for lists of tags. I also haven't seen any sort of GitHub issue or guidance from CyanVoxel proposing any specific syntax. Right now we have two pull requests that add some sort of boolean syntax, but they are very different and neither has been merged with the main project.
Lunaro-4's PR: #284
My PR: #310

I have some ideas for adding regex to my search syntax, but I'm hoping for more input from CyanVoxel on my pull request before I start adding more things to it. I'd love to get some feedback from you too if you are interested in checking out my pull request.

CyanVoxel — 07/09/2024 5:49 AM

My bad for being a bit slow with guidance on these, I think there was some very old discussions on the discord but it’s something I need to be more clear about with these newer PRs in the works.

From my understanding, #284 adds the pipe operator (|) as the OR syntax, although I feel having (||) in addition or instead of this would also make sense to a lot of people coming from that specific syntax. Likewise having the string “OR” in addition to the pipes would be intuitive, however this can make things tricky since that’s likely to be text inside of tag names or metadata. The AND is implicit (at least in the current default AND mode) but if for some reason this assumption changes, I would expect the syntax to mirror the OR syntax with “&” and/or “&&” as well as “AND” but yet again the “AND” string makes things tricky. #310 seems to be pretty in line with how I’d expect this to work, including the “NOT”/“-“/“!” Operators, parentheses, plus the “~” operator and more. Also before I forget, with the addition of this syntax I don’t see the need for having the “AND/OR” mode dropdown any longer.

I know tags with spaces was brought up again too, that one I had a longer answer for somewhere here in the discord but to sum it up it came down to being unnecessary in the long term since the UI is planned to support tag autocompletion that will show the actual tag graphic in the search bar on the frontend, while internally the syntax will reference the tag by ID. I don’t think there will be a remaining case where multiple tags will need to be searched by plaintext only after this, but if for some reason there is then having the tag name be in quotes while searching seems like the right direction to me.

Sorry for the lack of feedback on some of this stuff recently, I’m focusing on some other things at the moment before bringing most of my attention back to TagStudio. I just wanted to take a moment to throw these thoughts out there while I’m here
I haven’t entirely looked over #310 yet but I know there’s some discussion centering around some certain syntax and concepts. I’ll get to giving some more specific feedback at some point, plus I’ll help coordinate #284 and #310 since there’s some overlap but valuable additions in both. I may pull #284 into a new branch and have #310 work off of that before bringing it into 9.4/9.5

Killy — 07/09/2024 1:44 PM

Agree on unnecessary "AND/OR" dropdown. Agree on caution about which one is implicit. In case of fuzzy search it is less obvious thing...

But textual "AND", "OR" operators was something I thought to support in my GitHub comments.
In my understanding, TagStudio audience can include more artists than tech people, and so I'd focus on something that is clear for anyone without certain prior background.
Imagine using and symbols (conjunction and disjunction) for Boolean operations. For the life of me, I can't remember which one is what, even being a mathematician by education. & and | are not that bad and are pretty clear, partial appearance in plain language does help, but I wouldn't discard "AND", "OR", "NOT" operators.
Especially considering tags with spaces issue will be handled through autocompletion and so ambiguity can be avoided with least user effort.

I also suggested to have things namespaced for clarity - that will reduce possible ambiguity.
Even tags may have a namespace, but since they are the main thing, it is reasonable to assume they can have it optional. (But are there to help in case of ambiguity)
PR #310 showcases filename:, PR #284 showcases description: namespace. All can be a part of common pattern in the query language design.

For tags with spaces syntax, quotes seems a more commonplace search syntax and doesn't require much explanation. snake_case is pretty convenient in boorus, in combination with namespaces. If I were trying to support both - there might be some combinations to decide on syntax for...

I agree on confusion having both "&" and "&&" in context of search queries. Other operators also seem to have some unconventional use (in PR #310) that is currently not documented. I'd prefer to have mandatory text operator first, to avoid any ambiguity, then think whether there is a universally understandable symbol as an alternative.
Being obvious good, being clever bad.

Killy — 07/09/2024 1:55 PM

In general, first thing use care about is a good foundation that is very simple on its own and allows for extensibility.
Adding all the clever things from the start without consideration of their interplay will make it impossible to evolve in the future for new requirements.
(Less of an issue for search query language, but you will still have to re-teach users.)

Killy — 07/09/2024 2:08 PM

Regexes and wildcards typically have a toggle near a search bar in apps where they are supported.

This brings me to another point about search query language: the less escape syntax you need - the better.

In CSS selectors fashion, toggles for different local syntax can be brought into the language in form of function-like wrappers.

Yoylo — 07/09/2024 2:56 PM

As the person that made that "and/or" dropdown I wanted to note that that has always been planned to be removed as soon as syntax based logic got added

Killy — 07/09/2024 4:00 PM

Curiously, looks like ~ is legacy in danbooru.
I'd assume they tried and found out it wasn't so much of a good idea

gawi. — 07/09/2024 5:14 PM

Maybe I'm but a simple man, but I can't quite see myself getting much mileage out of conditional expressions that I wouldn't also be able to achieve with just "tags" and "-tags"
I'm guessing it would be the type of thing that becomes more valuable once you are working with a massive library that is actively being maintained
Don't let me be a demotivator, theory crafting on this must be worth doing if it's something so many different people have thought would be a good idea. I ultimately imagine it would be a lot easier to fund the best solution for this if some other aspects of the software were settled on I guess

Killy — 07/09/2024 6:41 PM

Fun fact nobody asked for:
it is possible to construct any logical expression with just AND and NOT operators. It just becomes impractical instantly.
A or B is the same as -(-A -B)

gawi. — 07/09/2024 11:04 PM

oh, i wasnt even thinking that far lol

Sam L. — 07/10/2024 2:24 AM (reply to Killy)

Even more fun fact, technically all I really needed to make universal boolean syntax would be NAND. I think it would be funny to remove the "And"/"Or" modes and replace them with a "NAND" mode, removing all syntax besides parentheses. This is obviously terrible, but I think it illustrates what the real goal is here. Also, I'm not a huge fan of escape characters either, I'm already thinking about how I could replace the one that I have now. I like the idea of supporting tag:"blah blah" or tag:'blah blah' type syntax, as that would allow me to support whitespace and do away with using slashes as escape characters to indicate when something is a tag. But then what if the tag wants to include a quotation mark in it? Then you need to escape quote-whitespace with like quote-quote-whitespace or something. Fun stuff.

Sam L. — 07/10/2024 2:28 AM (reply to gawi)

Don't worry about demotivation, the only reason I started working on all this was because I genuinely tend to use more advanced search for my own needs. It sucks to have a search that you really want to do, but then to butt up against the semantic limits of the search bar you are relying on.

Killy — 07/10/2024 2:35 AM (reply to Sam L)

Yeah, quotes are like minimal unavoidable escape syntax. Except...
approach 1: ' \' ' is ' (escaped by backslash)
approach 2: ' '' ' is ' (escaped by doubling)
approach 3: " ' " is ', ' " ' is " (other quotes, don't work when you need both inside)
approach 4: '' ' '' is ' (repeat outer quotes as many times as needed, anything less is a part of content)
The last one avoids escaping in the same definition. Although so far I only seen it in larger languages

Gherkin — 07/10/2024 2:39 AM

to add my two cents, as a user I would be perfectly fine with offloading some logic to tag creation time, in order to make searching syntax and logic easier. like only allowing tag names with a certain set of characters.

Killy — 07/10/2024 2:40 AM

Yeah, forgot to mention - another approach - limit the set of characters

Gherkin — 07/10/2024 2:40 AM

users are used to file names, discord channels, urls, etc to have a limited syntax, and probably wouldn't expect unlimited text support in tag names

Sam L. — 07/10/2024 2:41 AM (reply to Killy)

I have never seen or heard of approach 4, that's fascinating to me. I was mostly considering a mix of approach 2 and 3. And also, since outside the quotes would still require whitespace separation, I figured that I could allow un-escaped quotes that appear before more text.

Killy — 07/10/2024 2:41 AM

especially handy if snake_case will remain

Sam L. — 07/10/2024 2:42 AM (reply to Killy)

I don't like the idea of forcing a limited character set on all users. I like the idea of letting users limit their own characters if they want to avoid escapes, and then letting other users spam emoticons if they really feel like it.

Killy — 07/10/2024 2:43 AM (reply to Sam L)

Markdown party trick: `

Gherkin — 07/10/2024 2:43 AM

emoji in tag names is so cursed 😂😭
my kebab-case-ass would never

NiX — 07/10/2024 2:44 AM (reply to Gherkin)

Maybe we can use them as tag icons?🤔

Sam L. — 07/10/2024 2:46 AM (reply to Gherkin)

Emoji is already supported I'm pretty sure. I haven't tried it. I'm talking emoticons. Like this :^) Danbooru has a bunch of emoticon tags already, though it's not possible to search for emoticons that start with parenthesis, minus, tilde...
By the way, I just tried it. Emoji tag and search is supported in TagStudio.

NiX — 07/10/2024 3:10 AM (reply to Sam L)

They're just \u0000-FFFF character/s, some are a combination, some are standalone

Killy — 07/10/2024 3:12 AM

In C#, raw strings use the approach 4: https://learn.microsoft.com/en-us/dotnet/csharp/language-reference/tokens/raw-string
Of other languages, Rust, Swift and D have somewhat related variations.
C# devs probably regret not coming up with this earlier, among many other things...

Gherkin — 07/10/2024 3:29 AM

python strings are all utf-8, so fancy characters come for free!
still donn't think you should be putting weird characters in your tag names tho 😂

NiX — 07/10/2024 3:33 AM (reply to Gherkin)

[[Free cheese!]] Is only available in [[A giant Mousetrap!]]

Sam L. — 07/10/2024 3:33 AM (reply to Killy)

The only problem I can think of with the multi-quote thing is what do you do if the tag is entirely quotes? How is the code supposed to decide whether 11 quotes is 1 quote with 5 quotes on each side, or 3 quotes with 4 quotes on each side?

NiX — 07/10/2024 3:34 AM (reply to NiX)

||yes, I'm a spamton fan||

NiX — 07/10/2024 3:36 AM (reply to Sam L)

Display as is?

Gherkin — 07/10/2024 3:37 AM (reply to Sam L)

don't let users make a tag of all quotes lol ''''''''''' is stupid

Killy — 07/10/2024 3:39 AM (reply to Sam L)

'''' ''' '''' with additional rule to trim one space when the next character is the same quote

Sam L. — 07/10/2024 3:39 AM (reply to Gherkin)

Shoot, and how do I parse -tag:""""" monkey tag:"""""? Is the monkey a separate tag in middle? Or was the true tag just meant to be "" monkey tag:""?

NiX — 07/10/2024 3:42 AM (reply to Sam L)

Or was it monkey tag: all along?🤔

Sam L. — 07/10/2024 3:42 AM (reply to Killy)

How does the user search for a tag that's just space? if the user types tag:" " monkey tag:" " then were they leaving the gap because they wanted the quotes to be part of the tag? Or because they just wanted a space in the tag?
Whoops, that last one doesn't make sense.
Never mind? Not sure if there is a problem where I thought there was.

NiX — 07/10/2024 3:45 AM (reply to Sam L)

In any case, I think this edge case is supposed to be just 1 ' with 5 ' at either side, the approach of least feasible amount is just too complicated, just use the max continuous amount of ' from either side

Killy — 07/10/2024 3:47 AM (reply to NiX)

This might not generalize well. Makes it require to look ahead and know where the other end is

Sam L. — 07/10/2024 3:48 AM (reply to Killy)

I'm pretty confident in my regex at this point, personally I'm just worried about how ugly it is, and how to minimize all the dumb edge cases that need to be explained to the user.
Not to mention what to do if the user miss-types something.

NiX — 07/10/2024 3:48 AM (reply to Killy)

We already know the length of the string, we can treat it as an array and use [-x] to navigate from the end

Sam L. — 07/10/2024 3:49 AM

*mis-types 😉

NiX — 07/10/2024 3:52 AM

And, to generalize, we can just recu-
Recursion is bad, rethinking...

  • Find the longest continuous amount of ' or " at the front
  • Find the same from the end
  • Check which has less characters
  • Profit

Sam L. — 07/10/2024 3:53 AM (reply to NiX)

I'm pretty sure Killy was talking about regular expressions, or regex. This is the code that I use to tokenize searches right now for example

Killy — 07/10/2024 3:57 AM (reply to NiX)

Sounds complicated.
When you design a language, if you can make it context free, parseable from left to right - it significantly simplifies parsing.
A parser that considers options - is possible, but the more possibilities you want to keep open - the more you will hate yourself later

NiX — 07/10/2024 3:58 AM (reply to Killy)

I'm a JS backend dev, I'm throwing out ideas 😭

Killy — 07/10/2024 4:00 AM

I have published a parser combinators toolkit package for JS/TS btw
(Albeit it desperately needs an update to Node 18 due to some node changes... Should be fine with Deno)

NiX — 07/10/2024 4:02 AM

I just hope that I might never have to use something like that

Killy — 07/10/2024 4:03 AM

Once you understand how parser combinators work - it becomes a pleasure to make small language grammars

Killy — 07/10/2024 4:49 AM

Let's say we are trying to parse -tag:""""" monkey tag:""""""""
And we have an algorithm that tries to interpret quotes in a greedy way - as many as possible that produces valid result.
What should it do?
Is it monkey tag:""" or am I trying to filter by tags " and ""? (Or actually " and empty string, because greedy parser won't allow "" as well)
So, I'd say, the main issue here is actually ambiguity (wrong word, it makes some intents not representable due to more eager alternate parsing)
Greedy algorithm is not suitable, by definition it has too much reach - you will eventually need to constrain, escape it in some way...

Rust and Swift are somewhat better in this edge case - by using a directional combination of characters: ##" and "##, where number of # can grow but quote is single - leaves less for guesses.
If quotes came in pairs like «» - that would've also helped.
But we aren't really used to put strings into anything but quotes, and we only have certain types of quotes readily available on all keyboards.

Extra rule with spaces allows to solve the literal edge case (quote and the edge of content string) in most clean way I can think of, while secretly conceptually similar to those that use combinations of symbols for quotes. Unlike grammars with mandatory extra symbols, being able to use single quote symbol most of the times is also convenient.

Killy — 07/10/2024 5:38 AM

Ok, I think I realized one issue.
Should empty string be a valid content?
Let's say we have an algorithm that is not greedy and has extra rule with spaces.
"" will be interpreted as an opening quote of size 2.
" " or "" "" will be most natural and safe to consider as a string containing a single space. Only trim space if it's not the only one between opening and closing quotes. Otherwise it will be a single space that is not representable, that is more weird.
Empty string might be representable in different ways though, like tag: , although that is coincidental

Sam L. — 07/10/2024 6:56 AM (reply to Killy)

I still like the doubling-as-escape idea. I made this regex here that works pretty well for me: (?P<tag>\S+:("|')(?:\2\2|(?!\2\s).)*\2|\S+)(?:\s|$) The idea is that if the same quote character is used, then all pairs of quotes are treated as one, and the first unpaired quote followed by whitespace closes out the token.

Killy — 07/10/2024 7:08 AM

Yeah, doubling and backslash are simpler in implementation.
I'm not a fan of regexes though. Had some fun in the past with them, then seen not so great things made with them, then learned how to make more clean and scalable parsers differently

Sam L. — 07/10/2024 7:53 AM (reply to Killy)

You don't use regular expressions for tokenizing? What did you start doing instead?

K — 07/10/2024 2:47 PM (reply to Killy)

I can never get my head around regex. What the hype nowdays

Killy — 07/10/2024 2:49 PM (reply to Sam L)

It was too far into the morning, I should've phrased it differently.

I do try to avoid using regexes (or at least writing them manually) when possible, when grammar allows. Quoted strings in the grammar is a case when regex can simplify thigs.
But I also try to do least amount of work during tokenization, making it as dumb as possible. Tokens don't have to pack as much as possible. The smaller they are, the easier it is to compose them later. Parser combinators allow to reconstruct meaningful logical units easily.

The regex (?P<tag>\S+:("|')(?:\2\2|(?!\2\s).)*\2|\S+)(?:\s|$) tries to capture too much for my taste. You may have hard time later, trying to expand it for different features. I'd rather have a string token, : token and quoted string token. It is parser's responsibility to see what they mean together.
The most complex regex fragment you'll have to deal with - will be ("|')(?:\2\2|(?!\2\s).)*\2 for quoted strings.
One thing that seems missing in your regex - is a quoted string without namespace. Another thing - unquoted string with namespace. You may further bloat your regex, or introduce a separate regexes, or don't bother about it in the tokenizer, just handling any grammar primitives, leaving the composition to the parser.

Killy — 07/10/2024 3:19 PM

: is a part of the grammar but can also be captured by \S+ if occurs multiple times.
This leads to accidental behavior (not necessarily broken, depending on luck) instead of explicitly deciding what that should mean.
Unquoted strings have implicit limitations on what they can contain, depending on the rest of the grammar. This can also be made explicit.

gawi. — 07/10/2024 6:15 PM

man
you ever think like, we could just create our own custom characters, added by a custom commands, for the sake of the search

Killy — 07/10/2024 7:49 PM

(might as well have a Boolean expression builder, like a Scratch program, except I haven't seen actually good examples)

Gherkin — 07/10/2024 10:55 PM

[tag-name]<TAB>[tag name with spaces]<TAB>[g@rb""""age__name]
TAB can autocomplete to closest matching tag, or give feedback as typing that a tag exists or not

mister — 07/11/2024 4:01 PM

Should the program instantly and explicitly give feedback whilst typing a query, assisting you in properly formatting your search input - and if the user escapes this assistant menu, we just treat the query completely raw? This is similar to Discord mobile (see video)
https://cdn.discordapp.com/attachments/1229309667528806420/1260928986486280225/RPReplay_Final1720699019.mov?ex=66a04461&is=669ef2e1&hm=b2fafe7124af91139660574e2d511818fe371db7cd336436ac7b6a15d7d17280&

Killy — 07/11/2024 4:58 PM

In the perfect world...
It takes a lot of effort to make custom controls.
There might be premade controls for tags autocompletion, but with limited functionality compared to possible scope. Using 3rd party control might stand in the way of expanding its capabilities. Typically, existing tag input controls just make a list of tag names. No Boolean logic - "AND" combinator is all you've got.
Also, Discord search box works poorly. It fails half of the times for me to autocomplete after selecting a "namespace".

Search query language that is raw text can be considered a "nerdy" solution in 2024, due to the demand for intellectual capabilities of users, but it is least amount of UI grinding and can be reused in different contexts. For example, in API, if that happens some day.
Some devs chose to avoid "nerdy" things by all costs. Then proceed to make UI monstrosities with inherent limitations and more problems.

Ideal scenario: do a pen and paper planning of what features can be delivered quickly and which can be added later, design a query language that is not only expandable, but also easy to reflect in UI helpers, then, after it works, add UI features that naturally reflect the language...

Killy — 07/11/2024 5:25 PM

How a UI query builder may look like, if query is a combination of terms (combinations of namespace and value) and Boolean operations.
It may have some reminiscence of a reduced, 1D version of https://en.wikipedia.org/wiki/Scratch_(programming_language).

Query building is an iterative process.
User is likely to start with some term, most likely a tag.
As user types, autocompletion shows possible matching terms (or Boolean operations).
On input confirmation (Enter, Space, whatever), input is transformed into a term "chip" (sometimes also called "badge").
(Extra cool requirement would be to be able to keep all text editable)
Each "chip" can have so called "adorners" to modify it - wrap into a new Boolean operation, remove, or append a value if it is already a Boolean operation.

Speaking of similarity between text and UI:
Infix operators are inherently binary - that can become inconvenient or challenging in certain scenarios, while keeping text and UI from diverging.

  • A or (B or C) - making it binary in UI (exactly two operands in a block) would convolute the UI by a lot of borders;
  • [A][or B][or C] - stacking binary operations in UI may offer additional challenges on group operations, like inverting whole expression;
  • A or B or C - making it a list-like in UI - still have to keep "or" as separators, proper autocompletion might be a challenge, due to how operators priority can affect the whole expression;
  • or(A B C) - function-like prefix form - offers a visual grouping and list-like nature in text - will be closer match between text and UI. Also most friendly for autocompletion. Operators priority issues avoided by necessary parentheses/borders, but there won't be as many as in infix+mandatory parentheses;
  • (or A B C) - lisp-like prefix form - probably no advantages nowadays.

@KillyMXI
Copy link

In #314 (comment) I'm contemplating about support for Set Theory operations in the search query.
It adds another point to the question about implicit, operator-less AND/OR:

Mathematically, a set containing A and B is written as {A, B}, equivalent to {A} ⋃ {B}, but in Boolean terms that means "a set of elements that are in {A} or in {B}. AND/OR correspondence is kind of murky.

As such, operator-less {A B} doesn't really contain or need the same assumption as A B, but any assumption on the latter will require different mental gymnastics to interpret it and can be perceived inconsistent in some other way.

@samuellieberman
Copy link
Contributor

I have been thinking a lot about how tags, fields, and field contents are identified. Currently, @mm12's suggestion has quotation marks used to facilitate a more literal representation of tag and field identifiers:

General structure

[...] Terms are space-seperated, using " to escape spaces where needed. Note that the entire query can be treated as a large group.

And @mm12's suggestion has a regular expression option for matching field contents:

Common field attributes:

  • [...]
  • RegEx: Using RegEx syntax as the value (/regex/) searches with regex. Limited to text entries.

My suggestion for string matching is this:

Use Cases

Tag identifiers, field identifiers, and field content would all use the exact same text matching system, except that field content would match possible substrings rather than needing to match the whole string like in the other two cases.

Delimiters

Surrounding an expression with a delimiter would allow users to include whitespace while giving an indicator for the syntax. Multiple delimiter options gives users a way to avoid unnecessary escaping.

  • Single quotes '
  • double quotes "
  • back quotes `
  • forward slashes /

Different Syntaxes

  1. A wildcard syntax as the default without delimiters, or when using the prefix WC or wildcard with delimiters.
    • This would allow users to omit whitespace, or to replace it with underscores _ or dashes -.
    • This should also be case insensitive and allow users to omit punctuation, so that McDonald's can be searched as mcdonalds. This is inspired by the current (albeit not working) system in Tag Studio:
      def strip_punctuation(string: str) -> str:
      """Returns a given string stripped of all punctuation characters."""
      return (
      string.replace("(", "")
      .replace(")", "")
      .replace("[", "")
      .replace("]", "")
      .replace("{", "")
      .replace("}", "")
      .replace("'", "")
      .replace("`", "")
      .replace("’", "")
      .replace("‘", "")
      .replace('"', "")
      .replace("“", "")
      .replace("”", "")
      .replace("_", "")
      .replace("-", "")
      .replace(" ", "")
      .replace(" ", "")
      )
    • Additionally, this would introduce an asterisk * as a wildcard to match zero or more of any characters. This is inspired by the wildcard features that some boorus have, and doesn't require any escaping by default.
    • This is similar to glob patterns, but I see implementing the other glob like wildcards like question mark ? as a single character wildcard, or square brackets [...] as character classes to be unnecessary and counterproductive.
  2. A literal syntax as the default when surrounded by single quotes ', double quotes ", or back quotes `, or when using the prefix L or literal with forward slash / delimiters.
    • This would be case sensitive, and require every character to match, including whitespace and punctuation.
  3. A regular expression syntax as the default when surrounded by forward slashes /, or when using the prefix RE or regex with other delimiters.
    • This would essentially just pass the user's text to Python's (or SQLite's) regex engine with minimal reformatting.

Escaping

Escaping using backslash is actually kind of horrifying in this context, because there are three scenarios we would need to simultaneously accommodate with our system:

  1. Some number of backslashes appearing at the end of the string
  2. Some number of backslashes followed by the delimiter appearing in the middle of the string
  3. A significant number of backslashes being used to escape portions of a regular expression string

If we wanted to escape every backslash purely for the sake of simplifying the first two cases, then we are creating backslash hell for anyone who wants to use backslashes in regex syntax. And if we wanted to "pass through" backslashes to our syntax, we would have to do so selectively in order to accommodate the first two cases. Then we would suddenly have very different rules for strings of backslashes followed by delimiters compared to strings of backslashes standing on their own.

For these reasons, I prefer a "padding" approach. The only rules the user needs to know are that all pairs of delimiter characters are reduced to a single character, and that if they don't escape a delimiter character, then it may end up being interpreted as the end of the string. (Specifically if it is followed by whitespace, the end of the search query, or, when not matching field content, a colon : character.)

Has Field

I'm doing away with the has_<field>:[true|false] syntax and replacing it with a has:<field> syntax. has_<field>:false can be replicated with NOT has:<field>

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
TagStudio: Search The TagStudio search engine Type: Duplicate This issue or pull request already exists Type: Enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants