Skip to content

JSON Query Schema

Jens Alfke edited this page Aug 9, 2024 · 75 revisions

Table Of Contents

STATUS: This is a living document that reflects the latest version of the query syntax, as implemented on the master branch of LiteCore. It may describe features not yet available in a release of Couchbase Lite. Pay attention to "STATUS:" blocks like this one, which point out new features not yet in a release.

1. Introduction

There are two ways to create queries in Couchbase Lite:

  1. The legacy QueryBuilder API
  2. The newer API that takes a query string in N1QL / SQL++ syntax.

Both of these produce an intermediate representation of the query, a sort of Abstract Syntax Tree (AST) expressed in JSON. That's what this document describes. LiteCore then translates this into SQL that can be executed by SQLite.

Each node of the parse tree describes an operation and a list of operands (children). The operations can be arithmetic, comparison, logical, etc. The number of operands depends on the operation; for example, NOT has exactly one, - has one or two (negation or subtraction), AND has two or more.

A node is represented in JSON as an array, where the first element is a string naming the operation, and the other elements represent the operands (often nested arrays); for example ["=", ["+", 2, 2], 5]. (If you know LISP or any functional languages, this should look pretty familiar!)

A few operations, like SELECT and COLLATE, take a set of named operands. These are represented as a single operand that's a JSON object/dictionary whose items are the real operands.

Most of the operation names are SQL / N1QL keywords or math symbols, but there are also operations to represent document property paths ("."), query parameters ("$"), etc.

NOTE: This schema is case-insensitive, like SQL and N1QL. All operation names, function names, and SELECT keys can be upper- or lower-case or any mixture. However, document property names are case-sensitive.

2. Example

SELECT name.first, name.last FROM students WHERE grade = 12 AND gpa >= $GPA

As a JSON tree this looks like:

["SELECT", {
    "WHAT": [
        [".", "name", "first"],
        [".", "name", "last"] ],
    "WHERE":
        ["AND",
            ["=",
                [".", "grade"],
                12],
            [">=",
                [".", "gpa"],
                ["$", "GPA"] ] } ]

3. Leaf Types (Literals, Properties, Parameters, Variables)

Literals

As an operand, a JSON string, number, boolean or null represents itself.

Examples: true, false, null, 17, "foo"

The special value MISSING (equivalent to SQL's NULL) is represented as ["MISSING"]; i.e. it's a zero-argument operation that returns a constant.

An array literal is created using the "[]" operation. All of the operands are evaluated and concatenated to make the array.

Example: ["[]", 10, true, "foo"]

A dictionary literal is created simply by using a JSON dictionary/object. All of the values are evaluated.

Example: {"name": [".name"], "age": [".age"]}

NOTE: Array and dictionary literals ignore any values that evaluate to MISSING (similar to JavaScript's undefined.) So in the dictionary example above, if the document had no age property, the resulting dictionary would have only a name key.

Properties

Properties are references to document properties. The property operation name is "."; its operands are a path from the document root to the property being named.

Example: [".", "name", "first"]

As shorthand, a property expression can be collapsed into a one-element array, like [".name.first"] ... as long as none of the path components contain a ".", of course.

A property expression with zero operands, ["."], represents the root of the document. (This is commonly used in a WHAT list, where it is the equivalent of the SQL * specifier.)

In a query with a FROM clause, where multiple documents are being queried, a property expression's path MUST be prefixed with the alias of the document as its first operand. For example, if the alias were db, then [".", "name"] would become [".", "db", "name"]; ["."] would become [".", "db"]; and [".", "_id"] would become [".", "db", "_id"]. Of course these can be abbreviated as [".db.name"], etc.

If the query's WHAT clause uses the AS operator to declare a result alias, that alias can be used as a top-level property name (potentially hiding a document property with the same name!)

Magic Metadata Properties

There are some special top-level property names for accessing document metadata:

Name Type Value
_id string The document ID
_sequence integer The sequence number
_deleted boolean True if the document is deleted
_expiration integer or missing Expiration time (ms since Unix epoch)

Parameters

Parameters are placeholders whose values are substituted when the query is run. The parameter name is the single (required) operand of the "$" expression.

Example: ["$", "MIN_AGE"]

As shorthand, a parameter expressions can be collapsed into a one-element array, like ["$MIN_AGE"].

Variables

Variables are placeholders used in a ANY and EVERY expression to represent the array/dictionary item being iterated over. The (required) first operand of the "?" expression is the variable's name, and the (optional) extra operands are a property path relative to the variable's value.

Example: ["?", "address", "zip"]

As shorthand, a variable expression can be collapsed into a one-element array, like ["?address"] or ["?address.zip"].

4. Operations

The operations are named after their N1QL/SQL equivalents.

Category Name Operand Count
Constants MISSING 0
Arithmetic + 2+
- 1 or 2
* 2+
/ 2
% 2
String || 2+
Relational = 2
!= 2
< 2
<= 2
> 2
>= 2
BETWEEN 3: (value, min, max)
IS 2
IS NOT 2
IS VALUED 1: is neither NULL nor MISSING
LIKE 2
IN 2: (value, array)
NOT IN 2: (value, array)
EXISTS 1
COLLATE 2: (options, expr) [see Collation below]
Logical NOT 1
AND 2+
OR 2+
Functions name() Depends on function
Conditional CASE 3+: (expr, when1, then1 ...) [see below]
ANY 3: (variable name, array, satisfies)
EVERY 3: (variable name, array, satisfies)
ANY AND EVERY 3: (variable name, array, satisfies)
Properties . 0+: (path components) [see above]
_. 2: (expr, property-path) Evaluates a property (or path) of a dictionary value
Parameters $ 1 (name or position) [see above]
Variables ? 1+ (name, optional path components) [see above]
Blobs BLOB 1: (property path)
Queries SELECT 1 [see below]

4.1. CASE

The CASE operator needs a bit of explanation.

  • The first operand is the expression to test (which directly follows the CASE keyword in N1QL/SQL).
  • The second operand is the first expression to compare with (directly following WHEN.)
  • The third operand is the result to use if the first expression matches (directly following THEN.)
  • After that can come zero or more pairs of extra 'when' and 'then' expressions.
  • If there's one operand left over (i.e. the operation has an even number of operands) it's interpreted as the ELSE result.

Alternatively, the expression to test may be null; in that case, each of the odd-numbered "when" operands is checked for truthiness instead of being compared against the test expression.

5. Collation

The COLLATE operator does nothing itself, merely returns the value of its second operand, but it alters the string collation (comparison/sorting) used when evaluating that expression and nested expressions.

NOTE: The regular-expression functions did not obey collations prior to version 2.7.

The first operand is a dictionary that specifies the collation; its keys are:

Key Value Default Value
"UNICODE": Unicode-aware? false
"CASE": Case-sensitive? true
"DIAC": Diacritic (accent) -sensitive? true
"LOCALE": ISO locale* string or null null

* A locale is an ISO-639 language code plus, optionally, an underscore and an ISO-3166 country code: "en", "en_US", "fr_CA", etc.

Some details on combining keys:

  • If UNICODE is not true, only the CASE value is significant: a case-sensitive collation is a purely binary string comparison; a case-insensitive one also treats ASCII uppercase and lowercase letters as equivalent.
  • If UNICODE is true, but LOCALE is missing or null, the collation is Unicode-aware but not localized; for example, accented Roman letters sort right after the base letter. (This is implemented by using the "en_US" locale.)
  • Any keys not specified are inherited from the enclosing context.
  • There's implicitly a top-level context with the default values for the keys, i.e. {UNICODE: false, CASE: true, DIAC: true, LOCALE: null}.

About Unicode Collation

The details of Unicode collation are quite complex, though for the most part it just Does The Right Thing according to a human of that locale. But it doesn't behave like the simple strcmp and strcasecmp functions that programmers are used to!

  • The collation algorithm first compares the strings ignoring case and diacritics, just looking at the base letters. If the letters are not equal, it stops and returns the relative ordering based on the mismatched letters. This is usually, but not always, the ordering English speakers are used to; for example, in Lithuanian "Y" comes after "I", and in traditional Spanish the sequence "CH" sorts as a single letter that comes between "C" and "D".
  • Otherwise, if diacritic-sensitive, it compares the strings again, this time considering also diacritics (accents). If they differ, it returns the relative ordering. By default an accented letter sorts just after the base letter, but many locales have special rules: "Å" in Danish is treated as a separate letter that sorts just after "Z".
  • Otherwise, if case-sensitive, it compares the strings again, now considering case; if they differ in case, it returns the relative ordering, with lowercase coming before uppercase (contrary to the intuition of programmers used to ASCII ordering!)

The above applies to the Roman alphabet. For many non-Roman scripts, especially Chinese and Japanese, the rules get much more complicated.

Some unintuitive results: in a case-sensitive collation, "abc" comes before "ABC" (lowercase first!), but "abd" comes after "ABC" because the letter mismatch takes priority over the case mismatch. Likewise, "ápple" comes after "Apple" (in most locales) because the diacritic is higher priority than uppercase.

6. MATCH and Full-Text Search

The MATCH operator queries a full-text-search (FTS) index.

  • Its first parameter is the name of the index.
  • Its second parameter is a search string.

SQLite imposes limitations on how MATCH expressions can be used:

  • They have to appear at the top level of a SELECT.
  • Each index can only be matched once within any SELECT.

FTS Search-String Syntax

The search string in a MATCH expression is (currently) passed directly to SQLite’s FTS4 search engine, so its syntax is specified by FTS4. A summary:

  • Search terms (words) are separated by whitespace.
  • If there are multiple terms, by default the text must contain all of them.
  • A term ending in * will match any word beginning with that prefix.
  • A term starting with “^” only matches at the start of the text.
  • A series of terms enclosed in double-quotes is a phrase; the terms must appear adjacent to each other in that order in the text.
  • Parentheses can be used to group terms/phrases into a larger term.
  • The special words OR, AND, and NOT (capitalized!) can be used between terms/phrases as boolean operators.
  • The special word NEAR can be used between two terms/phrases to specify that they must appear near each other: within 10 words by default, but you can customize this by appending a “/” and a number, e.g. NEAR/5.

Word Matching

Full-text search is always case-insensitive. Its diacritic sensitivity (whether it ignores accent marks) is configured when the FTS index is created. Thus, a MATCH expression is not affected by a COLLATE clause.

Matching is affected by stemming and stop-words, if those are available in the selected language and enabled in the index.

  • Stemming causes different forms of the same word to match, so (in English) “bigger” matches “big” and “biggest”.
  • Stop-words are common but low-significance words, like English “the” and “are”, that are ignored completely in order to keep down the size of the index.

STATUS: (August 2024) Stemming is currently available for Danish, Dutch, English, Finnish, French, German, Hungarian, Italian, Norwegian, Portuguese, Romanian, Russian, Spanish, Swedish, Turkish. Stop-words are used in English and French.

STATUS: (August 2024) The FTS indexer considers words to be sequences of Unicode alphabetic characters separated by non-alphabetic characters. This is true of most languages, but many Asian languages like Japanese, Chinese and Thai do not normally use whitespace to separate words; FTS will not work with such text. (Finding word breaks in these languages is difficult and will require 3rd party libraries like Mecab or Apple’s NSLinguisticTagger.)

7. Top-Level Query, and SELECT

The SELECT statement has so many parameters, all of which are optional, that it makes a lot more sense to encode them as a dictionary with the following keys:

Key Value Default Value
"WHAT": Array of column expressions to return, generally properties document ID and sequence
"FROM": Array of database/join identifiers Database being queried
"WHERE": Boolean-valued expression true (all documents)
"HAVING": Expression true
"DISTINCT": Boolean false
"GROUP_BY": Array of expressions or property names [] (no grouping)
"ORDER_BY": Array of expressions or property names [] (unsorted)
"LIMIT": Number Infinite
"OFFSET": Number 0

The WHAT Clause: Result Columns

The WHAT array defines the columns of a query row, just like the column expressions after SELECT in N1QL/SQL syntax. Each item of the array may be:

  • A string literal, interpreted as a property path.
  • An expression (array or dictionary). Aggregate functions are allowed here.
  • An array of the form ["AS", <expression>, "<string>"], which is interpreted exactly the same as <expression>, but has the side effects of defining <string> as an result alias for <expression>, and setting the column's title to <string>.

Result Aliases

A result alias can be used in the WHERE clause as a shortcut for its expression, as though it were a document property. (This happens even if there is a document property with the same name; the alias shadows it.)

Example: {"WHAT":[["AS", ["+", [".x"], [".y"]], "sum"]], "WHERE": [">", [".sum"], 10]}

Column Titles

Column titles have no effect on the query but are returned via the c4query_getColumnTitle() C function. A column declared without AS has a default title based on the property path if any, or otherwise a made-up name like $1. Higher-level bindings may support accessing a query row as a dictionary using the column titles as keys.

The FROM Clause: Database/Join/Unnest Identifiers

The items in the FROM array are dictionaries with the following keys:

Key Value Default Value
"AS": Alphanumeric string: an alias to refer to this database or join by Derive from COLLECTION
"COLLECTION": String: Collection name Default collection
"SCOPE": String: Scope name Default scope
"JOIN": String: Type of join "INNER" (if ON is given)
"ON": Boolean-valued expression: the join constraint no join
"UNNEST": Array-valued expression no unnest

Some requirements:

  • Every item must have an effective unique AS property value. If the AS property is missing in the FROM clause, it will be derived from COLLECTION.
  • The first item in the array serves only to alias the default database; it can't have a JOIN, ON or UNNEST property.
  • The subsequent items must either be joins or unnests.
  • In a join:
    • There must be an ON property.
    • Legal values for JOIN are "INNER" (the default), "OUTER", "LEFT OUTER", and "CROSS". (Case-insensitive)
  • In an unnest:
    • There must be an UNNEST property.
    • There cannot be a JOIN or ON property.

Adding a FROM clause affects the interpretation of properties. Since there are usually multiple databases or join sources, property names (paths) in the entire query need to be disambiguated by prefixing the appropriate alias. So in the example below, a document's abbreviation property has to be named as .state.abbreviation, not just .abbreviation.

Scope and Collections in FROM

If COLLECTION is given, the special value "_" (just an underscore) may be used to refer to the default collection. The name of the database (i.e. the directory name without the .cblite2 extension) may also be used to refer to the default collection.

If SCOPE is given, COLLECTION must also be present and must name a non-default collection. Alternatively, the scope name can be given as a prefix of the COLLECTION value, followed by a ".": e.g. "COLLECTION":"fooscope.barcoll".

Example of FROM

    "FROM": [{"as": "person"},
             {"as": "state", "on": ["=", [".state.abbreviation"],
                                         [".person.contact.address.state"]]},
             {"as": "interest", "unnest": [".person.interests"]}],

Note how document property names are prefixed with aliases, like .state.abbreviation instead of just .abbreviation. This is required once aliases are introduced in a FROM clause, to avoid ambiguity.

8. N1QL Functions

For detailed information about parameters and results, please consult the N1QL documentation.

NOTE: There are some differences from SQL, or at least from SQLite; for example, SQLite has non-aggregate versions of min and max, but in N1QL (and LiteCore) these are called least and greatest.

Category Name Operand Count Notes
Aggregate array_agg() 1 Collects values into an array
avg() 1
count() 1
max() 1
min() 1
sum() 1
Arrays array_avg() 1 Average value of numbers in an array
array_contains() 2 (array, item)
array_count() 1 Number of non-null items in array
array_ifnull() 1 The first non-null item in an array
array_length() 1 Full length of array
array_max() 1 Maximum number in an array
array_min() 1 Minimum number in an array
array_sum() 1
Comparisons greatest() 2+ Maximum numeric argument (like SQL max)
least() 2+ Minimum numeric argument (like SQL min)
Conditionals ifmissing() 2+ Returns 1st non-missing arg
ifmissingornull() 1+ Returns 1st arg not missing or null
ifnull() 1+ Returns 1st non-null arg
missingif() 2 returns missing if arg 1 == arg 2, else returns arg 1
nullif() 2 returns null if arg 1 == arg 2, else returns arg 1
Dates millis_to_str() 1 Converts Unix timestamp in milliseconds to ISO-8601 date string in local time zone
millis_to_utc() 1 Converts Unix timestamp in milliseconds to ISO-8601 date string in UTC
str_to_millis() 1 Parses ISO-8601 date string to Unix timestamp in milliseconds
str_to_utc() 1 Normalizes ISO-8601 date string to UTC timezone
Math abs() 1
acos() 1 Note: All trig functions use radians
asin() 1
atan() 1
atan2() 2 Parameter order is (y, x) in version 3+; was (x, y) in version 2
ceil() 1
cos() 1
degrees() 1 Converts radians to degrees
div() 2 Real number division. div(x, y) = (double)x/(double)y
e() 0
exp() 1
idiv() 2 Integer number division. idiv(x, y) = (int)x/(int)y
ln() 1
log() 1
meta() 0-1 Returns object that contains all the meta properties. The optional argument is the database alias.
floor() 1
pi() 0
power() 2
radians() 1 Converts degrees to radians
round() 1–2 Optional 2nd argument gives number of decimal places to round to (default 0)
sign() 1 Returns -1, 0 or 1, reflecting sign of argument
sin() 1
sqrt() 1
tan() 1
trunc() 1–2 Optional 2nd argument gives number of decimal places to truncate to (default 0)
Patterns regexp_contains() 2 Args are (string, pattern)
regexp_like() 2 Synonym for regexp_contains
regexp_position() 2 Returns byte offset of 1st match, else -1
regexp_replace() 3-4 Args are (string, pattern, replacement) and optional limit
match() 2 FTS match
rank() 1 Returns ranking of FTS matches
Strings contains() 2
concat() 2+ like `
length() 1
lower() 1
ltrim() 1 Removes leading whitespace
rtrim() 1 Removes trailing whitespace
trim() 1–2 Removes leading & trailing whitespace
upper() 1
Types isarray()/is_array() 1
isatom()/is_atom() 1 "Atom" means boolean, number, or string
isboolean()/is_boolean() 1
isnumber()/is_number() 1
isobject()/is_object() 1
isstring()/is_string() 1
type()/typename() 1 Returns one of 'missing', 'null', 'boolean', 'number', 'string', 'binary', 'array', 'object'
toarray()/to_array() 1
toatom()/to_atom() 1 See N1QL docs
toboolean()/to_boolean() 1 See N1QL docs
tonumber()/to_number() 1 See N1QL docs
toobject()/to_object()
tostring()/to_string() 1 See N1QL docs
Predictive prediction() [q.v.] 2-3
euclidean_distance() 2-3
cosine_distance() 2
Vector approx_vector_distance() [q.v.] 2-5

prediction()

prediction() is not standard N1QL. It calls a predictive function, usually based on a machine-learning model, which must be registered with LiteCore at runtime before the query is compiled. Its parameters are:

  1. The name the predictivefunction was registered with
  2. The function's named inputs; this must be a dictionary and is usually given as a dictionary literal
  3. The name of the function output to use [optional]

If parameter 3 is not given, the function's output will be returned as a dictionary; otherwise only the named output value is returned.

If the model is unable to process the input because a required parameter is missing, or a parameter has the wrong type, the result of the function call is MISSING.

Example:

["prediction()", "mobilenet", {"image": ["BLOB", ".picture"]}, "classLabel"]

Assuming a trained MobileNet image classifier has been registered as "mobilenet", this will read an attached blob from the document's picture property, run it through MobileNet to classify it, and return the label most likely to apply to the image contents, e.g. "siamese cat" or "banana".

euclidean_distance()

Returns the Euclidean distance between two vectors, which is used as a distance metric in predictive queries.

Both parameters must be arrays of numbers, and must be the same length. The result is a non-negative floating-point number.

An optional third parameter is a power to raise the result to. Using 2 provides the common "squared Euclidean" distance.

cosine_distance()

Returns the cosine distance (one minus the cosine similarity) between two vectors, which is used as a distance metric in predictive queries.

Both parameters must be arrays of numbers, must be the same length, and must be non-empty. The result is a floating-point number in the range [-1 … +1].

approx_vector_distance()

This triggers a search of a pre-existing vector index. Its parameters are the same as on Server:

approx_vector_distance(indexed_property, target_vector, [metric], [num_probes], [accurate])

  • indexed_property is an expression evaluating to a vector value. There MUST exist a vector index on this exact property.
  • target_vector must also evaluate to a vector value; this is typically a query parameter.
  • metric is the name of the distance metric. It default to euclidean2.
  • num_probes is the number of centroids to probe. It defaults to the value given when the index was created.
  • accurate is only for Server compatibility. It defaults to false, and if given must have a value of false.

Note: The prediction and vector functions are only available in the Enterprise Edition (EE) of Couchbase Lite.

9. Indexes

Indexes aren't, strictly speaking, part of queries, but they use similar syntax. An index specifier is a JSON dictionary with the following keys:

Key Required? Value
WHAT Yes Array of one or more expressions to index, generally document properties.
WHERE No Expression that returns a "truthy" value for the documents to be indexed.

For backward compatibility an index specifier may also be an array, which is interpreted as though it were the value of a WHAT clause.

STATUS: (August 2024) The WHERE clause, and the dictionary form of the specifier, are experimental. In all current releases the specifier must be an array.

The effective use of indexes to optimize queries is sort of a black art. Fortunately there is a lot of information in books and online, and most of that advice applies here too.

The WHAT Clause

The WHAT clause of course determines what will be indexed. The index is sorted by these expression(s), in the same way as a query is sorted by its ORDER BY clause; so if there are multiple expressions, the first is the primary, the second the secondary, etc.

The DESC operator can be wrapped around an expression to specify a descending sort; this can accelerate queries that use descending order or that use the max() function.

In a full-text index, the expression determines the text that will be indexed for each document, so it should evaluate to a string. (If it doesn't, that document is ignored.) Full-text indexes only support a single expression.

The WHERE Clause

The optional WHERE clause creates a partial index that includes only some of the documents in the database. A partial index is smaller, and faster to create/update, but it can only be used by a query whose own WHERE clause contains an equivalent condition.

Since many real-world queries look for only a particular type of document, indexes used by such queries can take advantage of a WHERE clause that tests the document type. For example, an index of flight arrival times might look like {"WHAT": [[".arrival_time"]], "WHERE": ["=", [".type"], "flight"]}.

STATUS: (August 2024) The WHERE clause is still experimental. It is not yet supported in full-text indexes.