CQL Version 1.1 13th February 2004
Sample Queries - BNF - Rules - Features - Conformance - Context Sets - the CQL Context Set - Relations Modifiers - Masking - Result Sets - Proximity - Tutorial (temporarily unavailable)
CQL is a formal language for representing queries to information retrieval systems such as web indexes, bibliographic catalogs and museum collection information. The design objective is that queries be human readable and writable, and that the language be intuitive while maintaining the expressiveness of more complex languages.
Traditionally, query languages have fallen into two camps: Powerful, expressive languages, not easily readable nor writable by non-experts (e.g. SQL, PQF, and XQuery);or simple and intuitive languages not powerful enough to express complex concepts (e.g. CCL and google). CQL tries to combine simplicity and intuitiveness of expression for simple, every day queries, with the richness of more expressive languages to accommodate complex concepts when necessary.
Following are examples of simple CQL queries. These are all self-explanatory:
dinosaur
"complete dinosaur"
title = "complete dinosaur"
title exact "the complete dinosaur"
dinosaur or bird
dinosaur and "ice age"
dinosaur not reptile
dinosaur and bird or dinobird
(bird or dinosaur) and (feathers or scales)
"feathered dinosaur" and (yixian or jehol)
publicationYear < 1980
lengthOfFemur > 2.4
bioMass >= 100
The following are a bit more complicated:
Example | Explanation |
---|---|
title all "complete dinosaur" |
Title contains all of the words: "complete", and "dinosaur" |
|
Title contains any of the words: "dinosaur", "bird", or "reptile" |
(caudal or dorsal) prox vertebra |
A proximity query: either "caudal" or "dorsal" near 'vertebra" |
ribs prox/distance<=5 chevrons |
A more specific proximity query: "ribs" within 5 words of "chevrons" |
ribs prox/unit=sentence chevrons |
"ribs" in the same sentence as "chevrons" |
ribs prox/distance>0/unit=paragraph chevrons |
"ribs" and "chevrons" occuring in the same document in different paragraphs |
subject any/relevant "fish frog" |
find documents that would seem relevant either to "fish" or "frog" |
subject any/rel.lr "fish frog" |
Same as previous, but use a specific relevance algorithm (linear regression) |
Following is the Backus Naur Form (BNF) definition for CQL. ["::=" represents "is defined as"]
cqlQuery |
::= |
prefixAssignment cqlQuery | scopedClause |
prefixAssignment |
::= |
'>' prefix '=' uri | '>' uri |
scopedClause |
::= |
scopedClause booleanGroup searchClause | searchClause |
booleanGroup |
::= |
boolean [modifierList] |
boolean |
::= |
'and' | 'or' | 'not' | 'prox' |
searchClause |
::= |
'(' cqlQuery ')' |
relation |
::= |
comparitor [modifierList] |
comparitor |
::= |
comparitorSymbol | namedComparitor |
comparitorSymbol |
::= |
'=' | '>' | '<' | '>=' | '<=' | '<>' |
namedComparitor |
::= |
identifier |
modifierList |
::= |
modifierList modifier | modifier |
modifier |
::= |
'/' modifierName [comparitorSymbol modifierValue] |
prefix, uri, modifierName, modifierValue, searchTerm, index |
::= |
term |
term |
::= |
identifier | 'and' | 'or' | 'not' | 'prox' |
identifier |
::= |
charString1 | charString2 |
charString1 |
:= |
Any sequence of characters that does not include any of the following: whitespace If the final sequence is a reserved word, that token is returned instead. Note that '.' (period) may be included, and a sequence of digits is also permitted. Reserved words are 'and', 'or', 'not', and 'prox' (case insensitive). When a reserved word is used in a search term, case is preserved. |
charString2 |
:= |
Double quotes enclosing a sequence of any characters except double quote (unless preceded by backslash (\)). Backslash escapes the character following it. The resultant value includes all backslash characters except those releasing a double quote (this allows other systems to interpret the backslash character). The surrounding double quotes are not included. |
CQL Query
A CQL query is essentially a search clause, or multiple search
clauses connected by boolean operators. (In addition it may include
prefix assignments which assign short names to known contexts. See context
sets.)
Search Clause
A search clause consists of an index, relation, and search term, or a search
term alone. Thus every search clause has a search term, but both the
index and relation may be omitted - the clause must include either
both or neither of the index and relation. (Note that the use of the "index" concept
in CQL is not intended to have any implementation implications; it does
not imply the presence of a physical index.)
Examples:
Index/relation/search term: title
= cat
Search term only: cat
Search Term
Search terms may be enclosed in double quotes. Search terms must be
enclosed in double quotes if they contain any of the following characters: < > =
/ ( ) and whitespace. The search term may be empty, but must be present
in a search clause. An empty search term is expressed as "" and has no
defined semantics.
Index Name
An index name always includes a base name and may also include a prefix,
which provides a context for the index name, the name of the context
set of which the index is a part. If the context is not supplied,
it is determined by the server. If the index is not supplied it
is determined by the server. (Note that the index may be omitted only
when the relation is also omitted. Either both must be supplied, or both
omitted.)
Examples:
title = cat context determined
by the server
dc.title = cat index context is dc
cat context and
index determined by the server
Relation
The relation in a search clause specifies the relationship
between the index and search term. It also always includes a
base name and may also include a prefix providing a context for the
relation. If a relation is supplied with no accompanying context, the
context is 'cql' (the cql context
set). If no relation is supplied, then cql.scr (server choice)
is assumed, which means that the relation is determined by the server.
(Note that the relation may be omitted only when the index is also
omitted. Either both must be supplied, or both
omitted.)
Examples:
title = cat context for relation is 'cql' ; fully qualified
relation is cql.=
title cql.any cat relation is 'any';
relation context is 'cql'. Equivalent to: title any cat
cat index and relation are determined
by the server (formally the relation is 'cql.scr')
Examples:
dc.title any/relevant/rel.CORI "cat fish"
the relation 'any'
is modified by (1) 'relevant' whose context is 'cql' and (2) 'CORI' whose context
is 'rel'.
dc.author exact/stem "smith, j." the
relation 'exact' is modified by 'stem' whose context is 'cql'.
Boolean Operators
Search clauses may be linked by boolean operators. These are: and, or, not and prox.
(Note that not is really and-not, that is, it may
not be used as a unary operator.) Boolean operators all have the same
precedence; they are evaluated left-to-right. Parentheses
may
be used
to
overide left-to-right
evaluation.
Boolean Modifiers
As a relation may have modifiers, similiarly, a
boolean operator may have modifiers, separated by '/' characters. Boolean
modifiers may come from
any context
set. If
not supplied, the context is the CQL
context set. (Note that Boolean operators themselves
are limited to the built-in set of four.)
Example: dc.title=cat and/rel.sum dc.title=dog
Case Insensitive
All parts of CQL are case insensitive apart from user supplied search terms,
which may or may not be case sensitive. 'OR','or', 'Or' and 'oR' are
all the same boolean operator, just as 'dc.title', 'DC.Title' and 'dC.TiTLe'
are all the same context set plus index name.
The following are all formally defined by the CQL context set but described here for convenience.
For ordered (e.g. numeric) terms:
<, >, <=, >=,
and <> mean "less than", "greater than", "less
or equal", "greater or equal", and "not equal".
when the term is a list of words:
'=' is used for word adjacency -- the words appear in that order with no others intervening. (Note the dual use of '=', it is used for numeric equality as described above.)
'any' means "any of these words"
'all' means "all of these words"
When the term is a character string:
'exact' is used for exact string
matching.
When the term has multiple dimensions:
'within' may be used to search for values that
fall within the range, area or volume described by the search
term.
When the index's data has multiple dimensions:
'encloses' may be used to search for values of
the database's term fully encloses the search term.
This query | Would match this | but not this |
---|---|---|
title = "cat in the hat" |
"a day in the life of the cat in the hat" |
"hat in the cat" or "cat in the green hat" |
title all "cat hat" |
"hat in the cat" |
"cat in the grass" |
title any "cat hat" |
"cat in the grass" |
"dog in the grass" |
title exact "cat in the hat" |
"cat in the hat" |
"a day in the life of the cat in the hat" |
date within "2002 2005" |
2004 |
2006 |
dateRange encloses 2003 |
"2002 2005" |
"2004 2005" |
These relation modifiers request that the server perform some algorithm on the term before processing.
stem
The server should apply a stemming algorithm to the words within the term.
for example, walked, walking, walker etc. would all be represented
by the stem word walk. This allows a search like title =/stem "these
completed dinosaurs" to match The Complete Dinosaur.
relevant
The server should use a relevancy algorithm for determining matches and
the order of the result set.
Example: subject any/relevant "fish frog"
would find records relevant to "fish" or "frog" and order
the result set by relevance to fish or frog.
These modifiers qualify the relation to more precisely determine its semantics.
word
The term consists of words (rather than being an opaque string).
string
The term is a single item, and should not be broken up.
isoDate
Each item within the term conforms to the ISO 8601 specification for expressing
dates.
number
Each item within the term is a number.
uri
Each item within the term is a URI.
masked
This means that the masking rules (see next) apply. Masking is assumed
even if not specified, unless 'unmasked' is specified (so there is never
any reason to include 'masked').
A single asterisk (*) is used to mask zero or more characters.
A single question mark (?) is used to mask a single character, thus N consecutive question-marks means mask N characters.
Carat/hat (^) is used as an anchor character for terms that are word lists, that is, where the relation is 'all' or 'any', or '=' when used for word adjacency. It may not be used to anchor a string, that is, when relation is 'exact' (string matches are, by definition, anchored). It may occur at the beginning or end of a word (with no intervening space) to mean right or left anchored."^" has no special meaning when it occurs within a word (not at the beginning or end) or string but must be escaped nevertheless.
Backslash (\) is used to escape '*', '?', quote (") and '^' , as well as itself. The use of a backslash not followed immediately by one of these characters is reserved for future definition.
Masking examples:
dc.title = c*t (matches cat and coast etc.)
dc.title = c?t (matches cat and cot, not coast or ct)
" ?" (matches any single character)
dc.title = "^cat in the hat" (matches 'cat in the hat' where it is at the beginning of the field)
dc.title any "^cat eats rat" (matches 'cat eats rat', 'cat eats dog', 'cat', but not 'rat eats cat')
dc.title any "^cat ^dog eats rat" (matches 'cat eats rat', 'dog eats cat', 'cat loves bat', but not 'bat loves cat')
dc.title = "\"Of Couse\" she said"
A search clause may be a result set name. This is a special case, employing the context set 'cql'. The index and relation are expressed as "cql.resultSetId =" and the term is a result set name that has been returned by the server in the 'resultSetName' parameter of the response. It may be used by itself in a query to refer to an existing result set from which records are desired. It may also be used in conjunction with other resultSetName clauses or other indexes, combined by boolean operators. The semantics of resultSetId with relations other than "=" is undefined.
Example: cql.resultSetId = "resultA" and cql.resultSetId = "resultB"
The proximity boolean boolean operator is expressed in terms of distance, unit, and ordering.
Examples:
distance takes the form:
distance [relation] [value]
where relation is one of: "<", ">" ,"<=" ,">=" ,"=" , "<>"; default "<="
and value is a non-negative integer; default: 1 for word, zero otherwise
unit takes the form
unit=[value]
where value is one of "word", "sentence", "paragraph", or "element"(default "word"),
ordering is "ordered" or "unordered"; default "unordered"
Context sets permit CQL users to create their own indexes, relations, relation modifiers and boolean modiers without fear of chosing the same name as someone else and thereby having an ambiguous query. All of these four aspects of CQL must come from a context set, however there are rules for determining the prevailing default if one is not supplied. Context sets allow CQL to be used by communities in ways which the designers could not have foreseen, while still maintaining the same rules for parsing which allow interoperability.
When defining a new context set, it is necessary to provide a description of the semantics of each item within it. While context sets may contain indexes, relations, relation modifiers and boolean modifiers, there is no requirement that all should be present; in fact it is expected that most context sets will only define indexes.
Each context set has a unique identifier, a URI. When sending the context set in a query, a short form is used. These short names may be sent as a mapping within the query itself (see next), or be published by the recipient of the query in some protocol dependent fashion. The prefix 'cql' is reserved for the CQL context set, but authors may wish to recommend a short name for use with their set.
An index, relation, or modifier qualified by a context is represented in the form prefix.value, where prefix is a short name for a unique context set identifier.
Binding Short Name to URI
The binding of short name to
URI is defined either within the query or by the server. A prefix map
may occur at any place in the query and applies to anything which follows.
Example:
>dc="http://www.dublincore.org/" dc.title = "cat"
In the following query:
>a="http:/x.com/y" a.title=cat and (>a="http:/f.com/g" a.title=hat) and a.title=rat
both the "a" in "a.title=cat" and in "a.title=rat" refer to http:/x.com/y, while the "a" in "a.title=rat" refers to http:/f.com/g.
Default Context
When no context is attached to a relation, relation modifier,
or boolean modifier, the context is the cql context set. When no
context is attached to an index the context is determined by the server.
Known Context Sets
The following context sets have been defined and made
public. If you wish to have your context set listed here, please provide
the information
to
the SRU Maintenance Agency.
Short Name | Identifier | Reference |
---|---|---|
cql |
info:srw/cql-context-set/1/cql-v1.1 |
|
dc |
info:srw/cql-context-set/1/dc-v1.1 |
|
bath |
http://zing.z3950.org/cql/bath/2.0/ |
|
zthes |
http://zthes.z3950.org/cql/1.0/ |
ZThes thesaurus context set, v1.0. |
ccg |
http://srw.cheshire3.org/contextSets/ccg/1.1/ |
|
rec |
info:srw/cql-context-set/2/rec-1.1 |
|
net |
info:srw/cql-context-set/2/net-1.0 |
|
music |
info:srw/cql-context-set/3/music-1.0 |
|
rel |
info:srw/cql-context-set/2/relevance-1.0 |
|
zeerex |
info:srw/cql-context-set/2/zeerex-1.1 |
|
marc |
info:srw/cql-context-set/1/marc-1.0 |
In order to claim conformance to CQL a server must support one of the following three levels:
Level 0
Must be able to process a term-only query.
(The term is either a single word or if multiple words separated by spaces
then the entire search term is quoted). If the term includes quote marks,
they must be a escaped by preceding them with a backslash, e.g."raising
the \"titanic\"".)
If an unsupported query is supplied, must be able to respond with a diagnostic to say that the query is not supported.
Level 1
Support for Level 0.
Ability to parse both:
(a) search clauses consisting of 'index relation searchTerm'; and
(b) queries where search terms are combined with booleans, e.g. "term1
AND term2"
Support for at least one of (a) and (b).
Note that (b) does not necessarily include queries such as:
index relation term1 AND index relation term2
but rather queries where the search clauses are terms-only (do not include index or relation).
Level 2
Support for Level 1.
Ability to parse all of CQL and respond with appropriate diagnostics.
Note that Level 2 does not require support for all of CQL, it requires that the server be able to parse all of CQL (and respond with proper diagnostics for the parts not supported.).