Package it.unimi.di.big.mg4j.search

Classes that compose iterators over documents. Such iterators are returned, for instance, by IndexReader.documents(long).

Minimal-interval semantics

MG4J provides minimal-interval semantics. That is, if the index is full-text, a DocumentIterator will provide a list of documents and, for each document, a list of minimal intervals. This intervals denote ranges of positions in the document that satisfy the iterator: for instance, if you compose two documents iterators using an AndDocumentIterator, you will get as a result the intersection of the document lists of the underlying iterators. Moreover, for each document you will get the minimal set of intervals that contain one interval both from the first iterators and from the second one.

This information is of course very useful if you're going to assign a score to the document, as smaller intervals mean a more precise match. At the basic level (e.g., iterators returned by an index), the intervals returned upon a document are intervals of length one containing the term that was used to generate the iterator. Intervals for compound iterators are built in a natural way, preserving minimality. More details can be found in Charles L. A. Clarke, Gordon V. Cormack, and Forbes J. Burkowski, “An algebra for structured text search and a framework for its implementation”, Comput. J., 38:1 (1995), pages 43−56. Scorers for documents may be found in the it.unimi.di.big.mg4j.search.score package.

The algorithms used by classes in this package to compute minimal-interval operators are new: details can be found here.

Note that MG4J provides minimal-interval semantics for a set of indices. This extension is a significant improvement over single-index semantics. However, defining the exact meaning of a query is a nontrivial problem that will be fully dealt with in a forthcoming paper.

Searching with minimal-interval semantics

The aim of this section is to provide a minimal insight of how minimal-interval semantics works, and explain the basic syntax used by the Query command-line tool. In this section we shall try to discuss this issue only through examples; we shall later explain how you can actually perform searches of this kind using MG4J.

Note that you do not have to understand the details of minimal-interval semantics to fruitfully use MG4J. Several natural operators (ordered conjunction, proximity limitation, etc.) are computed by MG4J very efficiently using minimal-interval semantics just under the hood.

MG4J solves queries on multiple indices; by saying so, we mean that you may have many indices concerning the same document collection, and you want to perform a query that may search on some of them. Think, for example, of a collection of emails: you might have generated a number of indices, to index their subjects, sender, recipient(s), body etc. All these indices can be thought of as individual entities, their only relation being that they index collections with the same number of documents, and that same-numbered documents conceptually "come" from the same source. The notion of multiple indices should not be new to the reader that is familiar with the it.unimi.di.big.mg4j.document package.

In our examples, we will assume that we have three indices (say, subject, from and body), and that subject is the used as default. Be warned that the actual syntax of queries in this section is immaterial (even though we shall stick to the syntax of SimpleParser).

Two different aspects should be taken into consideration when trying to determine which document actually match (i.e., satisfy) the query:

  • first, one can consider this in a purely Boolean (true/false) setting: thus a document may either satisfy the query or not; this is actually the only information you can get for indices that do not contain positions;
  • second, one can consider, for a document that matches the query in the above sense, which intervals (i.e., minimal sequences of consecutive words within the document) actually witness the match; this information will be available if the index contains positions.

In the following subsections, we shall give information about both kind of satisfiability. A final section contains pathological cases that behave counterintuitively.

Queries available on all indices

Simple queries

The simplest possible query consists in a single search term. The documents matching such a query are exactly those that contain the given term, with respect to the default index. In our example, the query

meeting
will be matched by the documents (emails) that contain the term "meeting" in their subject. If you want, you can perform the query on another index (different from the default one); thus, for example, the query
body: meeting
will be matched by the documents that contain the term "meeting" in their body. In both cases, the intervals witnessing the match will be the single occurrences of the term "meeting" in the subject and in the body field, respectively. You can also (mostly for debugging purposes) use a sharp sign followed by a natural number to denote a term by its lexicographical rank. Thus,
#0
returns the documents containing the lexicographically smallest term. Note however, that some scorers might really require a term expressed as a string to work properly.

Escaping

In the last example, we have explained that an index can be specified using a colon; this is one of many special symbols that correspond to operators, and that we shall talk about in the following. If any of your query term happens to contain one of these symbols, you can escape it using a backslash (\); a backslash is escaped by itself (\\). So, for example, if your collection indexes the colon as a word, you can query for it as follows:

\:

Conjunctive queries

You can specify that more than one condition should be met in conjunction by using the AND operator. For example:

meeting AND schedule
will be matched by those document whose subject contains both the term "meeting" and the term "schedule" (not necessarily in this order). The witnesses will be minimal intervals in the subjects that contain both terms. For example, if the subject was
schedule the meeting (should we schedule this meeting or not)?
then the above query will have three witnesses: "schedule the meeting", "meeting (should we schedule" and "schedule this meeting".

The keyword AND can be subsituted with the symbol &, with the symbol ∧ (Unicode 0x2227), or can even be omitted. So the above query is equivalent to:

meeting & schedule
and to
meeting schedule
or
meeting ∧ schedule
Also in this case, you can select a different index for the query to be matched. For example:
body: meeting schedule
(or, equivalently, body: meeting AND schedule or body: meeting & schedule) will be matched by documents that contain the term "meeting" in their body and the term "schedule" in their subject. In this case, witnesses come from different sources: a witness will be any single occurrence of the word "meeting" in the body (there should be at least one to make the document match the query) and any single occurrence of the word "schedule" in the subject (again, there should be at least one to make the document match the query). If you want both terms to be searched for in the body index, you can use:
body: meeting body: schedule
or, simply,
body: (meeting schedule)

Disjunctive queries

You can also introduce a disjunctive (OR) query, like in

meeting OR schedule
that will be matched by the documents that contain the term "meeting" or the term "schedule" (or both) in their subject. A witness will then be every single occurrence of either word in the subject. The keyword OR can be substituted with |, or with the symbol ∨ (Unicode 0x2228); hence, the previous query is equivalent to
meeting | schedule
and to
meeting ∨ schedule

Conjunctive and disjunctive operators can appear in the same query, with the rule that AND has higher priority than OR. So, for example:

meeting AND schedule OR time
will be matched by documents whose subject contains both "meeting" and "schedule", and by documents whose subject contains "time". In this case, a witness will either be a (possibly long) interval containing both the words "meeting" and "schedule", or a one-word interval containing the word "time". If you want to change this behaviour, you should use parenthesis, like:
meeting AND (schedule OR time)

Again, you can use index selectors, like in:

body:meeting AND (schedule OR time)
that will be matched by documents containing "meeting" in their body (a witness being every single occurrence of the word in the body), and "schedule" or "time" in their subject (a witness being every single occurrence of either word in the subject). Similarly:
body:(meeting AND (schedule OR time))
will match documents that contain "meeting" and either "schedule" or "time" (or both) in their body.

Negative (NOT) queries

You can specify that you want to exclude documents containing a certain term, or, more in general, satisfying a certain query, by using the (unary, prefix) operator NOT. For example:

body:(meeting AND NOT tomorrow) AND subject:schedule
will be satistied by the emails that contain the term "schedule" in their subject, and the term "meeting" but not the term "tomorrow" in their body. The operator NOT can be substituted with !, like in:
body:(meeting !tomorrow) subject:schedule

Negative queries are easily understood in a Boolean context, but may be more difficult as far as witnesses are concerned. Basically, the implementation of NOT works in such a way that NOT is actually used only for the Boolean match, but does not influence witnesses. In more detail, the only witness associated with a true NOT query is an empty interval.

Prefix and multiterm queries

A prefix query is a simple query that is matched by all terms starting from the same nonempty prefix. For example:

govern*
is matched by all documents containing any word starting with "govern". For the prefix operator * to work, you have to endow your index with a PrefixMap. What really happens in this case is that the query is essentially expanded into a disjunction that contains all the words in the dictionary that start with "govern".

To be true, the expansion of a prefix query does not really lead to an OR, but rather to what we call a multiterm query: a multiterm query is like an OR, but it can only contain terms as subquery, and behaves under many respects like a single term. It is unusual to specify manually a multiterm query—rather, some query expansion mechanism like prefixes should be used, but if you want to try manually, a multiterm query can be obtained using the + operator. For example:

house + houses + housing
is a correct multiterm query, and it is loosely equivalent to
house OR houses OR housing

Note, however, that trying to use + instead of OR does not work if the subqueries are not simple queries, or if they concern different indices. For example:

house + title:meeting
would produce an error.

You may wonder why multiterm queries are needed, if they are essentially the same as OR queries. The first answer is efficiency: a multiterm query should be more efficient than an OR query.

The second answer is more subtle, and has to do with scorers. A scorer is a way to assign a score to a document that satisfies a query. Many scorers actually work by summing up suitable partial scores that depend on the document and on one of the terms in the query. Such partial scores are often function of the count (number of times the term appears in the document) and on the frequency (number of documents where the term appears), and they are often really high when the term has a low frequency. The idea behind this is that if I write:

computer OR methacrylic
a document that satisfies the query because it contains "methacrylic" is more valuable than one that contains the word "computer", being the former much more infrequent.

Nonetheless, trying to use these scorers on automatically expanded queries may lead to many problems. For example, suppose you expanded

govern*
as
government OR governance OR governor OR governing 
(we are here assuming that the four terms above are the only ones that appear in the dictionary and start with "govern"). Now, since "governance" is presumably much rarer than "government", we expect all documents containing only "governance" to be given a high score. using
government + governance + governor + governing 
the scorer acts on this bunch of words as a whole, and the frequency is assumed to be the maximum frequency (hence, it is the same for all words), avoiding the "governance"-prevalence problem.

Applying weights

MG4J allows one to assign to every subquery a nonnegative number, called weight. The actual meaning of the weight is immaterial, but it is postulated that weights only have impact on the scores of documents, not on satisfiability. In other words, some scorers may exploit weight information to decide how to rank documents that satisfy the query, but changing weights should never change the set of satisfying documents.

Weights are specified using a postfix syntax, like in:

government{1.3} + governance + governor{.2} + governing{.9} 
and their default value is 1 (in the example above, 1 is the weight that is assigned to the term "governance" as well as to the query as a whole). A more complex example is:
 
((foo | foogna{.3}) & (bar{.5} | bargna)){.7} | reallyimportant{.9}

True and false

MG4J provides true and false document iterators. A true iterator returns all documents of the collection, and for each document returns IntervalIterators.TRUE. A false iterator returns no document. They are represented, respectively, by the tokens #TRUE/#FALSE or by the symbols / (Unicode 0x22A4/0x22A5). They obey the standard logical laws, so

foo ⊤
is exactly equivalent to
foo

The main usage of true and false iterators is to enforce the syntactic structure of a composite document iterator. For instance, suppose you want to handle queries that are conjunctions of disjunctions, as in

(foo | bar) (gna | gne)

The problem is that if one of the disjunctions has exactly one element, the standard MG4J DocumentIteratorBuilderVisitor will eliminate the pleonastic OrDocumentIterator, obtaining a document iterator that does not reflect exactly the composite structure of the query. To avoid this problem, you can add true or false iterators. For instance, whereas

(foo | bar) (gna)
will have just one OrDocumentIterator,
(⊥ | foo | bar) (⊥ | gna)
is guaranteed to have two, with no change in semantics.

Queries available on indices with positions

Ordered conjunctive queries

The operator of ordered conjunction < works like AND, but requires the subqueries to be satisfied in the exact order in which they are specified, even though not necessarily consecutively. For example:

meeting < schedule
will only be matched by documents that contain in their subject at least one occurrence of the word "meeting" followed (maybe not immediately) by at least one occurrence of the word schedule. Again, for example, if the subject was
schedule the meeting (should we schedule this meeting or not)?
then the above query will have only one witness: "meeting (should we schedule"; the other two minimal intervals that contain both words ("schedule the meeting" and "schedule this meeting") are not witnesses because words appear in the wrong order.

Note that the ordering between witnesses is strict: for instance, the query

meeting < meeting
has as only witness "meeting (should we schedule meeting". The single word "meeting" alone is not a witness for the query.

In this case, it makes no sense (and it is indeed forbidden) to select a different index for the subqueries to be matched.

Consecutivity (phrasal queries)

You can specify that you want that some terms appear consecutively by using " (quotes). For example:

"meeting schedule"
will be matched if the terms "meeting" and "schedule" appear in this order, and consecutively, in the subject. Inside quotes, you can also use subqueries, surrounding them with parenthesis, like in:
"meeting (schedule OR time)"
that is matched by documents whose subject contains the term "meeting" followed by either "schedule" or "time". A witness will this time be necessarily an interval of exactly two words (the first being "meeting" and the second being either "schedule" or "time").

More precisely, the quotes operators is satisfied if there is a sequence of consecutive witnesses, with each witness coming from a different subquery, in the same order in which the queries appear.

Note that

"meeting schedule OR time"
would be invalid: if you want to use operators within quotes, you should do so between parenthesis. Moreover, within quotes you cannot change index. So you can say
body:"meeting schedule"
but you cannot use
"body:meeting subject:schedule"

The symbol $ (dollar) can be used to specify an arbitrary word in a consecutive query. For instance,

meeting $ schedule
will match "meeting our schedule" as well as "meeting my schedule". You can add dollars also at the start of a phrase, but not at the end (in the latter case, they will be ignored).

Proximity limit

As we have discussed, when a document matches a given query, there will be one or more witnesses within the document. Each such witness is a consecutive sequence of positions in the document that witness the matching. For example, consider the query

body:((meeting schedule) OR "John Smith") OR subject:alarm

This query will be matched by documents that contain the term "alarm" in their subject, and by documents that contain either the terms "meeting" and "schedule" or the (exact) sentence "John Smith" in their body. For every document that matches the query, there will be two sets of matching intervals, one about the body and the other about the subject; at least one of these two sets will be nonempty (because of the OR keyword). Intervals concerning the subject will simply be intervals of length one that correspond to the positions where the term "alarm" appears in the subject. Intervals concerning the body will be either intervals of length two corresponding to the positions where the sentence "John Smith" appears in the body, or intervals of length two or more where both "meeting" and "schedule" appear.

You might want to accept only matching intervals up to a certain length; for example, suppose you don't want to take into considerations intervals that contain "meeting" and "schedule" too far apart, say at a distance greater than 10 words. You can do this by using the proximity limit operator ~. Just rewrite the previous query as

body:((meeting schedule)~10 OR "John Smith") OR subject:alarm

This way, you are simply discarding the matching intervals that contain the terms "meeting" and "schedule" if their length (number of words) is greater than 10 (i.e., if "meeting" and "schedule" are separated by more than 8 words).

The proximity limit operator can be used at any point, and limits the length of all matching intervals of the query it is applied to. Note, however, that it may only be used on full-text indices.

Difference

The Brouwerian difference operator is specified using - (minus). It is a rather esoteric operator that is rarely met by the end user, and that, given two subqueries, kills the witnesses of the first query (the minuend) that contain one or more witnesses of the second query (the subtrahend). By definition, for documents that satisfy the minuend, but not the subtrahend, the witnesses are unchanged. For instance, the following query

schedule < meeting - this
will be matched only if the term "schedule" and the term "meeting" appear in this order without the term "this" inbetween. If the subject is
schedule the meeting (should we schedule this meeting or not)?
the only valid witness is "schedule this meeting", and indeed, the following query
schedule < meeting - (this | the)
will not match at all the subject above, as all witnesses of the minuend are killed by witnesses of the subtrahend.

As an additional feature, you can specify a left and a right margin that will be used to enlarge the intervals of the subtrahend. For instance,

"schedule < meeting - [[1,2]] this"

will kill intervals of the minuend that contain the whole fragment "schedule this meeting or" (so no interval will be killed at all).

Alignment

The alignment operator is specified using ^ (circumflex). It is very different in nature from all other operators as it works across indices: it will return the intersection of the intervals returned by each component iterator. Clearly this is meaningful only when working with aligned indices, like those generated by parallel texts containing semantic tagging (see, e.g., WikipediaDocumentCollection). For instance, assuming that text is an index of some text and sem contains entity tags like PERSON, etc., at the same positions of the corresponding words in text, you can use

John ^ sem:PERSON
to look for instances of "John" tagged as a person. Note that all queries involved must be on a single index, and that this is the only restriction. For instance,
( Washington | Francisco ) ^ sem:( PERSON | PLACE )
will search for "Washington" or "Francisco" both as persons and as places.

Remapping

The remapping operator turns results related to an index to results related to another index. For instance, using the same example as for the alignment operator,

sem:PERSON {{ sem -> text }}
would return the same documents and positions of the query sem:PERSON, but those documents and positions would be viewed as coming from the field text.

There are two main useful consequences: first of all, snippets will be displayed using content from the text field, so MG4J will display snippets containing readble text, and not tags. The second consequence is that now the document iterator associated with the query return results on text, so it can be freely mixed with positional operators. For instance,

"(sem:PERSON {{ sem -> text }}) finds"
searches for a person's name immediately followed by the term finds.

Many remappings can be specified in a single appearance of the operator, separated by a semicolon.

Queries available on payload-based indices

Actually, the atomic queries discussed above (term, prefix, etc.) can be used with standard indices, that is, indices of fields containing text. For payload-based indices, which represent document metadata such as dates, the standard query available in MG4J is a range query in which the first and last valid values are specified by the user. The resulting query is satisfying by all documents whose field is in the range. Both the first and the last value can be omitted. for instance, the following query

date:[ 20/2/2007 .. 23/2/2007 ]
will search for documents between 20 February and 23 February 2007, inclusive, whereas the query
date:[ .. 23/2/2007 ]
will search for documents up to 23 February 2007. Note that in the built-in parser spaces are necessary. They make it possible to separate the different tokens composing the query.

Range queries must not be used as a generic query mechanism, but rather to refine the result of a query over document content: a ranked query composed uniquely by a range query will have to scan the whole payload-based index just to return a few results.

Building and composing document iterators

The it.unimi.di.big.mg4j.search package contains all the classes needed to build a query and to match it against a certain collection of indices. This is actually only the semantic counterpart to a query; for the syntactic aspects, please refer to the it.unimi.di.big.mg4j.query.nodes package.

Basic classes

An Interval represents a consecutive set of natural numbers, that is, a witness within a document (in this case, numbers represent the positions within a document: 0 is the position of the first word, 1 is the position of the second and so on). An IntervalIterator is an iterator that returns intervals: typically, an interval iterator will return all intervals witnessing a certain query for a certain document (and a certain index).

For example, the query

body:((meeting schedule)~10 OR "John Smith") OR subject:alarm
will give rise to an interval iterator for the body and an interval iterator for the subject: the former will return intervals within the body witnessing the first part of the query, and the latter will return the intervals the intervals witnessing the second part of the query. Note that even upon a matching document either iterator may actually return no interval (because the overall query is disjunctive); nonetheless, the two iterators cannot be both empty.

It is always understood that intervals are returned in increasing order (of their left, or equivalently right, extreme).

A DocumentIterator is used to scan a whole collection of indices for a query. At every given moment, the iterator will be able to return the next document matching the query, and, for full-text indices, you will also be able to get the interval iterators of the witnesses for that document and for a specific index.

Obtaining and composing document iterators

The simplest kind of DocumentIterator you can build is an IndexIterator: it is a document iterator that scans a specific index for a specific term. You don't actually build an index iterator directly, but you rather obtain one by calling the Index.documents(CharSequence) (or, equivalently, IndexReader.documents(CharSequence)) method, that returns the set of documents containing a given term (and witnesses will be the single occurrences of such term).

Hence, for example, the following snippet opens a full-text index whose basename is mail-subject, and prints out all documents containing the word "meeting", each with the sequence of positions where the word appears (all intervals will be actually singletons).

        Index subjectIndex = Index.getInstance( "mail-subject" );
        DocumentIterator it = subjectIndex.documents( "meeting" );
        for( long d; ( d = it.nextDocument() ) != END_OF_LIST; ) { 
                        System.out.println( "Document #: " + d );
                        System.out.print( "\tPositions:" );
                        IntervalIterator intervalIterator = it.intervalIterator();
                        for ( Interval interval; ( interval = intervalIterator.nextInterval() ) != null; )
                                System.out.print( " " + interval );
                        System.out.println();
        }

A number of classes in this package can be used to compose iterators; more precisely, for each query operator discussed above there is a corresponding class in this package. Each such class has a factory method that allows one to build new document iterators by composing existing iterators.

For example, the following snippet shows how to search for mails containing the words "meeting", "schedule" and "monday".

        Index subjectIndex = Index.getInstance( "mail-subject" );
        DocumentIterator it = AndDocumentIterator.getInstance( 
                subjectIndex.documents( "meeting" ), 
                subjectIndex.documents( "schedule" ), 
                subjectIndex.documents( "monday" ) 
        );
        for( long d; ( d = it.nextDocument() ) != END_OF_LIST; ) { 
                        System.out.println( "Document #: " + d );
                        System.out.print( "\tPositions:" );
                        IntervalIterator intervalIterator = it.intervalIterator();
                        for ( Interval interval; ( interval = intervalIterator.nextInterval() ) != null; )
                                System.out.print( " " + interval );
                        System.out.println();
        }

For what concerns weights, there is no composition class for them: rather, every document iterator has a weight (a double), that one can set or get, and whose meaning and usage is left to the implementors of scorers.

The following table shows the correspondence between query operators and composition classes:

OperatorClass
AND & ∧ (conjunction)AndDocumentIterator
OR | ∨ (disjunction)OrDocumentIterator
NOT ! (negation)NotDocumentIterator
+ (multiterm)MultiTermIndexIterator
#TRUE ⊤TrueDocumentIterator
#FALSE ⊥FalseDocumentIterator
"..." (phrase)ConsecutiveDocumentIterator
< (ordered conjunction)OrderedAndDocumentIterator
~ (proximity)LowPassDocumentIterator
- (difference)DifferenceDocumentIterator
^ (alignment)AlignDocumentIterator
{{ .. }} (remap)RemappingDocumentIterator
[ .. ] (range)PayloadPredicateDocumentIterator

Note, however, that PayloadPredicateDocumentIterator is actually a completely generic predicate-based class that just returns documents whose payload satisfies a predicate.

Queries and document iterators

Even though it is perfectly legal to build document iterators by using these classes directly, this is not the natural way to do that. One should rather build a syntactic object corresponding to a query, and then make it into a document iterator that is, in some sense, the semantic counterpart of the query itself. To have more information about how this works exaclty, please consult the overview of the it.unimi.di.big.mg4j.query.nodes package.

Pathological cases

Due to minimality of intervals, sometimes the results of a query might be unexpected. This happens in particular when some cancellation of non-minimal intervals happens. For instance, consider the query

"is ( really | "really really" ) good"

We could expect that this query is satisfied by is really really good. But this does not happen, because the semantics of really | "really really" is an antichain of minimal intervals. Since whenever really really appears, also really appears, the intervals generated by the positions of really will cancel, when the disjunction is computed, the intervals generated by the positions of "really really". So the minimal-interval semantics of really and really | "really really" is exactly the same. We can of course get what we want using

"is really good" | "is really really good",
but the example shows that some care must be exercised.