Package it.unimi.di.big.mg4j.query.nodes

Composite representation for queries

This package contains the classes that represent queries as an abstract syntax tree, or, in design-pattern jargon, as a composite.

Warning: Before reading this overview, the reader is encouraged to consult the first part of the it.unimi.di.big.mg4j.search package documentation, where the notion of query is briefly presented.

The basic idea is to perform a complete decoupling of the query construction and resolution into three levels:

  • String level: the query is a string, written in a suitable language where, for example, every operator has a certain syntax, uses a certain keyword etc.
  • Tree level: the query is an abstract tree, that contains one internal node for each operator, but that has nothing to do with the actual way the query was written; for example, one may want to give the user the possibility to write the query by using a visual editor, in which case there will be some way to obtain the tree representation of the query without having any string-level representation for it.
  • Iterator level: the document iterator that lazily solves the query over a certain index or set of indices; this is the semantic counterpart of a query, and it is discussed at large in the overview of the it.unimi.di.big.mg4j.search package.

This package overview shows how to pass from the string to the tree level using the built-in QueryParser, and explains what is needed to implement the tree level, as well as the basic instruments needed to convert the tree level into the iterator level.

In MG4J, a query parser is an implementation of QueryParser: essentially, an object with a method that transforms strings in queries. You can use any parser you like, or build your queries programmatically.

The parser provided with MG4J and whose syntax is described in the manual is currently generated using JavaCC: unfortunately, this tool produces some public classes that somehow clutter the package documentation. All the parsing logic is contained in the SimpleParser class, which is generated by the JavaCC source SimpleParser.jj.

Building a query tree out of a query string

MG4J allows one to perform queries on multiple indices; by saying so, we mean that you may have many indices concerning the same document collection, and you want to perform a query that may search on some of them. Think, for example, of a collection of emails. You might have generated a number of indices, to index their subjects, sender, recipient(s), body etc. All these indices can be thought of as individual entities, their only relation being that they index collections with the same number of documents, and that same-numbered documents conceptually "come" from the same source.

To get a query tree from a string representation, you must first build a QueryParser object. For instance, if you use the built-in SimpleParser you have to provide a map whose keys are index names (called index aliases) and with each index alias mapped to the corresponding Index. Moreover, one of the aliases is taken to be the default index alias used for the queries.

In our example, we will assume that we have three index aliases (say, subject, from and body), and that subject is the default index alias. The parser would then be created as follows (here, we are stipulating that index with alias x has basename mail-x):

                Object2ReferenceMap<String,Index> indexAlias2Index = new Object2ReferenceOpenHashMap<String,Index>();
                Index defaultIndex;
                indexAlias2Index.put( "subject", defaultIndex = Index.getInstance( "mail-subject" ) );
                indexAlias2Index.put( "from", Index.getInstance( "mail-from" ) );
                indexAlias2Index.put( "body", Index.getInstance( "mail-body" ) );
                
                QueryParser parser = new QueryParser( indexAlias2Index.keySet(), "subject" );
                
after which you can use it for example as follows:
                        Query query = parser.parse( "meeting AND body:(schedule OR urgent)" );
                        DocumentIteratorBuilderVisitor visitor = new DocumentIteratorBuilderVisitor( indexAlias2Index, defaultIndex, 0 );
                        DocumentIterator it = query.accept( visitor );
                        while ( it.hasNext() ) {
                                System.out.println( "Document #: " + it.nextDocument() );
                                System.out.print( "\tPositions:" );
                                for ( Interval interval: it )
                                        System.out.print( " " + interval );
                                System.out.println();
                        }
                

Structure of a Query

As explained above, in this package a Query is simply an abstract tree: its leaves will correspond to ground queries, whereas internal nodes correspond to query operators.

Ground queries can be term queries, prefix queries, multiterm queries (the latter usually generated by some query-expansion mechanism), True and False. Any other query is either a composite query (e.g., a conjunctive query, a disjunctive query etc.), that is, a query composed by other subqueries, or it is obtained by some other operator (e.g., a low-pass query etc.).

The Weight class allows one to assign a weight to a query (it wraps a query and assigns a weight with it; queries that are not assigned a weight explicitly have a default weight of 1). A special treatment is reserved to weighted terms (that is, to nodes that are made by applying a Weight to a Term: they are considered essentially as variants of terms, and so they can be used to build a multiterm query.

Every query has a method that accepts a QueryBuilderVisitor object, through which you can visit the query tree. More precisely, when the accept method is invoked on a query, a recursive visit of the query tree is performed: at every node n of type Q (Q being any class implementing Query), the following steps are performed:

  • the visitor's visitPre(Q n) method is called; if the method returns false, the visit is interrupted; otherwise…
  • the subtrees are recursively visited, obtaining an array of results T[] x; if at a certain point during this visit some call decides that the visit should be interrupted, we stop; otherwise…
  • the visitor's visitPost(Q n, T[] x) method is called, and the result (of type T) is returned.

To make the idea of a visitor easier to understand, consider the following simple example of a visitor:

                static class PrinterVisitor extends AbstractQueryBuilderVisitor<String> {
                private static void appendArray( MutableString res, String t[], char sep ) {
                        for ( int i = 0; i < t.length - 1; i++ ) res.append( t[ i ] + sep );
                        res.append( t[ t.length - 1 ] );
                }
                public String[] newArray( int len ) { return new String[ len ]; }
                public String visitPost( OrderedAnd node, String[] t ) {
                        MutableString res = new MutableString();
                        res.append( "OAND(" );
                        appendArray( res, t, ',' );
                        res.append( ")" );
                        return res.toString();
                }
                public String visitPost( Or node, String[] t ) {
                        MutableString res = new MutableString();
                        res.append( "OR(" );
                        appendArray( res, t, ',' );
                        res.append( ")" );
                        return res.toString();
                }
                public String visitPost( And node, String[] t ) {
                        MutableString res = new MutableString();
                        res.append( "AND(" );
                        appendArray( res, t, ',' );
                        res.append( ")" );
                        return res.toString();
                }
                public String visitPost( Consecutive node, String[] t ) {
                        MutableString res = new MutableString();
                        res.append( "\"" );
                        appendArray( res, t, ' ' );
                        res.append( "\"" );
                        return res.toString();
                }
                public String visitPost( Not node, String t ) {
                        return "NOT(" + t + ")";
                }
                public String visitPost( LowPass node, String t ) {
                        return t + "~" + node.k;
                }
                public String visitPost( Select node, String t ) {
                        return node.index + ":" + t;
                }
                public String visitPost( Weight node, String t ) {
                                return t + "{" + node.weight + "}";
                }
                public String visit( Term node ) { return node.term.toString(); }
                public String visit( Prefix node ) { return node.prefix + "*"; }
                public String visit( MultiTerm node ) { return node.term.toString(); }
        }
        

Now, you can pass an instance of this visitor to a query, and as a result get a string (linear) representation of the query itself. (Of course, there is a simple way to get the same result, that is, calling the toString() method directly, but for this example is easy enough for illustrative purposes.)

Here is an example of how the above visitor can be used:

          public static void main( String arg[] ) throws Exception {
                Query qa = new Term( "a" );
                Query qb = new Prefix( "b" );
                Query qc = new Term( "c" );
                Query qd = new Term( "d" );

                Query q = new LowPass( new And( qa, new Select( "another_index", qb ), new Or( qc, qd ) ), 20 );

                System.out.println( q.accept( new PrinterVisitor() ) );
        }
                
that produces AND(a,another_index:b*,OR(c,d))~20.

Building a document iterator out of a query

The interface Query and its implementations provide the basic classes for the tree-level representation of a query. Building an iterator out of it can be thought of as a process of recursive instantiation of a query tree into an iterator tree; this is actually a sort of a copy visit, and it is indeed implemented as a special kind of visitor: the DocumentIteratorBuilderVisitor. You can create one such visitor and use it to visit a query tree, obtaining as a result a document iterator that you can use to get the results.

To be more precise, let us recall the example presented in the last part of the overview of the search package:

                        Index subjectIndex = Index.getInstance( "mail-subject" );
                        DocumentIterator it = AndDocumentIterator.getInstance( 
                                subjectIndex.documents( "meeting" ), 
                                subjectIndex.documents( "schedule" ), 
                                subjectIndex.documents( "monday" ) 
                        );
                        while ( it.hasNext() ) {
                                        System.out.println( "Document #: " + it.nextDocument() );
                                        System.out.print( "\tPositions:" );
                                        for ( Interval interval: it )
                                                System.out.print( " " + interval );
                                        System.out.println();
                        }
                

Here is an equivalent piece of code that uses the idea of decoupling the tree level from the iterator level.

                        Index subjectIndex = Index.getInstance( "mail-subject" );
                        Query q = new And( new Term( "meeting" ), new Term( "schedule" ), new Term( "monday" ) );
                        DocumentIteratorBuilderVisitor visitor = new DocumentIteratorBuilderVisitor( null, subjectIndex, 0 );
                        DocumentIterator it = q.accept( visitor );
                        while ( it.hasNext() ) {
                                System.out.println( "Document #: " + it.nextDocument() );
                                System.out.print( "\tPositions:" );
                                for ( Interval interval: it )
                                        System.out.print( " " + interval );
                                System.out.println();
                        }