Chapter 5. Accessing MG4J indices programmatically

Constructing an index and querying it using the Query class is fine, but usually MG4J must be integrated in some kind of environment. In this chapter we describe how to access programmatically an index using MG4J. A (small but growing) list of heavily commented examples is available in the it.unimi.di.big.mg4j.example package.

In general, the first thing you need is to load an index. To do that, you must use the Index.getInstance() method, which will arrange for you a number of things, like finding the right index class, possibly loading a term map, and so on. Usually you will have more than an index (e.g., title and main text).

The second piece of information that is necessary for the following phases is an index map—a data structure mapping a set of symbolic field names, which will be used to denote the various indices, to the actual indices (e.g., to the actual instances of the Index class). There are simple ways to build such maps on the fly using fastutil classes (see the RunQuery example). Another important map is the map of term processors, which maps each field name to the respective term processor. Usually the term processor is the one used to build the index, which can be recovered from the Index instance, but different choices are possible

There are now several ways to access MG4J. Given a textual query, the query is parsed and turned into an internal composite representation (essentially, a tree). Then, a builder visitor visits the tree and builds a corresponding document iterator, which will return the documents that satisfy the query.

At the basis of the query resolution, index iterators provide results from the index (i.e., documents in which a term appears, and other information): in other words, they are used as iterators corresponding to the leaves of the query tree. These can be combined in various ways (conjunction, disjunction, etc.) to form document iterators. Document iterators return documents satisfying the query and, for each document, a list of minimal intervals representing the regions of text satisfying the query. At that point, scorers are used to rank the documents returned by the document iterator.

You can handle this chain of events at many different levels. You can, for instance, build your own document iterators using the various implementations of DocumentIterator. Or you can create queries (i.e., composite built using the implementations of it.unimi.di.big.mg4j.query.node.Query), and turn them into document iterators. You can even start from a textual query, parse it to obtain a composite internal representation, and then go on.

Nonetheless, the simplest way is to use a façade class called QueryEngine that tries to do all the dirty work for you. A query engine just wants to know which parser you want to use (SimpleParser is the default parser provided with MG4J), which builder visitor you want to use, and which index map. The builder visitor is a visitor class that is used to traverse the internal representation of a query and compute the corresponding document iterator. The default visitor, DocumentIteratorBuilderVisitor, is very simple but fits its purpose. You might want to change it, for instance, to reduce object creation.

A query engine has many tweakable parameters, that you can find in the Javadoc documentation. However, its main advantage is that its method process() takes a textual query, a range of ranked results, and a list in which to deposit them, and does everything for you. You can easily get results from MG4J in this way.

A different route is that of customizing the QueryServlet class that MG4J uses for its HTTP/HTML display. This might simply involve changing the Velocity script that displays the results (and which is set by a system variable—see the class HttpQueryServer) or actually modifying the class code.