Constructing an index and querying it using the
Query
class is fine, but usually MG4J must be
integrated in some kind of environment. In this chapter we describe how to
access programmatically an index using MG4J. A (small but growing) list of
heavily commented examples is available in the
it.unimi.di.big.mg4j.example
package.
In general, the first thing you need is to load an index. To do
that, you must use the Index.getInstance()
method, which will arrange for you a number of things, like finding the
right index class, possibly loading a term map, and so on. Usually you
will have more than an index (e.g., title and main text).
The second piece of information that is necessary for the following
phases is an index map—a data structure mapping a
set of symbolic field names, which will be used to
denote the various indices, to the actual indices (e.g., to the actual
instances of the Index
class). There are simple
ways to build such maps on the fly using fastutil
classes
(see the RunQuery
example). Another important map
is the map of term processors, which maps each field name to the
respective term processor. Usually the term processor is the one used to
build the index, which can be recovered from the
Index
instance, but different choices are
possible
There are now several ways to access MG4J. Given a textual query, the query is parsed and turned into an internal composite representation (essentially, a tree). Then, a builder visitor visits the tree and builds a corresponding document iterator, which will return the documents that satisfy the query.
At the basis of the query resolution, index iterators provide results from the index (i.e., documents in which a term appears, and other information): in other words, they are used as iterators corresponding to the leaves of the query tree. These can be combined in various ways (conjunction, disjunction, etc.) to form document iterators. Document iterators return documents satisfying the query and, for each document, a list of minimal intervals representing the regions of text satisfying the query. At that point, scorers are used to rank the documents returned by the document iterator.
You can handle this chain of events at many different levels. You
can, for instance, build your own document iterators using the various
implementations of DocumentIterator
. Or you can
create queries (i.e., composite built using the implementations of
it.unimi.di.big.mg4j.query.node.Query
), and turn
them into document iterators. You can even start from a textual query,
parse it to obtain a composite internal representation, and then go
on.
Nonetheless, the simplest way is to use a façade class called
QueryEngine
that tries to do all the dirty work for
you. A query engine just wants to know which parser you want to use
(SimpleParser
is the default parser provided with
MG4J), which builder visitor you want to use, and which index map. The
builder visitor is a visitor class that is used to traverse the internal
representation of a query and compute the corresponding document iterator.
The default visitor,
DocumentIteratorBuilderVisitor
, is very simple but
fits its purpose. You might want to change it, for instance, to reduce
object creation.
A query engine has many tweakable parameters, that you can find in
the Javadoc documentation. However, its main advantage is that its method
process()
takes a textual query, a range of
ranked results, and a list in which to deposit them, and does everything
for you. You can easily get results from MG4J in this way.
A different route is that of customizing the
QueryServlet
class that MG4J uses for its HTTP/HTML
display. This might simply involve changing the Velocity script that
displays the results (and which is set by a system variable—see the class
HttpQueryServer
) or actually modifying the class
code.