Overview (MG4J (big) 5.4.4)

MG4J (Managing Gigabytes for Java) is a free full-text search engine for large document collections written in Java. The big version is a fork of the original MG4J that can handle more than 2³¹ terms and documents.

MG4J is distributed under the GNU Lesser General Public License.

Warning

MG4J 5.0 brings several new features, but also source and binary incompatibilities with previous releases.

MG4J is no longer based on gap-based indices. Classical interleaved indices are used for incremental index construction and high-performance indices are still supported for historical reasons, but all new indices are by default built using the new quasi-succinct format, which brings unprecedented performance and improves compression.
The package prefix of MG4J is now it.unimi.di.*, following the change of name of our department, so to ease the transition and making coexistence with previous versions possible.
DocumentIterator.nextDocument() now returns DocumentIterator.END_OF_LIST instead of -1 to denote list exhaustion.
The plethora of methods that accessed the positions of a term in an IndexIterator have been replaced by the single lazy IndexIterator.nextPosition() call, which returns IndexIterator.END_OF_POSITIONS when the positions are exhausted. Some static methods in IndexIterators should help with the transition.

Roadmap

MG4J is vast. Some of its component are the result of longtime research efforts, and are not easy to describe in full detail. Here we give a roadmap to the documentation, so that you do not have to wander recklessly through dozens of package descriptions.

First of all, MG4J comes with a manual that describes how to build indices, and how to access them from the command line or from the web. It is a good idea to start from the manual, build and play with a few indices, and then come back to package documentation, as the latter often refers to artifacts created by index construction.

If you want to interface MG4J with your own data, you must read the package documentation of it.unimi.di.big.mg4j.document, which describes document sequences, collections and factories.

If you want to load and query an index, you must read the package documentation of it.unimi.di.big.mg4j.index, which describes indices and index readers. The package contains also the documentation about term processors, which transform terms before they are actually indexed; they are fundamental to customise the indexing process.

If you want to have a look at your index, the package it.unimi.di.big.mg4j.query contains many useful classes that can help. In particular, a simple command-line tool let you query an index using a standard syntax. The tool makes it also possible to query the index using a browser (if you plan on using the command-line frequency, we suggest a utility such as rlwrap to provide command-line history and editing).

In a real applications, you might want to customise the index querying process. First of all, you must decide which syntax you want to use. A good starting point is described in the package it.unimi.di.big.mg4j.query.parser, which contains a simple parser generated with JavaCC. The parser generates an abstract query describe by a composite object whose description is given in it.unimi.di.big.mg4j.query.nodes. The query can then be turned into a DocumentIterator, which will return the documents matching the query and also the document intervals satisfying the query: the minimal-interval semantics used by MG4J is described in detail in it.unimi.di.big.mg4j.search, which also contains a description of the syntax used by the command-line tool.

Once a document iterator returning the matching documents is available, it is usually necessary to rank the documents. MG4J provides an abstract notion of Scorer and provides several examples. Scoring is a very sophisticated issue, and a lot of research has been devoted to this subject. MG4J provides implementation for some state-of-the-art scorers such as BM25, and also new scorers based on minimal-interval semantics such as VignaScorer.

All these pieces come together in the QueryEngine, which takes one or more queries, scores their results using one or more scorers, and returns only a certain part of the results themselves, decorated with suitably selected intervals that can be used to generate snippets. The query engine has several tunable parameters, so you can adapt it to your application. We suggest that you play with the command-line tool and the associated web interface to become familiar with the query-engine inner workings.

Packages
Package	Description
it.unimi.di.big.mg4j.document	This package contains all the logics related to and useful for managing documents, document collections and such.
it.unimi.di.big.mg4j.document.tika	This package contains classes that expose Tika parsers as MG4J factories.
it.unimi.di.big.mg4j.examples	Examples classes.
it.unimi.di.big.mg4j.index	Index generation and access.
it.unimi.di.big.mg4j.index.cluster	Index partitioning and clustering.
it.unimi.di.big.mg4j.index.payload
it.unimi.di.big.mg4j.index.snowball	Snowball-based term processors.
it.unimi.di.big.mg4j.index.wired
it.unimi.di.big.mg4j.io	Bit-level support classes.
it.unimi.di.big.mg4j.query	User interfaces for querying indices.
it.unimi.di.big.mg4j.query.nodes	Composite representation for queries
it.unimi.di.big.mg4j.query.parser	A simple JavaCC-generated parser used by the `Query` class.
it.unimi.di.big.mg4j.search	Classes that compose iterators over documents.
it.unimi.di.big.mg4j.search.score	Classes for assigning scores to documents.
it.unimi.di.big.mg4j.search.visitor	Visitors for composite document iterators.
it.unimi.di.big.mg4j.test
it.unimi.di.big.mg4j.tool	Line-command tools for index construction.
it.unimi.di.big.mg4j.util	Utility classes.
it.unimi.di.big.mg4j.util.parser.callback