Big news

Release 5.0 has several source and binary incompatibilities, and introduces quasi-succinct indices. Benchmarks on the performance of quasi-succinct indices can be found here; for instance, this table shows the number of seconds to answer 1000 multi-term queries on a document collection of 130 million web pages:

MG4J MG4J* Lucene 3.6.2
Terms 70.9 132.1 130.6
And 27.5 36.7 108.8
Phrase 78.2 127.2
Proximity 106.5 347.6

Both engines were set to just enumerate the results without scoring. The column labelled MG4J* gives the timings of an artificially modified version in which counts for each retrieved document have been read (MG4J now stores document pointers and counts in separate files, but Lucene interleaves them, so it has to read counts compulsorily). Proximity queries are conjunctive queries that must be satisfied within a window of 16 words. The row labelled “Terms” gives the timings for enumerating the posting lists of all terms appearing in the queries.

For what matters index size, the following table compares the new quasi-succinct indices, the previous high-performance indices using default γ/δ compression and Lucene indices on a number of different collections:

MG4J (new)MG4J (γ/δ)Lucene 3.6.2
TREC GOV2 (text, 25 M documents)
36.9 GB 40.3 GB 42.1 GB
TREC GOV2 (title, 25 M documents)
264 MB 308 MB 396 MB
Web .uk (text, 130 M documents)
108 GB 117 GB 126 GB
Web .uk (title, 130 M documents)
1.38 GB 1.59 GB 2.15 GB
Mímir token index (1 M documents)
0.96 GB 1.01 GB 1.34 GB
Tweets (13 M documents)
302 MB 341 MB 423 MB

Call for collaboration

The new quasi-succinct indices are very fast, but suggestions on low-level Java optimizations are welcome. In particular, the C++ version would benefit from a review by people acquainted with optimization for superscalar processors.

Introduction

MG4JMG4J (Managing Gigabytes for Java) is a free full-text search engine for large document collections written in Java. MG4J is a highly customisable, high-performance, full-fledged search engine providing state-of-the-art features (such as BM25/BM25F scoring) and new research algorithms.

The main points of MG4J are:

The starting point for understanding MG4J is a look at the tutorial, which explains how to index a sample collection and query the newly constructed index from the command line or using a browser. Then, the Javadoc class documentation can provide more insights.

MG4J is free software distributed under the GNU Lesser General Public License. If you find MG4J useful, we kindly ask you to quote the following reference:

@INPROCEEDINGS{BoVTREC2005,
        title = "{M}{G}4{J} at {T}{R}{E}{C} 2005",
        author="Paolo Boldi and Sebastiano Vigna",
        year = 2005,
        booktitle = "The Fourteenth Text REtrieval Conference (TREC 2005) Proceedings",
        editor = "Ellen M. Voorhees and Lori P. Buckland",
        publisher = "NIST",
        series = "Special Publications",
        number = "SP 500-266",
	note = "\texttt{\small http://mg4j.di.unimi.it/}",
}

Installation

InstallYou can grab MG4J from Maven Central. Otherwise, you just have to install the .jar file coming with the distribution and the dependencies, which are gathered for your convenience in a tarball.

Citations

Here you can find (in no particular order) research papers that have been written using MG4J. The list is not exhaustive, and we will be happy to include works that are missing.