Scan: Building batches

In this step, MG4J scans the whole document collection producing the so-called batches. Batches are subindices limited to a subset of documents, and they are created each time the number of indexed documents reaches a user-provided threshold, or when the available memory is too little.

An occurrence is a group of three numbers, say (t,d,p), meaning that term with index t appears in document d at position p. Here, the document is represented by an integer, called the document pointer, which is in most cases the position of the document in the document collection (0 for the first document, 1 for the second document and so on). Position is also an integer that represents where the term occurs in the document.

To understand what the scanning phase really does, suppose you have three documents:

Document pointerDocument
0I love you
1God is love
2Love is blind
3Blind justice

Here is the dictionary produced initially by the scanning phase:

Term indexTerm

Now, at least conceptually, this is the list of occurrences:

Occurrences (in the same order as they are found when scanning the documents)
(2,0,0) (5,0,1) (6,0,2) (1,1,0) (3,1,1) (5,1,2) (5,2,0) (3,2,1) (0,2,2) (0,3,0) (4,3,1)

This simply means that:

and so on. Inverted lists can now be obtained by re-sorting the occurrences in increasing term order, so that occurrences relative to the same term appear consecutively:

0 (blind)(0,2,2) (0,3,0)
1 (god)(1,1,0)
2 (i)(2,0,0)
3 (is)(3,1,1) (3,2,1)
4 (justice)(4,3,1)
5 (love)(5,0,1) (5,1,2) (5,2,0)
6 (you)(6,0,2)

Now, the indexer must:

The last point needs further explanation. Since occurrences are a lot it is not reasonable to think that they can be all kept in memory. What the indexer does is to keep track of a current batch where occurrences are stored as they are found. MG4J uses a in-memory inversion method: as documents are parsed, an in-memory representation of an index is built in memory. Since indices are heavily compressed, this makes it possible to build rather large batches. When memory is exhausted, the batch is dumped on disk under the form of a subindex (this operation is very fast, as the in-memory representation is essentially identical to the on-disk representation of a FileIndex). Every batch will be in term order, but different batches may (and usually, will) contain occurrences of the same terms.

Getting back to the example given in Chapter 1, where we indexed the collection javadoc.collection, the basename of the resulting index is going to be javadoc (as usual, completed with the field name). After running IndexBuilder, we get the following files:

-rw-r--r-- 1 vigna users  63K 2012-04-04 19:52 javadoc-text@0.frequencies
-rw-r--r-- 1 vigna users  27M 2012-04-04 19:52 javadoc-text@0.index
-rw-r--r-- 1 vigna users  85K 2012-04-04 19:52 javadoc-text@0.occurrencies
-rw-r--r-- 1 vigna users 177K 2012-04-04 19:52 javadoc-text@0.offsets
-rw-r--r-- 1 vigna users 162K 2012-04-04 19:52 javadoc-text@0.posnumbits
-rw-r--r-- 1 vigna users  348 2012-04-04 19:52
-rw-r--r-- 1 vigna users  45K 2012-04-04 19:52 javadoc-text@0.sizes
-rw-r--r-- 1 vigna users 227K 2012-04-04 19:52 javadoc-text@0.summaxpos
-rw-r--r-- 1 vigna users 1.2M 2012-04-04 19:52 javadoc-text@0.terms
-rw-r--r-- 1 vigna users  28K 2012-04-04 19:53 javadoc-text@1.frequencies
-rw-r--r-- 1 vigna users  37K 2012-04-04 19:53 javadoc-text@1.occurrencies
-rw-r--r-- 1 vigna users 9.9M 2012-04-04 19:53 javadoc-text@1.index
-rw-r--r-- 1 vigna users  81K 2012-04-04 19:53 javadoc-text@1.offsets
-rw-r--r-- 1 vigna users  75K 2012-04-04 19:53 javadoc-text@1.posnumbits
-rw-r--r-- 1 vigna users  343 2012-04-04 19:53
-rw-r--r-- 1 vigna users  19K 2012-04-04 19:53 javadoc-text@1.sizes
-rw-r--r-- 1 vigna users 108K 2012-04-04 19:53 javadoc-text@1.sumsmaxpos
-rw-r--r-- 1 vigna users 522K 2012-04-04 19:53 javadoc-text@1.terms
-rw-r--r-- 1 vigna users  390 2012-04-04 19:53
-rw-r--r-- 1 vigna users  140 2012-04-04 19:53 javadoc-text.cluster.strategy
-rw-r--r-- 1 vigna users 2.1M 2012-04-04 19:53 javadoc-text.counts
-rw-r--r-- 1 vigna users 100K 2012-04-04 19:53 javadoc-text.countsoffsets
-rw-r--r-- 1 vigna users  76K 2012-04-04 19:53 javadoc-text.frequencies
-rw-r--r-- 1 vigna users 104K 2012-04-04 19:53 javadoc-text.occurrencies
-rw-r--r-- 1 vigna users 4.2M 2012-04-04 19:53 javadoc-text.pointers
-rw-r--r-- 1 vigna users 173K 2012-04-04 19:53 javadoc-text.pointersoffsets
-rw-r--r-- 1 vigna users  28M 2012-04-04 19:53 javadoc-text.positions
-rw-r--r-- 1 vigna users 194K 2012-04-04 19:53 javadoc-text.positionsoffsets
-rw-r--r-- 1 vigna users  316 2012-04-04 19:53
-rw-r--r-- 1 vigna users  64K 2012-04-04 19:53 javadoc-text.sizes
-rw-r--r-- 1 vigna users  793 2012-04-04 19:53 javadoc-text.stats
-rw-r--r-- 1 vigna users 270K 2012-04-04 19:53 javadoc-text.sumsmaxpos
-rw-r--r-- 1 vigna users 597K 2012-04-04 19:53 javadoc-text.termmap
-rw-r--r-- 1 vigna users 1.5M 2012-04-04 19:53 javadoc-text.terms
-rw-r--r-- 1 vigna users 5.1K 2012-04-04 19:52 javadoc-title@0.frequencies
-rw-r--r-- 1 vigna users 121K 2012-04-04 19:52 javadoc-title@0.index
-rw-r--r-- 1 vigna users 5.1K 2012-04-04 19:52 javadoc-title@0.occurrencies
-rw-r--r-- 1 vigna users  16K 2012-04-04 19:52 javadoc-title@0.offsets
-rw-r--r-- 1 vigna users 5.6K 2012-04-04 19:52 javadoc-title@0.posnumbits
-rw-r--r-- 1 vigna users  335 2012-04-04 19:52
-rw-r--r-- 1 vigna users  13K 2012-04-04 19:52 javadoc-title@0.sizes
-rw-r--r-- 1 vigna users 2.4K 2012-04-04 19:52 javadoc-title@0.sumsmaxpos
-rw-r--r-- 1 vigna users 205K 2012-04-04 19:52 javadoc-title@0.terms
-rw-r--r-- 1 vigna users 2.5K 2012-04-04 19:53 javadoc-title@1.frequencies
-rw-r--r-- 1 vigna users  54K 2012-04-04 19:53 javadoc-title@1.index
-rw-r--r-- 1 vigna users 2.5K 2012-04-04 19:53 javadoc-title@1.occurrencies
-rw-r--r-- 1 vigna users 7.1K 2012-04-04 19:53 javadoc-title@1.offsets
-rw-r--r-- 1 vigna users 2.7K 2012-04-04 19:53 javadoc-title@1.posnumbits
-rw-r--r-- 1 vigna users  331 2012-04-04 19:53
-rw-r--r-- 1 vigna users 5.4K 2012-04-04 19:53 javadoc-title@1.sizes
-rw-r--r-- 1 vigna users 1.1K 2012-04-04 19:53 javadoc-title@1.sumsmaxpos
-rw-r--r-- 1 vigna users 106K 2012-04-04 19:53 javadoc-title@1.terms
-rw-r--r-- 1 vigna users  387 2012-04-04 19:53
-rw-r--r-- 1 vigna users  140 2012-04-04 19:53 javadoc-title.cluster.strategy
-rw-r--r-- 1 vigna users  20K 2012-04-04 19:53 javadoc-title.counts
-rw-r--r-- 1 vigna users 7.3K 2012-04-04 19:53 javadoc-title.countsoffsets
-rw-r--r-- 1 vigna users 7.3K 2012-04-04 19:53 javadoc-title.frequencies
-rw-r--r-- 1 vigna users 7.3K 2012-04-04 19:53 javadoc-title.occurrencies
-rw-r--r-- 1 vigna users 128K 2012-04-04 19:53 javadoc-title.pointers
-rw-r--r-- 1 vigna users  22K 2012-04-04 19:53 javadoc-title.pointersoffsets
-rw-r--r-- 1 vigna users  59K 2012-04-04 19:53 javadoc-title.positions
-rw-r--r-- 1 vigna users 8.9K 2012-04-04 19:53 javadoc-title.positionsoffsets
-rw-r--r-- 1 vigna users  303 2012-04-04 19:53
-rw-r--r-- 1 vigna users  19K 2012-04-04 19:53 javadoc-title.sizes
-rw-r--r-- 1 vigna users  746 2012-04-04 19:53 javadoc-title.stats
-rw-r--r-- 1 vigna users 3.3K 2012-04-04 19:53 javadoc-title.sumsmaxpos
-rw-r--r-- 1 vigna users 133K 2012-04-04 19:53 javadoc-title.termmap
-rw-r--r-- 1 vigna users 299K 2012-04-04 19:53 javadoc-title.terms

As you can see, there are several new files (they could be more or less, depending on the number of documents stored on your system): each file whose names starts with javadoc-text@. belongs to a certain subindex, that was generated using a batch of occurrences, where files starting with javadoc-text. belong to the final overall index. Other files, such as the .sizes files, contain the list of the document sizes (the number of words contained in each document). The latter is useful for statistical purposes, but it might also be used by the indices, to establish better compression methods for the inverted lists. The .terms files, instead, contains the terms indexed in each batch. Note that each subindex can be queried separately, albeit you will need to generate manually a term map if you want to write query by term and not by term number (IndexBuilder creates such a map just for the whole index). Note, however, that batches are created only for the purpose of being merged into the final overall index: as indices they are quite slow and unoptimized. The final index, instead, is by default of quasi-succinct type—a new, more efficient type of index developed for MG4J.

Now, if you look into the file, you will find some information:


You can see some the overall number of occurrences (25098453), the number of batches (2) and the maximum size (number of words) of a document (109296). Similar information is available on a per-batch basis looking at the remaining .properties files. More detailed information about each field can be found in the Javadoc documentation of the Index.PropertyKeys class and of the index class specified in the file (in this case, QuasiSuccinctIndex).

The files starting with javadoc-text.cluster present a cluster view of the set of batches just built. Essentially, they provide dynamic access to the entire set of batches as a single index. More information can be found in the documentation of the package it.unimi.di.mg4j.index.cluster.

Time/space requirements

The scanning phase is, by far, the most time/space consuming. MG4J will work with little memory, but more memory will make it possible to build larger batches, which can then be merged more quickly and without opening too many files. You shoiuld set the JVM memory as high as you can go, and a number of documens per batch that does not cause too many compactions (or most of the time will be spent in the garbage collector), always keeping in mind that larger batches are better. If you experience out-of-memory errors (but it shouldn't happen!), just lower the number of documents per batch. Note that the memory compaction performed by MG4J seems to make the JVM erroneously think that there is too much garbage collection, sometimes resulting in an OutOfMemoryError due to excessive garbage-collector overhead. Please use the option -XX:-UseGCOverheadLimit to overcome the problem.

The kind of document sequence is going to influence heavily the indexing time. The best way of providing data to MG4J is to stream documents to the standard input, separating them with a character (usually, newline or NUL). This is the default choice if you do not specify explicitly a collection. Other kind of collections (e.g., database-based collections) might be reasonably efficient, but, for instance, do not expect great results from document sequences retreving documents directly from the file system one at a time.

Remember that the indexer will produce a number of subindices, and this number will depend on the overall number of occurrences (which is, essentially, proportional to the total document size). Combining these subindices (or accessing them using an on-the-fly index combiner) has a cost in time that increases logarithmically with the number of subindices. Moreover, for each subindex an on-the-fly combiner needs to allocate buffers, so the memory cost for batch or on-the-fly combination increases linearly with the number of subindices. The rule of thumb is that you should try to make batches as large as possible, but you should also check the logs because working with an almost full heap can slow down Java significantly, and exaggeratly large batches can cause slowdown because of the large number of cache misses.