In this step, MG4J scans the whole document collection producing the so-called batches. Batches are subindices limited to a subset of documents, and they are created each time the number of indexed documents reaches a user-provided threshold, or when the available memory is too little.
An occurrence is a group of three numbers, say
(t
,d
,p
),
meaning that term with index t
appears in
document d
at position
p
. Here, the document is represented by an
integer, called the document pointer, which is in
most cases the position of the document in the document collection (0
for the first document, 1 for the second document and so on). Position
is also an integer that represents where the term occurs in the
document.
To understand what the scanning phase really does, suppose you have three documents:
Document pointer | Document |
---|---|
0 | I love you |
1 | God is love |
2 | Love is blind |
3 | Blind justice |
Here is the dictionary produced initially by the scanning phase:
Term index | Term |
---|---|
0 | blind |
1 | god |
2 | i |
3 | is |
4 | justice |
5 | love |
6 | you |
Now, at least conceptually, this is the list of occurrences:
Occurrences (in the same order as they are found when scanning the documents) |
---|
(2,0,0) (5,0,1) (6,0,2) (1,1,0) (3,1,1) (5,1,2) (5,2,0) (3,2,1) (0,2,2) (0,3,0) (4,3,1) |
This simply means that:
term 2 (I
) appears in document 0 at
position 0;
term 5 (love
) appears in document 0 at
position 1;
term 6 (you
) appears in document 0 at
position 2;
and so on. Inverted lists can now be obtained by re-sorting the occurrences in increasing term order, so that occurrences relative to the same term appear consecutively:
Term | Occurrences |
---|---|
0 (blind) | (0,2,2) (0,3,0) |
1 (god) | (1,1,0) |
2 (i) | (2,0,0) |
3 (is) | (3,1,1) (3,2,1) |
4 (justice) | (4,3,1) |
5 (love) | (5,0,1) (5,1,2) (5,2,0) |
6 (you) | (6,0,2) |
Now, the indexer must:
scan all documents and extract occurrences;
if the list of terms have not yet been obtained, gather new terms as they are found;
sort the terms in alphabetical order, renumbering all occurrences correspondingly;
(if required) renumber the documents and sort them in increasing order,
sort, at least partially, the occurrences found in increasing term order;
when the number of accumulated documents reaches a given threshold, create a subindex containing the current batch of occurrences.
The last point needs further explanation. Since occurrences are
a lot it is not reasonable to think that they can
be all kept in memory. What the indexer does is to keep track of a
current batch where occurrences are stored as they are found. MG4J uses
a in-memory inversion method: as documents are
parsed, an in-memory representation of an index is built in memory.
Since indices are heavily compressed, this makes it possible to build
rather large batches. When memory is exhausted, the batch is dumped on
disk under the form of a subindex (this operation is very fast, as the
in-memory representation is essentially identical to the on-disk
representation of a FileIndex
). Every batch will
be in term order, but different batches may (and usually, will) contain
occurrences of the same terms.
Getting back to the example given in Chapter 1, where we indexed
the collection javadoc.collection
, the basename of
the resulting index is going to be javadoc
(as
usual, completed with the field name). After running
IndexBuilder
, we get the following files:
-rw-r--r-- 1 vigna users 63K 2012-04-04 19:52 javadoc-text@0.frequencies -rw-r--r-- 1 vigna users 27M 2012-04-04 19:52 javadoc-text@0.index -rw-r--r-- 1 vigna users 85K 2012-04-04 19:52 javadoc-text@0.occurrencies -rw-r--r-- 1 vigna users 177K 2012-04-04 19:52 javadoc-text@0.offsets -rw-r--r-- 1 vigna users 162K 2012-04-04 19:52 javadoc-text@0.posnumbits -rw-r--r-- 1 vigna users 348 2012-04-04 19:52 javadoc-text@0.properties -rw-r--r-- 1 vigna users 45K 2012-04-04 19:52 javadoc-text@0.sizes -rw-r--r-- 1 vigna users 227K 2012-04-04 19:52 javadoc-text@0.summaxpos -rw-r--r-- 1 vigna users 1.2M 2012-04-04 19:52 javadoc-text@0.terms -rw-r--r-- 1 vigna users 28K 2012-04-04 19:53 javadoc-text@1.frequencies -rw-r--r-- 1 vigna users 37K 2012-04-04 19:53 javadoc-text@1.occurrencies -rw-r--r-- 1 vigna users 9.9M 2012-04-04 19:53 javadoc-text@1.index -rw-r--r-- 1 vigna users 81K 2012-04-04 19:53 javadoc-text@1.offsets -rw-r--r-- 1 vigna users 75K 2012-04-04 19:53 javadoc-text@1.posnumbits -rw-r--r-- 1 vigna users 343 2012-04-04 19:53 javadoc-text@1.properties -rw-r--r-- 1 vigna users 19K 2012-04-04 19:53 javadoc-text@1.sizes -rw-r--r-- 1 vigna users 108K 2012-04-04 19:53 javadoc-text@1.sumsmaxpos -rw-r--r-- 1 vigna users 522K 2012-04-04 19:53 javadoc-text@1.terms -rw-r--r-- 1 vigna users 390 2012-04-04 19:53 javadoc-text.cluster.properties -rw-r--r-- 1 vigna users 140 2012-04-04 19:53 javadoc-text.cluster.strategy -rw-r--r-- 1 vigna users 2.1M 2012-04-04 19:53 javadoc-text.counts -rw-r--r-- 1 vigna users 100K 2012-04-04 19:53 javadoc-text.countsoffsets -rw-r--r-- 1 vigna users 76K 2012-04-04 19:53 javadoc-text.frequencies -rw-r--r-- 1 vigna users 104K 2012-04-04 19:53 javadoc-text.occurrencies -rw-r--r-- 1 vigna users 4.2M 2012-04-04 19:53 javadoc-text.pointers -rw-r--r-- 1 vigna users 173K 2012-04-04 19:53 javadoc-text.pointersoffsets -rw-r--r-- 1 vigna users 28M 2012-04-04 19:53 javadoc-text.positions -rw-r--r-- 1 vigna users 194K 2012-04-04 19:53 javadoc-text.positionsoffsets -rw-r--r-- 1 vigna users 316 2012-04-04 19:53 javadoc-text.properties -rw-r--r-- 1 vigna users 64K 2012-04-04 19:53 javadoc-text.sizes -rw-r--r-- 1 vigna users 793 2012-04-04 19:53 javadoc-text.stats -rw-r--r-- 1 vigna users 270K 2012-04-04 19:53 javadoc-text.sumsmaxpos -rw-r--r-- 1 vigna users 597K 2012-04-04 19:53 javadoc-text.termmap -rw-r--r-- 1 vigna users 1.5M 2012-04-04 19:53 javadoc-text.terms -rw-r--r-- 1 vigna users 5.1K 2012-04-04 19:52 javadoc-title@0.frequencies -rw-r--r-- 1 vigna users 121K 2012-04-04 19:52 javadoc-title@0.index -rw-r--r-- 1 vigna users 5.1K 2012-04-04 19:52 javadoc-title@0.occurrencies -rw-r--r-- 1 vigna users 16K 2012-04-04 19:52 javadoc-title@0.offsets -rw-r--r-- 1 vigna users 5.6K 2012-04-04 19:52 javadoc-title@0.posnumbits -rw-r--r-- 1 vigna users 335 2012-04-04 19:52 javadoc-title@0.properties -rw-r--r-- 1 vigna users 13K 2012-04-04 19:52 javadoc-title@0.sizes -rw-r--r-- 1 vigna users 2.4K 2012-04-04 19:52 javadoc-title@0.sumsmaxpos -rw-r--r-- 1 vigna users 205K 2012-04-04 19:52 javadoc-title@0.terms -rw-r--r-- 1 vigna users 2.5K 2012-04-04 19:53 javadoc-title@1.frequencies -rw-r--r-- 1 vigna users 54K 2012-04-04 19:53 javadoc-title@1.index -rw-r--r-- 1 vigna users 2.5K 2012-04-04 19:53 javadoc-title@1.occurrencies -rw-r--r-- 1 vigna users 7.1K 2012-04-04 19:53 javadoc-title@1.offsets -rw-r--r-- 1 vigna users 2.7K 2012-04-04 19:53 javadoc-title@1.posnumbits -rw-r--r-- 1 vigna users 331 2012-04-04 19:53 javadoc-title@1.properties -rw-r--r-- 1 vigna users 5.4K 2012-04-04 19:53 javadoc-title@1.sizes -rw-r--r-- 1 vigna users 1.1K 2012-04-04 19:53 javadoc-title@1.sumsmaxpos -rw-r--r-- 1 vigna users 106K 2012-04-04 19:53 javadoc-title@1.terms -rw-r--r-- 1 vigna users 387 2012-04-04 19:53 javadoc-title.cluster.properties -rw-r--r-- 1 vigna users 140 2012-04-04 19:53 javadoc-title.cluster.strategy -rw-r--r-- 1 vigna users 20K 2012-04-04 19:53 javadoc-title.counts -rw-r--r-- 1 vigna users 7.3K 2012-04-04 19:53 javadoc-title.countsoffsets -rw-r--r-- 1 vigna users 7.3K 2012-04-04 19:53 javadoc-title.frequencies -rw-r--r-- 1 vigna users 7.3K 2012-04-04 19:53 javadoc-title.occurrencies -rw-r--r-- 1 vigna users 128K 2012-04-04 19:53 javadoc-title.pointers -rw-r--r-- 1 vigna users 22K 2012-04-04 19:53 javadoc-title.pointersoffsets -rw-r--r-- 1 vigna users 59K 2012-04-04 19:53 javadoc-title.positions -rw-r--r-- 1 vigna users 8.9K 2012-04-04 19:53 javadoc-title.positionsoffsets -rw-r--r-- 1 vigna users 303 2012-04-04 19:53 javadoc-title.properties -rw-r--r-- 1 vigna users 19K 2012-04-04 19:53 javadoc-title.sizes -rw-r--r-- 1 vigna users 746 2012-04-04 19:53 javadoc-title.stats -rw-r--r-- 1 vigna users 3.3K 2012-04-04 19:53 javadoc-title.sumsmaxpos -rw-r--r-- 1 vigna users 133K 2012-04-04 19:53 javadoc-title.termmap -rw-r--r-- 1 vigna users 299K 2012-04-04 19:53 javadoc-title.terms
As
you can see, there are several new files (they could be more or less,
depending on the number of documents stored on your system): each file
whose names starts with javadoc-text@.
belongs to a
certain subindex, that was generated using a batch of occurrences, where
files starting with javadoc-text.
belong to the
final overall index. Other files, such as the
.sizes
files, contain the list of the document
sizes (the number of words contained in each document). The latter is
useful for statistical purposes, but it might also be used by the
indices, to establish better compression methods for the inverted lists.
The .terms
files, instead, contains the terms
indexed in each batch. Note that each subindex can be queried
separately, albeit you will need to generate manually a term map if you
want to write query by term and not by term number
(IndexBuilder
creates such a map just for the
whole index). Note, however, that batches are created only for the
purpose of being merged into the final overall index: as indices they
are quite slow and unoptimized. The final index, instead, is by default
of quasi-succinct type—a new, more efficient type
of index developed for MG4J.
Now, if you look into the
javadoc-text.properties
file, you will find some
information:
documents=28523 terms=114994 postings=5676615 maxcount=12898 indexclass=it.unimi.di.mg4j.index.QuasiSuccinctIndex skipquantum=256 byteorder=LITTLE_ENDIAN termprocessor=it.unimi.di.mg4j.index.DowncaseTermProcessor batches=2 field=text size=284965070 maxdocsize=109296 occurrences=25098453
You can see some the overall number of
occurrences (25098453), the number of batches (2) and the maximum size
(number of words) of a document (109296). Similar information is
available on a per-batch basis looking at the remaining
.properties
files. More detailed information about
each field can be found in the Javadoc documentation of the
Index.PropertyKeys
class and of the index class
specified in the file (in this case,
QuasiSuccinctIndex
).
The files starting with javadoc-text.cluster
present a cluster view of the set of batches just
built. Essentially, they provide dynamic access to the entire set of
batches as a single index. More information can be found in the
documentation of the package
it.unimi.di.mg4j.index.cluster
.
The scanning phase is, by far, the most time/space consuming.
MG4J will work with little memory, but more memory will make it
possible to build larger batches, which can then be merged more
quickly and without opening too many files. You shoiuld set the JVM
memory as high as you can go, and a number of documens per batch that
does not cause too many compactions (or most of the time will be spent
in the garbage collector), always keeping in mind that larger batches
are better. If you experience out-of-memory errors (but it shouldn't
happen!), just lower the number of documents per batch. Note that the
memory compaction performed by MG4J seems to make the JVM erroneously
think that there is too much garbage collection, sometimes resulting
in an OutOfMemoryError
due to excessive
garbage-collector overhead. Please use the option
-XX:-UseGCOverheadLimit
to overcome the
problem.
The kind of document sequence is going to influence heavily the indexing time. The best way of providing data to MG4J is to stream documents to the standard input, separating them with a character (usually, newline or NUL). This is the default choice if you do not specify explicitly a collection. Other kind of collections (e.g., database-based collections) might be reasonably efficient, but, for instance, do not expect great results from document sequences retreving documents directly from the file system one at a time.
Remember that the indexer will produce a number of subindices, and this number will depend on the overall number of occurrences (which is, essentially, proportional to the total document size). Combining these subindices (or accessing them using an on-the-fly index combiner) has a cost in time that increases logarithmically with the number of subindices. Moreover, for each subindex an on-the-fly combiner needs to allocate buffers, so the memory cost for batch or on-the-fly combination increases linearly with the number of subindices. The rule of thumb is that you should try to make batches as large as possible, but you should also check the logs because working with an almost full heap can slow down Java significantly, and exaggeratly large batches can cause slowdown because of the large number of cache misses.