In this step, MG4J scans the whole document collection producing the so-called batches. Batches are subindices limited to a subset of documents, and they are created each time the number of indexed documents reaches a user-provided threshold, or when the available memory is too little.
An occurrence is a group of three numbers, say
(t
,d
,p
),
meaning that term with index t
appears in
document d
at position
p
. Here, both the term and document are
represented by a long integer, called, in the second case, the
document pointer, which is in most cases the
position of the document in the document collection (0 for the first
document, 1 for the second document and so on). Position is an integer
that represents where the term occurs in the document.
To understand what the scanning phase really does, suppose you have three documents:
Document pointer | Document |
---|---|
0 | I love you |
1 | God is love |
2 | Love is blind |
3 | Blind justice |
Here is the dictionary produced initially by the scanning phase:
Term index | Term |
---|---|
0 | blind |
1 | god |
2 | i |
3 | is |
4 | justice |
5 | love |
6 | you |
Now, at least conceptually, this is the list of occurrences:
Occurrences (in the same order as they are found when scanning the documents) |
---|
(2,0,0) (5,0,1) (6,0,2) (1,1,0) (3,1,1) (5,1,2) (5,2,0) (3,2,1) (0,2,2) (0,3,0) (4,3,1) |
This simply means that:
term 2 (I
) appears in document 0 at
position 0;
term 5 (love
) appears in document 0 at
position 1;
term 6 (you
) appears in document 0 at
position 2;
and so on. Inverted lists can now be obtained by re-sorting the occurrences in increasing term order, so that occurrences relative to the same term appear consecutively:
Term | Occurrences |
---|---|
0 (blind) | (0,2,2) (0,3,0) |
1 (god) | (1,1,0) |
2 (i) | (2,0,0) |
3 (is) | (3,1,1) (3,2,1) |
4 (justice) | (4,3,1) |
5 (love) | (5,0,1) (5,1,2) (5,2,0) |
6 (you) | (6,0,2) |
Now, the indexer must:
scan all documents and extract occurrences;
if the list of terms have not yet been obtained, gather new terms as they are found;
sort the terms in alphabetical order, renumbering all occurrences correspondingly;
(if required) renumber the documents and sort them in increasing order,
sort, at least partially, the occurrences found in increasing term order;
when the number of accumulated documents reaches a given threshold, create a subindex containing the current batch of occurrences.
The last point needs further explanation. Since occurrences are a lot it is not reasonable to think that they can be all kept in memory. What the indexing pass does is keeping an internal batch where occurrences are stored as they are found; when the batch is full, it is ordered by term, and flushed out on disk under the form of a subindex. Every batch will be in term order, but different batches may (and usually, will) contain occurrences of the same termG.
Getting back to the example given in Chapter 1, where we indexed
the collection javadoc.collection
, the basename of
the resulting index is going to be javadoc
(as
usual, completed with the field name). After running
IndexBuilder
, we get the following files:
-rw-r--r-- 1 vigna vigna 430 May 13 12:02 javadoc-text.cluster.properties -rw-r--r-- 1 vigna vigna 144 May 13 12:02 javadoc-text.cluster.strategy -rw-r--r-- 1 vigna vigna 20k May 13 12:02 javadoc-text.frequencies -rw-r--r-- 1 vigna vigna 29k May 13 12:02 javadoc-text.globcounts -rw-r--r-- 1 vigna vigna 1.1M May 13 12:02 javadoc-text.index -rw-r--r-- 1 vigna vigna 52k May 13 12:02 javadoc-text.offsets -rw-r--r-- 1 vigna vigna 5.8M May 13 12:02 javadoc-text.positions -rw-r--r-- 1 vigna vigna 54k May 13 12:02 javadoc-text.posnumbits -rw-r--r-- 1 vigna vigna 387 May 13 12:02 javadoc-text.properties -rw-r--r-- 1 vigna vigna 9.4k May 13 12:02 javadoc-text.sizes -rw-r--r-- 1 vigna vigna 1.2k May 13 12:02 javadoc-text.stats -rw-r--r-- 1 vigna vigna 175k May 13 12:02 javadoc-text.termmap -rw-r--r-- 1 vigna vigna 412k May 13 12:02 javadoc-text.terms -rw-r--r-- 1 vigna vigna 14k May 13 12:01 javadoc-text@0.frequencies -rw-r--r-- 1 vigna vigna 19k May 13 12:01 javadoc-text@0.globcounts -rw-r--r-- 1 vigna vigna 3.8M May 13 12:01 javadoc-text@0.index -rw-r--r-- 1 vigna vigna 40k May 13 12:01 javadoc-text@0.offsets -rw-r--r-- 1 vigna vigna 37k May 13 12:01 javadoc-text@0.posnumbits -rw-r--r-- 1 vigna vigna 342 May 13 12:01 javadoc-text@0.properties -rw-r--r-- 1 vigna vigna 4.7k May 13 12:01 javadoc-text@0.sizes -rw-r--r-- 1 vigna vigna 264k May 13 12:01 javadoc-text@0.terms -rw-r--r-- 1 vigna vigna 11k May 13 12:02 javadoc-text@1.frequencies -rw-r--r-- 1 vigna vigna 15k May 13 12:02 javadoc-text@1.globcounts -rw-r--r-- 1 vigna vigna 2.7M May 13 12:02 javadoc-text@1.index -rw-r--r-- 1 vigna vigna 31k May 13 12:02 javadoc-text@1.offsets -rw-r--r-- 1 vigna vigna 29k May 13 12:02 javadoc-text@1.posnumbits -rw-r--r-- 1 vigna vigna 342 May 13 12:02 javadoc-text@1.properties -rw-r--r-- 1 vigna vigna 4.5k May 13 12:02 javadoc-text@1.sizes -rw-r--r-- 1 vigna vigna 213k May 13 12:02 javadoc-text@1.terms -rw-r--r-- 1 vigna vigna 3.1k May 13 12:02 javadoc-text@2.frequencies -rw-r--r-- 1 vigna vigna 4.2k May 13 12:02 javadoc-text@2.globcounts -rw-r--r-- 1 vigna vigna 234k May 13 12:02 javadoc-text@2.index -rw-r--r-- 1 vigna vigna 10k May 13 12:02 javadoc-text@2.offsets -rw-r--r-- 1 vigna vigna 9.3k May 13 12:02 javadoc-text@2.posnumbits -rw-r--r-- 1 vigna vigna 336 May 13 12:02 javadoc-text@2.properties -rw-r--r-- 1 vigna vigna 221 May 13 12:02 javadoc-text@2.sizes -rw-r--r-- 1 vigna vigna 80k May 13 12:02 javadoc-text@2.terms -rw-r--r-- 1 vigna vigna 429 May 13 12:02 javadoc-title.cluster.properties -rw-r--r-- 1 vigna vigna 144 May 13 12:02 javadoc-title.cluster.strategy -rw-r--r-- 1 vigna vigna 1.5k May 13 12:02 javadoc-title.frequencies -rw-r--r-- 1 vigna vigna 1.5k May 13 12:02 javadoc-title.globcounts -rw-r--r-- 1 vigna vigna 25k May 13 12:02 javadoc-title.index -rw-r--r-- 1 vigna vigna 5.2k May 13 12:02 javadoc-title.offsets -rw-r--r-- 1 vigna vigna 9.9k May 13 12:02 javadoc-title.positions -rw-r--r-- 1 vigna vigna 1.6k May 13 12:02 javadoc-title.posnumbits -rw-r--r-- 1 vigna vigna 376 May 13 12:02 javadoc-title.properties -rw-r--r-- 1 vigna vigna 2.5k May 13 12:02 javadoc-title.sizes -rw-r--r-- 1 vigna vigna 1.1k May 13 12:02 javadoc-title.stats -rw-r--r-- 1 vigna vigna 31k May 13 12:02 javadoc-title.termmap -rw-r--r-- 1 vigna vigna 61k May 13 12:02 javadoc-title.terms -rw-r--r-- 1 vigna vigna 752 May 13 12:01 javadoc-title@0.frequencies -rw-r--r-- 1 vigna vigna 752 May 13 12:01 javadoc-title@0.globcounts -rw-r--r-- 1 vigna vigna 11k May 13 12:01 javadoc-title@0.index -rw-r--r-- 1 vigna vigna 2.1k May 13 12:01 javadoc-title@0.offsets -rw-r--r-- 1 vigna vigna 791 May 13 12:01 javadoc-title@0.posnumbits -rw-r--r-- 1 vigna vigna 329 May 13 12:01 javadoc-title@0.properties -rw-r--r-- 1 vigna vigna 1.2k May 13 12:01 javadoc-title@0.sizes -rw-r--r-- 1 vigna vigna 32k May 13 12:01 javadoc-title@0.terms -rw-r--r-- 1 vigna vigna 761 May 13 12:02 javadoc-title@1.frequencies -rw-r--r-- 1 vigna vigna 761 May 13 12:02 javadoc-title@1.globcounts -rw-r--r-- 1 vigna vigna 11k May 13 12:02 javadoc-title@1.index -rw-r--r-- 1 vigna vigna 2.1k May 13 12:02 javadoc-title@1.offsets -rw-r--r-- 1 vigna vigna 883 May 13 12:02 javadoc-title@1.posnumbits -rw-r--r-- 1 vigna vigna 330 May 13 12:02 javadoc-title@1.properties -rw-r--r-- 1 vigna vigna 1.2k May 13 12:02 javadoc-title@1.sizes -rw-r--r-- 1 vigna vigna 29k May 13 12:02 javadoc-title@1.terms -rw-r--r-- 1 vigna vigna 41 May 13 12:02 javadoc-title@2.frequencies -rw-r--r-- 1 vigna vigna 41 May 13 12:02 javadoc-title@2.globcounts -rw-r--r-- 1 vigna vigna 400 May 13 12:02 javadoc-title@2.index -rw-r--r-- 1 vigna vigna 91 May 13 12:02 javadoc-title@2.offsets -rw-r--r-- 1 vigna vigna 44 May 13 12:02 javadoc-title@2.posnumbits -rw-r--r-- 1 vigna vigna 320 May 13 12:02 javadoc-title@2.properties -rw-r--r-- 1 vigna vigna 57 May 13 12:02 javadoc-title@2.sizes -rw-r--r-- 1 vigna vigna 1.2k May 13 12:02 javadoc-title@2.terms
As
you can see, there are several new files (they could be more or less,
depending on the number of documents stored on your system): each file
whose names starts with javadoc-text@
belongs to a
certain subindex, that was generated using a batch of occurrences. The
javadoc-text.properties
file contains global
information pertaining all subindices. Other files, such as the
.sizes
files, contain the list of the document
sizes (the number of words contained in each document). The latter is
useful for statistical purposes, but it might also be used by the
indices, to establish better compression methods for the inverted lists.
The .terms
files, instead, contains the terms
indexed in each batch. Note that each subindex can be queried
separately, albeit you will need to generate manually a term map if you
want to write query by term and not by term number
(IndexBuilder
creates such a map just for the
whole index). The class
it.unimi.dsi.util.ImmutableExternalPrefixMap
, for
instance, can be used to this purpose.
Now, if you look into the
javadoc-text.properties
file, you will find some
information:
documents = 4090 terms = 32634 postings = 976736 maxcount = 3425 indexclass = it.unimi.di.big.mg4j.index.FileHPIndex skipquantum = 0 skipheight = 8 coding = FREQUENCIES:GAMMA coding = POINTERS:DELTA coding = COUNTS:GAMMA coding = POSITIONS:DELTA termprocessor = it.unimi.di.big.mg4j.index.DowncaseTermProcessor batches = 3 field = text size = 57884884 maxdocsize = 53057 occurrences = 4599358
You can see some the overall number of
occurrences (4599358), the number of batches (3) and the maximum size
(number of words) of a document (53057). Similar information is
available on a per-batch basis looking at the remaining
.properties
files.
The files starting with javadoc-text.cluster
present a cluster view of the set of batches just
built. Essentially, they provide dynamic access to the entire set of
batches as a single index. More information can be found in the
documentation of the package
it.unimi.di.big.mg4j.index.cluster
.
The scanning phase is, by far, the most time/space consuming.
MG4J will work with little memory, but more memory will make it
possible to build larger batches, which can then be merged more
quickly and without opening too many files. You shoiuld set the JVM
memory as high as you can go, and a number of documens per batch that
does not cause too many compactions (or most of the time will be spent
in the garbage collector), always keeping in mind that larger batches
are better. If you experience out-of-memory errors (but it shouldn't
happen!), just lower the number of documents per batch. Note that the
memory compaction performed by MG4J seems to make the JVM erroneously
think that there is too much garbage collection, sometimes resulting
in an OutOfMemoryError
due to excessive
garbage-collector overhead. Please use the option
-XX:-UseGCOverheadLimit
to overcome the
problem.
The kind of document sequence is going to influence heavily the indexing time. The best way of providing data to MG4J is to stream documents to the standard input, separating them with a character (usually, newline or NUL). This is the default choice if you do not specify explicitly a collection. Other kind of collections (e.g., database-based collections) might be reasonably efficient, but, for instance, do not expect great results from document sequences retreving documents directly from the file system one at a time.
Remember that the indexer will produce a number of subindices, and this number will depend on the overall number of occurrences (which is, essentially, proportional to the total document size). Combining these subindices (or accessing them using an on-the-fly index combiner) has a cost in time that increases logarithmically with the number of subindices. Moreover, for each subindex an on-the-fly combiner needs to allocate buffers, so the memory cost for batch or on-the-fly combination increases linearly with the number of subindices. The rule of thumb is that you should try to make batches as large as possible, but you should also check the logs because working with an almost full heap can slow down Java significantly, and exaggeratly large batches can cause slowdown because of the large number of cache misses.