Before starting our description of the indexing process, a brief introduction is necessary to present some basic concepts. MG4J has to do with documents (e.g., HTML files, mail messages etc.), and every document is composed by fields (e.g., the fields of a mail message will probably be its subject, sender, recipient, body etc.). Although, as we shall see, MG4J will provides support for non-textual fields, its "bread and butter" is with textual fields, and for the time being we shall assume that we are dealing with documents composed of just one textual field.
A textual field (in our simplified view: a document) is a sequence
of words: it is up to the factory producing the document to decide how
to choose words (e.g., one may want to discard digits or not), and how
to segment the document. For instance, the
typical letter-to-nonletter transition used to split Western languages
does not work very well with, say, Chinese. However, once segmentation
has produced suitable words, they must be turned into indexable
terms: for instance, you may want to downcase
your words, but at the same time you may want to keep "ph" (as in "this
soap's ph") separated from "Ph" (as in "Ph.D. degree"). You may also
want to make more dramatic transformations, such as
stemming, or avoid indexing a term altogether.
All these operation are performed by a term
processor, which can be specified on the command line. The
option --downcase
, for instance, selects for you the
class
it.unimi.di.big.mg4j.index.DowncaseTermProcessor
.
The chosen processor is recorded into the index structure: this is
essential for interpreting queries correctly.
Note that in the design of other search engines segmentation and processing are somehow mixed into a generic tokenisation phase. We prefer to split clearly between linguistic and algorithmic term processing. Linguistic processing depends only on the writing customs of a language, whereas algorithmic processing might be language neutral (we do not exclude, however, that it might be language dependent, too).
If you scan the whole document collection, you can collect all terms that appear in it; the set of all such terms is called the term dictionary. Note that every term in the dictionary appears in some (at least one) document, and probably it will appear in many documents, possibly even many times in some documents. (By the way: terms that appear in just one document are called hapax legomena, and they are far more frequent than one might expect in many collections, especially due to typos).
MG4J, like any other indexing tool, does not treat internally terms as character sequences, but it uses numbers. This means that terms in the dictionary are assigned an index (a number between 0 and the dictionary size minus 1), and that this index is used whenever the application needs to refer to a term. Usually, indices are assigned in lexicographical order: this means that index 0 is assigned to the first term in lexicographic order, index 1 to the next one and so on). The assignment between terms and indices is stored in a suitable data structure, that compactly represents both the dictionary and the map.
There are many possible different representations of this assignment, each with certain memory requirements and each allowing different kind of access to the data.
The simplest kind of representation of a dictionary is the
term list: a text file containing the whole
dictionary, one term per line, in index order (the first line
contains term with index 0, the second line contains term with index
1 etc.). This representation is not especially efficient, and
access-time is prohibitive for most applications. Usually, a file
containing the term list is stemmed with .terms
;
if the terms are not sorted lexicographically,
the file is stemmed with .terms.unsorted
.
A much more efficient representation is by means of a monotone minimal perfect hash function: it is a very compact data structure that is able to answer correctly to the question “What is the index of this term?” (more presicely, "What is the lexicographical rank of this term in the term list?"). You can build such a function from a sorted term list using the (main method of) implementations available in Sux4J.
Monotone minimal perfect functions are very efficient and
compact, but they have a serious limit. As we said before, they can
answer correctly to the question “What is the index this
term?”, but only for terms that appear in the
dictionary. In other words, if the above question is
posed for a term that does not appear anywhere, the answer you get
is completely useless. This is not going to cause any harm, if you
are sure that you will never try to access the function with a term
that does not belong to the dictionary, but it will become a
nuisance in all other cases. To solve this problem, you can
sign the function. A signed function will
answer with a special value (-1) that means “the word is not
in the dictionary”. You can sign any function using the
signing classes in dsiutils
(e.g., ShiftAddXorSignedFunction
).
Signed and unsigned monotone minimal perfect hash functions
are ok, as long as you don't need to access the index with
wildcards. Wildcard searches require the use of a prefix
map. A prefix map is able to anwer correctly to
questions like “What are the indices of terms starting with
these characters?”. This is meaningful only if the terms are
lexicographically sorted: in this case, the indices of terms
starting with a given prefix are consecutive, so the above question
can be answered by giving just two integers (the first and the last
index of terms satisfying the property). You can build a prefix map
by using the main method of one of the implementation of the
PrefixMap
interface, e.g.,
ImmutableExternalPrefixMap
from
dsiutils
—actually, this is exactly what happens
when you use IndexBuilder
, albeit you can
specify a different class for the term map using an option.