Preamble: terms, dictionaries and term-related maps

Preamble: terms, dictionaries and term-related maps
Prev	Chapter 2. Behind the scenes: The indexing process	Next

Before starting our description of the indexing process, a brief introduction is necessary to present some basic concepts. MG4J has to do with documents (e.g., HTML files, mail messages etc.), and every document is composed by fields (e.g., the fields of a mail message will probably be its subject, sender, recipient, body etc.). Although, as we shall see, MG4J will provides support for non-textual fields, its "bread and butter" is with textual fields, and for the time being we shall assume that we are dealing with documents composed of just one textual field.

A textual field (in our simplified view: a document) is a sequence of words: it is up to the factory producing the document to decide how to choose words (e.g., one may want to discard digits or not), and how to segment the document. For instance, the typical letter-to-nonletter transition used to split Western languages does not work very well with, say, Chinese. However, once segmentation has produced suitable words, they must be turned into indexable terms: for instance, you may want to downcase your words, but at the same time you may want to keep "ph" (as in "this soap's ph") separated from "Ph" (as in "Ph.D. degree"). You may also want to make more dramatic transformations, such as stemming, or avoid indexing a term altogether. All these operation are performed by a term processor, which can be specified on the command line. The option --downcase, for instance, selects for you the class it.unimi.di.big.mg4j.index.DowncaseTermProcessor. The chosen processor is recorded into the index structure: this is essential for interpreting queries correctly.

Note that in the design of other search engines segmentation and processing are somehow mixed into a generic tokenisation phase. We prefer to split clearly between linguistic and algorithmic term processing. Linguistic processing depends only on the writing customs of a language, whereas algorithmic processing might be language neutral (we do not exclude, however, that it might be language dependent, too).

If you scan the whole document collection, you can collect all terms that appear in it; the set of all such terms is called the term dictionary. Note that every term in the dictionary appears in some (at least one) document, and probably it will appear in many documents, possibly even many times in some documents. (By the way: terms that appear in just one document are called hapax legomena, and they are far more frequent than one might expect in many collections, especially due to typos).

MG4J, like any other indexing tool, does not treat internally terms as character sequences, but it uses numbers. This means that terms in the dictionary are assigned an index (a number between 0 and the dictionary size minus 1), and that this index is used whenever the application needs to refer to a term. Usually, indices are assigned in lexicographical order: this means that index 0 is assigned to the first term in lexicographic order, index 1 to the next one and so on). The assignment between terms and indices is stored in a suitable data structure, that compactly represents both the dictionary and the map.

There are many possible different representations of this assignment, each with certain memory requirements and each allowing different kind of access to the data.

The simplest kind of representation of a dictionary is the term list: a text file containing the whole dictionary, one term per line, in index order (the first line contains term with index 0, the second line contains term with index 1 etc.). This representation is not especially efficient, and access-time is prohibitive for most applications. Usually, a file containing the term list is stemmed with .terms; if the terms are not sorted lexicographically, the file is stemmed with .terms.unsorted.
A much more efficient representation is by means of a monotone minimal perfect hash function: it is a very compact data structure that is able to answer correctly to the question “What is the index of this term?” (more presicely, "What is the lexicographical rank of this term in the term list?"). You can build such a function from a sorted term list using the (main method of) implementations available in Sux4J.
Monotone minimal perfect functions are very efficient and compact, but they have a serious limit. As we said before, they can answer correctly to the question “What is the index this term?”, but only for terms that appear in the dictionary. In other words, if the above question is posed for a term that does not appear anywhere, the answer you get is completely useless. This is not going to cause any harm, if you are sure that you will never try to access the function with a term that does not belong to the dictionary, but it will become a nuisance in all other cases. To solve this problem, you can sign the function. A signed function will answer with a special value (-1) that means “the word is not in the dictionary”. You can sign any function using the signing classes in dsiutils (e.g., ShiftAddXorSignedFunction).
Signed and unsigned monotone minimal perfect hash functions are ok, as long as you don't need to access the index with wildcards. Wildcard searches require the use of a prefix map. A prefix map is able to anwer correctly to questions like “What are the indices of terms starting with these characters?”. This is meaningful only if the terms are lexicographically sorted: in this case, the indices of terms starting with a given prefix are consecutive, so the above question can be answered by giving just two integers (the first and the last index of terms satisfying the property). You can build a prefix map by using the main method of one of the implementation of the PrefixMap interface, e.g., ImmutableExternalPrefixMap from dsiutils—actually, this is exactly what happens when you use IndexBuilder, albeit you can specify a different class for the term map using an option.