Chapter 2. Behind the scenes: The indexing process

Table of Contents

Introduction
Preamble: terms, dictionaries and term-related maps
Scan: Building batches
Time/space requirements
Combining batches
Virtual fields in MG4J
Virtual fields and virtual fragments
Document resolvers
What is a document resolver actually doing: virtual texts and gaps
Payload-based indices

Introduction

The main point of MG4J is the construction of inverted indices: an inverted index is just like the index you can find at the end of a book is a list of the occurrences in the text of every term. Building an inverted index is a complex process that MG4J perform essentially in two phases. Furthermore, there is another step that is called term map construction that is optional, depending on the kind of functionalities you require of your index.

Besides traditional indices, MG4J provides payload-based indices, which are used to store metadata associated to documents such as dates, integers, and so on.

In this chapter we will try to dissect the whole process to give you an idea of what happens when you run the Index class.