Chapter 1. A Quick Tour of MG4J

Chapter 1. A Quick Tour of MG4J
Prev		Next

Table of Contents

More sophisticated queries
A semantic index
A TREC index

Building your first index

Indexing in MG4J is centered around documents, either exposed by means of sequences or of collections. For the time being, let us concentrate on collections, which are randomly addressable lists of documents.

Each document in a collection is associated with a title and a URI. Typical titles are filenames, or titles from HTML documents. URIs can be the actual URL of a page. To build our first document collection, we use the main method of the class FileSetDocumentCollection, which allows to build and serialize a set of documents specified by their filenames. As a typical case, we will build a collection out of your Javadoc documentation directory. Supposing your Javadocs are located in /usr/share/javadoc, you may try the following:

find /usr/share/javadoc/ -iname \*.html -type f | \
    egrep -v "(package-|-tree|class-use|index-.*.html|allclasses)" | \
    java it.unimi.di.mg4j.document.FileSetDocumentCollection \
        -f HtmlDocumentFactory -p encoding=UTF-8 javadoc.collection

Let us try to understand what's happening. We are providing as input to the main method of the class a list of files, one per line. Moreover, we are specifying (using the -f option) a factory, that is, something that will turn a pure stream of bytes (provided, in this case, by a file) into a document made by several fields (for instance, title and main text). The factory needs to know the encoding of the files, and we are specifying UTF-8 as a property. All this information is serlialised and stored in a file named javadoc.collection. Note that since we are using a standard MG4J factory, we can avoid to write the full factory class name (it.unimi.di.mg4j.document.HtmlDocumentFactory).

If you try and look into the file javadoc.collection, you will discover that this is indeed a typical, serialized version of a Java object; note that the file is not going to contain the files that are part of the collection, but only their name. This means, in particular, that the very existence of the collection will depend on the existence of the files spanned by the collection; in other words, deleting or modifying any of the indexed file may cause inconsistence in the collection (and, more importantly, in the index produced in the following steps). This is true of almost every collection: document collections may base their existence on some external data (files, web pages, mailbox files etc.), and they usually become inconsistent as soon as such data are modified, changed or deleted.

It is now time to index our collection. To do so, we simply pass the collection to the main method of the class IndexBuilder, which scans all documents in the collection and produces a number of indices, one for each field of the collection. The number of fields depends on the factory used to produce documents: in our case, we will get indices for the title (the content of the HTML title element, if present; the filename is used, instead, if the title element is absent) and the body (the textual content of the entire HTML page). Additionally, FileSetDocumentCollection sets the URI of each document to a URI pointing to the absolute location of the file in the file system; the document title is, once more, going to be the title appearing in the HTML content.

java -server it.unimi.di.mg4j.tool.IndexBuilder \
    --keep-batches --downcase -S javadoc.collection javadoc

The class IndexBuilder has a large number of options, as it runs in sequence the two phases of the indexing process. These phases are also available separately, mainly in the case of very large collection (hundreds of millions of documents) for which the memory limits are rather tight. You might need to specify a JVM memory option (e.g., -Xmx1G) to allow for more memory in such a case.

In this example, we have used the --downcase option that forces all the terms to be downcased: this means that the index will collapse words that differ only for the presence of upper/lowercase letters. For example, terms String and string will not be distinguished. More generally, you could specify a different term processor for custom term modification (in this case, the DowncaseTermProcessor class has been implicitly chosen). The -S option specifies that we are producing an index for the specified collection (javadoc.collection): if the option was omitted, Index would expect to index a document sequence read from standard input (more about this below). The --keep-batches option is not used normally, but we specify it here so to have a look at the temporary files generated during the indexing process. The last, unflagged option, javadoc, is the only mandatory option for Index, and it is the index basename, the basename after which all index files are stemmed.

Since our collection has documents containing two fields, named title and text (actually, there is also a third virtual field named anchor, but we will not index it for the time being), there will be two sets of index files: each will be named, by convention, with the index basename followed by the field name (separated with a dash). Hence, there will be index files named javadoc-title.something and files named javadoc-text.something.

We have now built indices, and we are ready to query them using a web server. This is very easy in MG4J: we just run the main method of the Query class specifying the -h option and passing as argument the indices and (for showing snippets) the collection:

java it.unimi.di.mg4j.query.Query -h -i FileSystemItem \
     -c javadoc.collection javadoc-text javadoc-title

We can now either use the command line (if you have rlwrap installed, you can put it to good use), or open the search page by pointing our browser to http://localhost:4242/Query and start querying the collection. Note that -i option, which specifies what to link to result items: the specified class links a file in the file system using a local HTTP server (the observation about class names made for factories applies here, too).

Note that the names we specified for the indices (e.g., javadoc-text) are actually URIs, so you can add options much like in a web query. For instance, javadoc-text?inMemory=1 would load the index into main memory, whereas javadoc-text?mapped=1 would try to use low-level memory-mapping features of the operating system to cache the most frequently used part of the index in main memory.

Prev		Next
MG4J: The Manual	Home	Building a compressed collection