Table of Contents
Indexing in MG4J is centered around documents, either exposed by means of sequences or of collections. For the time being, let us concentrate on collections, which are randomly addressable lists of documents.
Each document in a collection is associated with a
title and a URI. Typical
titles are filenames, or titles from HTML documents. URIs can be the
actual URL of a page. To build our first document collection, we use the
main method of the class
FileSetDocumentCollection
, which allows to build
and serialize a set of documents specified by their filenames. As a
typical case, we will build a collection out of your Javadoc
documentation directory. Supposing your Javadocs are located in
/usr/share/javadoc
, you may try the
following:
find /usr/share/javadoc/ -iname \*.html -type f | \ egrep -v "(package-|-tree|class-use|index-.*.html|allclasses)" | \ java it.unimi.di.big.mg4j.document.FileSetDocumentCollection \ -f HtmlDocumentFactory -p encoding=UTF-8 javadoc.collection
Let
us try to understand what's happening. We are providing as input to the
main method of the class a list of files, one per line. Moreover, we are
specifying (using the -f
option) a
factory, that is, something that will turn a pure
stream of bytes (provided, in this case, by a file) into a document made
by several fields (for instance, title and main
text). The factory needs to know the encoding of the files, and we are
specifying UTF-8 as a property. All this
information is serlialised and stored in a file named
javadoc.collection
. Note that since we are using a
standard MG4J factory, we can avoid to write the full factory class name
(it.unimi.di.big.mg4j.document.HtmlDocumentFactory
).
If you try and look into the file
javadoc.collection
, you will discover that this is
indeed a typical, serialized version of a Java object; note that the
file is not going to contain the files that are
part of the collection, but only their name. This means, in particular,
that the very existence of the collection will depend on the existence
of the files spanned by the collection; in other words, deleting or
modifying any of the indexed file may cause inconsistence in the
collection (and, more importantly, in the index produced in the
following steps). This is true of almost every collection: document
collections may base their existence on some external data (files, web
pages, mailbox files etc.), and they usually become inconsistent as soon
as such data are modified, changed or deleted.
It is now time to index our collection. To do so, we simply pass
the collection to the main method of the class
IndexBuilder
, which scans all documents in the
collection and produces a number of indices, one for each field of the
collection. The number of fields depends on the factory used to produce
documents: in our case, we will get indices for the title (the content
of the HTML title
element, if present; the filename
is used, instead, if the title element is absent) and the body (the
textual content of the entire HTML page). Additionally,
FileSetDocumentCollection
sets the URI of each
document to a URI pointing to the absolute location of the file in the
file system; the document title is, once more, going to be the title
appearing in the HTML content.
java it.unimi.di.big.mg4j.tool.IndexBuilder \ --keep-batches --downcase -S javadoc.collection javadoc
The class IndexBuilder
has a large number
of options, as it runs in sequence the two phases of the indexing
process. These phases are also available separately, mainly in the case
of very large collection (hundreds of millions of documents) for which
the memory limits are rather tight. Note that we did not specify a
memory option, for instance, --Xmx256M
, as it is not
necessary (and might be even pernicious) on newer Java virtual machines,
which allocate memory dynamically; if you run into memory problem,
please allow for more memory.
In this example, we have used the --downcase
option that forces all the terms to be downcased: this means that the
index will collapse words that differ only for the presence of
upper/lowercase letters. For example, terms String
and string
will not be distinguished. More generally,
you could specify a different term processor for
custom term modification (in this case, the
DowncaseTermProcessor
class has been implicitly
chosen). The -S
option specifies that we are producing
an index for the specified collection
(javadoc.collection
): if the option was omitted,
Index
would expect to index a document sequence
read from standard input (more about this below). The
--keep-batches
option is not used normally, but we
specify it here so to have a look at the temporary files generated
during the indexing process. The last, unflagged option,
javadoc
, is the only mandatory option for
Index
, and it is the index
basename, the basename after which all index files are
stemmed.
Since our collection has documents containing two fields, named
title
and text
, there will be two
sets of index files: each will be named, by convention, with the index
basename followed by the field name (separated with a dash). Hence,
there will be index files named
javadoc-title.something
and files named
javadoc-text.something
.
We have now built indices, and we are ready to query them using a
web server. This is very easy in MG4J: we
just run the main method of the Query
class
specifying the
option and passing
as argument the indices and (for showing snippets) the
collection:-h
java it.unimi.di.big.mg4j.query.Query -h -i FileSystemItem \ -c javadoc.collection javadoc-text javadoc-title
We can now either use the command line (if you have
rlwrap installed, you can put it to good use), or
open the search page by pointing our browser to
http://localhost:4242/Query
and start querying the
collection. Note that -i
option, which specifies what
to link to result items: the specified class links a file in the file
system using a local HTTP server (the observation about class names made
for factories applies here, too).
Note that the names we specified for the indices (e.g.,
javadoc-text
) are actually URIs, so you can add
options much like in a web query. For instance,
javadoc-text?inMemory=1
would load the index into
main memory, whereas javadoc-text?mapped=1
would try
to use low-level memory-mapping features of the operating system to
cache the most frequently used part of the index in main memory.