Virtual fields in MG4J

As we explained, documents usually originate from some stream in the form of byte sequences; every such sequence representing a document is then interpeted by some document factory that actually maps the byte sequence into a set of fields. For example, the it.unimi.di.big.mg4j.document.HtmlDocumentFactory translates a sequence of bytes into a set of fields, such as the title of the HTML document and its body. The factory deals with all the problems of translating bytes into characters, of establishing which parts of the document should be retained (e.g., in the case of HTML, discarding tags), of determining word borders etc.

There are cases, though, when the content of a document actually refers to another document in the collection: for example, it is well known that a HTML document may contain anchors, that are pieces of text that link to (and, at least conceptually, refer to) another document, specified via a URI.

As an example, consider the following document, with URI http://foo.bar/one.html:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" 
     "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html>
  <title>This is document one</title>
  <body>
    <p>Here you can find a <a href="http://foo.bar/two.html">document \
       containing a lot of information about Mongolia</a>.
  </body>
</html>

The piece of text that reads:

document containing a lot of information about Mongolia

is actually an anchor that refers to another document (with URI http://foo.bar/two.html) and this fact should somehow be made explicit when indexing the collection. For example, in some sense, the word Mongoliashould be taken as appearing in the document http://foo.bar/two.html, even if it may not even be mentioned in the text at that page.

This situation is dealt with by MG4J with the special notion of virtual field. Understanding how virtual fields actually work requires some patience, and some knowledge of the internal organization of document collections and factories; the reader may want to skip this section, reserving it for later.

Virtual fields and virtual fragments

As we briefly said, every document factory is responsible for turning raw byte sequences into documents. In particular, every factory transforms a sequence of bytes into a number of fields. Every field has a name and a type: for example, a document factory for mail documents might contain fields such as subject, from, to, body, date etc. The type of a field determines which values the field might contain. The two most important types of fields (and currently the only ones that MG4J is able to index) are textual fields and virtual fields.

A textual field, as the reader may guess, is just a piece of text, that is recognized as composed by words: words of a textual fields are the atoms of MG4J indexing system for textual fields. How words are really singled out from the stream of characters is a subtle problem that is dealt with by something that is called a word reader in the MG4J jargon, but we reserve a more comprehensive explanation of how this actually works for later.

Let us consider, instead, virtual fields. To make our explanations more concrete, let us consider the HTMLDocumentFactory: as we said above, this factory produces fields out of a HTML document. Actually, the factory has three fields: two of them (text and title) are textual, and one (anchor) is virtual.

A virtual field produces pieces of text that are to be referred to other documents, possibly belonging to the collection. To establish a precise terminology, let us call referrer the document that we are considering, and referee the document to which a certain piece of referrer is referring to. Now, the referrer produces in a virtual field a number of fragments of text, each referring to a certain referee. Hence, the content of a virtual field is conceptually a list of pairs made by a piece of text (called virtual fragment) and by some string that is aimed at representing the referee (called the document spec because it should somehow specify which document we are referring to).

In the case of the HTMLDocumentFactory, the anchor field is the list of all anchors contained in the document; the document spec is a URL (as specified in the href attribute) whereas the virtual fragment is the content of the anchor element. To be more precise, the actual implementation of the factory in MG4J considers not only the content of the anchor, but also some surrounding text, calle the anchor context. This is only incidental, though: the important point is that a certain piece of text is associated with the document spec.

Note that as far as document factories are concerned, there is no fixed way to map document spec into actual references to documents in the collection. This is resolved, in MG4J, by the notion of document resolver.

Document resolvers

A document resolver is an object that is able to map the document spec produced by some document factory into actual references to documents in the collection: more precisely, given a document spec, the resolver will decide whether the spec really refers to a document in the collection or not, and in the first case it will find out to which document the spec refers to.

You don't need to deal with document resolvers until you try to index virtual fields. This is something that actually MG4J does only on demand: this is why in the example of the previous section we ignored the problem. Indeed, when we issued the command:

java -Xmx256M it.unimi.di.big.mg4j.tool.IndexBuilder \
    --downcase -S javadoc.collection javadoc

we asked MG4J to index only the textual fields of the collection (whose documents were, as you remember, HTML documents). This means that only titles and texts were indexed, but no anchors (some of you may have noticed that MG4J emitted a brief warning about this fact, logging that Virtual field anchor is not being indexed; use -a or explicitly add field among the indexed ones).

Now, if you want to index also anchors you might explicitly ask for it, or you may use the -a option:

java -Xmx256M it.unimi.di.big.mg4j.tool.IndexBuilder \
    -a --downcase -S javadoc.collection javadoc

If you try to do so, you will get an exception, saying that No resolver was associated with virtual field anchor: to understand the meaning of this exception we need to build a document resolver that is able to translate the document spec produced for the field anchor by the HTMLDocumentFactory into references to documents of the collection. Note that every document spec needs a different kind of document resolver, and you need to know which document resolver fits the needs of a certain virtual field.

In the case of anchors, the job is done by the URLMPHVirtualDocumentResolver class, that turns URLs into document pointers (i.e., references to documents). To build a URL document resolver, you first need to find the URLs of the document within your collection; you can list them as follows

java -Xmx256M it.unimi.di.big.mg4j.tool.ScanMetadata \
    -S javadoc.collection -u javadoc.urls

This command scans the whole collection and produces a (text) file called javadoc.urls that contains the URLs of the collection in their order (of course, the collection URIs must actually be URLs). Note that in the case of our collections, URLs will actually be just file names.

By the way, you can use ScanMetadata also to extract other information (e.g., the document titles) from your collection.

Now that you have a list or URLs, one per document, you can build the document resolver you need by calling:

java -Xmx256M it.unimi.di.big.mg4j.tool.URLMPHVirtualDocumentResolver \
    -o javadoc.urls javadoc-anchor.resolver 

This command produces the resolver you need to index your anchor fields. Now, you can try again to index the whole collection, running:

java -Xmx512M it.unimi.di.big.mg4j.tool.IndexBuilder \
    -a -v anchor:javadoc-anchor.resolver --downcase \
    -S javadoc.collection javadoc

What is a document resolver actually doing: virtual texts and gaps

To understand what we just did, it is useful to think that conceptually all the virtual fragments that refer to a given document of the collection should be thought of as producing a single text, called the virtual text. So, for example, all the text of anchors referring to file:/usr/share/javadoc/java/java/lang/String.html should be concatenated and thought of a single virtual text that will be indexed as a part of file:/usr/share/javadoc/java/java/lang/String.html.

Indeed, if you start the query engine again

java it.unimi.di.big.mg4j.query.Query -h -i FileSystemItem \
     -c javadoc.collection javadoc-text \
     javadoc-title javadoc-anchor

you will be able to input queries such as text:implementation AND anchor:buffer that are matched by all documents that contain the word implementation in their text and the word buffer in (some of their) anchor(s).

Some caution should be exercised here. When indexing, the virtual text is actually (somehow) built by concatenating the anchor text. This means that virtual fragments coming from different anchors are actually concatenated. This fact might produce false positive results. For example, queries like anchor:(buffer AND long) are matched by documents that contain both the word buffer and the word long in their anchors, but not necessarily in the same anchor.

To avoid such kinds of false positives, you can play with virtual gaps: the virtual gap is a positive integer, and it is the virtual space left between different virtual fragments. For example, if the virtual gap is 64 (the default), anchors are concatenated by leaving 64 "empty words" between subsequent fragments.

Hence, for example, if you input a query like anchor:(buffer AND long)~64 you will be sure that only documents that contain both words in the same anchor will be found. Of course, this time you might have false negatives, if some anchor is longer than 64 words. If you want, while indexing you can specify a different virtual gap; for example:

java -Xmx512M it.unimi.di.big.mg4j.tool.IndexBuilder \
    -a -g anchor:100 -v anchor:javadoc-anchor.resolver \
    --downcase -S javadoc.collection javadoc

runs exactly as before, but leaving a virtual gap of 100 words between successive fragments.