As we explained, documents usually originate from some stream in
the form of byte sequences; every such sequence representing a document
is then interpeted by some document factory that
actually maps the byte sequence into a set of fields. For example, the
it.unimi.di.big.mg4j.document.HtmlDocumentFactory
translates a sequence of bytes into a set of fields, such as the title
of the HTML document and its body. The factory deals with all the
problems of translating bytes into characters, of establishing which
parts of the document should be retained (e.g., in the case of HTML,
discarding tags), of determining word borders etc.
There are cases, though, when the content of a document actually refers to another document in the collection: for example, it is well known that a HTML document may contain anchors, that are pieces of text that link to (and, at least conceptually, refer to) another document, specified via a URI.
As an example, consider the following document, with URI
http://foo.bar/one.html
:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"> <html> <title>This is document one</title> <body> <p>Here you can find a <a href="http://foo.bar/two.html">document \ containing a lot of information about Mongolia</a>. </body> </html>
The piece of text that reads:
document containing a lot of information about Mongolia
is actually an anchor that refers to another document (with URI
http://foo.bar/two.html
) and this fact should somehow be made
explicit when indexing the collection. For example, in some sense, the
word
should be taken as
appearing in the document Mongolia
http://foo.bar/two.html
, even if it
may not even be mentioned in the text at that page.
This situation is dealt with by MG4J with the special notion of virtual field. Understanding how virtual fields actually work requires some patience, and some knowledge of the internal organization of document collections and factories; the reader may want to skip this section, reserving it for later.
As we briefly said, every document factory is responsible for
turning raw byte sequences into documents. In particular, every
factory transforms a sequence of bytes into a number of
fields. Every field has a
name and a type: for
example, a document factory for mail documents might contain fields
such as subject
, from
,
to
, body
,
date
etc. The type of a field determines which
values the field might contain. The two most important types of fields
(and currently the only ones that MG4J is able to index) are
textual fields and virtual
fields.
A textual field, as the reader may guess, is just a piece of text, that is recognized as composed by words: words of a textual fields are the atoms of MG4J indexing system for textual fields. How words are really singled out from the stream of characters is a subtle problem that is dealt with by something that is called a word reader in the MG4J jargon, but we reserve a more comprehensive explanation of how this actually works for later.
Let us consider, instead, virtual fields. To make our
explanations more concrete, let us consider the
HTMLDocumentFactory
: as we said above, this
factory produces fields out of a HTML document. Actually, the factory
has three fields: two of them (text
and
title
) are textual, and one
(anchor
) is virtual.
A virtual field produces pieces of text that are to be referred to other documents, possibly belonging to the collection. To establish a precise terminology, let us call referrer the document that we are considering, and referee the document to which a certain piece of referrer is referring to. Now, the referrer produces in a virtual field a number of fragments of text, each referring to a certain referee. Hence, the content of a virtual field is conceptually a list of pairs made by a piece of text (called virtual fragment) and by some string that is aimed at representing the referee (called the document spec because it should somehow specify which document we are referring to).
In the case of the HTMLDocumentFactory
,
the anchor
field is the list of all anchors
contained in the document; the document spec is a URL (as specified in
the href attribute) whereas the virtual fragment is the content of the
anchor element. To be more precise, the actual implementation of the
factory in MG4J considers not only the content of the anchor, but also
some surrounding text, calle the anchor context. This is only
incidental, though: the important point is that a certain piece of
text is associated with the document spec.
Note that as far as document factories are concerned, there is no fixed way to map document spec into actual references to documents in the collection. This is resolved, in MG4J, by the notion of document resolver.
A document resolver is an object that is able to map the document spec produced by some document factory into actual references to documents in the collection: more precisely, given a document spec, the resolver will decide whether the spec really refers to a document in the collection or not, and in the first case it will find out to which document the spec refers to.
You don't need to deal with document resolvers until you try to index virtual fields. This is something that actually MG4J does only on demand: this is why in the example of the previous section we ignored the problem. Indeed, when we issued the command:
java -Xmx256M it.unimi.di.big.mg4j.tool.IndexBuilder \ --downcase -S javadoc.collection javadoc
we asked MG4J to index only the textual fields of the
collection (whose documents were, as you remember, HTML
documents). This means that only titles and texts were indexed, but no
anchors (some of you may have noticed that MG4J emitted a brief
warning about this fact, logging that Virtual field anchor is
not being indexed; use -a or explicitly add field among the indexed
ones
).
Now, if you want to index also anchors you might explicitly ask
for it, or you may use the -a
option:
java -Xmx256M it.unimi.di.big.mg4j.tool.IndexBuilder \ -a --downcase -S javadoc.collection javadoc
If you try to do so, you will get an exception, saying that
No resolver was associated with virtual field
anchor
: to understand the meaning of this exception we need
to build a document resolver that is able to translate the document
spec produced for the field anchor by the
HTMLDocumentFactory
into references to
documents of the collection. Note that every document spec needs a
different kind of document resolver, and you need to know which
document resolver fits the needs of a certain virtual field.
In the case of anchors, the job is done by the
URLMPHVirtualDocumentResolver
class, that turns
URLs into document pointers (i.e., references to documents). To build
a URL document resolver, you first need to find the URLs of the
document within your collection; you can list them as follows
java -Xmx256M it.unimi.di.big.mg4j.tool.ScanMetadata \ -S javadoc.collection -u javadoc.urls
This command scans the whole collection and produces a (text)
file called javadoc.urls
that contains the URLs
of the collection in their order (of course, the collection URIs must
actually be URLs). Note that in the case of our collections, URLs will
actually be just file names.
By the way, you can use ScanMetadata
also
to extract other information (e.g., the document titles) from your
collection.
Now that you have a list or URLs, one per document, you can build the document resolver you need by calling:
java -Xmx256M it.unimi.di.big.mg4j.tool.URLMPHVirtualDocumentResolver \ -o javadoc.urls javadoc-anchor.resolver
This command produces the resolver you need to index your anchor fields. Now, you can try again to index the whole collection, running:
java -Xmx512M it.unimi.di.big.mg4j.tool.IndexBuilder \ -a -v anchor:javadoc-anchor.resolver --downcase \ -S javadoc.collection javadoc
To understand what we just did, it is useful to think that
conceptually all the virtual fragments that refer to a given document
of the collection should be thought of as producing a single text,
called the virtual text. So, for example, all
the text of anchors referring to
file:/usr/share/javadoc/java/java/lang/String.html
should be concatenated and thought of a single virtual text that will
be indexed as a part of
file:/usr/share/javadoc/java/java/lang/String.html
.
Indeed, if you start the query engine again
java it.unimi.di.big.mg4j.query.Query -h -i FileSystemItem \ -c javadoc.collection javadoc-text \ javadoc-title javadoc-anchor
you will be able to input queries such as
text:implementation AND anchor:buffer
that are
matched by all documents that contain the word
implementation
in their text and the word
buffer
in (some of their) anchor(s).
Some caution should be exercised here. When indexing, the
virtual text is actually (somehow) built by concatenating the anchor
text. This means that virtual fragments coming from different anchors
are actually concatenated. This fact might produce false positive
results. For example, queries like anchor:(buffer AND
long)
are matched by documents that contain both the word
buffer
and the word long
in
their anchors, but not necessarily in the same
anchor.
To avoid such kinds of false positives, you can play with virtual gaps: the virtual gap is a positive integer, and it is the virtual space left between different virtual fragments. For example, if the virtual gap is 64 (the default), anchors are concatenated by leaving 64 "empty words" between subsequent fragments.
Hence, for example, if you input a query like
anchor:(buffer AND long)~64
you will be sure that
only documents that contain both words in the same
anchor will be found. Of course, this time you might have
false negatives, if some anchor is longer than 64 words. If you want,
while indexing you can specify a different virtual gap; for
example:
java -Xmx512M it.unimi.di.big.mg4j.tool.IndexBuilder \ -a -g anchor:100 -v anchor:javadoc-anchor.resolver \ --downcase -S javadoc.collection javadoc
runs exactly as before, but leaving a virtual gap of 100 words between successive fragments.