Package it.unimi.di.big.mg4j.document
Warning: We are still working on the document infrastructure. It should be pretty stable, but
changes should not be unexpected. Suggestions are welcome. Note also that most of the classes in this
package should be considered examples and suggestions: while a casual user will find them
invaluable in indexing data, a custom, large-scale application will usually require writing your own
DocumentCollection
.
Basic interfaces
The Document
interface
MG4J
aims at indexing the content of entities called documents. The main
classes that describe documents and sets of documents are included in the it.unimi.di.big.mg4j.document
package. In particular, documents are instances of the Document
interface. A document is characterized abstractly by the following data:
- a title, a character sequence that represents the document; the document title is
returned by the
Document.title()
method; - a URI, that somehow characterizes the document uniquely; the document URI is returned by
the
Document.uri()
method; - a number of fields; every field is abstractly represented by a number, the field index;
fields are numbered from 0, but the user should know how many fields a document exhibits, because this
information is not available in the document itself. For every field, the document exhibits the following data:
- the field content, that is an
Object
returned by theDocument.content(int)
method; the type of object that this method returns must be known by the calling class in advance; in particular, for textual fields (see below), the content will be aReader
; - for textual fields only, a word reader:
an object that is able to split the field
content (in this case: a sequence of characters) into a sequence of words; the word reader
is returned by the
Document.wordReader(int)
method (which must be called only for textual fields).
- the field content, that is an
Users should always close a document after usage by calling the Document.close()
method: the method is responsible for relinquishing all resources that a document instantiated for its very existence.
The DocumentFactory
interface
Documents usually do not come alone, but they are grouped into collections: documents within a collection are of the same type, and this fact explains why the document structure (number and type of fields) are not contained in the document itself. Indeed, documents are produced by document factories.
A document factory is an instance of the DocumentFactory
interface, that in
particular is able to produce a document. All documents produced by the same factory are of the same kind, and exhibit
the same number and type of fields. A factory gives information about the documents it produces through the following methods:
-
DocumentFactory.numberOfFields()
: returns the number of fields contained in each document produced by this factory; recall that fields are indexed starting from 0; -
DocumentFactory.fieldName(int)
: returns a mnemonic explanatory name for the given field; -
DocumentFactory.fieldIndex(String)
: returns the index of a field, given its mnemonic name; -
DocumentFactory.fieldType(int)
: returns (an integer representing) the type of the given field; possible types are static constants declared in theDocumentFactory.FieldType
interface; one of the possible types isDocumentFactory.FieldType.TEXT
, used for textual fields; note that the type of objects returned by theDocument.content(int)
method of theDocument
interface depend on the type of the field.
The abovementioned methods provide information about documents produced by the factory. The actual documents are
produced by the getDocument(rawContent,metadata)
method.
This method returns a new document from the factory. The rawContent
parameter is
the most important one: it is a stream
of bytes that the factory uses to produce the document.
The factory knows how the sequence of bytes should be interpreted
to produce a document of the desired kind.
Note that even though the interpretation of the sequence of bytes representing the raw
document content is entirely left to implementors, often you
might prefer to think of the input byte sequence as of a list of consecutive self-delimiting byte
subsequences, one for each field: in this case, the
InputStream.reset()
method of the InputStream
class is used to divide the
subsequences from one another.
The metadata
parameter is a map providing some basic
data about the document as derived by the collection. The map is a reference map with
suitable Enum
keys, and as such must be
queried using the keys in PropertyBasedDocumentFactory.MetadataKeys
, or other
similar factory-specific keys. For instance, the key PropertyBasedDocumentFactory.MetadataKeys.TITLE
gives a suggested title to the document (but the factory may ignore it, if it has a better way to determine
a title for the document), whereas PropertyBasedDocumentFactory.MetadataKeys.URI
specifies a suggested URI for
the document (which, once more, may be ignored by the factory).
Usually, a factory is built using a list of properties that define default values for metadata such as charset
encoding or MIME types. There properties can be passed in several ways, and usually the main method of a collection
provides an option (typically, -p) that let the user specify default metadata for the factory. The
property resolution algorithm is explained in the documentation for
PropertyBasedDocumentFactory
.
The DocumentSequence
interface
Up to this point, we have interpreted documents and document factories in a very abstract manner, but we gave no importance on the way the byte sequences representing the raw data are produced.
Basically, a source of documents is a DocumentSequence
. More precisely, instances of
this class represent sources that are able to generate documents. Typically, a document sequence is able to produce a
stream of documents after one another, through a special kind of iterator, called
DocumentIterator
, returned by the DocumentSequence.iterator()
method.
A document iterator is not really a Java Iterator
: it is simply a class that
exposes a method DocumentIterator.nextDocument()
that returns the next document,
if any, or null
when there are no more documents. Thus, document iterators can be lazy, which
is preferrable in several circumstance (e.g., documents coming from an input stream).
The DocumentCollection
interface
In some cases, the documents only appear as an uninterrupted stream and applications do not have direct accesses to
single documents; in particular, it might be the case that documents just disappear after being enumerated (as it happens
when the document source is standard input). In all such situations, a DocumentSequence
provides the only way to get the documents, because it only guarantees sequential access.
Nonetheless, there are other cases where documents can be easily accessed in a direct fashion, and can be read many
times (for example, when documents are files in the file system). In such cases, DocumentCollection
(an extension of DocumentSequence
)
can be used.
Apart for the methods of a document sequence, a document collection provides the following additional access methods to the documents:
-
DocumentCollection.size()
, that returns the collection size, i.e., the number of documents in the collection; -
DocumentCollection.document(long)
, that returns the document with given pointer; the document pointer is an integer representing uniquely a document within the collection: the i-th document produced by the collection's document iterator has pointer i−1 (so, document pointers range from 0, inclusive, tosize()
, exclusive); -
DocumentCollection.stream(long)
, that returns theInputStream
of raw data that this collection would use to produce the document with given pointer.
After a document collection has been created, for example, starting from a set of files in the file system, it can usually
be saved (serialized) on a file: the extension used for the filename is, by default, DocumentCollection.DEFAULT_EXTENSION
.
Implementors of this interface should always specify explicitly which assumptions on the existence of
external data are made for the consistency of a collection to be preserved. For example, a collection produced from a set of
files will be consistent until no file has been changed or deleted; if the latter situation happens, the collection usually
becomes inconsistent, and in any case you might expect that the indices thereof produced will no longer match the content
of the collection.
Note that a document collection has very weak requirements (and thus
very weak obligations) on the concurrent creation of several objects (documents, iterators, etc.). Please read carefully the
class description
.
Relations between document sequences/collections and document factories
As we explained above, document sequences/collections extract raw data (byte sequences) from some source and use a specified document factory to turn such data into documents. Hence, there exists a tight connection between each document sequence and the document factory it uses.
Typically, the document factory is provided to the document sequence at construction time, and this fact provides a form of flexibility, because different sources (e.g., the file system and the standard input) may be coupled with the same document factory (e.g., a document factory parsing HTML documents into text), or, conversely, the same document source may be used to produce documents with different formats.
Users should always be careful, however. Often, document sequences make assumptions about the factory they use, which
reduces the number of possible combinations the user may adopt. Implementors of the DocumentSequence
interface should always clarify all the assumptions they make about the factories that can be used for the sequence.
Warning: it is a common mistake forgetting to specify some property (typically, the character encoding) for the factory when creating a document sequence (or collection). This problem cannot be detected at construction time because, in principle, the document sequence could guess the property and pass it to the factory. The problem arises, usually, in the indexing phase, when the first document is retrieved. We realise that this is confusing and counterintuitive—the collection has already been created and serialized, so what's the problem?—but there is no way to check what the sequence will do with a particular document.
Document factories
Recall that a document factory is an object that is capable of producing homogeneous documents (documents with the same number/type of fields). Every document is produced starting from a raw bytestream.
The IdentityDocumentFactory
The simplest possible document factory is the IdentityDocumentFactory
: this factory
produces documents with a single textual field, called text, that is actually obtained by transforming the
byte sequence into a sequence of characters, using some default encoding. Actually, a
document factory must also provide a way to break the text into words. With this aim, the identity factory may be provided
with a Locale
that is used to determine how words are best broken in the given locale's language.
Other examples of document factories
Other implementations of document factories that are provided with MG4J
are:
-
HtmlDocumentFactory
: a factory used to parse HTML documents; this factory produces documents by parsing HTML streams. Bytes are converted into characters using a specified encoding; the resulting HTML character sequence is parsed to extract text (that is returned as a textual field named text) and title (the HTML title element, returned as a textual field named title). The title, if present, is also used as document title (otherwise, the suggested title is used). Note that a document collection might have information about the charset encoding (e.g., by means of HTTP headers): in this case themetadata
field ofDocumentFactory.getDocument(java.io.InputStream,it.unimi.dsi.fastutil.objects.Reference2ObjectMap)
should pass this information. - The package
it.unimi.di.big.mg4j.document.tika
provides factories based on Tika.
Composing document factories
As we said, many document factories interpret the raw content data (an InputStream
, i.e., a sequence
of bytes) as if it is really made by a concatenation of many InputStream
s, where each stream is typically
parsed to a field; to pass from one stream to the next, the InputStream.reset()
method is called.
Suppose you have n document factories D1, …, Dn, with f1, …, fn fields, respectively. One may want to build a new factory with f1+…+fn fields, where each document is produced by composing the document factories D1, …, Dn sequentially: in other words, the raw data are first passed to the first factory (that extracts f1 fields, typically resetting the stream as many times), then it is passed to the second factory (that extracts f2 fields) etc.
The CompositeDocumentFactory
does the job, and also allows one to change the field
names (that are otherwise named as they were in the subfactories).
The class MultipleInputStream
is a useful tool to produce raw data for composite factories:
it allows one to convert an array of input streams into a single input stream: each time the resulting stream is reset,
the multiple input stream will offer you the next stream in the array.
A special form of composite document factory is obtained using ReplicatedDocumentFactory
,
that allows one to compose sequentially the same document factory with itself a certain number of times.
Document collections and sequences
The InputStreamDocumentSequence
This is the simplest kind of document sequence: it just breaks a single InputStream
on the basis of a given
separator character; each piece of the stream is interpreted as the raw data corresponding to a document, and it is passed
to a factory (specified at construction time) for converting it into a Document
.
The FileSetDocumentCollection
This kind of collection is built starting from a set of files in the file system. Each file is interpreted as a document, and passed to a factory (specified at construction time). The suggested title for a document is the corresponding filename, and the suggested URI is the URI of the file.
The ZipDocumentCollection
facility
There are cases in which one would like to turn a document sequence into a document collection. This may happen for one of the following reasons:
- the sequence is, by its very nature, volatile (e.g., it is coming from standard input, and cannot be re-produced), but we would like to make it into a resident non-volatile collection;
- the sequence is not amenable to be accessed at random;
- the documents in the sequence are difficult to parse, and it is not advisable to repeat the parsing process every time they are accessed.
In all such cases, it may be advisable to produce a compact copy of the sequence that is easily and efficiently accessible at random.
To do this, one may use the ZipDocumentCollectionBuilder
, that takes a document sequence
and produces a "zipped clone" of the documents in the sequence: there are some mild limitations to the sequences that can
be used in this context, and the resulting collection is only a partial copy of the original one, but in most cases this is
sufficient for all indexing purposes. The builder will save two files: one contains the essential data concerning the zipped
collection, and the other contains the zipped version of the documents.
After this, the produced ZipDocumentCollection
may be used as any other collection.
-
Interface Summary Interface Description DispatchingDocumentFactory.DispatchingStrategy A strategy that decides which factory is appropriate using the document metadata.Document An indexable document.DocumentCollection A collection of documents.DocumentCollectionBuilder An interface for classes that can build collections during the indexing process.DocumentFactory A factory parsing and building documents of the same type.DocumentIterator An iterator over documents.DocumentSequence A sequence of documents. -
Class Summary Class Description AbstractDocument An abstract,safely closeable
implementation of a document.AbstractDocumentCollection An abstract,safely closeable
implementation of a document collection.AbstractDocumentFactory An abstract implementation of a factory, providing a protected method to check for field indices.AbstractDocumentIterator An abstract,safely closeable
implementation of a document iterator.AbstractDocumentSequence An abstract,safely closeable
implementation of a document sequence.CompositeDocumentFactory A composite factory that passes the input stream to a sequence of factories in turn.CompositeDocumentSequence A document sequence composing a list of underlying sequences.ConcatenatedDocumentCollection A document collection exhibiting a list of underlying document collections, called segments, as a single collection.ConcatenatedDocumentSequence A document sequence exhibiting a list of underlying document sequences, called segments, as a single sequence.CSVDocumentCollection ADocumentCollection
corresponding to a given set of records in a comma separated file.DispatchingDocumentFactory A document factory that actually dispatches the task of building documents to various factories according to some strategy.DispatchingDocumentFactory.StringBasedDispatchingStrategy A strategy that is based on trying to match the value of the metadata with a given key with respect to a certain set of values.FileSetDocumentCollection ADocumentCollection
corresponding to a given set of files.HtmlDocumentFactory A factory that provides fields for body and title of HTML documents.IdentityDocumentFactory A factory that provides a single field containing just the raw input stream; the encoding is set using the propertyPropertyBasedDocumentFactory.MetadataKeys.ENCODING
.InputStreamDocumentSequence A document sequence obtained by breaking an input stream at a specified separator.JavamailDocumentCollection ADocumentCollection
corresponding to a JavamailStore
.JdbcDocumentCollection ADocumentCollection
corresponding to the result of a query in a relational database.PropertyBasedDocumentFactory A document factory initialised by default properties.ReplicatedDocumentFactory A factory that replicates a given factory several times.SimpleCompressedDocumentCollection A basic, compressed document collection that can be easily built at indexing time.SimpleCompressedDocumentCollection.FrequencyCodec A simple codec for integers that remaps frequent numbers to smaller numbers.SimpleCompressedDocumentCollectionBuilder A builder for simple compressed document collections.SubDocumentCollection A collection that exhibits a contiguous subsets of documents from a given collection.SubDocumentFactory A factory that exposes a subset of the fields a given factory.SubsetDocumentSequence A collection that exhibits a subset of documents (possibly not contiguous) from a given sequence.TRECDocumentCollection A collection for the TREC GOV2 data set.TRECDocumentCollection.TRECDocumentDescriptor A compact description of the location and of the internal segmentation of a TREC document inside a file.TRECHeaderDocumentFactory A factory without fields that is used to interpret the header of a TREC GOV2 document.WarcDocumentSequence A document sequence over a set of (possibly compressed) Warc files.WikipediaDocumentCollection ADocumentCollection
corresponding to a given set of files in the Yahoo! Wikipedia format.WikipediaDocumentCollection.WhitespaceWordReader WikipediaDocumentSequence A class exhibiting a standard Wikipedia XML dump as aDocumentSequence
.WikipediaDocumentSequence.SignedRedirectedStringMap A wrapper around a signed function that remaps entries exceeding a provided threshold using a specified target array.WikipediaDocumentSequence.WikipediaHeaderFactory A factory responsible for special Wikipedia fields (see the class documentation).ZipDocumentCollection A document collection stored in a zip file.ZipDocumentCollection.ZipFactory A factory tightly coupled to aZipDocumentCollection
.ZipDocumentCollectionBuilder A builder for zipped document collections. -
Enum Summary Enum Description AbstractDocumentCollection.PropertyKeys Symbolic names for common properties of aDocumentCollection
.DispatchingDocumentFactory.MetadataKeys Case-insensitive keys for metadata.DocumentFactory.FieldType A field type.HtmlDocumentFactory.MetadataKeys IdentityDocumentFactory.MetadataKeys Case-insensitive keys for metadata.PropertyBasedDocumentFactory.MetadataKeys Case-insensitive keys for metadata passed toDocumentFactory.getDocument(java.io.InputStream,it.unimi.dsi.fastutil.objects.Reference2ObjectMap)
.WikipediaDocumentSequence.MetadataKeys ZipDocumentCollection.PropertyKeys Symbolic names for common properties of aDocumentCollection
.