it.unimi.di.mg4j.document
Class ZipDocumentCollection

java.lang.Object
  extended by it.unimi.di.mg4j.document.AbstractDocumentSequence
      extended by it.unimi.di.mg4j.document.AbstractDocumentCollection
          extended by it.unimi.di.mg4j.document.ZipDocumentCollection
All Implemented Interfaces:
DocumentCollection, DocumentSequence, SafelyCloseable, FlyweightPrototype<DocumentCollection>, Closeable, Serializable

public class ZipDocumentCollection
extends AbstractDocumentCollection
implements Serializable

A document collection stored in a zip file.

Each instance of this class has an associated zip file. Each Zip entry corresponds to a document: the title is recorded in the comment field, whereas the URI is written with MutableString.writeSelfDelimUTF8(java.io.OutputStream) directly to the zipped output stream. When building an exact ZipDocumentCollection subsequent word/nonword pairs are written in the same way, and delimited by two empty strings. If the collection is not exact, just words are written, and delimited by an empty string. Non-text fields are written directly to the zipped output stream as serialised objects.

The collection will produce the same documents as the original sequence whence it was produced, in the following sense:

The collection will be, as any other collection, serialized on a file, but it will refer to another zip file that is going to contain the documents themselves. Please use AbstractDocumentSequence.load(CharSequence) to load instances of this collection.

Note that the zip format is not designed for a large number of files. This class is mainly a useful example, and a handy way to build quickly a collection containing all fields at indexing time. For a more efficient kind of collection, see SimpleCompressedDocumentCollection.

Warning: the Reader returned by Document.content(int) for documents produced by this factory is just obtained as the concatenation of words and non-words returned by the word reader for that field. In case the collection is not exact, nonwords are substituted by a space.

See Also:
Serialized Form

Nested Class Summary
static class ZipDocumentCollection.PropertyKeys
          Symbolic names for common properties of a DocumentCollection.
protected static class ZipDocumentCollection.ZipFactory
          A factory tightly coupled to a ZipDocumentCollection.
 
Field Summary
static String ZIP_EXTENSION
           
 
Fields inherited from interface it.unimi.di.mg4j.document.DocumentCollection
DEFAULT_EXTENSION
 
Constructor Summary
ZipDocumentCollection(String zipFilename, DocumentFactory underlyingFactory, int numberOfDocuments, boolean exact)
          Constructs a document collection (for reading) corresponding to a given zip collection file.
 
Method Summary
 void close()
          Closes this document sequence, releasing all resources.
 ZipDocumentCollection copy()
           
 Document document(int index)
          Returns the document given its index.
 DocumentFactory factory()
          Returns the factory used by this sequence.
 void filename(CharSequence filename)
          Does nothing.
 DocumentIterator iterator()
          Returns an iterator over the sequence of documents.
 Reference2ObjectMap<Enum<?>,Object> metadata(int index)
          Returns the metadata map for a document.
 int size()
          Returns the number of documents in this collection.
 InputStream stream(int index)
          Returns an input stream for the raw content of a document.
 
Methods inherited from class it.unimi.di.mg4j.document.AbstractDocumentCollection
ensureDocumentIndex, main, printAllDocuments, toString
 
Methods inherited from class it.unimi.di.mg4j.document.AbstractDocumentSequence
finalize, load
 
Methods inherited from class java.lang.Object
clone, equals, getClass, hashCode, notify, notifyAll, wait, wait, wait
 

Field Detail

ZIP_EXTENSION

public static final String ZIP_EXTENSION
See Also:
Constant Field Values
Constructor Detail

ZipDocumentCollection

public ZipDocumentCollection(String zipFilename,
                             DocumentFactory underlyingFactory,
                             int numberOfDocuments,
                             boolean exact)
Constructs a document collection (for reading) corresponding to a given zip collection file.

Parameters:
zipFilename - the filename of the zip collection.
underlyingFactory - the underlying document factory.
numberOfDocuments - the number of documents.
exact - true iff this is an exact reproduction of the original sequence.
Method Detail

filename

public void filename(CharSequence filename)
              throws IOException
Description copied from class: AbstractDocumentSequence
Does nothing.

Specified by:
filename in interface DocumentSequence
Overrides:
filename in class AbstractDocumentSequence
Parameters:
filename - the filename of this document sequence.
Throws:
IOException

copy

public ZipDocumentCollection copy()
Specified by:
copy in interface DocumentCollection
Specified by:
copy in interface FlyweightPrototype<DocumentCollection>

factory

public DocumentFactory factory()
Description copied from interface: DocumentSequence
Returns the factory used by this sequence.

Every document sequence is based on a document factory that transforms raw bytes into a sequence of characters. The factory contains useful information such as the number of fields.

Specified by:
factory in interface DocumentSequence
Returns:
the factory used by this sequence.

size

public int size()
Description copied from interface: DocumentCollection
Returns the number of documents in this collection.

Specified by:
size in interface DocumentCollection
Returns:
the number of documents in this collection.

document

public Document document(int index)
                  throws IOException
Description copied from interface: DocumentCollection
Returns the document given its index.

Specified by:
document in interface DocumentCollection
Parameters:
index - an index between 0 (inclusive) and DocumentCollection.size() (exclusive).
Returns:
the index-th document.
Throws:
IOException

metadata

public Reference2ObjectMap<Enum<?>,Object> metadata(int index)
Description copied from interface: DocumentCollection
Returns the metadata map for a document.

Specified by:
metadata in interface DocumentCollection
Parameters:
index - an index between 0 (inclusive) and DocumentCollection.size() (exclusive).
Returns:
the metadata map for the document.

stream

public InputStream stream(int index)
                   throws IOException
Description copied from interface: DocumentCollection
Returns an input stream for the raw content of a document.

Specified by:
stream in interface DocumentCollection
Parameters:
index - an index between 0 (inclusive) and DocumentCollection.size() (exclusive).
Returns:
the raw content of the document as an input stream.
Throws:
IOException

iterator

public DocumentIterator iterator()
Description copied from interface: DocumentSequence
Returns an iterator over the sequence of documents.

Warning: this method can be safely called just one time. For instance, implementations based on standard input will usually throw an exception if this method is called twice.

Implementations may decide to override this restriction (in particular, if they implement DocumentCollection). Usually, however, it is not possible to obtain two iterators at the same time on a collection.

Specified by:
iterator in interface DocumentSequence
Overrides:
iterator in class AbstractDocumentCollection
Returns:
an iterator over the sequence of documents.
See Also:
DocumentCollection

close

public void close()
           throws IOException
Description copied from interface: DocumentSequence
Closes this document sequence, releasing all resources.

You should always call this method after having finished with this document sequence. Implementations are invited to call this method in a finaliser as a safety net (even better, implement SafelyCloseable), but since there is no guarantee as to when finalisers are invoked, you should not depend on this behaviour.

Specified by:
close in interface DocumentSequence
Specified by:
close in interface Closeable
Overrides:
close in class AbstractDocumentSequence
Throws:
IOException