Class ZipDocumentCollection
- java.lang.Object
-
- it.unimi.di.big.mg4j.document.AbstractDocumentSequence
-
- it.unimi.di.big.mg4j.document.AbstractDocumentCollection
-
- it.unimi.di.big.mg4j.document.ZipDocumentCollection
-
- All Implemented Interfaces:
DocumentCollection
,DocumentSequence
,SafelyCloseable
,FlyweightPrototype<DocumentCollection>
,Closeable
,Serializable
,AutoCloseable
public class ZipDocumentCollection extends AbstractDocumentCollection implements Serializable
A document collection stored in a zip file.Each instance of this class has an associated zip file. Each Zip entry corresponds to a document: the title is recorded in the comment field, whereas the URI is written with
MutableString.writeSelfDelimUTF8(java.io.OutputStream)
directly to the zipped output stream. When building an exact ZipDocumentCollection subsequent word/nonword pairs are written in the same way, and delimited by two empty strings. If the collection is not exact, just words are written, and delimited by an empty string. Non-text fields are written directly to the zipped output stream as serialised objects.The collection will produce the same documents as the original sequence whence it was produced, in the following sense:
- the resulting collection has as many document as the original sequence, in the same order, with the same titles and URI;
- every document has the same number of fields, with the same names and types;
- non-textual non-virtual fields will be written out as objects, so they need to be serializable;
- virtual fields will be written as a sequence of self-delimiting UTF-8 mutable strings
starting with the number of fragments (converted into a string with
String.valueOf(int)
), followed by a pair of strings for each fragment (the first string being the document specifier, and the second being the associated text); - textual fields will be written out in such a way that, when reading them, the same sequence of words and non-words will be produced; alternatively, one may produce a collection that only copies words (non-words are not copied).
The collection will be, as any other collection, serialized on a file, but it will refer to another zip file that is going to contain the documents themselves. Please use
AbstractDocumentSequence.load(CharSequence)
to load instances of this collection.Note that the zip format is not designed for a large number of files. This class is mainly a useful example, and a handy way to build quickly a collection containing all fields at indexing time. For a more efficient kind of collection, see
SimpleCompressedDocumentCollection
.Warning: the
Reader
returned byDocument.content(int)
for documents produced by this factory is just obtained as the concatenation of words and non-words returned by the word reader for that field. In case the collection is not exact, nonwords are substituted by a space.- See Also:
- Serialized Form
-
-
Nested Class Summary
Nested Classes Modifier and Type Class Description static class
ZipDocumentCollection.PropertyKeys
Symbolic names for common properties of aDocumentCollection
.protected static class
ZipDocumentCollection.ZipFactory
A factory tightly coupled to aZipDocumentCollection
.
-
Field Summary
Fields Modifier and Type Field Description static String
ZIP_EXTENSION
-
Fields inherited from interface it.unimi.di.big.mg4j.document.DocumentCollection
DEFAULT_EXTENSION
-
-
Constructor Summary
Constructors Constructor Description ZipDocumentCollection(String zipFilename, DocumentFactory underlyingFactory, long numberOfDocuments2, boolean exact)
Constructs a document collection (for reading) corresponding to a given zip collection file.
-
Method Summary
Modifier and Type Method Description void
close()
Closes this document sequence, releasing all resources.ZipDocumentCollection
copy()
Document
document(long index)
Returns the document given its index.DocumentFactory
factory()
Returns the factory used by this sequence.void
filename(CharSequence filename)
Does nothing.DocumentIterator
iterator()
Returns an iterator over the sequence of documents.Reference2ObjectMap<Enum<?>,Object>
metadata(long index)
Returns the metadata map for a document.long
size()
Returns the number of documents in this collection.InputStream
stream(long index)
Returns an input stream for the raw content of a document.-
Methods inherited from class it.unimi.di.big.mg4j.document.AbstractDocumentCollection
ensureDocumentIndex, main, printAllDocuments, toString
-
Methods inherited from class it.unimi.di.big.mg4j.document.AbstractDocumentSequence
finalize, load
-
-
-
-
Field Detail
-
ZIP_EXTENSION
public static final String ZIP_EXTENSION
- See Also:
- Constant Field Values
-
-
Constructor Detail
-
ZipDocumentCollection
public ZipDocumentCollection(String zipFilename, DocumentFactory underlyingFactory, long numberOfDocuments2, boolean exact)
Constructs a document collection (for reading) corresponding to a given zip collection file.- Parameters:
zipFilename
- the filename of the zip collection.underlyingFactory
- the underlying document factory.numberOfDocuments2
- the number of documents.exact
-true
iff this is an exact reproduction of the original sequence.
-
-
Method Detail
-
filename
public void filename(CharSequence filename) throws IOException
Description copied from class:AbstractDocumentSequence
Does nothing.- Specified by:
filename
in interfaceDocumentSequence
- Overrides:
filename
in classAbstractDocumentSequence
- Parameters:
filename
- the filename of this document sequence.- Throws:
IOException
-
copy
public ZipDocumentCollection copy()
- Specified by:
copy
in interfaceDocumentCollection
- Specified by:
copy
in interfaceFlyweightPrototype<DocumentCollection>
-
factory
public DocumentFactory factory()
Description copied from interface:DocumentSequence
Returns the factory used by this sequence.Every document sequence is based on a document factory that transforms raw bytes into a sequence of characters. The factory contains useful information such as the number of fields.
- Specified by:
factory
in interfaceDocumentSequence
- Returns:
- the factory used by this sequence.
-
size
public long size()
Description copied from interface:DocumentCollection
Returns the number of documents in this collection.- Specified by:
size
in interfaceDocumentCollection
- Returns:
- the number of documents in this collection.
-
document
public Document document(long index) throws IOException
Description copied from interface:DocumentCollection
Returns the document given its index.- Specified by:
document
in interfaceDocumentCollection
- Parameters:
index
- an index between 0 (inclusive) andDocumentCollection.size()
(exclusive).- Returns:
- the
index
-th document. - Throws:
IOException
-
metadata
public Reference2ObjectMap<Enum<?>,Object> metadata(long index)
Description copied from interface:DocumentCollection
Returns the metadata map for a document.- Specified by:
metadata
in interfaceDocumentCollection
- Parameters:
index
- an index between 0 (inclusive) andDocumentCollection.size()
(exclusive).- Returns:
- the metadata map for the document.
-
stream
public InputStream stream(long index) throws IOException
Description copied from interface:DocumentCollection
Returns an input stream for the raw content of a document.- Specified by:
stream
in interfaceDocumentCollection
- Parameters:
index
- an index between 0 (inclusive) andDocumentCollection.size()
(exclusive).- Returns:
- the raw content of the document as an input stream.
- Throws:
IOException
-
iterator
public DocumentIterator iterator()
Description copied from interface:DocumentSequence
Returns an iterator over the sequence of documents.Warning: this method can be safely called just one time. For instance, implementations based on standard input will usually throw an exception if this method is called twice.
Implementations may decide to override this restriction (in particular, if they implement
DocumentCollection
). Usually, however, it is not possible to obtain two iterators at the same time on a collection.- Specified by:
iterator
in interfaceDocumentSequence
- Overrides:
iterator
in classAbstractDocumentCollection
- Returns:
- an iterator over the sequence of documents.
- See Also:
DocumentCollection
-
close
public void close() throws IOException
Description copied from interface:DocumentSequence
Closes this document sequence, releasing all resources.You should always call this method after having finished with this document sequence. Implementations are invited to call this method in a finaliser as a safety net (even better, implement
SafelyCloseable
), but since there is no guarantee as to when finalisers are invoked, you should not depend on this behaviour.- Specified by:
close
in interfaceAutoCloseable
- Specified by:
close
in interfaceCloseable
- Specified by:
close
in interfaceDocumentSequence
- Overrides:
close
in classAbstractDocumentSequence
- Throws:
IOException
-
-