Class SimpleCompressedDocumentCollection
- java.lang.Object
-
- it.unimi.di.big.mg4j.document.AbstractDocumentSequence
-
- it.unimi.di.big.mg4j.document.AbstractDocumentCollection
-
- it.unimi.di.big.mg4j.document.SimpleCompressedDocumentCollection
-
- All Implemented Interfaces:
DocumentCollection
,DocumentSequence
,SafelyCloseable
,FlyweightPrototype<DocumentCollection>
,Closeable
,Serializable
,AutoCloseable
public class SimpleCompressedDocumentCollection extends AbstractDocumentCollection implements Serializable
A basic, compressed document collection that can be easily built at indexing time.Instances of this class record virtual and non-text fields just like
ZipDocumentCollection
—that is, in a zip file. However, text fields are recorded in a simple but highly efficient format. Terms (and nonterms) are numbered globally in an increasing way as they are met. While we scan each document, we keep track of frequencies for a limited number of terms: terms are encoded with their frequency rank if we know their statistics, or by a special code derived from their global number if we have no statistics about them. Every number involved is written in delta code.A collection can be exact or approximated: in the latter case, nonwords will not be recorded, and will be turned into spaces when decompressing.
A instance of this collection will be, as any other collection, serialised on a file, but it will refer to several other files that are derived from the instance basename. Please use
AbstractDocumentSequence.load(CharSequence)
to load instances of this collection.This class suffers the same scalability problem of
ZipDocumentCollection
if you compress non-text or virtual fields. Text compression, on the other hand, is extremely efficient and scalable.- Author:
- Sebastiano Vigna
- See Also:
- Serialized Form
-
-
Nested Class Summary
Nested Classes Modifier and Type Class Description protected static class
SimpleCompressedDocumentCollection.FrequencyCodec
A simple codec for integers that remaps frequent numbers to smaller numbers.-
Nested classes/interfaces inherited from class it.unimi.di.big.mg4j.document.AbstractDocumentCollection
AbstractDocumentCollection.PropertyKeys
-
-
Field Summary
Fields Modifier and Type Field Description protected static boolean
ASSERTS
static String
DOCUMENT_OFFSETS_EXTENSION
Standard extension for the file containing document offsets stored as δ-encoded gaps.static String
DOCUMENTS_EXTENSION
Standard extension for the file containing encoded documents.static String
NONTERM_OFFSETS_EXTENSION
Standard extension for the file containing nonterm offsets stored as δ-encoded gaps.static String
NONTERMS_EXTENSION
Standard extension for the file containing nonterms inMutableString.writeSelfDelimUTF8(java.io.OutputStream)
format.static String
STATS_EXTENSION
Standard extension for the stats file.static String
TERM_OFFSETS_EXTENSION
Standard extension for the file containing term offsets stored as δ-encoded gaps.static String
TERMS_EXTENSION
Standard extension for the file containing terms inMutableString.writeSelfDelimUTF8(java.io.OutputStream)
format.-
Fields inherited from interface it.unimi.di.big.mg4j.document.DocumentCollection
DEFAULT_EXTENSION
-
-
Constructor Summary
Constructors Modifier Constructor Description protected
SimpleCompressedDocumentCollection(String basename, long documents, long terms, long nonTerms, boolean exact, DocumentFactory factory)
-
Method Summary
Modifier and Type Method Description void
close()
Closes this document sequence, releasing all resources.DocumentCollection
copy()
Document
document(long index)
Returns the document given its index.DocumentFactory
factory()
Returns the factory used by this sequence.void
filename(CharSequence filename)
Does nothing.static void
main(String[] arg)
Reference2ObjectMap<Enum<?>,Object>
metadata(long index)
Returns the metadata map for a document.static void
optimize(CharSequence basename)
long
size()
Returns the number of documents in this collection.InputStream
stream(long index)
Returns an input stream for the raw content of a document.-
Methods inherited from class it.unimi.di.big.mg4j.document.AbstractDocumentCollection
ensureDocumentIndex, iterator, printAllDocuments, toString
-
Methods inherited from class it.unimi.di.big.mg4j.document.AbstractDocumentSequence
finalize, load
-
-
-
-
Field Detail
-
ASSERTS
protected static final boolean ASSERTS
- See Also:
- Constant Field Values
-
DOCUMENTS_EXTENSION
public static final String DOCUMENTS_EXTENSION
Standard extension for the file containing encoded documents.- See Also:
- Constant Field Values
-
DOCUMENT_OFFSETS_EXTENSION
public static final String DOCUMENT_OFFSETS_EXTENSION
Standard extension for the file containing document offsets stored as δ-encoded gaps.- See Also:
- Constant Field Values
-
TERMS_EXTENSION
public static final String TERMS_EXTENSION
Standard extension for the file containing terms inMutableString.writeSelfDelimUTF8(java.io.OutputStream)
format.- See Also:
- Constant Field Values
-
TERM_OFFSETS_EXTENSION
public static final String TERM_OFFSETS_EXTENSION
Standard extension for the file containing term offsets stored as δ-encoded gaps.- See Also:
- Constant Field Values
-
NONTERMS_EXTENSION
public static final String NONTERMS_EXTENSION
Standard extension for the file containing nonterms inMutableString.writeSelfDelimUTF8(java.io.OutputStream)
format.- See Also:
- Constant Field Values
-
NONTERM_OFFSETS_EXTENSION
public static final String NONTERM_OFFSETS_EXTENSION
Standard extension for the file containing nonterm offsets stored as δ-encoded gaps.- See Also:
- Constant Field Values
-
STATS_EXTENSION
public static final String STATS_EXTENSION
Standard extension for the stats file.- See Also:
- Constant Field Values
-
-
Constructor Detail
-
SimpleCompressedDocumentCollection
protected SimpleCompressedDocumentCollection(String basename, long documents, long terms, long nonTerms, boolean exact, DocumentFactory factory)
-
-
Method Detail
-
filename
public void filename(CharSequence filename) throws IOException
Description copied from class:AbstractDocumentSequence
Does nothing.- Specified by:
filename
in interfaceDocumentSequence
- Overrides:
filename
in classAbstractDocumentSequence
- Parameters:
filename
- the filename of this document sequence.- Throws:
IOException
-
copy
public DocumentCollection copy()
- Specified by:
copy
in interfaceDocumentCollection
- Specified by:
copy
in interfaceFlyweightPrototype<DocumentCollection>
-
document
public Document document(long index) throws IOException
Description copied from interface:DocumentCollection
Returns the document given its index.- Specified by:
document
in interfaceDocumentCollection
- Parameters:
index
- an index between 0 (inclusive) andDocumentCollection.size()
(exclusive).- Returns:
- the
index
-th document. - Throws:
IOException
-
metadata
public Reference2ObjectMap<Enum<?>,Object> metadata(long index) throws IOException
Description copied from interface:DocumentCollection
Returns the metadata map for a document.- Specified by:
metadata
in interfaceDocumentCollection
- Parameters:
index
- an index between 0 (inclusive) andDocumentCollection.size()
(exclusive).- Returns:
- the metadata map for the document.
- Throws:
IOException
-
size
public long size()
Description copied from interface:DocumentCollection
Returns the number of documents in this collection.- Specified by:
size
in interfaceDocumentCollection
- Returns:
- the number of documents in this collection.
-
stream
public InputStream stream(long index) throws IOException
Description copied from interface:DocumentCollection
Returns an input stream for the raw content of a document.- Specified by:
stream
in interfaceDocumentCollection
- Parameters:
index
- an index between 0 (inclusive) andDocumentCollection.size()
(exclusive).- Returns:
- the raw content of the document as an input stream.
- Throws:
IOException
-
close
public void close() throws IOException
Description copied from interface:DocumentSequence
Closes this document sequence, releasing all resources.You should always call this method after having finished with this document sequence. Implementations are invited to call this method in a finaliser as a safety net (even better, implement
SafelyCloseable
), but since there is no guarantee as to when finalisers are invoked, you should not depend on this behaviour.- Specified by:
close
in interfaceAutoCloseable
- Specified by:
close
in interfaceCloseable
- Specified by:
close
in interfaceDocumentSequence
- Overrides:
close
in classAbstractDocumentSequence
- Throws:
IOException
-
factory
public DocumentFactory factory()
Description copied from interface:DocumentSequence
Returns the factory used by this sequence.Every document sequence is based on a document factory that transforms raw bytes into a sequence of characters. The factory contains useful information such as the number of fields.
- Specified by:
factory
in interfaceDocumentSequence
- Returns:
- the factory used by this sequence.
-
optimize
public static void optimize(CharSequence basename) throws IOException, ClassNotFoundException
- Throws:
IOException
ClassNotFoundException
-
main
public static void main(String[] arg) throws IOException, com.martiansoftware.jsap.JSAPException, InstantiationException, IllegalAccessException, InvocationTargetException, NoSuchMethodException, org.apache.commons.configuration.ConfigurationException, ClassNotFoundException
- Throws:
IOException
com.martiansoftware.jsap.JSAPException
InstantiationException
IllegalAccessException
InvocationTargetException
NoSuchMethodException
org.apache.commons.configuration.ConfigurationException
ClassNotFoundException
-
-