it.unimi.di.mg4j.document
Class SimpleCompressedDocumentCollection

java.lang.Object
  extended by it.unimi.di.mg4j.document.AbstractDocumentSequence
      extended by it.unimi.di.mg4j.document.AbstractDocumentCollection
          extended by it.unimi.di.mg4j.document.SimpleCompressedDocumentCollection
All Implemented Interfaces:
DocumentCollection, DocumentSequence, SafelyCloseable, FlyweightPrototype<DocumentCollection>, Closeable, Serializable

public class SimpleCompressedDocumentCollection
extends AbstractDocumentCollection
implements Serializable

A basic, compressed document collection that can be easily built at indexing time.

Instances of this class record virtual and non-text fields just like ZipDocumentCollection—that is, in a zip file. However, text fields are recorded in a simple but highly efficient format. Terms (and nonterms) are numbered globally in an increasing way as they are met. While we scan each document, we keep track of frequencies for a limited number of terms: terms are encoded with their frequency rank if we know their statistics, or by a special code derived from their global number if we have no statistics about them. Every number involved is written in delta code.

A collection can be exact or approximated: in the latter case, nonwords will not be recorded, and will be turned into spaces when decompressing.

A instance of this collection will be, as any other collection, serialised on a file, but it will refer to several other files that are derived from the instance basename. Please use AbstractDocumentSequence.load(CharSequence) to load instances of this collection.

This class suffers the same scalability problem of ZipDocumentCollection if you compress non-text or virtual fields. Text compression, on the other hand, is extremely efficient and scalable.

Author:
Sebastiano Vigna
See Also:
Serialized Form

Nested Class Summary
protected static class SimpleCompressedDocumentCollection.FrequencyCodec
          A simple codec for integers that remaps frequent numbers to smaller numbers.
 
Nested classes/interfaces inherited from class it.unimi.di.mg4j.document.AbstractDocumentCollection
AbstractDocumentCollection.PropertyKeys
 
Field Summary
protected static boolean ASSERTS
           
static String DOCUMENT_OFFSETS_EXTENSION
          Standard extension for the file containing document offsets stored as δ-encoded gaps.
static String DOCUMENTS_EXTENSION
          Standard extension for the file containing encoded documents.
static String NONTERM_OFFSETS_EXTENSION
          Standard extension for the file containing nonterm offsets stored as δ-encoded gaps.
static String NONTERMS_EXTENSION
          Standard extension for the file containing nonterms in MutableString.writeSelfDelimUTF8(java.io.OutputStream) format.
static String STATS_EXTENSION
          Standard extension for the stats file.
static String TERM_OFFSETS_EXTENSION
          Standard extension for the file containing term offsets stored as δ-encoded gaps.
static String TERMS_EXTENSION
          Standard extension for the file containing terms in MutableString.writeSelfDelimUTF8(java.io.OutputStream) format.
 
Fields inherited from interface it.unimi.di.mg4j.document.DocumentCollection
DEFAULT_EXTENSION
 
Constructor Summary
protected SimpleCompressedDocumentCollection(String basename, long documents, long terms, long nonTerms, boolean exact, DocumentFactory factory)
           
 
Method Summary
 void close()
          Closes this document sequence, releasing all resources.
 DocumentCollection copy()
           
 Document document(int index)
          Returns the document given its index.
 DocumentFactory factory()
          Returns the factory used by this sequence.
 void filename(CharSequence filename)
          Does nothing.
static void main(String[] arg)
           
 Reference2ObjectMap<Enum<?>,Object> metadata(int index)
          Returns the metadata map for a document.
static void optimize(CharSequence basename)
           
 int size()
          Returns the number of documents in this collection.
 InputStream stream(int index)
          Returns an input stream for the raw content of a document.
 
Methods inherited from class it.unimi.di.mg4j.document.AbstractDocumentCollection
ensureDocumentIndex, iterator, printAllDocuments, toString
 
Methods inherited from class it.unimi.di.mg4j.document.AbstractDocumentSequence
finalize, load
 
Methods inherited from class java.lang.Object
clone, equals, getClass, hashCode, notify, notifyAll, wait, wait, wait
 

Field Detail

ASSERTS

protected static final boolean ASSERTS
See Also:
Constant Field Values

DOCUMENTS_EXTENSION

public static final String DOCUMENTS_EXTENSION
Standard extension for the file containing encoded documents.

See Also:
Constant Field Values

DOCUMENT_OFFSETS_EXTENSION

public static final String DOCUMENT_OFFSETS_EXTENSION
Standard extension for the file containing document offsets stored as δ-encoded gaps.

See Also:
Constant Field Values

TERMS_EXTENSION

public static final String TERMS_EXTENSION
Standard extension for the file containing terms in MutableString.writeSelfDelimUTF8(java.io.OutputStream) format.

See Also:
Constant Field Values

TERM_OFFSETS_EXTENSION

public static final String TERM_OFFSETS_EXTENSION
Standard extension for the file containing term offsets stored as δ-encoded gaps.

See Also:
Constant Field Values

NONTERMS_EXTENSION

public static final String NONTERMS_EXTENSION
Standard extension for the file containing nonterms in MutableString.writeSelfDelimUTF8(java.io.OutputStream) format.

See Also:
Constant Field Values

NONTERM_OFFSETS_EXTENSION

public static final String NONTERM_OFFSETS_EXTENSION
Standard extension for the file containing nonterm offsets stored as δ-encoded gaps.

See Also:
Constant Field Values

STATS_EXTENSION

public static final String STATS_EXTENSION
Standard extension for the stats file.

See Also:
Constant Field Values
Constructor Detail

SimpleCompressedDocumentCollection

protected SimpleCompressedDocumentCollection(String basename,
                                             long documents,
                                             long terms,
                                             long nonTerms,
                                             boolean exact,
                                             DocumentFactory factory)
Method Detail

filename

public void filename(CharSequence filename)
              throws IOException
Description copied from class: AbstractDocumentSequence
Does nothing.

Specified by:
filename in interface DocumentSequence
Overrides:
filename in class AbstractDocumentSequence
Parameters:
filename - the filename of this document sequence.
Throws:
IOException

copy

public DocumentCollection copy()
Specified by:
copy in interface DocumentCollection
Specified by:
copy in interface FlyweightPrototype<DocumentCollection>

document

public Document document(int index)
                  throws IOException
Description copied from interface: DocumentCollection
Returns the document given its index.

Specified by:
document in interface DocumentCollection
Parameters:
index - an index between 0 (inclusive) and DocumentCollection.size() (exclusive).
Returns:
the index-th document.
Throws:
IOException

metadata

public Reference2ObjectMap<Enum<?>,Object> metadata(int index)
                                             throws IOException
Description copied from interface: DocumentCollection
Returns the metadata map for a document.

Specified by:
metadata in interface DocumentCollection
Parameters:
index - an index between 0 (inclusive) and DocumentCollection.size() (exclusive).
Returns:
the metadata map for the document.
Throws:
IOException

size

public int size()
Description copied from interface: DocumentCollection
Returns the number of documents in this collection.

Specified by:
size in interface DocumentCollection
Returns:
the number of documents in this collection.

stream

public InputStream stream(int index)
                   throws IOException
Description copied from interface: DocumentCollection
Returns an input stream for the raw content of a document.

Specified by:
stream in interface DocumentCollection
Parameters:
index - an index between 0 (inclusive) and DocumentCollection.size() (exclusive).
Returns:
the raw content of the document as an input stream.
Throws:
IOException

close

public void close()
           throws IOException
Description copied from interface: DocumentSequence
Closes this document sequence, releasing all resources.

You should always call this method after having finished with this document sequence. Implementations are invited to call this method in a finaliser as a safety net (even better, implement SafelyCloseable), but since there is no guarantee as to when finalisers are invoked, you should not depend on this behaviour.

Specified by:
close in interface DocumentSequence
Specified by:
close in interface Closeable
Overrides:
close in class AbstractDocumentSequence
Throws:
IOException

factory

public DocumentFactory factory()
Description copied from interface: DocumentSequence
Returns the factory used by this sequence.

Every document sequence is based on a document factory that transforms raw bytes into a sequence of characters. The factory contains useful information such as the number of fields.

Specified by:
factory in interface DocumentSequence
Returns:
the factory used by this sequence.

optimize

public static void optimize(CharSequence basename)
                     throws IOException,
                            ClassNotFoundException
Throws:
IOException
ClassNotFoundException

main

public static void main(String[] arg)
                 throws IOException,
                        com.martiansoftware.jsap.JSAPException,
                        InstantiationException,
                        IllegalAccessException,
                        InvocationTargetException,
                        NoSuchMethodException,
                        ConfigurationException,
                        ClassNotFoundException
Throws:
IOException
com.martiansoftware.jsap.JSAPException
InstantiationException
IllegalAccessException
InvocationTargetException
NoSuchMethodException
ConfigurationException
ClassNotFoundException