|
|||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Object it.unimi.di.mg4j.document.AbstractDocumentSequence it.unimi.di.mg4j.document.AbstractDocumentCollection it.unimi.di.mg4j.document.SimpleCompressedDocumentCollection
public class SimpleCompressedDocumentCollection
A basic, compressed document collection that can be easily built at indexing time.
Instances of this class record virtual and non-text fields just like ZipDocumentCollection
—that is,
in a zip file. However, text fields are recorded in a simple but highly efficient format. Terms (and nonterms) are numbered globally
in an increasing way as they are met. While we scan each document, we keep track of frequencies for a limited number of terms:
terms are encoded with their frequency rank if we know their statistics, or by a special code derived from their
global number if we have no statistics about them. Every number involved is written in delta code.
A collection can be exact or approximated: in the latter case, nonwords will not be recorded, and will be turned into spaces when decompressing.
A instance of this collection will be, as any other collection, serialised on a file, but it will refer to several other files
that are derived from the instance basename. Please use AbstractDocumentSequence.load(CharSequence)
to load instances of this collection.
This class suffers the same scalability problem of ZipDocumentCollection
if you compress non-text or virtual fields. Text
compression, on the other hand, is extremely efficient and scalable.
Nested Class Summary | |
---|---|
protected static class |
SimpleCompressedDocumentCollection.FrequencyCodec
A simple codec for integers that remaps frequent numbers to smaller numbers. |
Nested classes/interfaces inherited from class it.unimi.di.mg4j.document.AbstractDocumentCollection |
---|
AbstractDocumentCollection.PropertyKeys |
Field Summary | |
---|---|
protected static boolean |
ASSERTS
|
static String |
DOCUMENT_OFFSETS_EXTENSION
Standard extension for the file containing document offsets stored as δ-encoded gaps. |
static String |
DOCUMENTS_EXTENSION
Standard extension for the file containing encoded documents. |
static String |
NONTERM_OFFSETS_EXTENSION
Standard extension for the file containing nonterm offsets stored as δ-encoded gaps. |
static String |
NONTERMS_EXTENSION
Standard extension for the file containing nonterms in MutableString.writeSelfDelimUTF8(java.io.OutputStream) format. |
static String |
STATS_EXTENSION
Standard extension for the stats file. |
static String |
TERM_OFFSETS_EXTENSION
Standard extension for the file containing term offsets stored as δ-encoded gaps. |
static String |
TERMS_EXTENSION
Standard extension for the file containing terms in MutableString.writeSelfDelimUTF8(java.io.OutputStream) format. |
Fields inherited from interface it.unimi.di.mg4j.document.DocumentCollection |
---|
DEFAULT_EXTENSION |
Constructor Summary | |
---|---|
protected |
SimpleCompressedDocumentCollection(String basename,
long documents,
long terms,
long nonTerms,
boolean exact,
DocumentFactory factory)
|
Method Summary | |
---|---|
void |
close()
Closes this document sequence, releasing all resources. |
DocumentCollection |
copy()
|
Document |
document(int index)
Returns the document given its index. |
DocumentFactory |
factory()
Returns the factory used by this sequence. |
void |
filename(CharSequence filename)
Does nothing. |
static void |
main(String[] arg)
|
Reference2ObjectMap<Enum<?>,Object> |
metadata(int index)
Returns the metadata map for a document. |
static void |
optimize(CharSequence basename)
|
int |
size()
Returns the number of documents in this collection. |
InputStream |
stream(int index)
Returns an input stream for the raw content of a document. |
Methods inherited from class it.unimi.di.mg4j.document.AbstractDocumentCollection |
---|
ensureDocumentIndex, iterator, printAllDocuments, toString |
Methods inherited from class it.unimi.di.mg4j.document.AbstractDocumentSequence |
---|
finalize, load |
Methods inherited from class java.lang.Object |
---|
clone, equals, getClass, hashCode, notify, notifyAll, wait, wait, wait |
Field Detail |
---|
protected static final boolean ASSERTS
public static final String DOCUMENTS_EXTENSION
public static final String DOCUMENT_OFFSETS_EXTENSION
public static final String TERMS_EXTENSION
MutableString.writeSelfDelimUTF8(java.io.OutputStream)
format.
public static final String TERM_OFFSETS_EXTENSION
public static final String NONTERMS_EXTENSION
MutableString.writeSelfDelimUTF8(java.io.OutputStream)
format.
public static final String NONTERM_OFFSETS_EXTENSION
public static final String STATS_EXTENSION
Constructor Detail |
---|
protected SimpleCompressedDocumentCollection(String basename, long documents, long terms, long nonTerms, boolean exact, DocumentFactory factory)
Method Detail |
---|
public void filename(CharSequence filename) throws IOException
AbstractDocumentSequence
filename
in interface DocumentSequence
filename
in class AbstractDocumentSequence
filename
- the filename of this document sequence.
IOException
public DocumentCollection copy()
copy
in interface DocumentCollection
copy
in interface FlyweightPrototype<DocumentCollection>
public Document document(int index) throws IOException
DocumentCollection
document
in interface DocumentCollection
index
- an index between 0 (inclusive) and DocumentCollection.size()
(exclusive).
index
-th document.
IOException
public Reference2ObjectMap<Enum<?>,Object> metadata(int index) throws IOException
DocumentCollection
metadata
in interface DocumentCollection
index
- an index between 0 (inclusive) and DocumentCollection.size()
(exclusive).
IOException
public int size()
DocumentCollection
size
in interface DocumentCollection
public InputStream stream(int index) throws IOException
DocumentCollection
stream
in interface DocumentCollection
index
- an index between 0 (inclusive) and DocumentCollection.size()
(exclusive).
IOException
public void close() throws IOException
DocumentSequence
You should always call this method after having finished with this document sequence.
Implementations are invited to call this method in a finaliser as a safety net (even better,
implement SafelyCloseable
), but since there
is no guarantee as to when finalisers are invoked, you should not depend on this behaviour.
close
in interface DocumentSequence
close
in interface Closeable
close
in class AbstractDocumentSequence
IOException
public DocumentFactory factory()
DocumentSequence
Every document sequence is based on a document factory that transforms raw bytes into a sequence of characters. The factory contains useful information such as the number of fields.
factory
in interface DocumentSequence
public static void optimize(CharSequence basename) throws IOException, ClassNotFoundException
IOException
ClassNotFoundException
public static void main(String[] arg) throws IOException, com.martiansoftware.jsap.JSAPException, InstantiationException, IllegalAccessException, InvocationTargetException, NoSuchMethodException, ConfigurationException, ClassNotFoundException
IOException
com.martiansoftware.jsap.JSAPException
InstantiationException
IllegalAccessException
InvocationTargetException
NoSuchMethodException
ConfigurationException
ClassNotFoundException
|
|||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |