|
|||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Object it.unimi.di.mg4j.document.AbstractDocumentSequence it.unimi.di.mg4j.document.AbstractDocumentCollection it.unimi.di.mg4j.document.TRECDocumentCollection
public class TRECDocumentCollection
A collection for the TREC GOV2 data set.
The documents are stored as a set of descriptors, representing the (possibly gzipped) file
they are contained in and the start and stop position in that file. To manage
descriptors later we rely on SegmentedInputStream
.
To interpret a file, we read up to <DOC> and place a start marker there, we advance to the header and store the URI. An intermediate marker is placed at the end of the doc header tag and a stop marker just before </DOC>.
The resulting SegmentedInputStream
has two segments
per document. By using a CompositeDocumentFactory
, the
first segment is parsed by a TRECHeaderDocumentFactory
,
whereas the second segment is parsed by a user-provided factory—usually,
an HtmlDocumentFactory
.
The collection provides both sequential access to all documents via the
iterator and random access to a given document. However, the two operations
are performed very differently as the sequential operation is much more
efficient than calling document(int)
repeatedly.
Nested Class Summary |
---|
Nested classes/interfaces inherited from class it.unimi.di.mg4j.document.AbstractDocumentCollection |
---|
AbstractDocumentCollection.PropertyKeys |
Field Summary | |
---|---|
static String |
DEFAULT_BUFFER_SIZE
Default buffer size, set up after some experiments. |
protected ObjectArrayList<it.unimi.di.mg4j.document.TRECDocumentCollection.TRECDocumentDescriptor> |
descriptors
The list of document descriptors. |
protected static byte[] |
DOC_CLOSE
|
protected static byte[] |
DOC_OPEN
|
protected static byte[] |
DOCHDR_CLOSE
|
protected static byte[] |
DOCHDR_OPEN
|
protected static byte[] |
DOCNO_CLOSE
|
protected static byte[] |
DOCNO_OPEN
|
protected DocumentFactory |
factory
The document factory. |
Fields inherited from interface it.unimi.di.mg4j.document.DocumentCollection |
---|
DEFAULT_EXTENSION |
Constructor Summary | |
---|---|
|
TRECDocumentCollection(String[] file,
DocumentFactory factory,
int bufferSize,
boolean useGzip)
Creates a new TREC collection by parsing the given files. |
protected |
TRECDocumentCollection(String[] file,
DocumentFactory factory,
ObjectArrayList<it.unimi.di.mg4j.document.TRECDocumentCollection.TRECDocumentDescriptor> descriptors,
int bufferSize,
boolean useGzip)
Copy constructor (that is, the one used by copy() . |
Method Summary | |
---|---|
void |
close()
Closes this document sequence, releasing all resources. |
TRECDocumentCollection |
copy()
|
Document |
document(int n)
Returns the document given its index. |
protected static boolean |
equals(byte[] a,
int len,
byte[] b)
|
DocumentFactory |
factory()
Returns the factory used by this sequence. |
DocumentIterator |
iterator()
Returns an iterator over the sequence of documents. |
static void |
main(String[] arg)
|
void |
merge(TRECDocumentCollection other)
Merges a new collection in this one, by rebuilding the gzFile array and appending the other object one, concatenating the descriptors while rebuilding all. |
Reference2ObjectMap<Enum<?>,Object> |
metadata(int index)
Returns the metadata map for a document. |
int |
size()
Returns the number of documents in this collection. |
InputStream |
stream(int n)
Returns an input stream for the raw content of a document. |
Methods inherited from class it.unimi.di.mg4j.document.AbstractDocumentCollection |
---|
ensureDocumentIndex, printAllDocuments, toString |
Methods inherited from class it.unimi.di.mg4j.document.AbstractDocumentSequence |
---|
filename, finalize, load |
Methods inherited from class java.lang.Object |
---|
clone, equals, getClass, hashCode, notify, notifyAll, wait, wait, wait |
Methods inherited from interface it.unimi.di.mg4j.document.DocumentSequence |
---|
filename |
Field Detail |
---|
public static final String DEFAULT_BUFFER_SIZE
protected DocumentFactory factory
protected transient ObjectArrayList<it.unimi.di.mg4j.document.TRECDocumentCollection.TRECDocumentDescriptor> descriptors
protected static final byte[] DOC_OPEN
protected static final byte[] DOC_CLOSE
protected static final byte[] DOCNO_OPEN
protected static final byte[] DOCNO_CLOSE
protected static final byte[] DOCHDR_OPEN
protected static final byte[] DOCHDR_CLOSE
Constructor Detail |
---|
protected TRECDocumentCollection(String[] file, DocumentFactory factory, ObjectArrayList<it.unimi.di.mg4j.document.TRECDocumentCollection.TRECDocumentDescriptor> descriptors, int bufferSize, boolean useGzip)
copy()
. Just
initializes final fields
public TRECDocumentCollection(String[] file, DocumentFactory factory, int bufferSize, boolean useGzip) throws IOException
file
- an array of file names containing documents in TREC GOV2 format.factory
- the document factory (usually, a composite one).bufferSize
- the buffer size.useGzip
- true iff the files are gzipped.
IOException
Method Detail |
---|
protected static boolean equals(byte[] a, int len, byte[] b)
public TRECDocumentCollection copy()
copy
in interface DocumentCollection
copy
in interface FlyweightPrototype<DocumentCollection>
public int size()
DocumentCollection
size
in interface DocumentCollection
public Document document(int n) throws IOException
DocumentCollection
document
in interface DocumentCollection
n
- an index between 0 (inclusive) and DocumentCollection.size()
(exclusive).
index
-th document.
IOException
public InputStream stream(int n) throws IOException
DocumentCollection
stream
in interface DocumentCollection
n
- an index between 0 (inclusive) and DocumentCollection.size()
(exclusive).
IOException
public Reference2ObjectMap<Enum<?>,Object> metadata(int index)
DocumentCollection
metadata
in interface DocumentCollection
index
- an index between 0 (inclusive) and DocumentCollection.size()
(exclusive).
public DocumentFactory factory()
DocumentSequence
Every document sequence is based on a document factory that transforms raw bytes into a sequence of characters. The factory contains useful information such as the number of fields.
factory
in interface DocumentSequence
public void close() throws IOException
DocumentSequence
You should always call this method after having finished with this document sequence.
Implementations are invited to call this method in a finaliser as a safety net (even better,
implement SafelyCloseable
), but since there
is no guarantee as to when finalisers are invoked, you should not depend on this behaviour.
close
in interface DocumentSequence
close
in interface Closeable
close
in class AbstractDocumentSequence
IOException
public void merge(TRECDocumentCollection other)
It is supposed that the passed object contains no duplicates for the local collection.
public DocumentIterator iterator() throws IOException
DocumentSequence
Warning: this method can be safely called just one time. For instance, implementations based on standard input will usually throw an exception if this method is called twice.
Implementations may decide to override this restriction
(in particular, if they implement DocumentCollection
). Usually,
however, it is not possible to obtain two iterators at the
same time on a collection.
iterator
in interface DocumentSequence
iterator
in class AbstractDocumentCollection
IOException
DocumentCollection
public static void main(String[] arg) throws IOException, com.martiansoftware.jsap.JSAPException, InstantiationException, IllegalAccessException, InvocationTargetException, NoSuchMethodException
IOException
com.martiansoftware.jsap.JSAPException
InstantiationException
IllegalAccessException
InvocationTargetException
NoSuchMethodException
|
|||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |