it.unimi.di.mg4j.document
Class TRECDocumentCollection

java.lang.Object
  extended by it.unimi.di.mg4j.document.AbstractDocumentSequence
      extended by it.unimi.di.mg4j.document.AbstractDocumentCollection
          extended by it.unimi.di.mg4j.document.TRECDocumentCollection
All Implemented Interfaces:
DocumentCollection, DocumentSequence, SafelyCloseable, FlyweightPrototype<DocumentCollection>, Closeable, Serializable

public class TRECDocumentCollection
extends AbstractDocumentCollection
implements Serializable

A collection for the TREC GOV2 data set.

The documents are stored as a set of descriptors, representing the (possibly gzipped) file they are contained in and the start and stop position in that file. To manage descriptors later we rely on SegmentedInputStream.

To interpret a file, we read up to <DOC> and place a start marker there, we advance to the header and store the URI. An intermediate marker is placed at the end of the doc header tag and a stop marker just before </DOC>.

The resulting SegmentedInputStream has two segments per document. By using a CompositeDocumentFactory, the first segment is parsed by a TRECHeaderDocumentFactory, whereas the second segment is parsed by a user-provided factory—usually, an HtmlDocumentFactory.

The collection provides both sequential access to all documents via the iterator and random access to a given document. However, the two operations are performed very differently as the sequential operation is much more efficient than calling document(int) repeatedly.

Author:
Alessio Orlandi, Luca Natali
See Also:
Serialized Form

Nested Class Summary
 
Nested classes/interfaces inherited from class it.unimi.di.mg4j.document.AbstractDocumentCollection
AbstractDocumentCollection.PropertyKeys
 
Field Summary
static String DEFAULT_BUFFER_SIZE
          Default buffer size, set up after some experiments.
protected  ObjectArrayList<it.unimi.di.mg4j.document.TRECDocumentCollection.TRECDocumentDescriptor> descriptors
          The list of document descriptors.
protected static byte[] DOC_CLOSE
           
protected static byte[] DOC_OPEN
           
protected static byte[] DOCHDR_CLOSE
           
protected static byte[] DOCHDR_OPEN
           
protected static byte[] DOCNO_CLOSE
           
protected static byte[] DOCNO_OPEN
           
protected  DocumentFactory factory
          The document factory.
 
Fields inherited from interface it.unimi.di.mg4j.document.DocumentCollection
DEFAULT_EXTENSION
 
Constructor Summary
  TRECDocumentCollection(String[] file, DocumentFactory factory, int bufferSize, boolean useGzip)
          Creates a new TREC collection by parsing the given files.
protected TRECDocumentCollection(String[] file, DocumentFactory factory, ObjectArrayList<it.unimi.di.mg4j.document.TRECDocumentCollection.TRECDocumentDescriptor> descriptors, int bufferSize, boolean useGzip)
          Copy constructor (that is, the one used by copy().
 
Method Summary
 void close()
          Closes this document sequence, releasing all resources.
 TRECDocumentCollection copy()
           
 Document document(int n)
          Returns the document given its index.
protected static boolean equals(byte[] a, int len, byte[] b)
           
 DocumentFactory factory()
          Returns the factory used by this sequence.
 DocumentIterator iterator()
          Returns an iterator over the sequence of documents.
static void main(String[] arg)
           
 void merge(TRECDocumentCollection other)
          Merges a new collection in this one, by rebuilding the gzFile array and appending the other object one, concatenating the descriptors while rebuilding all.
 Reference2ObjectMap<Enum<?>,Object> metadata(int index)
          Returns the metadata map for a document.
 int size()
          Returns the number of documents in this collection.
 InputStream stream(int n)
          Returns an input stream for the raw content of a document.
 
Methods inherited from class it.unimi.di.mg4j.document.AbstractDocumentCollection
ensureDocumentIndex, printAllDocuments, toString
 
Methods inherited from class it.unimi.di.mg4j.document.AbstractDocumentSequence
filename, finalize, load
 
Methods inherited from class java.lang.Object
clone, equals, getClass, hashCode, notify, notifyAll, wait, wait, wait
 
Methods inherited from interface it.unimi.di.mg4j.document.DocumentSequence
filename
 

Field Detail

DEFAULT_BUFFER_SIZE

public static final String DEFAULT_BUFFER_SIZE
Default buffer size, set up after some experiments.

See Also:
Constant Field Values

factory

protected DocumentFactory factory
The document factory.


descriptors

protected transient ObjectArrayList<it.unimi.di.mg4j.document.TRECDocumentCollection.TRECDocumentDescriptor> descriptors
The list of document descriptors. We assume that descriptors within the same file are contiguous


DOC_OPEN

protected static final byte[] DOC_OPEN

DOC_CLOSE

protected static final byte[] DOC_CLOSE

DOCNO_OPEN

protected static final byte[] DOCNO_OPEN

DOCNO_CLOSE

protected static final byte[] DOCNO_CLOSE

DOCHDR_OPEN

protected static final byte[] DOCHDR_OPEN

DOCHDR_CLOSE

protected static final byte[] DOCHDR_CLOSE
Constructor Detail

TRECDocumentCollection

protected TRECDocumentCollection(String[] file,
                                 DocumentFactory factory,
                                 ObjectArrayList<it.unimi.di.mg4j.document.TRECDocumentCollection.TRECDocumentDescriptor> descriptors,
                                 int bufferSize,
                                 boolean useGzip)
Copy constructor (that is, the one used by copy(). Just initializes final fields


TRECDocumentCollection

public TRECDocumentCollection(String[] file,
                              DocumentFactory factory,
                              int bufferSize,
                              boolean useGzip)
                       throws IOException
Creates a new TREC collection by parsing the given files.

Parameters:
file - an array of file names containing documents in TREC GOV2 format.
factory - the document factory (usually, a composite one).
bufferSize - the buffer size.
useGzip - true iff the files are gzipped.
Throws:
IOException
Method Detail

equals

protected static boolean equals(byte[] a,
                                int len,
                                byte[] b)

copy

public TRECDocumentCollection copy()
Specified by:
copy in interface DocumentCollection
Specified by:
copy in interface FlyweightPrototype<DocumentCollection>

size

public int size()
Description copied from interface: DocumentCollection
Returns the number of documents in this collection.

Specified by:
size in interface DocumentCollection
Returns:
the number of documents in this collection.

document

public Document document(int n)
                  throws IOException
Description copied from interface: DocumentCollection
Returns the document given its index.

Specified by:
document in interface DocumentCollection
Parameters:
n - an index between 0 (inclusive) and DocumentCollection.size() (exclusive).
Returns:
the index-th document.
Throws:
IOException

stream

public InputStream stream(int n)
                   throws IOException
Description copied from interface: DocumentCollection
Returns an input stream for the raw content of a document.

Specified by:
stream in interface DocumentCollection
Parameters:
n - an index between 0 (inclusive) and DocumentCollection.size() (exclusive).
Returns:
the raw content of the document as an input stream.
Throws:
IOException

metadata

public Reference2ObjectMap<Enum<?>,Object> metadata(int index)
Description copied from interface: DocumentCollection
Returns the metadata map for a document.

Specified by:
metadata in interface DocumentCollection
Parameters:
index - an index between 0 (inclusive) and DocumentCollection.size() (exclusive).
Returns:
the metadata map for the document.

factory

public DocumentFactory factory()
Description copied from interface: DocumentSequence
Returns the factory used by this sequence.

Every document sequence is based on a document factory that transforms raw bytes into a sequence of characters. The factory contains useful information such as the number of fields.

Specified by:
factory in interface DocumentSequence
Returns:
the factory used by this sequence.

close

public void close()
           throws IOException
Description copied from interface: DocumentSequence
Closes this document sequence, releasing all resources.

You should always call this method after having finished with this document sequence. Implementations are invited to call this method in a finaliser as a safety net (even better, implement SafelyCloseable), but since there is no guarantee as to when finalisers are invoked, you should not depend on this behaviour.

Specified by:
close in interface DocumentSequence
Specified by:
close in interface Closeable
Overrides:
close in class AbstractDocumentSequence
Throws:
IOException

merge

public void merge(TRECDocumentCollection other)
Merges a new collection in this one, by rebuilding the gzFile array and appending the other object one, concatenating the descriptors while rebuilding all.

It is supposed that the passed object contains no duplicates for the local collection.


iterator

public DocumentIterator iterator()
                          throws IOException
Description copied from interface: DocumentSequence
Returns an iterator over the sequence of documents.

Warning: this method can be safely called just one time. For instance, implementations based on standard input will usually throw an exception if this method is called twice.

Implementations may decide to override this restriction (in particular, if they implement DocumentCollection). Usually, however, it is not possible to obtain two iterators at the same time on a collection.

Specified by:
iterator in interface DocumentSequence
Overrides:
iterator in class AbstractDocumentCollection
Returns:
an iterator over the sequence of documents.
Throws:
IOException
See Also:
DocumentCollection

main

public static void main(String[] arg)
                 throws IOException,
                        com.martiansoftware.jsap.JSAPException,
                        InstantiationException,
                        IllegalAccessException,
                        InvocationTargetException,
                        NoSuchMethodException
Throws:
IOException
com.martiansoftware.jsap.JSAPException
InstantiationException
IllegalAccessException
InvocationTargetException
NoSuchMethodException