Class TRECDocumentCollection
- java.lang.Object
-
- it.unimi.di.big.mg4j.document.AbstractDocumentSequence
-
- it.unimi.di.big.mg4j.document.AbstractDocumentCollection
-
- it.unimi.di.big.mg4j.document.TRECDocumentCollection
-
- All Implemented Interfaces:
DocumentCollection
,DocumentSequence
,SafelyCloseable
,FlyweightPrototype<DocumentCollection>
,Closeable
,Serializable
,AutoCloseable
public class TRECDocumentCollection extends AbstractDocumentCollection implements Serializable
A collection for the TREC GOV2 data set.The documents are stored as a set of descriptors, representing the (possibly gzipped) file they are contained in and the start and stop position in that file. To manage descriptors later we rely on
SegmentedInputStream
.To interpret a file, we read up to <DOC> and place a start marker there, we advance to the header and store the URI. An intermediate marker is placed at the end of the doc header tag and a stop marker just before </DOC>.
The resulting
SegmentedInputStream
has two segments per document. By using aCompositeDocumentFactory
, the first segment is parsed by aTRECHeaderDocumentFactory
, whereas the second segment is parsed by a user-provided factory—usually, anHtmlDocumentFactory
.The collection provides both sequential access to all documents via the iterator and random access to a given document. However, the two operations are performed very differently as the sequential operation is much more efficient than calling
document(long)
repeatedly.- Author:
- Alessio Orlandi, Luca Natali
- See Also:
- Serialized Form
-
-
Nested Class Summary
Nested Classes Modifier and Type Class Description protected static class
TRECDocumentCollection.TRECDocumentDescriptor
A compact description of the location and of the internal segmentation of a TREC document inside a file.-
Nested classes/interfaces inherited from class it.unimi.di.big.mg4j.document.AbstractDocumentCollection
AbstractDocumentCollection.PropertyKeys
-
-
Field Summary
Fields Modifier and Type Field Description static String
DEFAULT_BUFFER_SIZE
Default buffer size, set up after some experiments.protected ObjectBigArrayBigList<TRECDocumentCollection.TRECDocumentDescriptor>
descriptors
The list of document descriptors.protected static byte[]
DOC_CLOSE
protected static byte[]
DOC_OPEN
protected static byte[]
DOCHDR_CLOSE
protected static byte[]
DOCHDR_OPEN
protected static byte[]
DOCNO_CLOSE
protected static byte[]
DOCNO_OPEN
protected DocumentFactory
factory
The document factory.protected String[]
file
The list of the files containing the documents.protected SegmentedInputStream
lastStream
The last returned stream.protected boolean
useGzip
Whether the files infile
are gzipped.-
Fields inherited from interface it.unimi.di.big.mg4j.document.DocumentCollection
DEFAULT_EXTENSION
-
-
Constructor Summary
Constructors Modifier Constructor Description TRECDocumentCollection(String[] file, DocumentFactory factory, int bufferSize, boolean useGzip)
Creates a new TREC collection by parsing the given files.protected
TRECDocumentCollection(String[] file, DocumentFactory factory, ObjectBigArrayBigList<TRECDocumentCollection.TRECDocumentDescriptor> descriptors, int bufferSize, boolean useGzip)
Copy constructor (that is, the one used bycopy()
.
-
Method Summary
Modifier and Type Method Description void
close()
Closes this document sequence, releasing all resources.TRECDocumentCollection
copy()
Document
document(long n)
Returns the document given its index.protected static boolean
equals(byte[] a, int len, byte[] b)
DocumentFactory
factory()
Returns the factory used by this sequence.DocumentIterator
iterator()
Returns an iterator over the sequence of documents.static void
main(String[] arg)
void
merge(TRECDocumentCollection other)
Merges a new collection in this one, by rebuilding the gzFile array and appending the other object one, concatenating the descriptors while rebuilding all.Reference2ObjectMap<Enum<?>,Object>
metadata(long index)
Returns the metadata map for a document.protected void
parseContent(int fileIndex, InputStream is)
long
size()
Returns the number of documents in this collection.InputStream
stream(long n)
Returns an input stream for the raw content of a document.-
Methods inherited from class it.unimi.di.big.mg4j.document.AbstractDocumentCollection
ensureDocumentIndex, printAllDocuments, toString
-
Methods inherited from class it.unimi.di.big.mg4j.document.AbstractDocumentSequence
filename, finalize, load
-
Methods inherited from class java.lang.Object
clone, equals, getClass, hashCode, notify, notifyAll, wait, wait, wait
-
Methods inherited from interface it.unimi.di.big.mg4j.document.DocumentSequence
filename
-
-
-
-
Field Detail
-
DEFAULT_BUFFER_SIZE
public static final String DEFAULT_BUFFER_SIZE
Default buffer size, set up after some experiments.- See Also:
- Constant Field Values
-
file
protected String[] file
The list of the files containing the documents.
-
useGzip
protected final boolean useGzip
Whether the files infile
are gzipped.
-
factory
protected DocumentFactory factory
The document factory.
-
descriptors
protected transient ObjectBigArrayBigList<TRECDocumentCollection.TRECDocumentDescriptor> descriptors
The list of document descriptors. We assume that descriptors within the same file are contiguous
-
lastStream
protected SegmentedInputStream lastStream
The last returned stream.
-
DOC_OPEN
protected static final byte[] DOC_OPEN
-
DOC_CLOSE
protected static final byte[] DOC_CLOSE
-
DOCNO_OPEN
protected static final byte[] DOCNO_OPEN
-
DOCNO_CLOSE
protected static final byte[] DOCNO_CLOSE
-
DOCHDR_OPEN
protected static final byte[] DOCHDR_OPEN
-
DOCHDR_CLOSE
protected static final byte[] DOCHDR_CLOSE
-
-
Constructor Detail
-
TRECDocumentCollection
protected TRECDocumentCollection(String[] file, DocumentFactory factory, ObjectBigArrayBigList<TRECDocumentCollection.TRECDocumentDescriptor> descriptors, int bufferSize, boolean useGzip)
Copy constructor (that is, the one used bycopy()
. Just initializes final fields
-
TRECDocumentCollection
public TRECDocumentCollection(String[] file, DocumentFactory factory, int bufferSize, boolean useGzip) throws IOException
Creates a new TREC collection by parsing the given files.- Parameters:
file
- an array of file names containing documents in TREC GOV2 format.factory
- the document factory (usually, a composite one).bufferSize
- the buffer size.useGzip
- true iff the files are gzipped.- Throws:
IOException
-
-
Method Detail
-
equals
protected static boolean equals(byte[] a, int len, byte[] b)
-
parseContent
protected void parseContent(int fileIndex, InputStream is) throws IOException
- Throws:
IOException
-
copy
public TRECDocumentCollection copy()
- Specified by:
copy
in interfaceDocumentCollection
- Specified by:
copy
in interfaceFlyweightPrototype<DocumentCollection>
-
size
public long size()
Description copied from interface:DocumentCollection
Returns the number of documents in this collection.- Specified by:
size
in interfaceDocumentCollection
- Returns:
- the number of documents in this collection.
-
document
public Document document(long n) throws IOException
Description copied from interface:DocumentCollection
Returns the document given its index.- Specified by:
document
in interfaceDocumentCollection
- Parameters:
n
- an index between 0 (inclusive) andDocumentCollection.size()
(exclusive).- Returns:
- the
index
-th document. - Throws:
IOException
-
stream
public InputStream stream(long n) throws IOException
Description copied from interface:DocumentCollection
Returns an input stream for the raw content of a document.- Specified by:
stream
in interfaceDocumentCollection
- Parameters:
n
- an index between 0 (inclusive) andDocumentCollection.size()
(exclusive).- Returns:
- the raw content of the document as an input stream.
- Throws:
IOException
-
metadata
public Reference2ObjectMap<Enum<?>,Object> metadata(long index)
Description copied from interface:DocumentCollection
Returns the metadata map for a document.- Specified by:
metadata
in interfaceDocumentCollection
- Parameters:
index
- an index between 0 (inclusive) andDocumentCollection.size()
(exclusive).- Returns:
- the metadata map for the document.
-
factory
public DocumentFactory factory()
Description copied from interface:DocumentSequence
Returns the factory used by this sequence.Every document sequence is based on a document factory that transforms raw bytes into a sequence of characters. The factory contains useful information such as the number of fields.
- Specified by:
factory
in interfaceDocumentSequence
- Returns:
- the factory used by this sequence.
-
close
public void close() throws IOException
Description copied from interface:DocumentSequence
Closes this document sequence, releasing all resources.You should always call this method after having finished with this document sequence. Implementations are invited to call this method in a finaliser as a safety net (even better, implement
SafelyCloseable
), but since there is no guarantee as to when finalisers are invoked, you should not depend on this behaviour.- Specified by:
close
in interfaceAutoCloseable
- Specified by:
close
in interfaceCloseable
- Specified by:
close
in interfaceDocumentSequence
- Overrides:
close
in classAbstractDocumentSequence
- Throws:
IOException
-
merge
public void merge(TRECDocumentCollection other)
Merges a new collection in this one, by rebuilding the gzFile array and appending the other object one, concatenating the descriptors while rebuilding all.It is supposed that the passed object contains no duplicates for the local collection.
-
iterator
public DocumentIterator iterator() throws IOException
Description copied from interface:DocumentSequence
Returns an iterator over the sequence of documents.Warning: this method can be safely called just one time. For instance, implementations based on standard input will usually throw an exception if this method is called twice.
Implementations may decide to override this restriction (in particular, if they implement
DocumentCollection
). Usually, however, it is not possible to obtain two iterators at the same time on a collection.- Specified by:
iterator
in interfaceDocumentSequence
- Overrides:
iterator
in classAbstractDocumentCollection
- Returns:
- an iterator over the sequence of documents.
- Throws:
IOException
- See Also:
DocumentCollection
-
main
public static void main(String[] arg) throws IOException, com.martiansoftware.jsap.JSAPException, InstantiationException, IllegalAccessException, InvocationTargetException, NoSuchMethodException
- Throws:
IOException
com.martiansoftware.jsap.JSAPException
InstantiationException
IllegalAccessException
InvocationTargetException
NoSuchMethodException
-
-