Class WarcDocumentSequence
- java.lang.Object
-
- it.unimi.di.big.mg4j.document.AbstractDocumentSequence
-
- it.unimi.di.big.mg4j.document.WarcDocumentSequence
-
- All Implemented Interfaces:
DocumentSequence
,SafelyCloseable
,Closeable
,Serializable
,AutoCloseable
public class WarcDocumentSequence extends AbstractDocumentSequence implements Serializable
A document sequence over a set of (possibly compressed) Warc files.The metadata provided by the sequence include the encoding, the URI, the MIME type.
If a Warc header with name “WARC-TREC-ID” is present, it will be used as TITLE.
This class will also fetch and use the BUbiNG guessed charset, if present.
As a commodity, this class provides a main method for the creation of a serialized version of the document sequence.
- See Also:
- Serialized Form
-
-
Field Summary
Fields Modifier and Type Field Description protected int
bufferSize
The buffer size used for reads.static String
DEFAULT_BUFFER_SIZE
Default buffer size, set up after some experiments.protected DocumentFactory
factory
The user specified factory.protected boolean
useGzip
Whether the Warcfile are gzipped.protected String[]
warcFile
The list of WARC files
-
Constructor Summary
Constructors Modifier Constructor Description protected
WarcDocumentSequence(WarcDocumentSequence prototype)
WarcDocumentSequence(String[] warcFile, DocumentFactory factory, boolean useGzip, int bufferSize)
-
Method Summary
Modifier and Type Method Description DocumentFactory
factory()
Returns the factory used by this sequence.protected Document
getCurrentDocument(it.unimi.di.law.warc.records.WarcRecord record)
DocumentIterator
iterator()
Returns an iterator over the sequence of documents.static void
main(String[] args)
-
Methods inherited from class it.unimi.di.big.mg4j.document.AbstractDocumentSequence
close, filename, finalize, load
-
-
-
-
Field Detail
-
DEFAULT_BUFFER_SIZE
public static final String DEFAULT_BUFFER_SIZE
Default buffer size, set up after some experiments.- See Also:
- Constant Field Values
-
factory
protected final DocumentFactory factory
The user specified factory.
-
bufferSize
protected final int bufferSize
The buffer size used for reads.
-
useGzip
protected final boolean useGzip
Whether the Warcfile are gzipped.
-
warcFile
protected final String[] warcFile
The list of WARC files
-
-
Constructor Detail
-
WarcDocumentSequence
protected WarcDocumentSequence(WarcDocumentSequence prototype)
-
WarcDocumentSequence
public WarcDocumentSequence(String[] warcFile, DocumentFactory factory, boolean useGzip, int bufferSize)
-
-
Method Detail
-
factory
public DocumentFactory factory()
Description copied from interface:DocumentSequence
Returns the factory used by this sequence.Every document sequence is based on a document factory that transforms raw bytes into a sequence of characters. The factory contains useful information such as the number of fields.
- Specified by:
factory
in interfaceDocumentSequence
- Returns:
- the factory used by this sequence.
-
getCurrentDocument
protected Document getCurrentDocument(it.unimi.di.law.warc.records.WarcRecord record) throws IOException
- Throws:
IOException
-
iterator
public DocumentIterator iterator() throws IOException
Description copied from interface:DocumentSequence
Returns an iterator over the sequence of documents.Warning: this method can be safely called just one time. For instance, implementations based on standard input will usually throw an exception if this method is called twice.
Implementations may decide to override this restriction (in particular, if they implement
DocumentCollection
). Usually, however, it is not possible to obtain two iterators at the same time on a collection.- Specified by:
iterator
in interfaceDocumentSequence
- Returns:
- an iterator over the sequence of documents.
- Throws:
IOException
- See Also:
DocumentCollection
-
-