Class WarcDocumentSequence

    • Field Detail

      • DEFAULT_BUFFER_SIZE

        public static final String DEFAULT_BUFFER_SIZE
        Default buffer size, set up after some experiments.
        See Also:
        Constant Field Values
      • factory

        protected final DocumentFactory factory
        The user specified factory.
      • bufferSize

        protected final int bufferSize
        The buffer size used for reads.
      • useGzip

        protected final boolean useGzip
        Whether the Warcfile are gzipped.
      • warcFile

        protected final String[] warcFile
        The list of WARC files
    • Constructor Detail

      • WarcDocumentSequence

        public WarcDocumentSequence​(String[] warcFile,
                                    DocumentFactory factory,
                                    boolean useGzip,
                                    int bufferSize)
    • Method Detail

      • factory

        public DocumentFactory factory()
        Description copied from interface: DocumentSequence
        Returns the factory used by this sequence.

        Every document sequence is based on a document factory that transforms raw bytes into a sequence of characters. The factory contains useful information such as the number of fields.

        Specified by:
        factory in interface DocumentSequence
        Returns:
        the factory used by this sequence.
      • getCurrentDocument

        protected Document getCurrentDocument​(it.unimi.di.law.warc.records.WarcRecord record)
                                       throws IOException
        Throws:
        IOException
      • iterator

        public DocumentIterator iterator()
                                  throws IOException
        Description copied from interface: DocumentSequence
        Returns an iterator over the sequence of documents.

        Warning: this method can be safely called just one time. For instance, implementations based on standard input will usually throw an exception if this method is called twice.

        Implementations may decide to override this restriction (in particular, if they implement DocumentCollection). Usually, however, it is not possible to obtain two iterators at the same time on a collection.

        Specified by:
        iterator in interface DocumentSequence
        Returns:
        an iterator over the sequence of documents.
        Throws:
        IOException
        See Also:
        DocumentCollection