Class InputStreamDocumentSequence
- java.lang.Object
-
- java.io.InputStream
-
- it.unimi.dsi.fastutil.io.MeasurableInputStream
-
- it.unimi.dsi.fastutil.io.FastBufferedInputStream
-
- it.unimi.di.big.mg4j.document.InputStreamDocumentSequence
-
- All Implemented Interfaces:
DocumentSequence
,MeasurableStream
,RepositionableStream
,Closeable
,AutoCloseable
public class InputStreamDocumentSequence extends FastBufferedInputStream implements DocumentSequence
A document sequence obtained by breaking an input stream at a specified separator.This document sequences blindly passes to the indexer sequences of characters read in a specified encoding and separated by a specified byte.
-
-
Nested Class Summary
-
Nested classes/interfaces inherited from class it.unimi.dsi.fastutil.io.FastBufferedInputStream
FastBufferedInputStream.LineTerminator
-
-
Field Summary
-
Fields inherited from class it.unimi.dsi.fastutil.io.FastBufferedInputStream
ALL_TERMINATORS, avail, buffer, DEFAULT_BUFFER_SIZE, is, pos, readBytes
-
-
Constructor Summary
Constructors Constructor Description InputStreamDocumentSequence(InputStream inputStream, int separator, DocumentFactory factory)
Creates a new document sequence based on a given input stream and separator.InputStreamDocumentSequence(InputStream inputStream, int separator, DocumentFactory factory, int maxDocs)
Creates a new document sequence based on a given input stream and separator; the sequence will not return more than the given number of documents.
-
Method Summary
Modifier and Type Method Description int
available()
Returns one if there is an available byte which is not a separator, zero otherwise.void
close()
Closes this document sequence, releasing all resources.DocumentFactory
factory()
Returns the factory used by this sequence.void
filename(CharSequence filename)
Sets the filename of this document sequence.void
flush()
DocumentIterator
iterator()
Returns an iterator over the sequence of documents.void
mark(int readlimit)
boolean
markSupported()
boolean
noMoreBytes()
int
read()
int
read(byte[] b)
int
read(byte[] b, int offset, int length)
void
reset()
Deprecated.long
skip(long skip)
-
Methods inherited from class it.unimi.dsi.fastutil.io.FastBufferedInputStream
length, noMoreCharacters, position, position, readLine, readLine, readLine, readLine
-
Methods inherited from class java.io.InputStream
nullInputStream, readAllBytes, readNBytes, readNBytes, skipNBytes, transferTo
-
-
-
-
Constructor Detail
-
InputStreamDocumentSequence
public InputStreamDocumentSequence(InputStream inputStream, int separator, DocumentFactory factory, int maxDocs)
Creates a new document sequence based on a given input stream and separator; the sequence will not return more than the given number of documents.- Parameters:
inputStream
- the input stream containing all documents.separator
- the separator.factory
- the factory that will be used to create documents.maxDocs
- the maximum number of documents returned.
-
InputStreamDocumentSequence
public InputStreamDocumentSequence(InputStream inputStream, int separator, DocumentFactory factory)
Creates a new document sequence based on a given input stream and separator.- Parameters:
inputStream
- the input stream containing all documents.separator
- the separator.factory
- the factory that will be used to create documents.
-
-
Method Detail
-
iterator
public DocumentIterator iterator()
Description copied from interface:DocumentSequence
Returns an iterator over the sequence of documents.Warning: this method can be safely called just one time. For instance, implementations based on standard input will usually throw an exception if this method is called twice.
Implementations may decide to override this restriction (in particular, if they implement
DocumentCollection
). Usually, however, it is not possible to obtain two iterators at the same time on a collection.- Specified by:
iterator
in interfaceDocumentSequence
- Returns:
- an iterator over the sequence of documents.
- See Also:
DocumentCollection
-
factory
public DocumentFactory factory()
Description copied from interface:DocumentSequence
Returns the factory used by this sequence.Every document sequence is based on a document factory that transforms raw bytes into a sequence of characters. The factory contains useful information such as the number of fields.
- Specified by:
factory
in interfaceDocumentSequence
- Returns:
- the factory used by this sequence.
-
noMoreBytes
public boolean noMoreBytes() throws IOException
- Throws:
IOException
-
read
public int read() throws IOException
- Overrides:
read
in classFastBufferedInputStream
- Throws:
IOException
-
read
public int read(byte[] b) throws IOException
- Overrides:
read
in classInputStream
- Throws:
IOException
-
read
public int read(byte[] b, int offset, int length) throws IOException
- Overrides:
read
in classFastBufferedInputStream
- Throws:
IOException
-
mark
public void mark(int readlimit)
- Overrides:
mark
in classInputStream
-
markSupported
public boolean markSupported()
- Overrides:
markSupported
in classInputStream
-
available
public int available() throws IOException
Returns one if there is an available byte which is not a separator, zero otherwise.This behaviour tries to avoid calls to
InputStream.available()
s, which are unbelievably slow. Stream decoders presently require just to know whether it is possible to read a character in a nonblocking way or not.- Overrides:
available
in classFastBufferedInputStream
- Returns:
- one if there is an available byte which is not a separator, zero otherwise.
- Throws:
IOException
-
skip
public long skip(long skip)
- Overrides:
skip
in classFastBufferedInputStream
-
reset
@Deprecated public void reset()
Deprecated.- Overrides:
reset
in classFastBufferedInputStream
-
flush
public void flush()
- Overrides:
flush
in classFastBufferedInputStream
-
close
public void close() throws IOException
Description copied from interface:DocumentSequence
Closes this document sequence, releasing all resources.You should always call this method after having finished with this document sequence. Implementations are invited to call this method in a finaliser as a safety net (even better, implement
SafelyCloseable
), but since there is no guarantee as to when finalisers are invoked, you should not depend on this behaviour.- Specified by:
close
in interfaceAutoCloseable
- Specified by:
close
in interfaceCloseable
- Specified by:
close
in interfaceDocumentSequence
- Overrides:
close
in classFastBufferedInputStream
- Throws:
IOException
-
filename
public void filename(CharSequence filename) throws IOException
Description copied from interface:DocumentSequence
Sets the filename of this document sequence.Several document sequences (or collections) are stored using Java's standard serialisation mechanism; nonetheless, they require access to files that are stored as serialised filenames inside the instance. If all pieces are in the current directory, this works as expected. However, if the sequence was specified using a complete pathname, during deserialisation it will be impossible to recover the associated files. In this case, the class expects that this method is invoked over the newly deserialised instance so that pathnames can be relativised to the given filename. Classes that need this mechanism should not fail upon deserialisation if they do not find some support file, but rather wait for the first access.
In several cases, this method can be a no-op (e.g., for an
InputStreamDocumentSequence
or aFileSetDocumentCollection
). Other implementations, such asSimpleCompressedDocumentCollection
orZipDocumentCollection
, require a specific treatment.AbstractDocumentSequence
implements this method as a no-op.- Specified by:
filename
in interfaceDocumentSequence
- Parameters:
filename
- the filename of this document sequence.- Throws:
IOException
-
-