InputStreamDocumentSequence (MG4J 5.1)

Overview

Package

Class

Tree

Deprecated

Index

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

it.unimi.di.mg4j.document
Class InputStreamDocumentSequence

java.lang.Object
  java.io.InputStream
      it.unimi.dsi.fastutil.io.MeasurableInputStream
          it.unimi.dsi.fastutil.io.FastBufferedInputStream
              it.unimi.di.mg4j.document.InputStreamDocumentSequence

All Implemented Interfaces:: DocumentSequence, MeasurableStream, RepositionableStream, Closeable

public class InputStreamDocumentSequence
extends FastBufferedInputStream
implements DocumentSequence
extends FastBufferedInputStream
implements DocumentSequence

A document sequence obtained by breaking an input stream at a specified separator.

This document sequences blindly passes to the indexer sequences of characters read in a specified encoding and separated by a specified byte.

Nested Class Summary

Nested classes/interfaces inherited from class it.unimi.dsi.fastutil.io.FastBufferedInputStream
`FastBufferedInputStream.LineTerminator`

Field Summary

Fields inherited from class it.unimi.dsi.fastutil.io.FastBufferedInputStream
`ALL_TERMINATORS, avail, buffer, DEFAULT_BUFFER_SIZE, is, pos, readBytes`

Constructor Summary
`InputStreamDocumentSequence(InputStream inputStream, int separator, DocumentFactory factory)` Creates a new document sequence based on a given input stream and separator.
`InputStreamDocumentSequence(InputStream inputStream, int separator, DocumentFactory factory, int maxDocs)` Creates a new document sequence based on a given input stream and separator; the sequence will not return more than the given number of documents.

Method Summary
`int`	`available()` Returns one if there is an available byte which is not a separator, zero otherwise.
`void`	`close()` Closes this document sequence, releasing all resources.
`DocumentFactory`	`factory()` Returns the factory used by this sequence.
`void`	`filename(CharSequence filename)` Sets the filename of this document sequence.
`void`	`flush()`
`DocumentIterator`	`iterator()` Returns an iterator over the sequence of documents.
`void`	`mark(int readlimit)`
`boolean`	`markSupported()`
`boolean`	`noMoreBytes()`
`int`	`read()`
`int`	`read(byte[] b)`
`int`	`read(byte[] b, int offset, int length)`
`void`	`reset()` Deprecated.
`long`	`skip(long skip)`

Methods inherited from class it.unimi.dsi.fastutil.io.FastBufferedInputStream
`length, noMoreCharacters, position, position, readLine, readLine, readLine, readLine`

Methods inherited from class java.lang.Object
`clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait`

Constructor Detail

InputStreamDocumentSequence

public InputStreamDocumentSequence(InputStream inputStream,
                                   int separator,
                                   DocumentFactory factory,
                                   int maxDocs)

Creates a new document sequence based on a given input stream and separator; the sequence will not return more than the given number of documents.

Parameters:: inputStream - the input stream containing all documents.; separator - the separator.; factory - the factory that will be used to create documents.; maxDocs - the maximum number of documents returned.

InputStreamDocumentSequence

public InputStreamDocumentSequence(InputStream inputStream,
                                   int separator,
                                   DocumentFactory factory)

Creates a new document sequence based on a given input stream and separator.

Parameters:: inputStream - the input stream containing all documents.; separator - the separator.; factory - the factory that will be used to create documents.

Method Detail

iterator

public DocumentIterator iterator()

Description copied from interface: DocumentSequence

Returns an iterator over the sequence of documents.

Warning: this method can be safely called just one time. For instance, implementations based on standard input will usually throw an exception if this method is called twice.

Implementations may decide to override this restriction (in particular, if they implement DocumentCollection). Usually, however, it is not possible to obtain two iterators at the same time on a collection.

Specified by:: iterator in interface DocumentSequence

Returns:: an iterator over the sequence of documents.
See Also:: DocumentCollection

factory

public DocumentFactory factory()

Description copied from interface: DocumentSequence

Returns the factory used by this sequence.

Every document sequence is based on a document factory that transforms raw bytes into a sequence of characters. The factory contains useful information such as the number of fields.

Specified by:: factory in interface DocumentSequence

Returns:: the factory used by this sequence.

noMoreBytes

public boolean noMoreBytes()
                    throws IOException

Throws:: IOException

read

public int read()
         throws IOException

Overrides:: read in class FastBufferedInputStream

Throws:: IOException

read

public int read(byte[] b)
         throws IOException

Overrides:: read in class InputStream

Throws:: IOException

read

public int read(byte[] b,
                int offset,
                int length)
         throws IOException

Overrides:: read in class FastBufferedInputStream

Throws:: IOException

mark

public void mark(int readlimit)

Overrides:: mark in class InputStream

markSupported

public boolean markSupported()

Overrides:: markSupported in class InputStream

available

public int available()
              throws IOException

Returns one if there is an available byte which is not a separator, zero otherwise.

This behaviour tries to avoid calls to InputStream.available()s, which are unbelievably slow. Stream decoders presently require just to know whether it is possible to read a character in a nonblocking way or not.

Overrides:: available in class FastBufferedInputStream

Returns:: one if there is an available byte which is not a separator, zero otherwise.
Throws:: IOException

skip

public long skip(long skip)

Overrides:: skip in class FastBufferedInputStream

reset

@Deprecated
public void reset()

Deprecated.

Overrides:: reset in class FastBufferedInputStream

flush

public void flush()

Overrides:: flush in class FastBufferedInputStream

close

public void close()
           throws IOException

Description copied from interface: DocumentSequence

Closes this document sequence, releasing all resources.

You should always call this method after having finished with this document sequence. Implementations are invited to call this method in a finaliser as a safety net (even better, implement SafelyCloseable), but since there is no guarantee as to when finalisers are invoked, you should not depend on this behaviour.

Specified by:: close in interface DocumentSequence
Specified by:: close in interface Closeable
Overrides:: close in class FastBufferedInputStream

Throws:: IOException

filename

public void filename(CharSequence filename)
              throws IOException

Description copied from interface: DocumentSequence

Sets the filename of this document sequence.

Several document sequences (or collections) are stored using Java's standard serialisation mechanism; nonetheless, they require access to files that are stored as serialised filenames inside the instance. If all pieces are in the current directory, this works as expected. However, if the sequence was specified using a complete pathname, during deserialisation it will be impossible to recover the associated files. In this case, the class expects that this method is invoked over the newly deserialised instance so that pathnames can be relativised to the given filename. Classes that need this mechanism should not fail upon deserialisation if they do not find some support file, but rather wait for the first access.

In several cases, this method can be a no-op (e.g., for an InputStreamDocumentSequence or a FileSetDocumentCollection). Other implementations, such as SimpleCompressedDocumentCollection or ZipDocumentCollection, require a specific treatment. AbstractDocumentSequence implements this method as a no-op.

Specified by:: filename in interface DocumentSequence

Parameters:: filename - the filename of this document sequence.
Throws:: IOException

Overview

Package

Class

Tree

Deprecated

Index

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

it.unimi.di.mg4j.document Class InputStreamDocumentSequence

InputStreamDocumentSequence

InputStreamDocumentSequence

iterator

factory

noMoreBytes

read

read

read

mark

markSupported

available

skip

reset

flush

close

filename

it.unimi.di.mg4j.document
Class InputStreamDocumentSequence