java.lang.Object
- it.unimi.di.big.mg4j.tool.Scan

```
public class Scan
extends Object
```
Scans a document sequence, dividing it in batches of occurrences and writing for each batch a corresponding subindex.
This class (more precisely, its run() method) reads a document sequence and produces several batches, that is, subindices corresponding to subsets of term/document pairs of the collection. A set of batches is generated for each indexed field of the collection. A main method invokes the above method setting its parameters using suitable options. Usually, batches are then merged into the actual index (as it happens with IndexBuilder).
Unless a serialised DocumentSequence is specified using the suitable option, an implicit InputStreamDocumentSequence is created using separator byte (default is 10, i.e., newline). In the latter case, the factory and its properties can be set with command-line options.
The only mandatory argument is a basename, which will be used to stem the names of all files generated. The first batch of a field named field will use the basename basename-field@0, the second batch basename-field@1 and so on. It is also possible to specify a separate directory for batch files (e.g., for easier cleanup when they are no longer necessary).
Since documents are read sequentially, every document has a natural index starting from 0. If no remapping (i.e., renumbering) is specified, the document index of each document corresponds to its natural index. If, however, a remapping is specified, under the form of a list of integers, the document index of a document is the integer found in the corresponding position of the list. More precisely, a remapping for N documents is a list of N distinct integers, and a document with natural index i has document index given by the i-th element of the list. This is useful when indexing statically ranked documents (e.g., if you are indexing a part of the web and would like the index to return documents with higher static rank first). If the remapping file is provided, it must be a sequence of integers in DataInput format; if N is the number of documents, the file is to contain exactly N distinct integers. The integers need not be between 0 and N-1, to allow the remapping of subindices (but a warning will be logged in this case, just to be sure you know what you're doing).
Also every term has an associated number starting from 0, assigned in lexicographic order.
Index types and indexing types

A standard index contains a list of terms, and for each term a posting list. Each posting contains mandatorily a document pointer, and then, optionally, the count and the positions of the term (whether the last two elements appear can be specified using suitable compression flags).
The indexing type of a standard index can be Scan.IndexingType.STANDARD, Scan.IndexingType.REMAPPED or Scan.IndexingType.VIRTUAL. In the first case, we index the words occurring in documents as usual. In the second case, before writing the index all documents are renumbered following a provided map. In the third case (used only with DocumentFactory.FieldType.VIRTUAL fields) indexing is performed on a virtual document obtained by collating a number of fragments. Fragments are associated with documents by some key, and a VirtualDocumentResolver turns a key into a document natural number, so that the collation process can take place (a settable gap is inserted between fragments).
Besides storing document pointers, document counts, and position, MG4J makes it possible to store an arbitrary payload with each posting. This feature is presently used only to create payload-based indices—indices without counts and positions that contain a single, dummy word #. They are actually used to store arbitrary data associated to each document, such as dates and integers: using a special syntax, is then possible to specify range queries on the values of such fields.
The main difference between standard and payload-based indices is that the first type is handled by instances of this class, whereas the second type is handled by instances of Scan.PayloadAccumulator. The run() method creates a set of suitable instances, one for each indexed field, and feeds them in parallel with data from the appropriate field of the same document.
Note that this class uses an internal hack that mimicks BitStreamIndexWriter to perform a lightweight in-memory inversion that generates directly compressed posting lists. As a consequence, codes are fixed (see CompressionFlags.DEFAULT_STANDARD_INDEX and CompressionFlags.DEFAULT_PAYLOAD_INDEX). The only choice you have is the completeness of the index, which can range from pointers to counts up to full positions. More sophisticated choices (e.g., coding, skipping structures, etc.) can be obtained when combining the batches.
Building collections while indexing

During the indexing process, a DocumentCollectionBuilder can be used to generate a document collection that copies the sequence used to generate the index. While any builder can be passed to run() method, specifying a builder class on the command line requires that the class provides a constructor accepting a basename for the generated collection (CharSequence), the original factory (DocumentFactory) and a boolean that specifies whether the collection built should be exact (i.e., if it should index nonwords).
A collection will be generated for each batch (the basename will be the same), so each batch can be used separately as an index with its associated collection. Finally, a ConcatenatedDocumentCollection will be used to concatenate virtually the collections associated with batches, thus providing a global collection.
Batch subdivision and content

The scanning process will try to build batches containing exactly the specified number of documents per batch (for all indexed fields). There are of course space constraints that could make building exact batches impossible, as the entire data of a batch must into core memory. If memory is too low, a batch will be generated with fewer documents than expected. There is also a maximum number of terms allowed, as a very large number of terms (more than few millions) can cause massive garbage collection: in that case, it is better to dump a batch and just start a new one.
The larger the number of documents in a batch is, the quicker index construction will be. Usually, some experiments and a look at the logs is all that suffices to find out good parameters for the Java virtual machine maximum memory setting the number of documents per batch and the maximum number of terms (these depends on the structure of the collection you are indexing).
Each batch is an interleaved index. Using a suitable option, you can get for each batch an additional file with extension .terms.unsorted containing the list of indexed terms in the same order in which they were met in the document collection.
Finally, a file with extension .cluster.properties contains contains information about the set of batches seen as a DocumentalCluster. Besides the standard keys, the file contains IndexCluster.PropertyKeys.LOCALINDEX entries specifing the basename of the batches, and a strategy. After creating manually suitable term maps for each batch, you will be able to access the set of batches as a single index (note, however, that standard batches are very compact but provide slow access).

Since:

1.0

Author:

Sebastiano Vigna

Nested Class Summary

Nested Classes
Modifier and Type	Class	Description
`static class`	`Scan.Completeness`
`static class`	`Scan.IndexingType`
`protected static class`	`Scan.PayloadAccumulator`	An accumulator for payloads.
`static interface`	`Scan.VirtualDocumentFragment`	An interface that describes a virtual document fragment.

Field Summary

Fields
Modifier and Type	Field	Description
`static String`	`CLUSTER_PROPERTIES_EXTENSION`	The extension of the strategy for the cluster associated with a scan.
`protected int[][]`	`currSize`	A big array containing the current maximum size for each document, if the field indexed is virtual.
`protected LongArrayList`	`cutPoints`	The cutpoints of the batches (for building later a `ContiguousDocumentalStrategy`).
`static int`	`DEFAULT_BATCH_SIZE`	The default batch size.
`static int`	`DEFAULT_BUFFER_SIZE`	The default buffer size.
`static int`	`DEFAULT_DELIMITER`	The default delimiter separating two documents read from standard input (a newline).
`static int`	`DEFAULT_MAX_TERMS`	The default maximum number of terms.
`static int`	`DEFAULT_VIRTUAL_DOCUMENT_GAP`	The default virtual field gap.
`static int`	`INITIAL_TERM_MAP_SIZE`	The initial size of the term map.
`boolean`	`outOfMemoryError`	If true, this class experienced an `OutOfMemoryError` during some buffer reallocation.
`static int`	`PERC_AVAILABLE_MEMORY_CHECK`	When available memory goes below this threshold, we try a compaction.
`static int`	`PERC_AVAILABLE_MEMORY_DUMP`	If after compaction there is less memory (in percentage) than this value, we will flush the current batch.
`protected int`	`virtualDocumentGap`	The width of the artificial gap introduced between virtual-document fragments.

Constructor Summary

Constructors
Constructor	Description
`Scan(IOFactory ioFactory, String basename, String field, Scan.Completeness completeness, TermProcessor termProcessor, Scan.IndexingType indexingType, long numVirtualDocs, int virtualDocumentGap, int bufferSize, DocumentCollectionBuilder builder, File batchDir)`	Creates a new scanner instance.
`Scan(String basename, String field, TermProcessor termProcessor, boolean documentsAreInOrder, int bufferSize, DocumentCollectionBuilder builder, File batchDir)`	Deprecated.
`Scan(String basename, String field, TermProcessor termProcessor, Scan.IndexingType indexingType, int bufferSize, DocumentCollectionBuilder builder, File batchDir)`	Deprecated.
`Scan(String basename, String field, Scan.Completeness completeness, TermProcessor termProcessor, Scan.IndexingType indexingType, int numVirtualDocs, int virtualDocumentGap, int bufferSize, DocumentCollectionBuilder builder, File batchDir)`	Creates a new scanner instance using the `IOFactory.FILESYSTEM_FACTORY`.

Method Summary

Modifier and Type	Method	Description
`protected static String`	`batchBasename(int batch, String basename, File batchDir)`	Returns the name of a batch.
`static void`	`cleanup(IOFactory ioFactory, String basename, int batches, File batchDir)`	Cleans all intermediate files generated by a run of this class.
`void`	`close()`	Closes this pass, releasing all resources.
`protected long`	`dumpBatch()`	Dumps the current batch on disk as an index.
`static DocumentSequence`	`getSequence(String sequenceName, Class<?> factoryClass, String[] property, int delimiter, Logger logger)`	Returns the document sequence to be indexed.
`static void`	`main(String[] arg)`
`protected void`	`openSizeBitStream()`
`static int[]`	`parseFieldNames(String[] indexedFieldName, DocumentFactory factory, boolean allSupported)`
`static IOFactory`	`parseIOFactory(String ioFactorySpec)`
`static int[]`	`parseQualifiedSizes(String[] qualifiedSizes, String defaultSize, int[] indexedField, DocumentFactory factory)`
`static int[]`	`parseVirtualDocumentGap(String[] virtualDocumentGapSpec, int[] indexedField, DocumentFactory factory)`
`static VirtualDocumentResolver[]`	`parseVirtualDocumentResolver(IOFactory ioFactory, String[] virtualDocumentSpec, int[] indexedField, DocumentFactory factory)`
`void`	`processDocument(long documentPointer, WordReader wordReader)`	Processes a document.
`static void`	`run(IOFactory ioFactory, String basename, DocumentSequence documentSequence, Scan.Completeness completeness, TermProcessor termProcessor, DocumentCollectionBuilder builder, int bufferSize, int documentsPerBatch, int maxTerms, int[] indexedField, VirtualDocumentResolver[] virtualDocumentResolver, int[] virtualGap, String mapFile, long logInterval, String tempDirName)`	Runs in parallel a number of instances.
`static void`	`run(String basename, DocumentSequence documentSequence, TermProcessor termProcessor, DocumentCollectionBuilder builder, int bufferSize, int documentsPerBatch, int maxTerms, int[] indexedField, VirtualDocumentResolver[] virtualDocumentResolver, int[] virtualGap, String mapFile, long logInterval, String tempDirName)`	Deprecated.
`static void`	`run(String basename, DocumentSequence documentSequence, Scan.Completeness completeness, TermProcessor termProcessor, DocumentCollectionBuilder builder, int bufferSize, int documentsPerBatch, int maxTerms, int[] indexedField, VirtualDocumentResolver[] virtualDocumentResolver, int[] virtualGap, String mapFile, long logInterval, String tempDirName)`	Runs in parallel a number of instances using the `IOFactory.FILESYSTEM_FACTORY`.
`static void`	`saveProperties(IOFactory ioFactory, Properties properties, String filename)`
`String`	`toString()`

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait

Field Detail
- PERC_AVAILABLE_MEMORY_CHECK
```
public static final int PERC_AVAILABLE_MEMORY_CHECK
```
  When available memory goes below this threshold, we try a compaction.
  
  See Also:
  
  Constant Field Values
- PERC_AVAILABLE_MEMORY_DUMP
```
public static final int PERC_AVAILABLE_MEMORY_DUMP
```
  If after compaction there is less memory (in percentage) than this value, we will flush the current batch.
  
  See Also:
  
  Constant Field Values
- CLUSTER_PROPERTIES_EXTENSION
```
public static final String CLUSTER_PROPERTIES_EXTENSION
```
  The extension of the strategy for the cluster associated with a scan.
  
  See Also:
  
  Constant Field Values
- INITIAL_TERM_MAP_SIZE
```
public static final int INITIAL_TERM_MAP_SIZE
```
  The initial size of the term map.
  
  See Also:
  
  Constant Field Values
- outOfMemoryError
```
public boolean outOfMemoryError
```
  If true, this class experienced an OutOfMemoryError during some buffer reallocation.
- currSize
```
protected int[][] currSize
```
  A big array containing the current maximum size for each document, if the field indexed is virtual.
- virtualDocumentGap
```
protected int virtualDocumentGap
```
  The width of the artificial gap introduced between virtual-document fragments.
- cutPoints
```
protected final LongArrayList cutPoints
```
  The cutpoints of the batches (for building later a ContiguousDocumentalStrategy).
- DEFAULT_DELIMITER
```
public static final int DEFAULT_DELIMITER
```
  The default delimiter separating two documents read from standard input (a newline).
  
  See Also:
  
  Constant Field Values
- DEFAULT_BATCH_SIZE
```
public static final int DEFAULT_BATCH_SIZE
```
  The default batch size.
  
  See Also:
  
  Constant Field Values
- DEFAULT_MAX_TERMS
```
public static final int DEFAULT_MAX_TERMS
```
  The default maximum number of terms.
  
  See Also:
  
  Constant Field Values
- DEFAULT_BUFFER_SIZE
```
public static final int DEFAULT_BUFFER_SIZE
```
  The default buffer size.
  
  See Also:
  
  Constant Field Values
- DEFAULT_VIRTUAL_DOCUMENT_GAP
```
public static final int DEFAULT_VIRTUAL_DOCUMENT_GAP
```
  The default virtual field gap.
  
  See Also:
  
  Constant Field Values

Constructor Detail
- Scan
```
@Deprecated
public Scan(String basename,
            String field,
            TermProcessor termProcessor,
            boolean documentsAreInOrder,
            int bufferSize,
            DocumentCollectionBuilder builder,
            File batchDir)
     throws IOException
```
  Deprecated.
  
  Creates a new scanner instance using the IOFactory.FILESYSTEM_FACTORY.
  
  Parameters:
  
  basename - the basename (usually a global filename followed by the field name, separated by a dash).
  
  field - the field to be indexed.
  
  termProcessor - the term processor for this index.
  
  documentsAreInOrder - if true, documents will be served in increasing order.
  
  bufferSize - the buffer size used in all I/O.
  
  builder - a builder used to create a compressed document collection on the fly.
  
  batchDir - a directory for batch files; batch names will be relativised to this directory if it is not null.
  
  Throws:
  
  IOException
- Scan
```
@Deprecated
public Scan(String basename,
            String field,
            TermProcessor termProcessor,
            Scan.IndexingType indexingType,
            int bufferSize,
            DocumentCollectionBuilder builder,
            File batchDir)
     throws IOException
```
  Deprecated.
  
  Creates a new scanner instance.
  
  Throws:
  
  IOException
- Scan
```
public Scan(String basename,
            String field,
            Scan.Completeness completeness,
            TermProcessor termProcessor,
            Scan.IndexingType indexingType,
            int numVirtualDocs,
            int virtualDocumentGap,
            int bufferSize,
            DocumentCollectionBuilder builder,
            File batchDir)
     throws IOException
```
  Creates a new scanner instance using the IOFactory.FILESYSTEM_FACTORY.
  
  Parameters:
  
  basename - the basename (usually a global filename followed by the field name, separated by a dash).
  
  field - the field to be indexed.
  
  termProcessor - the term processor for this index.
  
  indexingType - the type of indexing procedure.
  
  numVirtualDocs - the number of virtual documents that will be used, in case of a virtual index; otherwise, immaterial.
  
  virtualDocumentGap - the artificial gap introduced between virtual documents fragments, in case of a virtual index; otherwise, immaterial.
  
  bufferSize - the buffer size used in all I/O.
  
  builder - a builder used to create a compressed document collection on the fly.
  
  batchDir - a directory for batch files; batch names will be relativised to this directory if it is not null.
  
  Throws:
  
  IOException
- Scan
```
public Scan(IOFactory ioFactory,
            String basename,
            String field,
            Scan.Completeness completeness,
            TermProcessor termProcessor,
            Scan.IndexingType indexingType,
            long numVirtualDocs,
            int virtualDocumentGap,
            int bufferSize,
            DocumentCollectionBuilder builder,
            File batchDir)
     throws IOException
```
  Creates a new scanner instance.
  
  Parameters:
  
  ioFactory - the factory that will be used to perform I/O.
  
  basename - the basename (usually a global filename followed by the field name, separated by a dash).
  
  field - the field to be indexed.
  
  termProcessor - the term processor for this index.
  
  indexingType - the type of indexing procedure.
  
  numVirtualDocs - the number of virtual documents that will be used, in case of a virtual index; otherwise, immaterial.
  
  virtualDocumentGap - the artificial gap introduced between virtual documents fragments, in case of a virtual index; otherwise, immaterial.
  
  bufferSize - the buffer size used in all I/O.
  
  builder - a builder used to create a compressed document collection on the fly.
  
  batchDir - a directory for batch files; batch names will be relativised to this directory if it is not null.
  
  Throws:
  
  IOException

Method Detail

cleanup
```
public static void cleanup(IOFactory ioFactory,
                           String basename,
                           int batches,
                           File batchDir)
                    throws IOException
```
Cleans all intermediate files generated by a run of this class.

Parameters:

ioFactory - the factory that will be used to perform I/O.

basename - the basename of the run.

batches - the number of generated batches.

batchDir - if not null, a temporary directory where the batches are located.

Throws:

IOException

batchBasename
```
protected static String batchBasename(int batch,
                                      String basename,
                                      File batchDir)
```
Returns the name of a batch.
You can override this method if you prefer a different batch naming scheme.

Parameters:

batch - the batch number.

basename - the index basename.

batchDir - if not null, a temporary directory for batches.

Returns:

simply basename@batch, if batchDir is null; otherwise, we relativise the name to batchDir.

dumpBatch
```
protected long dumpBatch()
                  throws IOException,
                         org.apache.commons.configuration.ConfigurationException
```
Dumps the current batch on disk as an index.

Returns:

the number of occurrences contained in the batch.

Throws:

IOException

org.apache.commons.configuration.ConfigurationException

openSizeBitStream

protected void openSizeBitStream()
                          throws IOException

Throws:: IOException

run

@Deprecated
public static void run(String basename,
                       DocumentSequence documentSequence,
                       TermProcessor termProcessor,
                       DocumentCollectionBuilder builder,
                       int bufferSize,
                       int documentsPerBatch,
                       int maxTerms,
                       int[] indexedField,
                       VirtualDocumentResolver[] virtualDocumentResolver,
                       int[] virtualGap,
                       String mapFile,
                       long logInterval,
                       String tempDirName)
                throws org.apache.commons.configuration.ConfigurationException,
                       IOException

Deprecated.

Runs in parallel a number of instances, indexing positions.

Throws:: org.apache.commons.configuration.ConfigurationException; IOException
See Also:: run(String, DocumentSequence, it.unimi.di.big.mg4j.tool.Scan.Completeness, TermProcessor, DocumentCollectionBuilder, int, int, int, int[], VirtualDocumentResolver[], int[], String, long, String)

run
```
public static void run(String basename,
                       DocumentSequence documentSequence,
                       Scan.Completeness completeness,
                       TermProcessor termProcessor,
                       DocumentCollectionBuilder builder,
                       int bufferSize,
                       int documentsPerBatch,
                       int maxTerms,
                       int[] indexedField,
                       VirtualDocumentResolver[] virtualDocumentResolver,
                       int[] virtualGap,
                       String mapFile,
                       long logInterval,
                       String tempDirName)
                throws org.apache.commons.configuration.ConfigurationException,
                       IOException
```
Runs in parallel a number of instances using the IOFactory.FILESYSTEM_FACTORY.
This commodity method takes care of instantiating one instance per indexed field, and to pass the right information to each instance. All options are common to all fields, except for the number of occurrences in a batch, which can be tuned for each field separately.

Parameters:

basename - the index basename.

documentSequence - a document sequence.

completeness - the completeness level of this run.

termProcessor - the term processor for this index.

builder - if not null, a builder that will be used to create new collection built using documentSequence.

bufferSize - the buffer size used in all I/O.

documentsPerBatch - the number of documents that we should try to put in each segment.

maxTerms - the maximum number of overall (i.e., cross-field) terms in a batch.

indexedField - the fields that should be indexed, in increasing order.

virtualDocumentResolver - the array of virtual document resolvers to be used, parallel to indexedField: it can safely contain anything (even null) in correspondence to non-virtual fields, and can safely be null if no fields are virtual.

virtualGap - the array of virtual field gaps to be used, parallel to indexedField: it can safely contain anything in correspondence to non-virtual fields, and can safely be null if no fields are virtual.

mapFile - the name of a file containing a map to be applied to document indices.

logInterval - the minimum time interval between activity logs in milliseconds.

tempDirName - a directory for temporary files.

Throws:

org.apache.commons.configuration.ConfigurationException

IOException

run
```
public static void run(IOFactory ioFactory,
                       String basename,
                       DocumentSequence documentSequence,
                       Scan.Completeness completeness,
                       TermProcessor termProcessor,
                       DocumentCollectionBuilder builder,
                       int bufferSize,
                       int documentsPerBatch,
                       int maxTerms,
                       int[] indexedField,
                       VirtualDocumentResolver[] virtualDocumentResolver,
                       int[] virtualGap,
                       String mapFile,
                       long logInterval,
                       String tempDirName)
                throws org.apache.commons.configuration.ConfigurationException,
                       IOException
```
Runs in parallel a number of instances.
This commodity method takes care of instantiating one instance per indexed field, and to pass the right information to each instance. All options are common to all fields, except for the number of occurrences in a batch, which can be tuned for each field separately.

Parameters:

ioFactory - the factory that will be used to perform I/O.

basename - the index basename.

documentSequence - a document sequence.

completeness - the completeness level of this run.

termProcessor - the term processor for this index.

builder - if not null, a builder that will be used to create new collection built using documentSequence.

bufferSize - the buffer size used in all I/O.

documentsPerBatch - the number of documents that we should try to put in each segment.

maxTerms - the maximum number of overall (i.e., cross-field) terms in a batch.

indexedField - the fields that should be indexed, in increasing order.

virtualDocumentResolver - the array of virtual document resolvers to be used, parallel to indexedField: it can safely contain anything (even null) in correspondence to non-virtual fields, and can safely be null if no fields are virtual.

virtualGap - the array of virtual field gaps to be used, parallel to indexedField: it can safely contain anything in correspondence to non-virtual fields, and can safely be null if no fields are virtual.

mapFile - the name of a file containing a map to be applied to document indices.

logInterval - the minimum time interval between activity logs in milliseconds.

tempDirName - a directory for temporary files.

Throws:

IOException

org.apache.commons.configuration.ConfigurationException

processDocument
```
public void processDocument(long documentPointer,
                            WordReader wordReader)
                     throws IOException
```
Processes a document.

Parameters:

documentPointer - the integer pointer associated with the document.

wordReader - the word reader associated with the document.

Throws:

IOException

saveProperties

public static void saveProperties(IOFactory ioFactory,
                                  Properties properties,
                                  String filename)
                           throws org.apache.commons.configuration.ConfigurationException,
                                  IOException

Throws:: org.apache.commons.configuration.ConfigurationException; IOException

close

public void close()
           throws org.apache.commons.configuration.ConfigurationException,
                  IOException

Closes this pass, releasing all resources.

Throws:: org.apache.commons.configuration.ConfigurationException; IOException

toString
```
public String toString()
```
Overrides:

toString in class Object

parseQualifiedSizes

public static int[] parseQualifiedSizes(String[] qualifiedSizes,
                                        String defaultSize,
                                        int[] indexedField,
                                        DocumentFactory factory)
                                 throws com.martiansoftware.jsap.ParseException

Throws:: com.martiansoftware.jsap.ParseException

parseVirtualDocumentResolver

public static VirtualDocumentResolver[] parseVirtualDocumentResolver(IOFactory ioFactory,
                                                                     String[] virtualDocumentSpec,
                                                                     int[] indexedField,
                                                                     DocumentFactory factory)

parseVirtualDocumentGap

public static int[] parseVirtualDocumentGap(String[] virtualDocumentGapSpec,
                                            int[] indexedField,
                                            DocumentFactory factory)

parseFieldNames

public static int[] parseFieldNames(String[] indexedFieldName,
                                    DocumentFactory factory,
                                    boolean allSupported)

parseIOFactory

public static IOFactory parseIOFactory(String ioFactorySpec)
                                throws IllegalArgumentException,
                                       IllegalAccessException,
                                       ClassNotFoundException,
                                       InvocationTargetException,
                                       InstantiationException,
                                       NoSuchMethodException,
                                       IOException

Throws:: IllegalArgumentException; IllegalAccessException; ClassNotFoundException; InvocationTargetException; InstantiationException; NoSuchMethodException; IOException

getSequence

public static DocumentSequence getSequence(String sequenceName,
                                           Class<?> factoryClass,
                                           String[] property,
                                           int delimiter,
                                           Logger logger)
                                    throws IllegalAccessException,
                                           InvocationTargetException,
                                           NoSuchMethodException,
                                           IOException,
                                           ClassNotFoundException,
                                           InstantiationException,
                                           IllegalArgumentException,
                                           SecurityException

Returns the document sequence to be indexed.

Parameters:: sequenceName - the name of a serialised document sequence, or null for standard input.; factoryClass - the class of the DocumentFactory that should be passed to the document sequence.; property - an array of property strings to be used in the factory initialisation.; delimiter - a delimiter in case we want to use standard input.; logger - a logger.
Returns:: the document sequence to be indexed.
Throws:: IllegalAccessException; InvocationTargetException; NoSuchMethodException; IOException; ClassNotFoundException; InstantiationException; IllegalArgumentException; SecurityException

main

public static void main(String[] arg)
                 throws com.martiansoftware.jsap.JSAPException,
                        InvocationTargetException,
                        NoSuchMethodException,
                        org.apache.commons.configuration.ConfigurationException,
                        ClassNotFoundException,
                        IOException,
                        IllegalAccessException,
                        InstantiationException

Throws:: com.martiansoftware.jsap.JSAPException; InvocationTargetException; NoSuchMethodException; org.apache.commons.configuration.ConfigurationException; ClassNotFoundException; IOException; IllegalAccessException; InstantiationException

Class Scan

Index types and indexing types

Building collections while indexing

Batch subdivision and content

Nested Class Summary

Field Summary

Constructor Summary

Method Summary

Methods inherited from class java.lang.Object

Field Detail

PERC_AVAILABLE_MEMORY_CHECK

PERC_AVAILABLE_MEMORY_DUMP

CLUSTER_PROPERTIES_EXTENSION

INITIAL_TERM_MAP_SIZE

outOfMemoryError

currSize

virtualDocumentGap

cutPoints

DEFAULT_DELIMITER

DEFAULT_BATCH_SIZE

DEFAULT_MAX_TERMS

DEFAULT_BUFFER_SIZE

DEFAULT_VIRTUAL_DOCUMENT_GAP

Constructor Detail

Scan

Scan

Scan

Scan

Method Detail

cleanup

batchBasename

dumpBatch

openSizeBitStream

run

run

run

processDocument

saveProperties

close

toString

parseQualifiedSizes

parseVirtualDocumentResolver

parseVirtualDocumentGap

parseFieldNames

parseIOFactory

getSequence

main