it.unimi.di.mg4j.tool
Class IndexBuilder

java.lang.Object
  extended by it.unimi.di.mg4j.tool.IndexBuilder

public class IndexBuilder
extends Object

An index builder.

An instance of this class exposes a run() method that will index the DocumentSequence provided at construction time by calling Scan and Combine in sequence.

Additionally, a main method provides easy access to index construction.

All indexing parameters are available either as chainable setters that can be called optionally before invoking run(), or as public mutable collections and maps. For instance,

 new IndexBuilder( "foo", sequence ).skips( true ).run();
 
will build an index with basename foo using skips. If instead we want to index just the first field of the sequence, and use a ShiftAddXorSignedStringMap as a term map, we can use the following code:
 new IndexBuilder( "foo", sequence )
     .termMapClass( ShiftAddXorSignedMinimalPerfectHash.class )
     .indexedFields( 0 ).run();
 

More sophisticated modifications can be applied using public maps:

 IndexBuilder indexBuilder = new IndexBuilder( "foo", sequence );
 indexBuilder.virtualDocumentGaps.put( 0, 30 );
 indexBuilder.virtualDocumentResolver.put( 0, someVirtualDocumentResolver );
 indexBuilder.run();
 


Field Summary
 IntSortedSet indexedFields
          The set of indexed fields (expressed as field indices).
 Int2IntMap virtualDocumentGaps
          A map from field indices to virtual gaps.
 Int2ObjectMap<VirtualDocumentResolver> virtualDocumentResolvers
          A map from field indices to a corresponding VirtualDocumentResolver.
 
Constructor Summary
IndexBuilder(String basename, DocumentSequence documentSequence)
          Creates a new index builder with default parameters.
 
Method Summary
 IndexBuilder batchDirName(String batchDirName)
          Sets the temporary directory for batches (default: the directory containing the basename).
 IndexBuilder bufferSize(int bufferSize)
          Sets both the scan buffer size and the combine buffer size.
 IndexBuilder builder(DocumentCollectionBuilder builder)
          Sets the document collection builder (default: null).
 IndexBuilder combineBufferSize(int bufferSize)
          Sets the Combine buffer size (default: Combine.DEFAULT_BUFFER_SIZE).
 IndexBuilder documentsPerBatch(int documentsPerBatch)
          Sets the number of documents per batch (default: Scan.DEFAULT_BATCH_SIZE).
 IndexBuilder height(int height)
          Sets the skip height (default: BitStreamIndex.DEFAULT_HEIGHT).
 IndexBuilder indexedFields(int... field)
          Sets the indexed fields to those provided (default: all fields, but see indexedFields).
 IndexBuilder indexType(Combine.IndexType indexType)
          Sets the type of the index to be built (default: Combine.IndexType.QUASI_SUCCINCT).
 IndexBuilder interleaved(boolean interleaved)
          Sets the interleaved flag (default: false).
 IndexBuilder ioFactory(IOFactory ioFactory)
          Sets the I/O factory (default: IOFactory.FILESYSTEM_FACTORY).
 IndexBuilder keepBatches(boolean keepBatches)
          Sets the “keep batches” flag (default: false).
 IndexBuilder logInterval(long logInterval)
          Sets the logging time interval (default: ProgressLogger.DEFAULT_LOG_INTERVAL).
static void main(String[] arg)
           
 IndexBuilder mapFile(String mapFile)
          Sets the name of a file containing a map on the document indices (default: null).
 IndexBuilder maxTerms(int maxTerms)
          Sets the maximum number of overall (i.e., cross-field) terms per batch (default: Scan.DEFAULT_BATCH_SIZE).
 IndexBuilder pasteBufferSize(int bufferSize)
          Sets the size in byte of the internal buffer using when pasting indices (default: Paste.DEFAULT_MEMORY_BUFFER_SIZE).
 IndexBuilder payloadWriterFlags(Map<CompressionFlags.Component,CompressionFlags.Coding> payloadWriterFlags)
          Sets the writer compression flags for payload-based indices (default: CompressionFlags.DEFAULT_PAYLOAD_INDEX).
 IndexBuilder quantum(int quantum)
          Sets the skip quantum (default: BitStreamIndex.DEFAULT_QUANTUM).
 IndexBuilder quasiSuccinctWriterFlags(Map<CompressionFlags.Component,CompressionFlags.Coding> quasiSuccinctWriterFlags)
          Sets the writer compression flags for standard indices (default: CompressionFlags.DEFAULT_QUASI_SUCCINCT_INDEX).
 void run()
          Builds the index.
 IndexBuilder scanBufferSize(int bufferSize)
          Sets the Scan buffer size (default: Scan.DEFAULT_BUFFER_SIZE).
 IndexBuilder skipBufferSize(int bufferSize)
          Sets the size in byte of the internal buffer using during the construction of a index with skips (default: SkipBitStreamIndexWriter.DEFAULT_TEMP_BUFFER_SIZE).
 IndexBuilder skips(boolean skips)
          Sets the skip flag (default: true).
 IndexBuilder standardWriterFlags(Map<CompressionFlags.Component,CompressionFlags.Coding> standardWriterFlags)
          Sets the writer compression flags for standard indices (default: CompressionFlags.DEFAULT_STANDARD_INDEX).
 IndexBuilder termMapClass(Class<? extends StringMap<? extends CharSequence>> termMapClass)
          Sets the class used to build the index term map (default: ImmutableExternalPrefixMap).
 IndexBuilder termProcessor(TermProcessor termProcessor)
          Sets the term processor (default: DowncaseTermProcessor).
 IndexBuilder virtualDocumentResolver(int field, VirtualDocumentResolver virtualDocumentResolver)
          Adds a virtual document resolver to virtualDocumentResolvers.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

indexedFields

public IntSortedSet indexedFields
The set of indexed fields (expressed as field indices). If left empty, all fields will be indexed, with the proviso that fields of type DocumentFactory.FieldType.VIRTUAL will be indexed only if they have a corresponding VirtualDocumentResolver.

An alternative, chained access to this map is provided by the method indexedFields(int[])

After calling run(), this map will contain the set of fields actually indexed.


virtualDocumentResolvers

public Int2ObjectMap<VirtualDocumentResolver> virtualDocumentResolvers
A map from field indices to a corresponding VirtualDocumentResolver.


virtualDocumentGaps

public Int2IntMap virtualDocumentGaps
A map from field indices to virtual gaps. Only values associated with fields of type DocumentFactory.FieldType.VIRTUAL are meaningful, and the default return value is set fo Scan.DEFAULT_VIRTUAL_DOCUMENT_GAP. You can either add entries, or change the default return value.

Constructor Detail

IndexBuilder

public IndexBuilder(String basename,
                    DocumentSequence documentSequence)
Creates a new index builder with default parameters.

Note, in particular, that the resulting index will be a BitStreamHPIndex (unless you require payloads, in which case it will be a BitStreamIndex with skips), and that all terms will be downcased. You can set more finely the type of index using interleaved(boolean) and skips(boolean).

Parameters:
basename - the basename from which all files will be stemmed.
documentSequence - the document sequence to be indexed.
Method Detail

ioFactory

public IndexBuilder ioFactory(IOFactory ioFactory)
Sets the I/O factory (default: IOFactory.FILESYSTEM_FACTORY).

Parameters:
ioFactory - the I/O factory.
Returns:
this index builder.

termProcessor

public IndexBuilder termProcessor(TermProcessor termProcessor)
Sets the term processor (default: DowncaseTermProcessor).

Parameters:
termProcessor - the term processor.
Returns:
this index builder.

builder

public IndexBuilder builder(DocumentCollectionBuilder builder)
Sets the document collection builder (default: null).

Parameters:
builder - a document-collection builder class that will be used to build a collection during the indexing phase.
Returns:
this index builder.

indexedFields

public IndexBuilder indexedFields(int... field)
Sets the indexed fields to those provided (default: all fields, but see indexedFields).

This is a utility method that provides a way to set indexedFields in a chainable way.

Parameters:
field - a list of fields to be indexed, that will replace the current values in indexedFields.
Returns:
this index builder.
See Also:
indexedFields

virtualDocumentResolver

public IndexBuilder virtualDocumentResolver(int field,
                                            VirtualDocumentResolver virtualDocumentResolver)
Adds a virtual document resolver to virtualDocumentResolvers.

This is a utility method that provides a way to put an element into virtualDocumentResolvers in a chainable way.

Parameters:
field - a field index.
virtualDocumentResolver - a virtual document resolver.
Returns:
this index builder.
See Also:
virtualDocumentResolvers

scanBufferSize

public IndexBuilder scanBufferSize(int bufferSize)
Sets the Scan buffer size (default: Scan.DEFAULT_BUFFER_SIZE).

Parameters:
bufferSize - a buffer size for Scan.
Returns:
this index builder.

combineBufferSize

public IndexBuilder combineBufferSize(int bufferSize)
Sets the Combine buffer size (default: Combine.DEFAULT_BUFFER_SIZE).

Parameters:
bufferSize - a buffer size for Combine.
Returns:
this index builder.

bufferSize

public IndexBuilder bufferSize(int bufferSize)
Sets both the scan buffer size and the combine buffer size.

Parameters:
bufferSize - a buffer size.
Returns:
this index builder.

skipBufferSize

public IndexBuilder skipBufferSize(int bufferSize)
Sets the size in byte of the internal buffer using during the construction of a index with skips (default: SkipBitStreamIndexWriter.DEFAULT_TEMP_BUFFER_SIZE).

Parameters:
bufferSize - a buffer size for SkipBitStreamIndexWriter.
Returns:
this index builder.

pasteBufferSize

public IndexBuilder pasteBufferSize(int bufferSize)
Sets the size in byte of the internal buffer using when pasting indices (default: Paste.DEFAULT_MEMORY_BUFFER_SIZE).

Parameters:
bufferSize - a buffer size for Paste.
Returns:
this index builder.

documentsPerBatch

public IndexBuilder documentsPerBatch(int documentsPerBatch)
Sets the number of documents per batch (default: Scan.DEFAULT_BATCH_SIZE).

Parameters:
documentsPerBatch - the number of documents Scan will attempt to add to each batch.
Returns:
this index builder.

maxTerms

public IndexBuilder maxTerms(int maxTerms)
Sets the maximum number of overall (i.e., cross-field) terms per batch (default: Scan.DEFAULT_BATCH_SIZE).

Parameters:
maxTerms - the maximum number of overall (i.e., cross-field) terms Scan will attempt to add to each batch.
Returns:
this index builder.

keepBatches

public IndexBuilder keepBatches(boolean keepBatches)
Sets the “keep batches” flag (default: false). If true, the temporary batch files generated during index construction wil not be deleted.

Parameters:
keepBatches - the new value for the “keep batches” flag.
Returns:
this index builder.

standardWriterFlags

public IndexBuilder standardWriterFlags(Map<CompressionFlags.Component,CompressionFlags.Coding> standardWriterFlags)
Sets the writer compression flags for standard indices (default: CompressionFlags.DEFAULT_STANDARD_INDEX).

Parameters:
standardWriterFlags - the flags for standard indices.
Returns:
this index builder.

quasiSuccinctWriterFlags

public IndexBuilder quasiSuccinctWriterFlags(Map<CompressionFlags.Component,CompressionFlags.Coding> quasiSuccinctWriterFlags)
Sets the writer compression flags for standard indices (default: CompressionFlags.DEFAULT_QUASI_SUCCINCT_INDEX).

Parameters:
quasiSuccinctWriterFlags - the flags for quasi-succinct indices.
Returns:
this index builder.

payloadWriterFlags

public IndexBuilder payloadWriterFlags(Map<CompressionFlags.Component,CompressionFlags.Coding> payloadWriterFlags)
Sets the writer compression flags for payload-based indices (default: CompressionFlags.DEFAULT_PAYLOAD_INDEX).

Parameters:
payloadWriterFlags - the flags for payload-based indices.
Returns:
this index builder.

skips

public IndexBuilder skips(boolean skips)
Sets the skip flag (default: true). If true, the index will have a skipping structure. The flag is a no-op unless you require an interleaved index, as high-performance indices always have skips.

Parameters:
skips - the new value for the skip flag.
Returns:
this index builder.

interleaved

public IndexBuilder interleaved(boolean interleaved)
Sets the interleaved flag (default: false). If true, the index will be forced to be an interleaved index (but note that in a number of cases, such as missing index components or payloads, the index will be necessarily interleaved).

Parameters:
interleaved - the new value for the interleaved flag.
Returns:
this index builder.

indexType

public IndexBuilder indexType(Combine.IndexType indexType)
Sets the type of the index to be built (default: Combine.IndexType.QUASI_SUCCINCT).

Parameters:
indexType - the desired index type.
Returns:
this index builder.

quantum

public IndexBuilder quantum(int quantum)
Sets the skip quantum (default: BitStreamIndex.DEFAULT_QUANTUM).

Parameters:
quantum - the skip quantum.
Returns:
this index builder.

height

public IndexBuilder height(int height)
Sets the skip height (default: BitStreamIndex.DEFAULT_HEIGHT).

Parameters:
height - the skip height.
Returns:
this index builder.

mapFile

public IndexBuilder mapFile(String mapFile)
Sets the name of a file containing a map on the document indices (default: null).

The provided file must containing integers in DataOutput format. They must by as many as the number of documents in the collection provided at construction time, and the resulting function must be injective (i.e., there must be no duplicates).

Parameters:
mapFile - a file representing a document map (or null for no mapping).
Returns:
this index builder.

logInterval

public IndexBuilder logInterval(long logInterval)
Sets the logging time interval (default: ProgressLogger.DEFAULT_LOG_INTERVAL).

Parameters:
logInterval - the logging time interval.
Returns:
this index builder.

batchDirName

public IndexBuilder batchDirName(String batchDirName)
Sets the temporary directory for batches (default: the directory containing the basename).

Parameters:
batchDirName - the name of the temporary directory for batches, or null for the directory containing the basename.
Returns:
this index builder.

termMapClass

public IndexBuilder termMapClass(Class<? extends StringMap<? extends CharSequence>> termMapClass)
Sets the class used to build the index term map (default: ImmutableExternalPrefixMap).

The only requirement for termMapClass (besides, of course, implementing StringMap) is that of having a public constructor accepting a single parameter of type Iterable<CharSequence>.

Parameters:
termMapClass - the class used to build the index term map.
Returns:
this index builder.

run

public void run()
         throws ConfigurationException,
                SecurityException,
                IOException,
                URISyntaxException,
                ClassNotFoundException,
                InstantiationException,
                IllegalAccessException,
                InvocationTargetException,
                NoSuchMethodException
Builds the index.

This method simply invokes Scan and Combine using the internally stored settings, and finally builds a StringMap.

If the provided document sequence can be iterated over several times, this method can be called several times, too, rebuilding each time the index.

Throws:
ConfigurationException
SecurityException
IOException
URISyntaxException
ClassNotFoundException
InstantiationException
IllegalAccessException
InvocationTargetException
NoSuchMethodException

main

public static void main(String[] arg)
                 throws com.martiansoftware.jsap.JSAPException,
                        InvocationTargetException,
                        NoSuchMethodException,
                        IllegalAccessException,
                        ConfigurationException,
                        ClassNotFoundException,
                        IOException,
                        InstantiationException,
                        URISyntaxException,
                        SecurityException,
                        IllegalArgumentException
Throws:
com.martiansoftware.jsap.JSAPException
InvocationTargetException
NoSuchMethodException
IllegalAccessException
ConfigurationException
ClassNotFoundException
IOException
InstantiationException
URISyntaxException
SecurityException
IllegalArgumentException