Class IndexBuilder
- java.lang.Object
-
- it.unimi.di.big.mg4j.tool.IndexBuilder
-
public class IndexBuilder extends Object
An index builder.An instance of this class exposes a
run()
method that will index theDocumentSequence
provided at construction time by callingScan
andCombine
in sequence.Additionally, a main method provides easy access to index construction.
All indexing parameters are available either as chainable setters that can be called optionally before invoking
run()
, or as public mutable collections and maps. For instance,new IndexBuilder( "foo", sequence ).skips( true ).run();
will build an index with basename foo using skips. If instead we want to index just the first field of the sequence, and use aShiftAddXorSignedStringMap
as a term map, we can use the following code:new IndexBuilder( "foo", sequence ) .termMapClass( ShiftAddXorSignedMinimalPerfectHash.class ) .indexedFields( 0 ).run();
More sophisticated modifications can be applied using public maps:
IndexBuilder indexBuilder = new IndexBuilder( "foo", sequence ); indexBuilder.virtualDocumentGaps.put( 0, 30 ); indexBuilder.virtualDocumentResolver.put( 0, someVirtualDocumentResolver ); indexBuilder.run();
-
-
Field Summary
Fields Modifier and Type Field Description IntSortedSet
indexedFields
The set of indexed fields (expressed as field indices).Int2IntMap
virtualDocumentGaps
A map from field indices to virtual gaps.Int2ObjectMap<VirtualDocumentResolver>
virtualDocumentResolvers
A map from field indices to a correspondingVirtualDocumentResolver
.
-
Constructor Summary
Constructors Constructor Description IndexBuilder(String basename, DocumentSequence documentSequence)
Creates a new index builder with default parameters.
-
Method Summary
Modifier and Type Method Description IndexBuilder
batchDirName(String batchDirName)
Sets the temporary directory for batches (default: the directory containing the basename).IndexBuilder
bufferSize(int bufferSize)
Sets both the scan buffer size and the combine buffer size.IndexBuilder
builder(DocumentCollectionBuilder builder)
Sets the document collection builder (default:null
).IndexBuilder
combineBufferSize(int bufferSize)
Sets theCombine
buffer size (default:Combine.DEFAULT_BUFFER_SIZE
).IndexBuilder
documentsPerBatch(int documentsPerBatch)
Sets the number of documents per batch (default:Scan.DEFAULT_BATCH_SIZE
).IndexBuilder
height(int height)
Sets the skip height (default:BitStreamIndex.DEFAULT_HEIGHT
).IndexBuilder
indexedFields(int... field)
Sets the indexed fields to those provided (default: all fields, but seeindexedFields
).IndexBuilder
indexType(Combine.IndexType indexType)
Sets the type of the index to be built (default:Combine.IndexType.QUASI_SUCCINCT
).IndexBuilder
interleaved(boolean interleaved)
Sets the interleaved flag (default: false).IndexBuilder
ioFactory(IOFactory ioFactory)
Sets the I/O factory (default:IOFactory.FILESYSTEM_FACTORY
).IndexBuilder
keepBatches(boolean keepBatches)
Sets the “keep batches” flag (default: false).IndexBuilder
logInterval(long logInterval)
Sets the logging time interval (default:ProgressLogger.DEFAULT_LOG_INTERVAL
).static void
main(String[] arg)
IndexBuilder
mapFile(String mapFile)
Sets the name of a file containing a map on the document indices (default:null
).IndexBuilder
maxTerms(int maxTerms)
Sets the maximum number of overall (i.e., cross-field) terms per batch (default:Scan.DEFAULT_BATCH_SIZE
).IndexBuilder
pasteBufferSize(int bufferSize)
Sets the size in byte of the internal buffer using when pasting indices (default:Paste.DEFAULT_MEMORY_BUFFER_SIZE
).IndexBuilder
payloadWriterFlags(Map<CompressionFlags.Component,CompressionFlags.Coding> payloadWriterFlags)
Sets the writer compression flags for payload-based indices (default:CompressionFlags.DEFAULT_PAYLOAD_INDEX
).IndexBuilder
quantum(int quantum)
Sets the skip quantum (default:BitStreamIndex.DEFAULT_QUANTUM
).IndexBuilder
quasiSuccinctWriterFlags(Map<CompressionFlags.Component,CompressionFlags.Coding> quasiSuccinctWriterFlags)
Sets the writer compression flags for standard indices (default:CompressionFlags.DEFAULT_QUASI_SUCCINCT_INDEX
).void
run()
Builds the index.IndexBuilder
scanBufferSize(int bufferSize)
Sets theScan
buffer size (default:Scan.DEFAULT_BUFFER_SIZE
).IndexBuilder
skipBufferSize(int bufferSize)
Sets the size in byte of the internal buffer using during the construction of a index with skips (default:SkipBitStreamIndexWriter.DEFAULT_TEMP_BUFFER_SIZE
).IndexBuilder
skips(boolean skips)
Sets the skip flag (default: true).IndexBuilder
standardWriterFlags(Map<CompressionFlags.Component,CompressionFlags.Coding> standardWriterFlags)
Sets the writer compression flags for standard indices (default:CompressionFlags.DEFAULT_STANDARD_INDEX
).IndexBuilder
termMapClass(Class<? extends StringMap<? extends CharSequence>> termMapClass)
Sets the class used to build the index term map (default:ImmutableExternalPrefixMap
).IndexBuilder
termProcessor(TermProcessor termProcessor)
Sets the term processor (default:DowncaseTermProcessor
).IndexBuilder
virtualDocumentResolver(int field, VirtualDocumentResolver virtualDocumentResolver)
Adds a virtual document resolver tovirtualDocumentResolvers
.
-
-
-
Field Detail
-
indexedFields
public IntSortedSet indexedFields
The set of indexed fields (expressed as field indices). If left empty, all fields will be indexed, with the proviso that fields of typeDocumentFactory.FieldType.VIRTUAL
will be indexed only if they have a correspondingVirtualDocumentResolver
.An alternative, chained access to this map is provided by the method
indexedFields(int[])
After calling
run()
, this map will contain the set of fields actually indexed.
-
virtualDocumentResolvers
public Int2ObjectMap<VirtualDocumentResolver> virtualDocumentResolvers
A map from field indices to a correspondingVirtualDocumentResolver
.
-
virtualDocumentGaps
public Int2IntMap virtualDocumentGaps
A map from field indices to virtual gaps. Only values associated with fields of typeDocumentFactory.FieldType.VIRTUAL
are meaningful, and the default return value is set foScan.DEFAULT_VIRTUAL_DOCUMENT_GAP
. You can either add entries, or change the default return value.
-
-
Constructor Detail
-
IndexBuilder
public IndexBuilder(String basename, DocumentSequence documentSequence)
Creates a new index builder with default parameters.Note, in particular, that the resulting index will be a BitStreamHPIndex (unless you require payloads, in which case it will be a
BitStreamIndex
with skips), and that all terms will be downcased. You can set more finely the type of index usinginterleaved(boolean)
andskips(boolean)
.- Parameters:
basename
- the basename from which all files will be stemmed.documentSequence
- the document sequence to be indexed.
-
-
Method Detail
-
ioFactory
public IndexBuilder ioFactory(IOFactory ioFactory)
Sets the I/O factory (default:IOFactory.FILESYSTEM_FACTORY
).- Parameters:
ioFactory
- the I/O factory.- Returns:
- this index builder.
-
termProcessor
public IndexBuilder termProcessor(TermProcessor termProcessor)
Sets the term processor (default:DowncaseTermProcessor
).- Parameters:
termProcessor
- the term processor.- Returns:
- this index builder.
-
builder
public IndexBuilder builder(DocumentCollectionBuilder builder)
Sets the document collection builder (default:null
).- Parameters:
builder
- a document-collection builder class that will be used to build a collection during the indexing phase.- Returns:
- this index builder.
-
indexedFields
public IndexBuilder indexedFields(int... field)
Sets the indexed fields to those provided (default: all fields, but seeindexedFields
).This is a utility method that provides a way to set
indexedFields
in a chainable way.- Parameters:
field
- a list of fields to be indexed, that will replace the current values inindexedFields
.- Returns:
- this index builder.
- See Also:
indexedFields
-
virtualDocumentResolver
public IndexBuilder virtualDocumentResolver(int field, VirtualDocumentResolver virtualDocumentResolver)
Adds a virtual document resolver tovirtualDocumentResolvers
.This is a utility method that provides a way to put an element into
virtualDocumentResolvers
in a chainable way.- Parameters:
field
- a field index.virtualDocumentResolver
- a virtual document resolver.- Returns:
- this index builder.
- See Also:
virtualDocumentResolvers
-
scanBufferSize
public IndexBuilder scanBufferSize(int bufferSize)
Sets theScan
buffer size (default:Scan.DEFAULT_BUFFER_SIZE
).- Parameters:
bufferSize
- a buffer size forScan
.- Returns:
- this index builder.
-
combineBufferSize
public IndexBuilder combineBufferSize(int bufferSize)
Sets theCombine
buffer size (default:Combine.DEFAULT_BUFFER_SIZE
).- Parameters:
bufferSize
- a buffer size forCombine
.- Returns:
- this index builder.
-
bufferSize
public IndexBuilder bufferSize(int bufferSize)
Sets both the scan buffer size and the combine buffer size.- Parameters:
bufferSize
- a buffer size.- Returns:
- this index builder.
-
skipBufferSize
public IndexBuilder skipBufferSize(int bufferSize)
Sets the size in byte of the internal buffer using during the construction of a index with skips (default:SkipBitStreamIndexWriter.DEFAULT_TEMP_BUFFER_SIZE
).- Parameters:
bufferSize
- a buffer size forSkipBitStreamIndexWriter
.- Returns:
- this index builder.
-
pasteBufferSize
public IndexBuilder pasteBufferSize(int bufferSize)
Sets the size in byte of the internal buffer using when pasting indices (default:Paste.DEFAULT_MEMORY_BUFFER_SIZE
).- Parameters:
bufferSize
- a buffer size forPaste
.- Returns:
- this index builder.
-
documentsPerBatch
public IndexBuilder documentsPerBatch(int documentsPerBatch)
Sets the number of documents per batch (default:Scan.DEFAULT_BATCH_SIZE
).- Parameters:
documentsPerBatch
- the number of documentsScan
will attempt to add to each batch.- Returns:
- this index builder.
-
maxTerms
public IndexBuilder maxTerms(int maxTerms)
Sets the maximum number of overall (i.e., cross-field) terms per batch (default:Scan.DEFAULT_BATCH_SIZE
).- Parameters:
maxTerms
- the maximum number of overall (i.e., cross-field) termsScan
will attempt to add to each batch.- Returns:
- this index builder.
-
keepBatches
public IndexBuilder keepBatches(boolean keepBatches)
Sets the “keep batches” flag (default: false). If true, the temporary batch files generated during index construction wil not be deleted.- Parameters:
keepBatches
- the new value for the “keep batches” flag.- Returns:
- this index builder.
-
standardWriterFlags
public IndexBuilder standardWriterFlags(Map<CompressionFlags.Component,CompressionFlags.Coding> standardWriterFlags)
Sets the writer compression flags for standard indices (default:CompressionFlags.DEFAULT_STANDARD_INDEX
).- Parameters:
standardWriterFlags
- the flags for standard indices.- Returns:
- this index builder.
-
quasiSuccinctWriterFlags
public IndexBuilder quasiSuccinctWriterFlags(Map<CompressionFlags.Component,CompressionFlags.Coding> quasiSuccinctWriterFlags)
Sets the writer compression flags for standard indices (default:CompressionFlags.DEFAULT_QUASI_SUCCINCT_INDEX
).- Parameters:
quasiSuccinctWriterFlags
- the flags for quasi-succinct indices.- Returns:
- this index builder.
-
payloadWriterFlags
public IndexBuilder payloadWriterFlags(Map<CompressionFlags.Component,CompressionFlags.Coding> payloadWriterFlags)
Sets the writer compression flags for payload-based indices (default:CompressionFlags.DEFAULT_PAYLOAD_INDEX
).- Parameters:
payloadWriterFlags
- the flags for payload-based indices.- Returns:
- this index builder.
-
skips
public IndexBuilder skips(boolean skips)
Sets the skip flag (default: true). If true, the index will have a skipping structure. The flag is a no-op unless you require an interleaved index, as high-performance indices always have skips.- Parameters:
skips
- the new value for the skip flag.- Returns:
- this index builder.
-
interleaved
public IndexBuilder interleaved(boolean interleaved)
Sets the interleaved flag (default: false). If true, the index will be forced to be an interleaved index (but note that in a number of cases, such as missing index components or payloads, the index will be necessarily interleaved).- Parameters:
interleaved
- the new value for the interleaved flag.- Returns:
- this index builder.
-
indexType
public IndexBuilder indexType(Combine.IndexType indexType)
Sets the type of the index to be built (default:Combine.IndexType.QUASI_SUCCINCT
).- Parameters:
indexType
- the desired index type.- Returns:
- this index builder.
-
quantum
public IndexBuilder quantum(int quantum)
Sets the skip quantum (default:BitStreamIndex.DEFAULT_QUANTUM
).- Parameters:
quantum
- the skip quantum.- Returns:
- this index builder.
-
height
public IndexBuilder height(int height)
Sets the skip height (default:BitStreamIndex.DEFAULT_HEIGHT
).- Parameters:
height
- the skip height.- Returns:
- this index builder.
-
mapFile
public IndexBuilder mapFile(String mapFile)
Sets the name of a file containing a map on the document indices (default:null
).The provided file must containing integers in
DataOutput
format. They must by as many as the number of documents in the collection provided at construction time, and the resulting function must be injective (i.e., there must be no duplicates).- Parameters:
mapFile
- a file representing a document map (ornull
for no mapping).- Returns:
- this index builder.
-
logInterval
public IndexBuilder logInterval(long logInterval)
Sets the logging time interval (default:ProgressLogger.DEFAULT_LOG_INTERVAL
).- Parameters:
logInterval
- the logging time interval.- Returns:
- this index builder.
-
batchDirName
public IndexBuilder batchDirName(String batchDirName)
Sets the temporary directory for batches (default: the directory containing the basename).- Parameters:
batchDirName
- the name of the temporary directory for batches, ornull
for the directory containing the basename.- Returns:
- this index builder.
-
termMapClass
public IndexBuilder termMapClass(Class<? extends StringMap<? extends CharSequence>> termMapClass)
Sets the class used to build the index term map (default:ImmutableExternalPrefixMap
).The only requirement for
termMapClass
(besides, of course, implementingStringMap
) is that of having a public constructor accepting a single parameter of typeIterable
<CharSequence
>.- Parameters:
termMapClass
- the class used to build the index term map, ornull
to disable the construction of a term map.- Returns:
- this index builder.
-
run
public void run() throws org.apache.commons.configuration.ConfigurationException, SecurityException, IOException, URISyntaxException, ClassNotFoundException, InstantiationException, IllegalAccessException, InvocationTargetException, NoSuchMethodException
Builds the index.This method simply invokes
Scan
andCombine
using the internally stored settings, and finally builds aStringMap
.If the provided document sequence can be iterated over several times, this method can be called several times, too, rebuilding each time the index.
- Throws:
org.apache.commons.configuration.ConfigurationException
SecurityException
IOException
URISyntaxException
ClassNotFoundException
InstantiationException
IllegalAccessException
InvocationTargetException
NoSuchMethodException
-
main
public static void main(String[] arg) throws com.martiansoftware.jsap.JSAPException, InvocationTargetException, NoSuchMethodException, IllegalAccessException, org.apache.commons.configuration.ConfigurationException, ClassNotFoundException, IOException, InstantiationException, URISyntaxException
- Throws:
com.martiansoftware.jsap.JSAPException
InvocationTargetException
NoSuchMethodException
IllegalAccessException
org.apache.commons.configuration.ConfigurationException
ClassNotFoundException
IOException
InstantiationException
URISyntaxException
-
-