public class URLMPHVirtualDocumentResolver extends Object implements VirtualDocumentResolver
Instances of this class store in a
all URIs from a collection, and consider a virtual-document specification a (possibly relative) URI. The
virtual-document specification is resolved against the document URI, and then the perfect hash is used
to retrieve the corresponding document.
This class provides a main method that helps in building serialised resolvers from URI lists.
In case of pathological document collections with duplicate URIs (most notably, the GOV2 collection
used for TREC evaluations), an option makes it possible to add random noise to duplicates, so that
minimal perfect hash construction does not go into an infinite loop. It is a rather crude solution, but it
is nonsensical to have duplicate URIs in the first place. Additional option include the kind of minimal perfect
hash function you want to use (e.g., out of
sux4j) and the number of bits used to sign them.
Warning: up to version 5.2.1, this class was applying
resolve(CharSequence) methods. This does not happen any longer,
as it was breaking URLs such as http://en.wikipedia.org/wiki//dev/null.
|Constructor and Description|
|Modifier and Type||Method and Description|
Sets the context document.
Returns the number of documents handled by this resolver, if it is known.
Resolves a virtual document specification.
public void context(Document document)
VirtualDocumentResolver.resolve(CharSequence)will assume the virtual-document specification was found in
public long resolve(CharSequence virtualDocumentSpec)
Note that the resolution process is carried out in the context of the last document
VirtualDocumentResolver.context(Document) (e.g., for relative URI resolution). If
was never called, the behaviour is undefined.
public long numberOfDocuments()
VirtualDocumentResolver.resolve(CharSequence)will always return a number smaller than the one returned by this method.