Class URLMPHVirtualDocumentResolver
- java.lang.Object
-
- it.unimi.di.big.mg4j.tool.URLMPHVirtualDocumentResolver
-
- All Implemented Interfaces:
VirtualDocumentResolver
,Serializable
public class URLMPHVirtualDocumentResolver extends Object implements VirtualDocumentResolver
A virtual-document resolver based on document URIs.Instances of this class store in a
StringMap
instances all URIs from a collection, and consider a virtual-document specification a (possibly relative) URI. The virtual-document specification is resolved against the document URI, and then the perfect hash is used to retrieve the corresponding document.This class provides a main method that helps in building serialised resolvers from URI lists. In case of pathological document collections with duplicate URIs (most notably, the GOV2 collection used for TREC evaluations), an option makes it possible to add random noise to duplicates, so that minimal perfect hash construction does not go into an infinite loop. It is a rather crude solution, but it is nonsensical to have duplicate URIs in the first place. Additional option include the kind of minimal perfect hash function you want to use (e.g., out of
it.unimi.dsi.sux4j
) and the number of bits used to sign them.Warning: up to version 5.2.1, this class was applying
URI.normalize()
incontext(Document)
andresolve(CharSequence)
methods. This does not happen any longer, as it was breaking URLs such as http://en.wikipedia.org/wiki//dev/null.- See Also:
- Serialized Form
-
-
Constructor Summary
Constructors Constructor Description URLMPHVirtualDocumentResolver(StringMap<? extends CharSequence> url2DocumentPointer)
-
Method Summary
Modifier and Type Method Description void
context(Document document)
Sets the context document.static void
main(String[] arg)
long
numberOfDocuments()
Returns the number of documents handled by this resolver, if it is known.long
resolve(CharSequence virtualDocumentSpec)
Resolves a virtual document specification.
-
-
-
Constructor Detail
-
URLMPHVirtualDocumentResolver
public URLMPHVirtualDocumentResolver(StringMap<? extends CharSequence> url2DocumentPointer)
-
-
Method Detail
-
context
public void context(Document document)
Description copied from interface:VirtualDocumentResolver
Sets the context document. All successive calls toVirtualDocumentResolver.resolve(CharSequence)
will assume the virtual-document specification was found indocument
.- Specified by:
context
in interfaceVirtualDocumentResolver
- Parameters:
document
- the context document.
-
resolve
public long resolve(CharSequence virtualDocumentSpec)
Description copied from interface:VirtualDocumentResolver
Resolves a virtual document specification.Note that the resolution process is carried out in the context of the last document passed to
VirtualDocumentResolver.context(Document)
(e.g., for relative URI resolution). IfVirtualDocumentResolver.context(Document)
was never called, the behaviour is undefined.- Specified by:
resolve
in interfaceVirtualDocumentResolver
- Parameters:
virtualDocumentSpec
- the virtual document specification.- Returns:
- the document
virtualDocumentSpec
refers to, or -1 if the specification could not be resolved.
-
numberOfDocuments
public long numberOfDocuments()
Description copied from interface:VirtualDocumentResolver
Returns the number of documents handled by this resolver, if it is known. A call toVirtualDocumentResolver.resolve(CharSequence)
will always return a number smaller than the one returned by this method.- Specified by:
numberOfDocuments
in interfaceVirtualDocumentResolver
- Returns:
- the number of documents handled by this resolver.
-
main
public static void main(String[] arg) throws com.martiansoftware.jsap.JSAPException, IOException
- Throws:
com.martiansoftware.jsap.JSAPException
IOException
-
-