Class WikipediaDocumentSequence
- java.lang.Object
-
- it.unimi.di.big.mg4j.document.AbstractDocumentSequence
-
- it.unimi.di.big.mg4j.document.WikipediaDocumentSequence
-
- All Implemented Interfaces:
DocumentSequence
,SafelyCloseable
,Closeable
,Serializable
,AutoCloseable
public class WikipediaDocumentSequence extends AbstractDocumentSequence implements Serializable
A class exhibiting a standard Wikipedia XML dump as aDocumentSequence
.Warning: this class has no connection whatsoever with
WikipediaDocumentCollection
.The purpose of this class is making the indexing of Wikipedia and of its entity graph starting from a pristine Wikipedia XML dump reasonably easy. There are a few steps involved, mainly due to the necessity of working out redirects, but the whole procedure can be carried out with very little resources. The class uses the
WikiModel.toHtml(String, Appendable, String, String)
method to convert the Wikipedia format into HTML, and then passes the result to a standardHtmlDocumentFactory
(suggestion on alternative conversion methods are welcome). A few additional fields are handled byWikipediaDocumentSequence.WikipediaHeaderFactory
.Note that no properties are passed to the underlying
HtmlDocumentFactory
: if you want to set the anchor properties (seeHtmlDocumentFactory.MetadataKeys
), you need to use a quite humongous constructor.Warning: there's a known bug in the creation of Wikipedia dumps. Until it is fixed, you might have to filter out duplicates before using this class.
How to index Wikipedia
As a first step, download the Wikipedia XML dump (it's the “pages-articles” file; it should start with a
mediawiki
opening tag). This class can process the file in its compressed form, but we suggest to uncompress it usingbunzip2
, as processing is an order of magnitude faster. (Note that the following process will exclude namespaced pages such asTemplate:something
and all templates, such as infoboxes, etc.; if you want to include them, you must use a different constructor.)The first step is extracting metadata (in particular, the URLs that are necessary to index correctly the anchor text). We do not suggest specific Java options, but try to use as much memory as you can.
java it.unimi.di.big.mg4j.tool.ScanMetadata \ -o "it.unimi.di.big.mg4j.document.WikipediaDocumentSequence(enwiki-latest-pages-articles.xml,false,http://en.wikipedia.org/wiki/,false)" \ -u enwiki.uris -t enwiki.titles
Note that we used the
ObjectParser
-based constructor of this class, which makes it possible to create aWikipediaDocumentSequence
instance parsing a textual specification (see the constructor documentation for details about the parameters).The second step consists in building a first
VirtualDocumentResolver
which, however, does not comprise redirect information:java it.unimi.di.big.mg4j.tool.URLMPHVirtualDocumentResolver -o enwiki.uris enwiki.vdr
Now we need to use the ad hoc main method of this class to rescan the collection, gather the redirect information and merge it with our current resolver:
java it.unimi.di.big.mg4j.document.WikipediaDocumentSequence \ enwiki-latest-pages-articles.xml http://en.wikipedia.org/wiki/ enwiki.uris enwiki.vdr enwikired.vdr
During this phase a quite large number of warnings about failed redirects might appear. This is normal, in particular if you do not index template pages. If you suspect an actual bug, try first to index template pages, too. Failed redirects should be in the order of few thousands, and all due to internal inconsistencies of the dump: to check that this is the case, check whether the target of a failed redirect appears as a page title (it shouldn't).
We have now all information required to build a complete index (we use the Porter2 stemmer in this example):
java it.unimi.di.big.mg4j.tool.IndexBuilder \ -o "it.unimi.di.big.mg4j.document.WikipediaDocumentSequence(enwiki.xml,false,http://en.wikipedia.org/wiki/,true)" \ --all-fields -v enwikired.vdr -t EnglishStemmer enwiki
Finally, we can build the entity graph using a bridge class that exposes any
DocumentSequence
with a virtual field as anImmutableGraph
of the WebGraph framework (the nodes will be in one-to-one correspondence with the documents returned by the index):java it.unimi.dsi.big.webgraph.BVGraph \ -s "it.unimi.di.big.mg4j.util.DocumentSequenceImmutableSequentialGraph(\"it.unimi.di.big.mg4j.document.WikipediaDocumentSequence(enwiki.xml,false,http://en.wikipedia.org/wiki/,true)\",anchor,enwikired.vdr)" \ enwiki
Additional fields
The additional fields generated by this class (some of which are a bit hacky) are:
title
- the title of the Wikipedia page;
id
- a payload index containing the Wikipedia identifier of the page;
lastedit
- a payload index containing the last edit of the page;
category
- a field containing the categories of the page, separated by an artificial marker
OXOXO
(so when you look for a category as a phrase you don't get false cross-category positives); firstpar
- a heuristically generated first paragraph of the page, useful for identification beyond the title;
redirects
- a virtual field treating the link of the page with its title and any redirect link to the page as an anchor: in practice, the field contains all names under which the page is known in Wikipedia.
Note that for each link in a disambiguation page this class will generate a fake link with the same target, but the title of the disambiguation page as text. This is in the same spirit of the
redirects
field—we enrich the HTMLanchor
field with useful information without altering the generated graph.- See Also:
- Serialized Form
-
-
Nested Class Summary
Nested Classes Modifier and Type Class Description static class
WikipediaDocumentSequence.MetadataKeys
static class
WikipediaDocumentSequence.SignedRedirectedStringMap
A wrapper around a signed function that remaps entries exceeding a provided threshold using a specified target array.static class
WikipediaDocumentSequence.WikipediaHeaderFactory
A factory responsible for special Wikipedia fields (see the class documentation).
-
Constructor Summary
Constructors Constructor Description WikipediaDocumentSequence(String file, boolean bzipped, String baseURL, boolean parseText)
Builds a new Wikipedia document sequence using default anchor settings and discarding namespaced pages and templates.WikipediaDocumentSequence(String file, boolean bzipped, String baseURL, boolean parseText, boolean keepNamespaced, boolean keepTemplates)
Builds a new Wikipedia document sequence using default anchor settings.WikipediaDocumentSequence(String file, boolean bzipped, String baseURL, boolean parseText, boolean keepNamespaced, boolean keepTemplates, int maxPreAnchor, int maxAnchor, int maxPostAnchor, String delimiter)
Builds a new Wikipedia document sequence.WikipediaDocumentSequence(String file, String bzipped, String baseURL, String parseText)
A string-based constructor to be used with anObjectParser
.WikipediaDocumentSequence(String file, String bzipped, String baseURL, String parseText, String keepNamespaced, String keepTemplates)
A string-based constructor to be used with anObjectParser
.WikipediaDocumentSequence(String file, String bzipped, String baseURL, String parseText, String keepNamespaced, String keepTemplates, String maxBeforeAnchor, String maxAnchor, String maxPostAnchor)
A string-based constructor to be used with anObjectParser
.WikipediaDocumentSequence(String file, String bzipped, String baseURL, String parseText, String keepNamespaced, String keepTemplates, String maxBeforeAnchor, String maxAnchor, String maxPostAnchor, String delimiter)
A string-based constructor to be used with anObjectParser
.
-
Method Summary
Modifier and Type Method Description DocumentFactory
factory()
Returns the factory used by this sequence.DocumentIterator
iterator()
Returns an iterator over the sequence of documents.static void
main(String[] arg)
-
Methods inherited from class it.unimi.di.big.mg4j.document.AbstractDocumentSequence
close, filename, finalize, load
-
-
-
-
Constructor Detail
-
WikipediaDocumentSequence
public WikipediaDocumentSequence(String file, boolean bzipped, String baseURL, boolean parseText)
Builds a new Wikipedia document sequence using default anchor settings and discarding namespaced pages and templates.- Parameters:
file
- the file containing the Wikipedia dump.bzipped
- whetherfile
is compressed withbzip2
.baseURL
- a base URL for links (e.g., for the English Wikipedia,http://en.wikipedia.org/wiki/
); note that if it is nonempty this string must terminate with a slash.parseText
- whether to parse the text (this parameter is only set to false during metadata-scanning phases to speed up the scanning process).
-
WikipediaDocumentSequence
public WikipediaDocumentSequence(String file, boolean bzipped, String baseURL, boolean parseText, boolean keepNamespaced, boolean keepTemplates)
Builds a new Wikipedia document sequence using default anchor settings.- Parameters:
file
- the file containing the Wikipedia dump.bzipped
- whetherfile
is compressed withbzip2
.baseURL
- a base URL for links (e.g., for the English Wikipedia,http://en.wikipedia.org/wiki/
); note that if it is nonempty this string must terminate with a slash.parseText
- whether to parse the text (this parameter is only set to false during metadata-scanning phases to speed up the scanning process).keepNamespaced
- whether to keep namespaced pages (e.g.,Template:something
pages).keepTemplates
- whether to keep templates (e.g., infoboxes, taxoboxes, etc.); we suggest to pass false if you're building a Wikipedia graph.
-
WikipediaDocumentSequence
public WikipediaDocumentSequence(String file, boolean bzipped, String baseURL, boolean parseText, boolean keepNamespaced, boolean keepTemplates, int maxPreAnchor, int maxAnchor, int maxPostAnchor, String delimiter)
Builds a new Wikipedia document sequence.- Parameters:
file
- the file containing the Wikipedia dump.bzipped
- whetherfile
is compressed withbzip2
.baseURL
- a base URL for links (e.g., for the English Wikipedia,http://en.wikipedia.org/wiki/
); note that if it is nonempty this string must terminate with a slash.parseText
- whether to parse the text (this parameter is only set to false during metadata-scanning phases to speed up the scanning process).keepNamespaced
- whether to keep namespaced pages (e.g.,Template:something
pages).keepTemplates
- whether to keep templates (e.g., infoboxes, taxoboxes, etc.); we suggest to pass false if you're building a Wikipedia graph.maxPreAnchor
- maximum number of character before an anchor.maxAnchor
- maximum number of character in an anchor.maxPostAnchor
- maximum number of characters after an anchor.delimiter
- a token that will be inserted to delimit the anchor text, ornull
for no delimiter.
-
WikipediaDocumentSequence
public WikipediaDocumentSequence(String file, String bzipped, String baseURL, String parseText)
A string-based constructor to be used with anObjectParser
.
-
WikipediaDocumentSequence
public WikipediaDocumentSequence(String file, String bzipped, String baseURL, String parseText, String keepNamespaced, String keepTemplates)
A string-based constructor to be used with anObjectParser
.
-
WikipediaDocumentSequence
public WikipediaDocumentSequence(String file, String bzipped, String baseURL, String parseText, String keepNamespaced, String keepTemplates, String maxBeforeAnchor, String maxAnchor, String maxPostAnchor)
A string-based constructor to be used with anObjectParser
.
-
WikipediaDocumentSequence
public WikipediaDocumentSequence(String file, String bzipped, String baseURL, String parseText, String keepNamespaced, String keepTemplates, String maxBeforeAnchor, String maxAnchor, String maxPostAnchor, String delimiter)
A string-based constructor to be used with anObjectParser
.
-
-
Method Detail
-
iterator
public DocumentIterator iterator() throws IOException
Description copied from interface:DocumentSequence
Returns an iterator over the sequence of documents.Warning: this method can be safely called just one time. For instance, implementations based on standard input will usually throw an exception if this method is called twice.
Implementations may decide to override this restriction (in particular, if they implement
DocumentCollection
). Usually, however, it is not possible to obtain two iterators at the same time on a collection.- Specified by:
iterator
in interfaceDocumentSequence
- Returns:
- an iterator over the sequence of documents.
- Throws:
IOException
- See Also:
DocumentCollection
-
factory
public DocumentFactory factory()
Description copied from interface:DocumentSequence
Returns the factory used by this sequence.Every document sequence is based on a document factory that transforms raw bytes into a sequence of characters. The factory contains useful information such as the number of fields.
- Specified by:
factory
in interfaceDocumentSequence
- Returns:
- the factory used by this sequence.
-
main
public static void main(String[] arg) throws ParserConfigurationException, SAXException, IOException, com.martiansoftware.jsap.JSAPException, ClassNotFoundException
- Throws:
ParserConfigurationException
SAXException
IOException
com.martiansoftware.jsap.JSAPException
ClassNotFoundException
-
-