Class WikipediaDocumentSequence

  • All Implemented Interfaces:
    DocumentSequence, SafelyCloseable, Closeable, Serializable, AutoCloseable

    public class WikipediaDocumentSequence
    extends AbstractDocumentSequence
    implements Serializable
    A class exhibiting a standard Wikipedia XML dump as a DocumentSequence.

    Warning: this class has no connection whatsoever with WikipediaDocumentCollection.

    The purpose of this class is making the indexing of Wikipedia and of its entity graph starting from a pristine Wikipedia XML dump reasonably easy. There are a few steps involved, mainly due to the necessity of working out redirects, but the whole procedure can be carried out with very little resources. The class uses the WikiModel.toHtml(String, Appendable, String, String) method to convert the Wikipedia format into HTML, and then passes the result to a standard HtmlDocumentFactory (suggestion on alternative conversion methods are welcome). A few additional fields are handled by WikipediaDocumentSequence.WikipediaHeaderFactory.

    Note that no properties are passed to the underlying HtmlDocumentFactory: if you want to set the anchor properties (see HtmlDocumentFactory.MetadataKeys), you need to use a quite humongous constructor.

    Warning: there's a known bug in the creation of Wikipedia dumps. Until it is fixed, you might have to filter out duplicates before using this class.

    How to index Wikipedia

    As a first step, download the Wikipedia XML dump (it's the “pages-articles” file; it should start with a mediawiki opening tag). This class can process the file in its compressed form, but we suggest to uncompress it using bunzip2, as processing is an order of magnitude faster. (Note that the following process will exclude namespaced pages such as Template:something and all templates, such as infoboxes, etc.; if you want to include them, you must use a different constructor.)

    The first step is extracting metadata (in particular, the URLs that are necessary to index correctly the anchor text). We do not suggest specific Java options, but try to use as much memory as you can.

     java it.unimi.di.big.mg4j.tool.ScanMetadata \
       -o "it.unimi.di.big.mg4j.document.WikipediaDocumentSequence(enwiki-latest-pages-articles.xml,false,http://en.wikipedia.org/wiki/,false)" \
       -u enwiki.uris -t enwiki.titles
     

    Note that we used the ObjectParser-based constructor of this class, which makes it possible to create a WikipediaDocumentSequence instance parsing a textual specification (see the constructor documentation for details about the parameters).

    The second step consists in building a first VirtualDocumentResolver which, however, does not comprise redirect information:

     java it.unimi.di.big.mg4j.tool.URLMPHVirtualDocumentResolver -o enwiki.uris enwiki.vdr
     

    Now we need to use the ad hoc main method of this class to rescan the collection, gather the redirect information and merge it with our current resolver:

     java it.unimi.di.big.mg4j.document.WikipediaDocumentSequence \
       enwiki-latest-pages-articles.xml http://en.wikipedia.org/wiki/ enwiki.uris enwiki.vdr enwikired.vdr
     

    During this phase a quite large number of warnings about failed redirects might appear. This is normal, in particular if you do not index template pages. If you suspect an actual bug, try first to index template pages, too. Failed redirects should be in the order of few thousands, and all due to internal inconsistencies of the dump: to check that this is the case, check whether the target of a failed redirect appears as a page title (it shouldn't).

    We have now all information required to build a complete index (we use the Porter2 stemmer in this example):

     java it.unimi.di.big.mg4j.tool.IndexBuilder \
       -o "it.unimi.di.big.mg4j.document.WikipediaDocumentSequence(enwiki.xml,false,http://en.wikipedia.org/wiki/,true)" \ 
       --all-fields -v enwikired.vdr -t EnglishStemmer enwiki
     

    Finally, we can build the entity graph using a bridge class that exposes any DocumentSequence with a virtual field as an ImmutableGraph of the WebGraph framework (the nodes will be in one-to-one correspondence with the documents returned by the index):

     java it.unimi.dsi.big.webgraph.BVGraph \
       -s "it.unimi.di.big.mg4j.util.DocumentSequenceImmutableSequentialGraph(\"it.unimi.di.big.mg4j.document.WikipediaDocumentSequence(enwiki.xml,false,http://en.wikipedia.org/wiki/,true)\",anchor,enwikired.vdr)" \ 
       enwiki
     

    Additional fields

    The additional fields generated by this class (some of which are a bit hacky) are:

    title
    the title of the Wikipedia page;
    id
    a payload index containing the Wikipedia identifier of the page;
    lastedit
    a payload index containing the last edit of the page;
    category
    a field containing the categories of the page, separated by an artificial marker OXOXO (so when you look for a category as a phrase you don't get false cross-category positives);
    firstpar
    a heuristically generated first paragraph of the page, useful for identification beyond the title;
    redirects
    a virtual field treating the link of the page with its title and any redirect link to the page as an anchor: in practice, the field contains all names under which the page is known in Wikipedia.

    Note that for each link in a disambiguation page this class will generate a fake link with the same target, but the title of the disambiguation page as text. This is in the same spirit of the redirects field—we enrich the HTML anchor field with useful information without altering the generated graph.

    See Also:
    Serialized Form
    • Constructor Detail

      • WikipediaDocumentSequence

        public WikipediaDocumentSequence​(String file,
                                         boolean bzipped,
                                         String baseURL,
                                         boolean parseText)
        Builds a new Wikipedia document sequence using default anchor settings and discarding namespaced pages and templates.
        Parameters:
        file - the file containing the Wikipedia dump.
        bzipped - whether file is compressed with bzip2.
        baseURL - a base URL for links (e.g., for the English Wikipedia, http://en.wikipedia.org/wiki/); note that if it is nonempty this string must terminate with a slash.
        parseText - whether to parse the text (this parameter is only set to false during metadata-scanning phases to speed up the scanning process).
      • WikipediaDocumentSequence

        public WikipediaDocumentSequence​(String file,
                                         boolean bzipped,
                                         String baseURL,
                                         boolean parseText,
                                         boolean keepNamespaced,
                                         boolean keepTemplates)
        Builds a new Wikipedia document sequence using default anchor settings.
        Parameters:
        file - the file containing the Wikipedia dump.
        bzipped - whether file is compressed with bzip2.
        baseURL - a base URL for links (e.g., for the English Wikipedia, http://en.wikipedia.org/wiki/); note that if it is nonempty this string must terminate with a slash.
        parseText - whether to parse the text (this parameter is only set to false during metadata-scanning phases to speed up the scanning process).
        keepNamespaced - whether to keep namespaced pages (e.g., Template:something pages).
        keepTemplates - whether to keep templates (e.g., infoboxes, taxoboxes, etc.); we suggest to pass false if you're building a Wikipedia graph.
      • WikipediaDocumentSequence

        public WikipediaDocumentSequence​(String file,
                                         boolean bzipped,
                                         String baseURL,
                                         boolean parseText,
                                         boolean keepNamespaced,
                                         boolean keepTemplates,
                                         int maxPreAnchor,
                                         int maxAnchor,
                                         int maxPostAnchor,
                                         String delimiter)
        Builds a new Wikipedia document sequence.
        Parameters:
        file - the file containing the Wikipedia dump.
        bzipped - whether file is compressed with bzip2.
        baseURL - a base URL for links (e.g., for the English Wikipedia, http://en.wikipedia.org/wiki/); note that if it is nonempty this string must terminate with a slash.
        parseText - whether to parse the text (this parameter is only set to false during metadata-scanning phases to speed up the scanning process).
        keepNamespaced - whether to keep namespaced pages (e.g., Template:something pages).
        keepTemplates - whether to keep templates (e.g., infoboxes, taxoboxes, etc.); we suggest to pass false if you're building a Wikipedia graph.
        maxPreAnchor - maximum number of character before an anchor.
        maxAnchor - maximum number of character in an anchor.
        maxPostAnchor - maximum number of characters after an anchor.
        delimiter - a token that will be inserted to delimit the anchor text, or null for no delimiter.