java.lang.Object
- it.unimi.di.big.mg4j.document.AbstractDocumentSequence
- - it.unimi.di.big.mg4j.document.WikipediaDocumentSequence

All Implemented Interfaces:

DocumentSequence, SafelyCloseable, Closeable, Serializable, AutoCloseable
```
public class WikipediaDocumentSequence
extends AbstractDocumentSequence
implements Serializable
```
A class exhibiting a standard Wikipedia XML dump as a DocumentSequence.
Warning: this class has no connection whatsoever with WikipediaDocumentCollection.
The purpose of this class is making the indexing of Wikipedia and of its entity graph starting from a pristine Wikipedia XML dump reasonably easy. There are a few steps involved, mainly due to the necessity of working out redirects, but the whole procedure can be carried out with very little resources. The class uses the WikiModel.toHtml(String, Appendable, String, String) method to convert the Wikipedia format into HTML, and then passes the result to a standard HtmlDocumentFactory (suggestion on alternative conversion methods are welcome). A few additional fields are handled by WikipediaDocumentSequence.WikipediaHeaderFactory.
Note that no properties are passed to the underlying HtmlDocumentFactory: if you want to set the anchor properties (see HtmlDocumentFactory.MetadataKeys), you need to use a quite humongous constructor.
Warning: there's a known bug in the creation of Wikipedia dumps. Until it is fixed, you might have to filter out duplicates before using this class.
How to index Wikipedia

As a first step, download the Wikipedia XML dump (it's the “pages-articles” file; it should start with a mediawiki opening tag). This class can process the file in its compressed form, but we suggest to uncompress it using bunzip2, as processing is an order of magnitude faster. (Note that the following process will exclude namespaced pages such as Template:something and all templates, such as infoboxes, etc.; if you want to include them, you must use a different constructor.)
The first step is extracting metadata (in particular, the URLs that are necessary to index correctly the anchor text). We do not suggest specific Java options, but try to use as much memory as you can.
```
 java it.unimi.di.big.mg4j.tool.ScanMetadata \
   -o "it.unimi.di.big.mg4j.document.WikipediaDocumentSequence(enwiki-latest-pages-articles.xml,false,http://en.wikipedia.org/wiki/,false)" \
   -u enwiki.uris -t enwiki.titles
 
```
Note that we used the ObjectParser-based constructor of this class, which makes it possible to create a WikipediaDocumentSequence instance parsing a textual specification (see the constructor documentation for details about the parameters).
The second step consists in building a first VirtualDocumentResolver which, however, does not comprise redirect information:
```
 java it.unimi.di.big.mg4j.tool.URLMPHVirtualDocumentResolver -o enwiki.uris enwiki.vdr
 
```
Now we need to use the ad hoc main method of this class to rescan the collection, gather the redirect information and merge it with our current resolver:
```
 java it.unimi.di.big.mg4j.document.WikipediaDocumentSequence \
   enwiki-latest-pages-articles.xml http://en.wikipedia.org/wiki/ enwiki.uris enwiki.vdr enwikired.vdr
 
```
During this phase a quite large number of warnings about failed redirects might appear. This is normal, in particular if you do not index template pages. If you suspect an actual bug, try first to index template pages, too. Failed redirects should be in the order of few thousands, and all due to internal inconsistencies of the dump: to check that this is the case, check whether the target of a failed redirect appears as a page title (it shouldn't).
We have now all information required to build a complete index (we use the Porter2 stemmer in this example):
```
 java it.unimi.di.big.mg4j.tool.IndexBuilder \
   -o "it.unimi.di.big.mg4j.document.WikipediaDocumentSequence(enwiki.xml,false,http://en.wikipedia.org/wiki/,true)" \ 
   --all-fields -v enwikired.vdr -t EnglishStemmer enwiki
 
```
Finally, we can build the entity graph using a bridge class that exposes any DocumentSequence with a virtual field as an ImmutableGraph of the WebGraph framework (the nodes will be in one-to-one correspondence with the documents returned by the index):
```
 java it.unimi.dsi.big.webgraph.BVGraph \
   -s "it.unimi.di.big.mg4j.util.DocumentSequenceImmutableSequentialGraph(\"it.unimi.di.big.mg4j.document.WikipediaDocumentSequence(enwiki.xml,false,http://en.wikipedia.org/wiki/,true)\",anchor,enwikired.vdr)" \ 
   enwiki
 
```
Additional fields

The additional fields generated by this class (some of which are a bit hacky) are:

title
the title of the Wikipedia page;
id
a payload index containing the Wikipedia identifier of the page;
lastedit
a payload index containing the last edit of the page;
category
a field containing the categories of the page, separated by an artificial marker OXOXO (so when you look for a category as a phrase you don't get false cross-category positives);
firstpar
a heuristically generated first paragraph of the page, useful for identification beyond the title;
redirects
a virtual field treating the link of the page with its title and any redirect link to the page as an anchor: in practice, the field contains all names under which the page is known in Wikipedia.

Note that for each link in a disambiguation page this class will generate a fake link with the same target, but the title of the disambiguation page as text. This is in the same spirit of the redirects field—we enrich the HTML anchor field with useful information without altering the generated graph.
See Also:

Serialized Form

Nested Class Summary

Nested Classes
Modifier and Type	Class	Description
`static class`	`WikipediaDocumentSequence.MetadataKeys`
`static class`	`WikipediaDocumentSequence.SignedRedirectedStringMap`	A wrapper around a signed function that remaps entries exceeding a provided threshold using a specified target array.
`static class`	`WikipediaDocumentSequence.WikipediaHeaderFactory`	A factory responsible for special Wikipedia fields (see the class documentation).

Constructor Summary

Constructors
Constructor	Description
`WikipediaDocumentSequence(String file, boolean bzipped, String baseURL, boolean parseText)`	Builds a new Wikipedia document sequence using default anchor settings and discarding namespaced pages and templates.
`WikipediaDocumentSequence(String file, boolean bzipped, String baseURL, boolean parseText, boolean keepNamespaced, boolean keepTemplates)`	Builds a new Wikipedia document sequence using default anchor settings.
`WikipediaDocumentSequence(String file, boolean bzipped, String baseURL, boolean parseText, boolean keepNamespaced, boolean keepTemplates, int maxPreAnchor, int maxAnchor, int maxPostAnchor, String delimiter)`	Builds a new Wikipedia document sequence.
`WikipediaDocumentSequence(String file, String bzipped, String baseURL, String parseText)`	A string-based constructor to be used with an `ObjectParser`.
`WikipediaDocumentSequence(String file, String bzipped, String baseURL, String parseText, String keepNamespaced, String keepTemplates)`	A string-based constructor to be used with an `ObjectParser`.
`WikipediaDocumentSequence(String file, String bzipped, String baseURL, String parseText, String keepNamespaced, String keepTemplates, String maxBeforeAnchor, String maxAnchor, String maxPostAnchor)`	A string-based constructor to be used with an `ObjectParser`.
`WikipediaDocumentSequence(String file, String bzipped, String baseURL, String parseText, String keepNamespaced, String keepTemplates, String maxBeforeAnchor, String maxAnchor, String maxPostAnchor, String delimiter)`	A string-based constructor to be used with an `ObjectParser`.

Method Summary

Modifier and Type	Method	Description
`DocumentFactory`	`factory()`	Returns the factory used by this sequence.
`DocumentIterator`	`iterator()`	Returns an iterator over the sequence of documents.
`static void`	`main(String[] arg)`

Methods inherited from class it.unimi.di.big.mg4j.document.AbstractDocumentSequence
close, filename, finalize, load

Methods inherited from class java.lang.Object
clone, equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Constructor Detail

WikipediaDocumentSequence
```
public WikipediaDocumentSequence(String file,
                                 boolean bzipped,
                                 String baseURL,
                                 boolean parseText)
```
Builds a new Wikipedia document sequence using default anchor settings and discarding namespaced pages and templates.

Parameters:

file - the file containing the Wikipedia dump.

bzipped - whether file is compressed with bzip2.

baseURL - a base URL for links (e.g., for the English Wikipedia, http://en.wikipedia.org/wiki/); note that if it is nonempty this string must terminate with a slash.

parseText - whether to parse the text (this parameter is only set to false during metadata-scanning phases to speed up the scanning process).

WikipediaDocumentSequence
```
public WikipediaDocumentSequence(String file,
                                 boolean bzipped,
                                 String baseURL,
                                 boolean parseText,
                                 boolean keepNamespaced,
                                 boolean keepTemplates)
```
Builds a new Wikipedia document sequence using default anchor settings.

Parameters:

file - the file containing the Wikipedia dump.

bzipped - whether file is compressed with bzip2.

baseURL - a base URL for links (e.g., for the English Wikipedia, http://en.wikipedia.org/wiki/); note that if it is nonempty this string must terminate with a slash.

parseText - whether to parse the text (this parameter is only set to false during metadata-scanning phases to speed up the scanning process).

keepNamespaced - whether to keep namespaced pages (e.g., Template:something pages).

keepTemplates - whether to keep templates (e.g., infoboxes, taxoboxes, etc.); we suggest to pass false if you're building a Wikipedia graph.

WikipediaDocumentSequence
```
public WikipediaDocumentSequence(String file,
                                 boolean bzipped,
                                 String baseURL,
                                 boolean parseText,
                                 boolean keepNamespaced,
                                 boolean keepTemplates,
                                 int maxPreAnchor,
                                 int maxAnchor,
                                 int maxPostAnchor,
                                 String delimiter)
```
Builds a new Wikipedia document sequence.

Parameters:

file - the file containing the Wikipedia dump.

bzipped - whether file is compressed with bzip2.

baseURL - a base URL for links (e.g., for the English Wikipedia, http://en.wikipedia.org/wiki/); note that if it is nonempty this string must terminate with a slash.

parseText - whether to parse the text (this parameter is only set to false during metadata-scanning phases to speed up the scanning process).

keepNamespaced - whether to keep namespaced pages (e.g., Template:something pages).

keepTemplates - whether to keep templates (e.g., infoboxes, taxoboxes, etc.); we suggest to pass false if you're building a Wikipedia graph.

maxPreAnchor - maximum number of character before an anchor.

maxAnchor - maximum number of character in an anchor.

maxPostAnchor - maximum number of characters after an anchor.

delimiter - a token that will be inserted to delimit the anchor text, or null for no delimiter.

WikipediaDocumentSequence

public WikipediaDocumentSequence(String file,
                                 String bzipped,
                                 String baseURL,
                                 String parseText)

A string-based constructor to be used with an ObjectParser.

See Also:: WikipediaDocumentSequence(String, boolean, String, boolean)

WikipediaDocumentSequence

public WikipediaDocumentSequence(String file,
                                 String bzipped,
                                 String baseURL,
                                 String parseText,
                                 String keepNamespaced,
                                 String keepTemplates)

A string-based constructor to be used with an ObjectParser.

See Also:: WikipediaDocumentSequence(String, boolean, String, boolean, boolean, boolean)

WikipediaDocumentSequence

public WikipediaDocumentSequence(String file,
                                 String bzipped,
                                 String baseURL,
                                 String parseText,
                                 String keepNamespaced,
                                 String keepTemplates,
                                 String maxBeforeAnchor,
                                 String maxAnchor,
                                 String maxPostAnchor)

A string-based constructor to be used with an ObjectParser.

See Also:: WikipediaDocumentSequence(String, boolean, String, boolean, boolean, boolean, int, int, int, String)

WikipediaDocumentSequence

public WikipediaDocumentSequence(String file,
                                 String bzipped,
                                 String baseURL,
                                 String parseText,
                                 String keepNamespaced,
                                 String keepTemplates,
                                 String maxBeforeAnchor,
                                 String maxAnchor,
                                 String maxPostAnchor,
                                 String delimiter)

A string-based constructor to be used with an ObjectParser.

See Also:: WikipediaDocumentSequence(String, boolean, String, boolean, boolean, boolean, int, int, int, String)

Method Detail
- iterator
```
public DocumentIterator iterator()
                          throws IOException
```
  Description copied from interface: DocumentSequence
  
  Returns an iterator over the sequence of documents.
  Warning: this method can be safely called just one time. For instance, implementations based on standard input will usually throw an exception if this method is called twice.
  Implementations may decide to override this restriction (in particular, if they implement DocumentCollection). Usually, however, it is not possible to obtain two iterators at the same time on a collection.
  
  Specified by:
  
  iterator in interface DocumentSequence
  
  Returns:
  
  an iterator over the sequence of documents.
  
  Throws:
  
  IOException
  
  See Also:
  
  DocumentCollection
- factory
```
public DocumentFactory factory()
```
  Description copied from interface: DocumentSequence
  
  Returns the factory used by this sequence.
  Every document sequence is based on a document factory that transforms raw bytes into a sequence of characters. The factory contains useful information such as the number of fields.
  
  Specified by:
  
  factory in interface DocumentSequence
  
  Returns:
  
  the factory used by this sequence.
- main
```
public static void main(String[] arg)
                 throws ParserConfigurationException,
                        SAXException,
                        IOException,
                        com.martiansoftware.jsap.JSAPException,
                        ClassNotFoundException
```
  Throws:
  
  ParserConfigurationException
  
  SAXException
  
  IOException
  
  com.martiansoftware.jsap.JSAPException
  
  ClassNotFoundException

Class WikipediaDocumentSequence

How to index Wikipedia

Additional fields

Nested Class Summary

Constructor Summary

Method Summary

Methods inherited from class it.unimi.di.big.mg4j.document.AbstractDocumentSequence

Methods inherited from class java.lang.Object

Constructor Detail

WikipediaDocumentSequence

WikipediaDocumentSequence

WikipediaDocumentSequence

WikipediaDocumentSequence

WikipediaDocumentSequence

WikipediaDocumentSequence

WikipediaDocumentSequence

Method Detail

iterator

factory

main