Class FilterOutWikipediaDuplicates


  • public class FilterOutWikipediaDuplicates
    extends Object
    Reads a Wikipedia XML dump and outputs the same dump after eliminating duplicate pages. A duplicate page is a page whose title appeared earlier in the XML stream.