it.unimi.di.mg4j.document
Class WikipediaDocumentCollection

java.lang.Object
  extended by it.unimi.di.mg4j.document.AbstractDocumentSequence
      extended by it.unimi.di.mg4j.document.AbstractDocumentCollection
          extended by it.unimi.di.mg4j.document.WikipediaDocumentCollection
All Implemented Interfaces:
DocumentCollection, DocumentSequence, SafelyCloseable, FlyweightPrototype<DocumentCollection>, Closeable, Serializable

public class WikipediaDocumentCollection
extends AbstractDocumentCollection
implements Serializable

A DocumentCollection corresponding to a given set of files in the Yahoo! Wikipedia format.

This class provides a main method with a flexible syntax that serialises into a document collection a list of (possibly gzip'd) files given on the command line or piped into standard input. The files are to be taken from the semantically annotated snapshot of the english wikipedia distributed by Yahoo!. The position of each record is stored using an EliasFanoMonotoneLongBigList per file, which gives us random access with very little overhead.

Each column of the collection is indexed in parallel, and is accessible using its label as field name. For instance, a query like

 Washington ^ WSJ:(B\-E\:PERSON | B\-I\:PERSON)
 
will search for “Washington”, but only if the term has been annotated as a person name (note the escaping, which is necessary if you use the standard parser). See the it.unimi.di.mg4j.search package for more info about the operators available.

See the collection page for more information about the tagging process.

See Also:
Serialized Form

Nested Class Summary
static class WikipediaDocumentCollection.WhitespaceWordReader
           
 
Nested classes/interfaces inherited from class it.unimi.di.mg4j.document.AbstractDocumentCollection
AbstractDocumentCollection.PropertyKeys
 
Field Summary
 
Fields inherited from interface it.unimi.di.mg4j.document.DocumentCollection
DEFAULT_EXTENSION
 
Constructor Summary
  WikipediaDocumentCollection(String[] file, DocumentFactory factory, boolean phrase)
          Builds a document collection corresponding to a given set of Wikipedia files specified as an array.
  WikipediaDocumentCollection(String[] file, DocumentFactory factory, boolean phrase, boolean gzipped)
          Builds a document collection corresponding to a given set of (possibly gzip'd) Wikipedia files specified as an array.
protected WikipediaDocumentCollection(String[] file, DocumentFactory factory, ObjectArrayList<EliasFanoMonotoneLongBigList> pointers, int size, int[] firstDocument, boolean phrase, boolean gzipped)
           
 
Method Summary
 WikipediaDocumentCollection copy()
           
 Document document(int index)
          Returns the document given its index.
 DocumentFactory factory()
          Returns the factory used by this sequence.
 DocumentIterator iterator()
          Returns an iterator over the sequence of documents.
static void main(String[] arg)
           
 Reference2ObjectMap<Enum<?>,Object> metadata(int index)
          Returns the metadata map for a document.
 int size()
          Returns the number of documents in this collection.
 InputStream stream(int index)
          Returns an input stream for the raw content of a document.
 
Methods inherited from class it.unimi.di.mg4j.document.AbstractDocumentCollection
ensureDocumentIndex, printAllDocuments, toString
 
Methods inherited from class it.unimi.di.mg4j.document.AbstractDocumentSequence
close, filename, finalize, load
 
Methods inherited from class java.lang.Object
clone, equals, getClass, hashCode, notify, notifyAll, wait, wait, wait
 
Methods inherited from interface it.unimi.di.mg4j.document.DocumentSequence
close, filename
 

Constructor Detail

WikipediaDocumentCollection

public WikipediaDocumentCollection(String[] file,
                                   DocumentFactory factory,
                                   boolean phrase)
                            throws IOException
Builds a document collection corresponding to a given set of Wikipedia files specified as an array.

Beware. This class is not guaranteed to work if files are deleted or modified after creation!

Parameters:
file - an array containing the files that will be contained in the collection.
factory - the factory that will be used to create documents.
phrase - whether phrases should be indexed instead of documents.
Throws:
IOException

WikipediaDocumentCollection

public WikipediaDocumentCollection(String[] file,
                                   DocumentFactory factory,
                                   boolean phrase,
                                   boolean gzipped)
                            throws IOException
Builds a document collection corresponding to a given set of (possibly gzip'd) Wikipedia files specified as an array.

Beware. This class is not guaranteed to work if files are deleted or modified after creation!

Parameters:
file - an array containing the files that will be contained in the collection.
factory - the factory that will be used to create documents.
phrase - whether phrases should be indexed instead of documents.
gzipped - the files in file are gzip'd.
Throws:
IOException

WikipediaDocumentCollection

protected WikipediaDocumentCollection(String[] file,
                                      DocumentFactory factory,
                                      ObjectArrayList<EliasFanoMonotoneLongBigList> pointers,
                                      int size,
                                      int[] firstDocument,
                                      boolean phrase,
                                      boolean gzipped)
Method Detail

factory

public DocumentFactory factory()
Description copied from interface: DocumentSequence
Returns the factory used by this sequence.

Every document sequence is based on a document factory that transforms raw bytes into a sequence of characters. The factory contains useful information such as the number of fields.

Specified by:
factory in interface DocumentSequence
Returns:
the factory used by this sequence.

size

public int size()
Description copied from interface: DocumentCollection
Returns the number of documents in this collection.

Specified by:
size in interface DocumentCollection
Returns:
the number of documents in this collection.

metadata

public Reference2ObjectMap<Enum<?>,Object> metadata(int index)
                                             throws IOException
Description copied from interface: DocumentCollection
Returns the metadata map for a document.

Specified by:
metadata in interface DocumentCollection
Parameters:
index - an index between 0 (inclusive) and DocumentCollection.size() (exclusive).
Returns:
the metadata map for the document.
Throws:
IOException

document

public Document document(int index)
                  throws IOException
Description copied from interface: DocumentCollection
Returns the document given its index.

Specified by:
document in interface DocumentCollection
Parameters:
index - an index between 0 (inclusive) and DocumentCollection.size() (exclusive).
Returns:
the index-th document.
Throws:
IOException

stream

public InputStream stream(int index)
                   throws IOException
Description copied from interface: DocumentCollection
Returns an input stream for the raw content of a document.

Specified by:
stream in interface DocumentCollection
Parameters:
index - an index between 0 (inclusive) and DocumentCollection.size() (exclusive).
Returns:
the raw content of the document as an input stream.
Throws:
IOException

iterator

public DocumentIterator iterator()
                          throws IOException
Description copied from interface: DocumentSequence
Returns an iterator over the sequence of documents.

Warning: this method can be safely called just one time. For instance, implementations based on standard input will usually throw an exception if this method is called twice.

Implementations may decide to override this restriction (in particular, if they implement DocumentCollection). Usually, however, it is not possible to obtain two iterators at the same time on a collection.

Specified by:
iterator in interface DocumentSequence
Overrides:
iterator in class AbstractDocumentCollection
Returns:
an iterator over the sequence of documents.
Throws:
IOException
See Also:
DocumentCollection

copy

public WikipediaDocumentCollection copy()
Specified by:
copy in interface DocumentCollection
Specified by:
copy in interface FlyweightPrototype<DocumentCollection>

main

public static void main(String[] arg)
                 throws IOException,
                        com.martiansoftware.jsap.JSAPException,
                        InstantiationException,
                        IllegalAccessException,
                        InvocationTargetException,
                        NoSuchMethodException
Throws:
IOException
com.martiansoftware.jsap.JSAPException
InstantiationException
IllegalAccessException
InvocationTargetException
NoSuchMethodException