|
|||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Objectit.unimi.di.mg4j.document.AbstractDocumentSequence
it.unimi.di.mg4j.document.AbstractDocumentCollection
it.unimi.di.mg4j.document.WikipediaDocumentCollection
public class WikipediaDocumentCollection
A DocumentCollection
corresponding to
a given set of files in the Yahoo! Wikipedia format.
This class provides a main method with a flexible syntax that serialises
into a document collection a list of (possibly gzip'd) files given on the command line or
piped into standard input. The files are to be taken from the
semantically
annotated snapshot of the english wikipedia distributed by Yahoo!.
The position of each record is stored using an EliasFanoMonotoneLongBigList
per file, which gives us
random access with very little overhead.
Each column of the collection is indexed in parallel, and is accessible using its label as field name. For instance, a query like
Washington ^ WSJ:(B\-E\:PERSON | B\-I\:PERSON)will search for “Washington”, but only if the term has been annotated as a person name (note the escaping, which is necessary if you use the standard parser). See the
it.unimi.di.mg4j.search
package for more info about the operators available.
See the collection page for more information about the tagging process.
Nested Class Summary | |
---|---|
static class |
WikipediaDocumentCollection.WhitespaceWordReader
|
Nested classes/interfaces inherited from class it.unimi.di.mg4j.document.AbstractDocumentCollection |
---|
AbstractDocumentCollection.PropertyKeys |
Field Summary |
---|
Fields inherited from interface it.unimi.di.mg4j.document.DocumentCollection |
---|
DEFAULT_EXTENSION |
Constructor Summary | |
---|---|
|
WikipediaDocumentCollection(String[] file,
DocumentFactory factory,
boolean phrase)
Builds a document collection corresponding to a given set of Wikipedia files specified as an array. |
|
WikipediaDocumentCollection(String[] file,
DocumentFactory factory,
boolean phrase,
boolean gzipped)
Builds a document collection corresponding to a given set of (possibly gzip'd) Wikipedia files specified as an array. |
protected |
WikipediaDocumentCollection(String[] file,
DocumentFactory factory,
ObjectArrayList<EliasFanoMonotoneLongBigList> pointers,
int size,
int[] firstDocument,
boolean phrase,
boolean gzipped)
|
Method Summary | |
---|---|
WikipediaDocumentCollection |
copy()
|
Document |
document(int index)
Returns the document given its index. |
DocumentFactory |
factory()
Returns the factory used by this sequence. |
DocumentIterator |
iterator()
Returns an iterator over the sequence of documents. |
static void |
main(String[] arg)
|
Reference2ObjectMap<Enum<?>,Object> |
metadata(int index)
Returns the metadata map for a document. |
int |
size()
Returns the number of documents in this collection. |
InputStream |
stream(int index)
Returns an input stream for the raw content of a document. |
Methods inherited from class it.unimi.di.mg4j.document.AbstractDocumentCollection |
---|
ensureDocumentIndex, printAllDocuments, toString |
Methods inherited from class it.unimi.di.mg4j.document.AbstractDocumentSequence |
---|
close, filename, finalize, load |
Methods inherited from class java.lang.Object |
---|
clone, equals, getClass, hashCode, notify, notifyAll, wait, wait, wait |
Methods inherited from interface it.unimi.di.mg4j.document.DocumentSequence |
---|
close, filename |
Constructor Detail |
---|
public WikipediaDocumentCollection(String[] file, DocumentFactory factory, boolean phrase) throws IOException
Beware. This class is not guaranteed to work if files are deleted or modified after creation!
file
- an array containing the files that will be contained in the collection.factory
- the factory that will be used to create documents.phrase
- whether phrases should be indexed instead of documents.
IOException
public WikipediaDocumentCollection(String[] file, DocumentFactory factory, boolean phrase, boolean gzipped) throws IOException
Beware. This class is not guaranteed to work if files are deleted or modified after creation!
file
- an array containing the files that will be contained in the collection.factory
- the factory that will be used to create documents.phrase
- whether phrases should be indexed instead of documents.gzipped
- the files in file
are gzip'd.
IOException
protected WikipediaDocumentCollection(String[] file, DocumentFactory factory, ObjectArrayList<EliasFanoMonotoneLongBigList> pointers, int size, int[] firstDocument, boolean phrase, boolean gzipped)
Method Detail |
---|
public DocumentFactory factory()
DocumentSequence
Every document sequence is based on a document factory that transforms raw bytes into a sequence of characters. The factory contains useful information such as the number of fields.
factory
in interface DocumentSequence
public int size()
DocumentCollection
size
in interface DocumentCollection
public Reference2ObjectMap<Enum<?>,Object> metadata(int index) throws IOException
DocumentCollection
metadata
in interface DocumentCollection
index
- an index between 0 (inclusive) and DocumentCollection.size()
(exclusive).
IOException
public Document document(int index) throws IOException
DocumentCollection
document
in interface DocumentCollection
index
- an index between 0 (inclusive) and DocumentCollection.size()
(exclusive).
index
-th document.
IOException
public InputStream stream(int index) throws IOException
DocumentCollection
stream
in interface DocumentCollection
index
- an index between 0 (inclusive) and DocumentCollection.size()
(exclusive).
IOException
public DocumentIterator iterator() throws IOException
DocumentSequence
Warning: this method can be safely called just one time. For instance, implementations based on standard input will usually throw an exception if this method is called twice.
Implementations may decide to override this restriction
(in particular, if they implement DocumentCollection
). Usually,
however, it is not possible to obtain two iterators at the
same time on a collection.
iterator
in interface DocumentSequence
iterator
in class AbstractDocumentCollection
IOException
DocumentCollection
public WikipediaDocumentCollection copy()
copy
in interface DocumentCollection
copy
in interface FlyweightPrototype<DocumentCollection>
public static void main(String[] arg) throws IOException, com.martiansoftware.jsap.JSAPException, InstantiationException, IllegalAccessException, InvocationTargetException, NoSuchMethodException
IOException
com.martiansoftware.jsap.JSAPException
InstantiationException
IllegalAccessException
InvocationTargetException
NoSuchMethodException
|
|||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |