Querying MG4J is easy if you already
used a text-indexing system. The simplest possible query is a single
term, e.g., class
: the answer that you will obtain by
such a query is the set of all documents (in our case: all files among
those that have been indexed) that contain the word
class
(or any other uppercase/lowercase variant
thereof).
There are several additional operators you might want to try:
AND:
writing more than one term (separated
by whitespace) means that you want to look for documents that
contain all the specified words (not
necessarily in the same order or consecutively); for example, the
query InputStream Reader encoding
means that you
want to look for documents that contain all the given
words; you can convey the same meaning by using the
operator &
(a.k.a. AND
),
thus writing InputStream & Reader &
encoding
instead;
OR:
if you want to write a disjunctive
query you can use the operator |
(a.k.a. OR);
thus, for example, the query InputStream | Reader |
encoding
means that you are looking for documents that
contain any of the given words;
NOT:
you can use the operator ! (a.k.a.
NOT) to mean negation; thus, for example, the query
InputStream & !Reader
means that you are
looking for documents that contain the first term but not
the second;
phrase: you can force consecutivity by using quotation marks;
thus "InputStream Reader"
means that you want to
look for documents that contain these two words
consecutively;
proximity restriction: you can limit your search to documents
where the words you are searching appear within a limited portion of
the document; this is done with the tilda operator; for example,
(InputStream Reader)~5
means that you are looking
for documents where the two given words appear (in any order) within
5 words from each other;
ordered AND:
writing more than one term
separated by <
will find documents containing
the given terms in the specified order.
wildcard search: you can perform wildcard searches by
appending * at the end of a term; for example,
term*
will look for documents containing "term",
"terms", "termed" and so on.
parentheses: you can use parentheses to enforce priority when
building complex queries; parentheses are not needed in many cases,
but they are necessary, for example, when a boolean query is written
within a phrase; for example, if you want to look for the word
InputStream
followed by Reader
or Writer
, you will enter the query
"InputStream (Reader | Writer)"
.
index specifiers: prefixing a query with the name of an index
followed by a colon you can restrict the search to that index. The
name of an index is by default the name of the field that it has
indexed, so title:Reader
will search for
Reader
just in titles.
range queries: if you created an index containing
payloads (dates, integers, etc.) you can
perform range queries using square brackets and two dots: for
instance, assuming the existence of a field date
the query [ 20/2/2007 .. 23/2/2007 ]
will search
for documents whose date is between 20 February and 23 February
2007, inclusive.
MG4J will emphasise intervals satisfying the query. By clicking on the link of a document, the document will be opened in the browser.
The description we have just given just scratches the surfaces of
the queries you can write with MG4J: all the
operators can be freely combined, obtaining very sophisticated
constraints on the documents returned. More information on this topic
can be found in the documentation of the package
it.unimi.di.big.mg4j.search
.
MG4J actually provide very sophisticated query tuning. In particular, it provides scorers, which let you reorder the documents satisying a query depending on some criterion. To use this features, you must use the command line interface, albeit all settings will be used for the subsequent web queries.
Type $
to get some help on the available
options. A basic command is $mode
, which lets you
choose the kind of result: just the document number and title, the
intervals, snippets and so on. Some options require a full index and a
collection (for instance, snippets). The most interesting command,
however, is $scorer
, that lets you choose a scorer
for your documents. For instance,
$score BM25Scorer VignaScorer
reproduces
the standard settings, using a BM25 scorer and a scorer that shows
firsts documents satisfying your queries more frequently and in
smaller intervals, linearly combined with equal weight. Scorers are
described in the documentation of the package
it.unimi.di.big.mg4j.search.score
.
When you use a scorer, it is a good idea to use multiplexing: when multiplexing is on, each query is multiplexed to all indices (by default, a query is directed to the first index specified on the command line). Just type
$mplex on
Of course, you can always choose a specific index with the colon notation. You can also change the weight of your indices (which is particularly useful when multiplexing):
$weight text:1 title:3
In
this way, weight-based scorers will usually consider the
title
field three times more important than the
text
field.
You can also change the way snippets (or intervals) on display are chosen: MG4J provides an interval selector, a class that will try to choose the best intervals to be shown. You can set the maximum length of an interval, and the maximum number of intervals:
$selector 3 40
will show at most three intervals, and intervals longer than 40 characters will be broken. All these changes are reflected in the web interface.
If you want to learn more about query resolution, you should
have a look at the documentation of the class
it.unimi.di.big.mg4j.query.QueryEngine
, which
embodies all the logic used to answer queries in MG4J.
For our next example, we will put to good use the semantically annotated snapshot of the English Wikipedia created at Yahoo!. The collection exhibits Wikipedia articles as a number of parallel texts, one of which is the sequence of tokens, whereas others provide information like "this token is a person's name". MG4J provides an alignment operator that can be used with parallel texts to align results of two queries—in practise, you can ask which results of an arbitrary query match certain semantic conditions. The support has a few rough edges, but it's an interesting example nonetheless.
First of all, you must get the collection, for instance through
Yahoo!. The collection is made by a number of text files (stored, say,
in /your/wiki/dir/
), which must be recorded in a
WikipediaDocumentCollection
as follows:
find /your/wiki/dir/ -type f \ java it.unimi.di.big.mg4j.document.WikipediaDocumentCollection wiki.collection
Similarly
to a FileSetDocumentCollection
, the serialized
collection will contain references to the files and also a compacted
representation of pointers to the start of each record in each file.
You can now index as before, and invoke Query
.
In this particular case, we use tokens and some semantic
tagging.
java it.unimi.di.big.mg4j.query.Query -h \ -c wiki.collection wiki-token wiki-WSJ
We're now ready. For instance, the query
Washington ^ WSJ:(B\-E\:PERSON | B\-I\:PERSON)
will
search for "Washington", but only in those positions that have been
marked as person names. This happens because the alignment operator
^
solves the left query, and then keeps just those
results whose positions are the same as those of the second query,
which can be on a completely different index. Note that the left and
right query of an alignment operator are completely arbitrary, and the
overall query is a standard query on the first index. Thus,
"(Washington ^ WSJ:(B\-E\:PERSON | B\-I\:PERSON)) was"
would search for "Washington" as a person name, but only if immediately followed by "was".
If you click on the title of a result, you will be brought to
the corresponding Wikipedia page, as the factory embodied in the
WikipediaDocumentCollection
sets the document
URI to the Wikipedia page. If you want a more technical view of what's
happening, you can use the GenericItem
class,
which will display in a very simple manner the content of all
fields.
Another interesting property of the Wikipedia examples is the
end-of-sentence markers (¶) are indexed. You can use another fairly
exotic operator, Brouwerian difference, to
restrict your results to queries that are true inside a
sentence. The semantics of query in MG4J is a set of
minimal intervals that represent region of text
that satisfy the query. For instance, for the query was
killed
the intervals describe the smallest regions of text
in which was
and killed
do
appear. But the difference operator (a minus) will eliminate the
intervals generated by the left query that contain one or more
interval from the right query. Thus,
was killed - ¶
will perform the same search, but we will see only those results for which there are regions of text not containing ¶. In other words, results will be restricted to be within a sentence, as matches (i.e., again, regions of text) that cross sentence borders will be killed by the difference operator.
Finally, the index remapping operator
comes handy in two situations: display the results of a field using
another, parallel field, or applying positional operators to results
from different fields. If you search for WSJ:(B\-E\:PERSON |
B\-I\:PERSON)
, the resulting snippets will be rather
ugly:
Document #205 [2.000000] Protected_areas_of_Tasmania WSJ: ...0 0 0 B-N:CARDINAL I-N:CARDINAL I-N:CARDINAL I-N:CARDINAL 0 B-E:PERSON I-E:PERSON 0 0 0 0 0 B-N:CARDINAL 0 ... Document #258 [1.999152] List_of_people_by_name:_Kea-Kel WSJ: ? 0 0 B-E:PER_DESC 0 0 0 B-E:PERSON ? 0 0 0 0 0 ? 0 ...
This is
correct, as MG4J is displaying results from the WJS
field. It is however easy to remap those
results to another index: if we try (WSJ:(B\-E\:PERSON |
B\-I\:PERSON)){{WSJ->token}}
, the result will be
like
Document #205 [2.000000] Protected_areas_of_Tasmania token: ...but it contains no fewer than 495 separate Protected Areas with a total area of 22 , ... Document #258 [1.999152] List_of_people_by_name:_Kea-Kel token: ? List of people by name : Kea-Kel ? Access to rest of list ? Access ...
Snippets
are now represented using the parallel content of the
token
field.
Assume now that we want to find a person's name
immediately followed by the term "was". A direct
attemp would be trying the query "WSJ:(B\-E\:PERSON |
B\-I\:PERSON) was"
: the result would be an error message
("The phrase operator requires subqueries on the same index"). This is
correct, because intervals returned by the two subqueries of the
phrasal operators are on different indices—mixing them makes no sense.
However, if you're sure that you are handling indices on parallel
texts, the idea does make sense, and we can
convince MG4J about this as follows:
{token, WSJ}>"(WSJ:(B\-E\:PERSON | B\-I\:PERSON){{WSJ->token}}) (token:was)" Document #225 [1.996363] Days_of_our_Lives token: ...family tree by way of SORAS . ? Abby was rapidly aged to a teenager . ? Abby ... Document #467 [1.995153] Airey_Neave token: ...in Northern Ireland . ? In 1975 , Neave was the campaign manager for Margaret Thatcher 's victorious ...
In this example, we qualified also "was" with an index selector to avoid problems in case multiplexing is on.
In this section we discuss thoroughly the construction of an
index based on the TREC GOV2 collection (Text REtrieval Conferences
are series of events organized by the National Institute of Standards
and Technology to evaluate scientifically and reproducibly systems for
information retrieval). TREC collections must be bought to be used,
but they are very commonly used for scientific work. In this example
we use a TRECDocumentCollection
to index GOV2
(25 million web pages). Be warned that different collections have
slightly different formats, and
TRECDocumentCollection
might need some tweaking
to work with them (we are making it more and more flexible on a
per-request basis).
GOV2 data comes as a list of files in directories named
GX000
, GX001
, and so on. The
file themselves are zipped (you can create the collection with
unzipped files, however, if you want faster access).
find GX??? -iname \*.gz | \ java it.unimi.di.big.mg4j.document.TRECDocumentCollection \ -f HtmlDocumentFactory -p encoding=ISO-8859-1 -z trec.collection
After some grinding, you'll get the collection. Note that this process is mainly useful for accessing later the collection in a random fashion—for instance, to generate snippets.
Since we want to index anchor text, we must now generate URIs that will represent each document.
java it.unimi.di.big.mg4j.tool.ScanMetadata -S trec.collection -u trec.uris
Note
the -U
option. GOV2 contains many duplicate (and even
triplicate) URLs, modulo trivial normalizations such as adding a bar
after the host name. The -U
option is a very crude
way of making them unique. (A more principled mechanism would involve
merging all documents with identical URLs, but that should have been
addressed when GOV2 was built.)
We are now ready to build our index:
java it.unimi.di.big.mg4j.tool.IndexBuilder -S trec.collection \ -t snowball.PorterStemmer -a -v anchor:trec.vdr trec
We set a
rather large batch size, assuming that a lot of memory is available.
As we said, Scan
will try to detect low-memory
conditions and dump batches automatically, but you can lower the batch
size, in case you run into out-of-memory errors. We also require a
downcasing Porter stemmer (all Snowball-based stemmers
downcase terms). Beware again: you will be generating hundreds of
batches, so you must be able to open a few thousand files in the
combination phase. When the indexing process is completed, you can
query the index as usual.