Querying MG4J

Querying MG4J is easy if you already used a text-indexing system. The simplest possible query is a single term, e.g., class: the answer that you will obtain by such a query is the set of all documents (in our case: all files among those that have been indexed) that contain the word class (or any other uppercase/lowercase variant thereof).

There are several additional operators you might want to try:

MG4J will emphasise intervals satisfying the query. By clicking on the link of a document, the document will be opened in the browser.

The description we have just given just scratches the surfaces of the queries you can write with MG4J: all the operators can be freely combined, obtaining very sophisticated constraints on the documents returned. More information on this topic can be found in the documentation of the package it.unimi.di.big.mg4j.search.

More sophisticated queries

MG4J actually provide very sophisticated query tuning. In particular, it provides scorers, which let you reorder the documents satisying a query depending on some criterion. To use this features, you must use the command line interface, albeit all settings will be used for the subsequent web queries.

Type $ to get some help on the available options. A basic command is $mode, which lets you choose the kind of result: just the document number and title, the intervals, snippets and so on. Some options require a full index and a collection (for instance, snippets). The most interesting command, however, is $scorer, that lets you choose a scorer for your documents. For instance,

$score BM25Scorer VignaScorer

reproduces the standard settings, using a BM25 scorer and a scorer that shows firsts documents satisfying your queries more frequently and in smaller intervals, linearly combined with equal weight. Scorers are described in the documentation of the package it.unimi.di.big.mg4j.search.score.

When you use a scorer, it is a good idea to use multiplexing: when multiplexing is on, each query is multiplexed to all indices (by default, a query is directed to the first index specified on the command line). Just type

$mplex on

Of course, you can always choose a specific index with the colon notation. You can also change the weight of your indices (which is particularly useful when multiplexing):

$weight text:1 title:3

In this way, weight-based scorers will usually consider the title field three times more important than the text field.

You can also change the way snippets (or intervals) on display are chosen: MG4J provides an interval selector, a class that will try to choose the best intervals to be shown. You can set the maximum length of an interval, and the maximum number of intervals:

$selector 3 40

will show at most three intervals, and intervals longer than 40 characters will be broken. All these changes are reflected in the web interface.

If you want to learn more about query resolution, you should have a look at the documentation of the class it.unimi.di.big.mg4j.query.QueryEngine, which embodies all the logic used to answer queries in MG4J.

A semantic index

For our next example, we will put to good use the semantically annotated snapshot of the English Wikipedia created at Yahoo!. The collection exhibits Wikipedia articles as a number of parallel texts, one of which is the sequence of tokens, whereas others provide information like "this token is a person's name". MG4J provides an alignment operator that can be used with parallel texts to align results of two queries—in practise, you can ask which results of an arbitrary query match certain semantic conditions. The support has a few rough edges, but it's an interesting example nonetheless.

First of all, you must get the collection, for instance through Yahoo!. The collection is made by a number of text files (stored, say, in /your/wiki/dir/), which must be recorded in a WikipediaDocumentCollection as follows:

find /your/wiki/dir/ -type f \
    java it.unimi.di.big.mg4j.document.WikipediaDocumentCollection wiki.collection

Similarly to a FileSetDocumentCollection, the serialized collection will contain references to the files and also a compacted representation of pointers to the start of each record in each file. You can now index as before, and invoke Query. In this particular case, we use tokens and some semantic tagging.

java it.unimi.di.big.mg4j.query.Query -h \
     -c wiki.collection wiki-token wiki-WSJ

We're now ready. For instance, the query

Washington ^ WSJ:(B\-E\:PERSON | B\-I\:PERSON)

will search for "Washington", but only in those positions that have been marked as person names. This happens because the alignment operator ^ solves the left query, and then keeps just those results whose positions are the same as those of the second query, which can be on a completely different index. Note that the left and right query of an alignment operator are completely arbitrary, and the overall query is a standard query on the first index. Thus,

"(Washington ^ WSJ:(B\-E\:PERSON | B\-I\:PERSON)) was"

would search for "Washington" as a person name, but only if immediately followed by "was".

If you click on the title of a result, you will be brought to the corresponding Wikipedia page, as the factory embodied in the WikipediaDocumentCollection sets the document URI to the Wikipedia page. If you want a more technical view of what's happening, you can use the GenericItem class, which will display in a very simple manner the content of all fields.

Another interesting property of the Wikipedia examples is the end-of-sentence markers (¶) are indexed. You can use another fairly exotic operator, Brouwerian difference, to restrict your results to queries that are true inside a sentence. The semantics of query in MG4J is a set of minimal intervals that represent region of text that satisfy the query. For instance, for the query was killed the intervals describe the smallest regions of text in which was and killed do appear. But the difference operator (a minus) will eliminate the intervals generated by the left query that contain one or more interval from the right query. Thus,

was killed - ¶

will perform the same search, but we will see only those results for which there are regions of text not containing ¶. In other words, results will be restricted to be within a sentence, as matches (i.e., again, regions of text) that cross sentence borders will be killed by the difference operator.

Finally, the index remapping operator comes handy in two situations: display the results of a field using another, parallel field, or applying positional operators to results from different fields. If you search for WSJ:(B\-E\:PERSON | B\-I\:PERSON), the resulting snippets will be rather ugly:

Document #205 [2.000000] Protected_areas_of_Tasmania
WSJ: ...0 0 0 B-N:CARDINAL I-N:CARDINAL I-N:CARDINAL I-N:CARDINAL 
0 B-E:PERSON I-E:PERSON 0 0 0 0 0 B-N:CARDINAL 0 ... 

Document #258 [1.999152] List_of_people_by_name:_Kea-Kel
WSJ: ? 0 0 B-E:PER_DESC 0 0 0 B-E:PERSON ? 0 0 0 0 0 ? 0 ...

This is correct, as MG4J is displaying results from the WJS field. It is however easy to remap those results to another index: if we try (WSJ:(B\-E\:PERSON | B\-I\:PERSON)){{WSJ->token}}, the result will be like

Document #205 [2.000000] Protected_areas_of_Tasmania
token: ...but it contains no fewer than 495 separate Protected Areas 
with a total area of 22 , ... 

Document #258 [1.999152] List_of_people_by_name:_Kea-Kel
token: ? List of people by name : Kea-Kel ? Access to rest of list ? Access ...

Snippets are now represented using the parallel content of the token field.

Assume now that we want to find a person's name immediately followed by the term "was". A direct attemp would be trying the query "WSJ:(B\-E\:PERSON | B\-I\:PERSON) was": the result would be an error message ("The phrase operator requires subqueries on the same index"). This is correct, because intervals returned by the two subqueries of the phrasal operators are on different indices—mixing them makes no sense. However, if you're sure that you are handling indices on parallel texts, the idea does make sense, and we can convince MG4J about this as follows:

{token, WSJ}>"(WSJ:(B\-E\:PERSON | B\-I\:PERSON){{WSJ->token}}) (token:was)"

Document #225 [1.996363] Days_of_our_Lives
token: ...family tree by way of SORAS . ? Abby was rapidly aged to a teenager . ? Abby ...   

Document #467 [1.995153] Airey_Neave
token: ...in Northern Ireland . ? In 1975 , Neave was the campaign manager for Margaret Thatcher 's victorious ...  

In this example, we qualified also "was" with an index selector to avoid problems in case multiplexing is on.

A TREC index

In this section we discuss thoroughly the construction of an index based on the TREC GOV2 collection (Text REtrieval Conferences are series of events organized by the National Institute of Standards and Technology to evaluate scientifically and reproducibly systems for information retrieval). TREC collections must be bought to be used, but they are very commonly used for scientific work. In this example we use a TRECDocumentCollection to index GOV2 (25 million web pages). Be warned that different collections have slightly different formats, and TRECDocumentCollection might need some tweaking to work with them (we are making it more and more flexible on a per-request basis).

GOV2 data comes as a list of files in directories named GX000, GX001, and so on. The file themselves are zipped (you can create the collection with unzipped files, however, if you want faster access).

find GX??? -iname \*.gz | \
    java it.unimi.di.big.mg4j.document.TRECDocumentCollection \
    -f HtmlDocumentFactory -p encoding=ISO-8859-1 -z trec.collection

After some grinding, you'll get the collection. Note that this process is mainly useful for accessing later the collection in a random fashion—for instance, to generate snippets.

Since we want to index anchor text, we must now generate URIs that will represent each document.

java it.unimi.di.big.mg4j.tool.ScanMetadata -S trec.collection -u trec.uris

Note the -U option. GOV2 contains many duplicate (and even triplicate) URLs, modulo trivial normalizations such as adding a bar after the host name. The -U option is a very crude way of making them unique. (A more principled mechanism would involve merging all documents with identical URLs, but that should have been addressed when GOV2 was built.)

We are now ready to build our index:

java it.unimi.di.big.mg4j.tool.IndexBuilder -S trec.collection \
    -t snowball.PorterStemmer -a -v anchor:trec.vdr trec

We set a rather large batch size, assuming that a lot of memory is available. As we said, Scan will try to detect low-memory conditions and dump batches automatically, but you can lower the batch size, in case you run into out-of-memory errors. We also require a downcasing Porter stemmer (all Snowball-based stemmers downcase terms). Beware again: you will be generating hundreds of batches, so you must be able to open a few thousand files in the combination phase. When the indexing process is completed, you can query the index as usual.