|
|||||||||
PREV NEXT | FRAMES NO FRAMES |
See:
Description
MG4J (Managing Gigabytes for Java) is a free full-text search engine for large document collections written in Java.
MG4J is distributed under the GNU Lesser General Public License.
MG4J 5.0 brings several new features, but also source and binary incompatibilities with previous releases.
it.unimi.di.*
, following the change of
name of our department, so to ease the transition and making coexistence with previous versions possible.
DocumentIterator.nextDocument()
now returns DocumentIterator.END_OF_LIST
instead of -1 to
denote list exhaustion.
DocumentIterator
is now strictly lazy; in
particular, it does not implement Iterator
. Please replace
calls to hasNext()
with a check to
DocumentIterator.nextDocument()
!=
DocumentIterator.END_OF_LIST
, or try
whether the semantics of DocumentIterator.mayHaveNext()
suits you.
IndexIterator
have been replaced by the single lazy IndexIterator.nextPosition()
call,
which returns IndexIterator.END_OF_POSITIONS
when the positions are
exhausted. Some static methods in IndexIterators
should help with the
transition.
MG4J is vast. Some of its component are the result of longtime research efforts, and are not easy to describe in full detail. Here we give a roadmap to the documentation, so that you do not have to wander recklessly through dozens of package descriptions.
First of all, MG4J comes with a manual that describes how to build indices, and how to access them from the command line or from the web. It is a good idea to start from the manual, build and play with a few indices, and then come back to package documentation, as the latter often refers to artifacts created by index construction.
If you want to interface MG4J with your own data, you must read
the package documentation of it.unimi.di.mg4j.document
, which describes document
sequences, collections and factories.
If you want to load and query an index, you must read
the package documentation of it.unimi.di.mg4j.index
, which describes indices and
index readers. The package contains also the documentation about
term processors, which transform terms
before they are actually indexed; they are fundamental to customise the indexing process.
If you want to have a look at your index, the package
it.unimi.di.mg4j.query
contains many useful classes that can help. In particular,
a simple command-line tool let you query an index using a standard syntax. The
tool makes it also possible to query the index using a browser (if you plan on using the command-line
frequency, we suggest a utility such as rlwrap
to provide command-line history and editing).
it.unimi.di.mg4j.query.parser
, which contains a simple parser generated
with JavaCC. The parser generates an abstract query
describe by a composite object whose description is given in it.unimi.di.mg4j.query.nodes
. The
query can then be turned into a DocumentIterator
, which will return
the documents matching the query and also the document intervals satisfying the query: the
minimal-interval semantics
used by MG4J is described in detail in it.unimi.di.mg4j.search
, which also contains
a description of the syntax used by the
command-line tool.
Once a document iterator returning the matching documents is available, it is usually necessary
to rank the documents. MG4J provides an abstract notion of Scorer
and provides several examples. Scoring is a very sophisticated issue, and a lot of research has
been devoted to this subject. MG4J provides implementation for some state-of-the-art scorers
such as BM25, and also new scorers based
on minimal-interval semantics such as VignaScorer
.
All these pieces come together in the QueryEngine
, which takes one
or more queries, scores their results using one or more scorers, and returns only a certain part of
the results themselves, decorated with suitably selected intervals that can be used to
generate snippets. The query engine has several tunable parameters, so you can adapt it to your application.
We suggest that you play with the command-line tool and
the associated web interface to become familiar with the query-engine inner workings.
MG4J requires Java ≥6 and relies on the DSI utilities and two packages providing high-performance containers and algorithms, that is, fastutil 6.4 or greater, and Sux4J. Command-line parsing and support requires JSAP. Factories and collections use pdfbox and a Javamail implementation. The HTTP interface uses the Jetty 6 HTTP server, velocity, velocity-tools and the servlet APIs. MG4J uses also a number of useful libraries from the Jakarta commons project, including collections, lang, configuration and io. All logging is performed using log4j. Compiling MG4J requires javacc and jars from Tika (and related dependencies).
|
|||||||||
PREV NEXT | FRAMES NO FRAMES |