it.unimi.di.mg4j.util.parser.callback
Class AnchorExtractor

java.lang.Object
  extended by it.unimi.dsi.parser.callback.DefaultCallback
      extended by it.unimi.di.mg4j.util.parser.callback.AnchorExtractor
All Implemented Interfaces:
Callback

public class AnchorExtractor
extends DefaultCallback

A callback extracting anchor text. When instantiating the extractor, you can specify the number of characters to be considered before the anchor, after the anchor or during the anchor (just the first characters are taken into consideration in the last two characters, and just the last ones in the first case).

At the end of parsing, the result (the list of anchors) is available in anchors, whose elements provide the content of the href attribute the text of the anchor and around the anchor; text is however modified so that fragment of words at the beginning of the pre-anchor context, or at the end of the post-anchor context, are cut away.

For example, a fragment like: ...foo fOO FOO FOO ANCHOR TEXT BAR BAR BAr bar... (where the uppercase part represents the pre- and post-anchor context) generates the element Anchor("xxx", "FOO FOO ANCHOR TEXT BAR BAR")


Nested Class Summary
static class AnchorExtractor.Anchor
          A class representing an anchor.
 
Field Summary
 ObjectList<AnchorExtractor.Anchor> anchors
          The resulting list of anchors.
static boolean DEBUG
           
static Logger LOGGER
           
 
Fields inherited from interface it.unimi.dsi.parser.callback.Callback
EMPTY_CALLBACK_ARRAY
 
Constructor Summary
AnchorExtractor(int maxBefore, int maxAnchor, int maxAfter)
           
 
Method Summary
 boolean characters(char[] characters, int offset, int length, boolean flowBroken)
           
 void configure(BulletParser parser)
           
 void endDocument()
           
 boolean endElement(Element element)
           
 void startDocument()
           
 boolean startElement(Element element, Map<Attribute,MutableString> attrMap)
           
 
Methods inherited from class it.unimi.dsi.parser.callback.DefaultCallback
cdata, getInstance
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

LOGGER

public static final Logger LOGGER

DEBUG

public static final boolean DEBUG
See Also:
Constant Field Values

anchors

public final ObjectList<AnchorExtractor.Anchor> anchors
The resulting list of anchors.

Constructor Detail

AnchorExtractor

public AnchorExtractor(int maxBefore,
                       int maxAnchor,
                       int maxAfter)
Parameters:
maxBefore - maximum number of words to be considered before of the anchor.
maxAfter - maximum number of words to be considered after the anchor.
Method Detail

configure

public void configure(BulletParser parser)
Specified by:
configure in interface Callback
Overrides:
configure in class DefaultCallback

startDocument

public void startDocument()
Specified by:
startDocument in interface Callback
Overrides:
startDocument in class DefaultCallback

endDocument

public void endDocument()
Specified by:
endDocument in interface Callback
Overrides:
endDocument in class DefaultCallback

startElement

public boolean startElement(Element element,
                            Map<Attribute,MutableString> attrMap)
Specified by:
startElement in interface Callback
Overrides:
startElement in class DefaultCallback

endElement

public boolean endElement(Element element)
Specified by:
endElement in interface Callback
Overrides:
endElement in class DefaultCallback

characters

public boolean characters(char[] characters,
                          int offset,
                          int length,
                          boolean flowBroken)
Specified by:
characters in interface Callback
Overrides:
characters in class DefaultCallback