it.unimi.di.mg4j.util.parser.callback
Class AnchorExtractor
java.lang.Object
it.unimi.dsi.parser.callback.DefaultCallback
it.unimi.di.mg4j.util.parser.callback.AnchorExtractor
- All Implemented Interfaces:
- Callback
public class AnchorExtractor
- extends DefaultCallback
A callback extracting anchor text. When instantiating the extractor, you can specify the number of characters to
be considered before the anchor, after the anchor or during the anchor (just the first characters are taken into
consideration in the last two characters, and just the last ones in the first case).
At the end of parsing, the result (the list of anchors) is available in anchors
, whose
elements provide the content of the href attribute
the text of the anchor and around the anchor; text is however modified so that fragment of words at the beginning
of the pre-anchor context, or at the end of the post-anchor context, are cut away.
For example, a fragment like:
...foo fOO FOO FOO ANCHOR TEXT BAR BAR BAr bar...
(where the uppercase part represents the pre- and post-anchor context) generates the element
Anchor("xxx", "FOO FOO ANCHOR TEXT BAR BAR")
Constructor Summary |
AnchorExtractor(int maxBefore,
int maxAnchor,
int maxAfter)
|
Methods inherited from class java.lang.Object |
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
LOGGER
public static final Logger LOGGER
DEBUG
public static final boolean DEBUG
- See Also:
- Constant Field Values
anchors
public final ObjectList<AnchorExtractor.Anchor> anchors
- The resulting list of anchors.
AnchorExtractor
public AnchorExtractor(int maxBefore,
int maxAnchor,
int maxAfter)
- Parameters:
maxBefore
- maximum number of words to be considered before of the anchor.maxAfter
- maximum number of words to be considered after the anchor.
configure
public void configure(BulletParser parser)
- Specified by:
configure
in interface Callback
- Overrides:
configure
in class DefaultCallback
startDocument
public void startDocument()
- Specified by:
startDocument
in interface Callback
- Overrides:
startDocument
in class DefaultCallback
endDocument
public void endDocument()
- Specified by:
endDocument
in interface Callback
- Overrides:
endDocument
in class DefaultCallback
startElement
public boolean startElement(Element element,
Map<Attribute,MutableString> attrMap)
- Specified by:
startElement
in interface Callback
- Overrides:
startElement
in class DefaultCallback
endElement
public boolean endElement(Element element)
- Specified by:
endElement
in interface Callback
- Overrides:
endElement
in class DefaultCallback
characters
public boolean characters(char[] characters,
int offset,
int length,
boolean flowBroken)
- Specified by:
characters
in interface Callback
- Overrides:
characters
in class DefaultCallback