Class AnchorExtractor
- java.lang.Object
-
- it.unimi.dsi.parser.callback.DefaultCallback
-
- it.unimi.di.big.mg4j.util.parser.callback.AnchorExtractor
-
- All Implemented Interfaces:
Callback
public class AnchorExtractor extends DefaultCallback
A callback extracting anchor text. When instantiating the extractor, you can specify the number of characters to be considered before the anchor, after the anchor or during the anchor (just the first characters are taken into consideration in the last two characters, and just the last ones in the first case).At the end of parsing, the result (the list of anchors) is available in
anchors
, whose elements provide the content of the href attribute the text of the anchor and around the anchor; text is however modified so that fragment of words at the beginning of the pre-anchor context, or at the end of the post-anchor context, are cut away.For example, a fragment like:
...foo fOO FOO FOO ANCHOR TEXT BAR BAR BAr bar...
(where the uppercase part represents the pre- and post-anchor context) generates the elementAnchor("xxx", "FOO FOO ANCHOR TEXT BAR BAR")
-
-
Nested Class Summary
Nested Classes Modifier and Type Class Description static class
AnchorExtractor.Anchor
A class representing an anchor.
-
Field Summary
Fields Modifier and Type Field Description ObjectList<AnchorExtractor.Anchor>
anchors
The resulting list of anchors.static boolean
DEBUG
static Logger
LOGGER
-
Fields inherited from interface it.unimi.dsi.parser.callback.Callback
EMPTY_CALLBACK_ARRAY
-
-
Constructor Summary
Constructors Constructor Description AnchorExtractor(int maxPreAnchor, int maxAnchor, int maxPostAnchor)
Creates a new anchor extractor.AnchorExtractor(int maxPreAnchor, int maxAnchor, int maxPostAnchor, String delimiter)
Creates a new anchor extractor.
-
Method Summary
Modifier and Type Method Description boolean
characters(char[] characters, int offset, int length, boolean flowBroken)
void
configure(BulletParser parser)
void
endDocument()
boolean
endElement(Element element)
void
startDocument()
boolean
startElement(Element element, Map<Attribute,MutableString> attrMap)
-
Methods inherited from class it.unimi.dsi.parser.callback.DefaultCallback
cdata, getInstance
-
-
-
-
Field Detail
-
LOGGER
public static final Logger LOGGER
-
DEBUG
public static final boolean DEBUG
- See Also:
- Constant Field Values
-
anchors
public final ObjectList<AnchorExtractor.Anchor> anchors
The resulting list of anchors.
-
-
Constructor Detail
-
AnchorExtractor
public AnchorExtractor(int maxPreAnchor, int maxAnchor, int maxPostAnchor)
Creates a new anchor extractor.- Parameters:
maxPreAnchor
- maximum number of characters before an anchor.maxAnchor
- maximum number of characters in an anchor.maxPostAnchor
- maximum number of characters after an anchor.
-
AnchorExtractor
public AnchorExtractor(int maxPreAnchor, int maxAnchor, int maxPostAnchor, String delimiter)
Creates a new anchor extractor.- Parameters:
maxPreAnchor
- maximum number of characters before an anchor.maxAnchor
- maximum number of characters in an anchor.maxPostAnchor
- maximum number of characters after an anchor.delimiter
- a token that will be inserted to delimit the anchor text, ornull
for no delimiter.
-
-
Method Detail
-
configure
public void configure(BulletParser parser)
- Specified by:
configure
in interfaceCallback
- Overrides:
configure
in classDefaultCallback
-
startDocument
public void startDocument()
- Specified by:
startDocument
in interfaceCallback
- Overrides:
startDocument
in classDefaultCallback
-
endDocument
public void endDocument()
- Specified by:
endDocument
in interfaceCallback
- Overrides:
endDocument
in classDefaultCallback
-
startElement
public boolean startElement(Element element, Map<Attribute,MutableString> attrMap)
- Specified by:
startElement
in interfaceCallback
- Overrides:
startElement
in classDefaultCallback
-
endElement
public boolean endElement(Element element)
- Specified by:
endElement
in interfaceCallback
- Overrides:
endElement
in classDefaultCallback
-
characters
public boolean characters(char[] characters, int offset, int length, boolean flowBroken)
- Specified by:
characters
in interfaceCallback
- Overrides:
characters
in classDefaultCallback
-
-