public class SimpleParser extends BaseParser
Modifier and Type | Field and Description |
---|---|
protected BaseContentExtractor |
_contentExtractor |
protected BaseLinkExtractor |
_linkExtractor |
protected org.apache.tika.parser.ParseContext |
_parseContext |
Constructor and Description |
---|
SimpleParser() |
SimpleParser(BaseContentExtractor contentExtractor,
BaseLinkExtractor linkExtractor,
ParserPolicy parserPolicy) |
SimpleParser(BaseContentExtractor contentExtractor,
BaseLinkExtractor linkExtractor,
ParserPolicy parserPolicy,
boolean includeMarkup) |
SimpleParser(BaseContentExtractor contentExtractor,
BaseLinkExtractor linkExtractor,
ParserPolicy parserPolicy,
org.apache.tika.parser.ParseContext parseContext) |
SimpleParser(ParserPolicy parserPolicy) |
SimpleParser(ParserPolicy parserPolicy,
boolean includeMarkup) |
Modifier and Type | Method and Description |
---|---|
org.apache.tika.parser.Parser |
getTikaParser() |
protected void |
init() |
boolean |
isExtractLanguage() |
ParsedDatum |
parse(FetchedDatum fetchedDatum) |
void |
setExtractLanguage(boolean extractLanguage) |
getCharset, getContentLocation, getLanguage, getParserPolicy
protected BaseContentExtractor _contentExtractor
protected BaseLinkExtractor _linkExtractor
protected org.apache.tika.parser.ParseContext _parseContext
public SimpleParser()
public SimpleParser(ParserPolicy parserPolicy)
public SimpleParser(BaseContentExtractor contentExtractor, BaseLinkExtractor linkExtractor, ParserPolicy parserPolicy)
contentExtractor
- to use instead of new SimpleContentExtractor
()linkExtractor
- to use instead of new SimpleLinkExtractor
()parserPolicy
- to customize operation of the parser
SimpleLinkExtractor
simply to control the set of link tags
and attributes it processes. Instead, use ParserPolicy.setLinkTags(java.util.Set<java.lang.String>)
and ParserPolicy.setLinkAttributeTypes(java.util.Set<java.lang.String>)
, and then pass this policy
to SimpleParser(ParserPolicy)
.public SimpleParser(ParserPolicy parserPolicy, boolean includeMarkup)
parserPolicy
- to customize operation of the parserincludeMarkup
- true if output should be raw HTML, versus extracted text
SimpleLinkExtractor
simply to control the set of link tags
and attributes it processes. Instead, use ParserPolicy.setLinkTags(java.util.Set<java.lang.String>)
and ParserPolicy.setLinkAttributeTypes(java.util.Set<java.lang.String>)
, and then pass this policy
to SimpleParser(ParserPolicy)
.public SimpleParser(BaseContentExtractor contentExtractor, BaseLinkExtractor linkExtractor, ParserPolicy parserPolicy, boolean includeMarkup)
parserPolicy
- to customize operation of the parserincludeMarkup
- true if output should be raw HTML, versus extracted text
SimpleLinkExtractor
simply to control the set of link tags
and attributes it processes. Instead, use ParserPolicy.setLinkTags(java.util.Set<java.lang.String>)
and ParserPolicy.setLinkAttributeTypes(java.util.Set<java.lang.String>)
, and then pass this policy
to SimpleParser(ParserPolicy)
.public SimpleParser(BaseContentExtractor contentExtractor, BaseLinkExtractor linkExtractor, ParserPolicy parserPolicy, org.apache.tika.parser.ParseContext parseContext)
contentExtractor
- to use instead of new SimpleContentExtractor
()linkExtractor
- to use instead of new SimpleLinkExtractor
()parserPolicy
- to customize operation of the parserparseContext
- used to pass context info to the parser
SimpleLinkExtractor
simply to control the set of link tags
and attributes it processes. Instead, use ParserPolicy.setLinkTags(java.util.Set<java.lang.String>)
and ParserPolicy.setLinkAttributeTypes(java.util.Set<java.lang.String>)
, and then pass this policy
to SimpleParser(ParserPolicy)
.protected void init()
public org.apache.tika.parser.Parser getTikaParser()
public void setExtractLanguage(boolean extractLanguage)
public boolean isExtractLanguage()
public ParsedDatum parse(FetchedDatum fetchedDatum) throws java.lang.Exception
parse
in class BaseParser
java.lang.Exception
Copyright © 2012 Bixo Labs