Class PdfDocumentFactory
-
- All Implemented Interfaces:
DocumentFactory
,FlyweightPrototype<DocumentFactory>
,Serializable
public class PdfDocumentFactory extends AbstractSimpleTikaDocumentFactory
A document factory for the PDF format.The metadata that will be tentatively parsed are
Metadata.TITLE
,MSOffice.AUTHOR
,Metadata.CREATOR
,MSOffice.KEYWORDS
,Metadata.SUBJECT
, producer, created, trapped, andHttpHeaders.LAST_MODIFIED
.- Author:
- Salvatore Insalaco
- See Also:
- Serialized Form
-
-
Nested Class Summary
-
Nested classes/interfaces inherited from class it.unimi.di.big.mg4j.document.PropertyBasedDocumentFactory
PropertyBasedDocumentFactory.MetadataKeys
-
Nested classes/interfaces inherited from interface it.unimi.di.big.mg4j.document.DocumentFactory
DocumentFactory.FieldType
-
-
Field Summary
-
Fields inherited from class it.unimi.di.big.mg4j.document.PropertyBasedDocumentFactory
defaultMetadata
-
-
Constructor Summary
Constructors Constructor Description PdfDocumentFactory()
PdfDocumentFactory(Reference2ObjectMap<Enum<?>,Object> defaultMetadata)
PdfDocumentFactory(Properties properties)
PdfDocumentFactory(String[] property)
-
Method Summary
Modifier and Type Method Description protected org.apache.tika.parser.Parser
getParser()
The parser to be used to parse this kind of documents; subclasses should return always the same instance, as Tika parsers are immutable and thread-safe.protected List<TikaField>
metadataFields()
The list of Tika fields (apart for content) that this factory provides; it returns the empty list, so most subclasses may want to override this method.-
Methods inherited from class it.unimi.di.big.mg4j.document.tika.AbstractSimpleTikaDocumentFactory
copy, fields, getDocument, parseProperty
-
Methods inherited from class it.unimi.di.big.mg4j.document.tika.AbstractTikaDocumentFactory
fieldIndex, fieldName, fieldType, numberOfFields
-
Methods inherited from class it.unimi.di.big.mg4j.document.PropertyBasedDocumentFactory
ensureJustOne, getInstance, getInstance, getInstance, getInstance, parseProperties, parseProperties, resolve, resolve, resolveNotNull, sameKey, toString
-
Methods inherited from class it.unimi.di.big.mg4j.document.AbstractDocumentFactory
ensureFieldIndex
-
-
-
-
Constructor Detail
-
PdfDocumentFactory
public PdfDocumentFactory()
-
PdfDocumentFactory
public PdfDocumentFactory(Properties properties) throws org.apache.commons.configuration.ConfigurationException
- Throws:
org.apache.commons.configuration.ConfigurationException
-
PdfDocumentFactory
public PdfDocumentFactory(Reference2ObjectMap<Enum<?>,Object> defaultMetadata)
-
PdfDocumentFactory
public PdfDocumentFactory(String[] property) throws org.apache.commons.configuration.ConfigurationException
- Throws:
org.apache.commons.configuration.ConfigurationException
-
-
Method Detail
-
getParser
protected org.apache.tika.parser.Parser getParser()
Description copied from class:AbstractSimpleTikaDocumentFactory
The parser to be used to parse this kind of documents; subclasses should return always the same instance, as Tika parsers are immutable and thread-safe.- Specified by:
getParser
in classAbstractSimpleTikaDocumentFactory
- Returns:
- the parser to be used to parse this kind of documents.
-
metadataFields
protected List<TikaField> metadataFields()
Description copied from class:AbstractSimpleTikaDocumentFactory
The list of Tika fields (apart for content) that this factory provides; it returns the empty list, so most subclasses may want to override this method.- Overrides:
metadataFields
in classAbstractSimpleTikaDocumentFactory
- Returns:
- the list of Tika fields that this factory provides.
-
-