Interface DocumentFactory
-
- All Superinterfaces:
FlyweightPrototype<DocumentFactory>
,Serializable
- All Known Implementing Classes:
AbstractDocumentFactory
,AbstractSimpleTikaDocumentFactory
,AbstractTikaDocumentFactory
,AutoDetectDocumentFactory
,CompositeDocumentFactory
,DispatchingDocumentFactory
,EPUBDocumentFactory
,HtmlDocumentFactory
,HtmlDocumentFactory
,IdentityDocumentFactory
,MSOfficeDocumentFactory
,OOXMLDocumentFactory
,OpenDocumentDocumentFactory
,PdfDocumentFactory
,PropertyBasedDocumentFactory
,ReplicatedDocumentFactory
,RTFDocumentFactory
,SubDocumentFactory
,TextDocumentFactory
,TRECHeaderDocumentFactory
,WikipediaDocumentSequence.WikipediaHeaderFactory
,XMLDocumentFactory
,ZipDocumentCollection.ZipFactory
public interface DocumentFactory extends Serializable, FlyweightPrototype<DocumentFactory>
A factory parsing and building documents of the same type.Each document produced by the same factory has a number of fields, which represent units of information that should be indexed separately. The number of available fields may be recovered calling
numberOfFields()
, their types callingfieldType(int)
, and their symbolic names usingfieldName(int)
.Factories contain the parsing and document-level breaking logic. For instance, a factory for HTML documents might extract the text into a title and a body, and expose them as
DocumentFactory.FieldType.TEXT
fields. Additionally, the last modification date might be exposed as aDocumentFactory.FieldType.DATE
field, and so on. Warning: implementations of this class are not required to be thread-safe, but they provideflyweight copies
. Thecopy()
method is strengthened so to return a instance of this class.
-
-
Nested Class Summary
Nested Classes Modifier and Type Interface Description static class
DocumentFactory.FieldType
A field type.
-
Method Summary
Modifier and Type Method Description DocumentFactory
copy()
int
fieldIndex(String fieldName)
Returns the index of a field, given its symbolic name.String
fieldName(int field)
Returns the symbolic name of a field.DocumentFactory.FieldType
fieldType(int field)
Returns the type of a field.Document
getDocument(InputStream rawContent, Reference2ObjectMap<Enum<?>,Object> metadata)
Returns the document obtained by parsing the given byte stream.int
numberOfFields()
Returns the number of fields present in the documents produced by this factory.
-
-
-
Method Detail
-
numberOfFields
int numberOfFields()
Returns the number of fields present in the documents produced by this factory.- Returns:
- the number of fields present in the documents produced by this factory.
-
fieldName
String fieldName(int field)
Returns the symbolic name of a field.- Parameters:
field
- the index of a field (between 0 inclusive andnumberOfFields()
exclusive}).- Returns:
- the symbolic name of the
field
-th field.
-
fieldIndex
int fieldIndex(String fieldName)
Returns the index of a field, given its symbolic name.- Parameters:
fieldName
- the name of a field of this factory.- Returns:
- the corresponding index, or -1 if there is no field with name
fieldName
.
-
fieldType
DocumentFactory.FieldType fieldType(int field)
Returns the type of a field.The possible types are defined in
DocumentFactory.FieldType
.- Parameters:
field
- the index of a field (between 0 inclusive andnumberOfFields()
exclusive}).- Returns:
- the type of the
field
-th field.
-
getDocument
Document getDocument(InputStream rawContent, Reference2ObjectMap<Enum<?>,Object> metadata) throws IOException
Returns the document obtained by parsing the given byte stream.The parameter
metadata
actually replaces the lack of a simple keyword-based parameter-passing system in Java. This method might take several different type of “suggestions” which have been collected by the collection: typically, the document title, a URI representing the document, its MIME type, its encoding and so on. Some of this information might be set by default (as it happens, for instance, in aPropertyBasedDocumentFactory
). Implementations of this method must consult the metadata provided by the collection, possibly complete them with default factory metadata, and proceed to the document construction.- Parameters:
rawContent
- the raw content from which the document should be extracted; it must not be closed, as resource management is a responsibility of the DocumentCollection.metadata
- a map from enums (e.g., keys taken inPropertyBasedDocumentFactory
) to various kind of objects.- Returns:
- the document obtained by parsing the given character sequence.
- Throws:
IOException
-
copy
DocumentFactory copy()
- Specified by:
copy
in interfaceFlyweightPrototype<DocumentFactory>
-
-