it.unimi.di.mg4j.document
Class IdentityDocumentFactory

java.lang.Object
  extended by it.unimi.di.mg4j.document.AbstractDocumentFactory
      extended by it.unimi.di.mg4j.document.PropertyBasedDocumentFactory
          extended by it.unimi.di.mg4j.document.IdentityDocumentFactory
All Implemented Interfaces:
DocumentFactory, FlyweightPrototype<DocumentFactory>, Serializable

public class IdentityDocumentFactory
extends PropertyBasedDocumentFactory

A factory that provides a single field containing just the raw input stream; the encoding is set using the property PropertyBasedDocumentFactory.MetadataKeys.ENCODING. The field is named text, but you can change the name using the property fieldname.

By default, the WordReader provided by this factory is just a FastBufferedReader, but you can specify an alternative word reader using the property PropertyBasedDocumentFactory.MetadataKeys.WORDREADER. For instance, if you need to index a list of identifiers to retrieve documents from the collection more easily, you can use a LineWordReader to index each line of a file as a whole.

See Also:
Serialized Form

Nested Class Summary
static class IdentityDocumentFactory.MetadataKeys
          Case-insensitive keys for metadata.
 
Nested classes/interfaces inherited from interface it.unimi.di.mg4j.document.DocumentFactory
DocumentFactory.FieldType
 
Field Summary
 
Fields inherited from class it.unimi.di.mg4j.document.PropertyBasedDocumentFactory
defaultMetadata
 
Constructor Summary
IdentityDocumentFactory()
           
IdentityDocumentFactory(Properties properties)
           
IdentityDocumentFactory(Reference2ObjectMap<Enum<?>,Object> defaultMetadata)
           
IdentityDocumentFactory(String[] property)
           
 
Method Summary
 IdentityDocumentFactory copy()
           
 int fieldIndex(String fieldName)
          Returns the index of a field, given its symbolic name.
 String fieldName(int field)
          Returns the symbolic name of a field.
 DocumentFactory.FieldType fieldType(int field)
          Returns the type of a field.
 Document getDocument(InputStream rawContent, Reference2ObjectMap<Enum<?>,Object> metadata)
          Returns the document obtained by parsing the given byte stream.
 int numberOfFields()
          Returns the number of fields present in the documents produced by this factory.
protected  boolean parseProperty(String key, String[] values, Reference2ObjectMap<Enum<?>,Object> metadata)
          Parses a property with given key and value, adding it to the given map.
 
Methods inherited from class it.unimi.di.mg4j.document.PropertyBasedDocumentFactory
ensureJustOne, getInstance, getInstance, getInstance, getInstance, parseProperties, parseProperties, resolve, resolve, resolveNotNull, sameKey
 
Methods inherited from class it.unimi.di.mg4j.document.AbstractDocumentFactory
ensureFieldIndex, toString
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait
 

Constructor Detail

IdentityDocumentFactory

public IdentityDocumentFactory()

IdentityDocumentFactory

public IdentityDocumentFactory(Reference2ObjectMap<Enum<?>,Object> defaultMetadata)

IdentityDocumentFactory

public IdentityDocumentFactory(Properties properties)
                        throws ConfigurationException
Throws:
ConfigurationException

IdentityDocumentFactory

public IdentityDocumentFactory(String[] property)
                        throws ConfigurationException
Throws:
ConfigurationException
Method Detail

parseProperty

protected boolean parseProperty(String key,
                                String[] values,
                                Reference2ObjectMap<Enum<?>,Object> metadata)
                         throws ConfigurationException
Description copied from class: PropertyBasedDocumentFactory
Parses a property with given key and value, adding it to the given map.

Currently this implementation just parses the PropertyBasedDocumentFactory.MetadataKeys.LOCALE property.

Subclasses should do their own parsing, returing true in case of success and returning super.parseProperty() otherwise.

Overrides:
parseProperty in class PropertyBasedDocumentFactory
Parameters:
key - the property key.
values - the property value; this is an array, because properties may have a list of comma-separated values.
metadata - the metadata map.
Returns:
true if the property was parsed correctly, false if it was ignored.
Throws:
ConfigurationException

copy

public IdentityDocumentFactory copy()

numberOfFields

public int numberOfFields()
Description copied from interface: DocumentFactory
Returns the number of fields present in the documents produced by this factory.

Returns:
the number of fields present in the documents produced by this factory.

fieldName

public String fieldName(int field)
Description copied from interface: DocumentFactory
Returns the symbolic name of a field.

Parameters:
field - the index of a field (between 0 inclusive and DocumentFactory.numberOfFields() exclusive}).
Returns:
the symbolic name of the field-th field.

fieldIndex

public int fieldIndex(String fieldName)
Description copied from interface: DocumentFactory
Returns the index of a field, given its symbolic name.

Parameters:
fieldName - the name of a field of this factory.
Returns:
the corresponding index, or -1 if there is no field with name fieldName.

fieldType

public DocumentFactory.FieldType fieldType(int field)
Description copied from interface: DocumentFactory
Returns the type of a field.

The possible types are defined in DocumentFactory.FieldType.

Parameters:
field - the index of a field (between 0 inclusive and DocumentFactory.numberOfFields() exclusive}).
Returns:
the type of the field-th field.

getDocument

public Document getDocument(InputStream rawContent,
                            Reference2ObjectMap<Enum<?>,Object> metadata)
Description copied from interface: DocumentFactory
Returns the document obtained by parsing the given byte stream.

The parameter metadata actually replaces the lack of a simple keyword-based parameter-passing system in Java. This method might take several different type of “suggestions” which have been collected by the collection: typically, the document title, a URI representing the document, its MIME type, its encoding and so on. Some of this information might be set by default (as it happens, for instance, in a PropertyBasedDocumentFactory). Implementations of this method must consult the metadata provided by the collection, possibly complete them with default factory metadata, and proceed to the document construction.

Parameters:
rawContent - the raw content from which the document should be extracted; it must not be closed, as resource management is a responsibility of the DocumentCollection.
metadata - a map from enums (e.g., keys taken in PropertyBasedDocumentFactory) to various kind of objects.
Returns:
the document obtained by parsing the given character sequence.