Class FileSetDocumentCollection
- java.lang.Object
-
- it.unimi.di.big.mg4j.document.AbstractDocumentSequence
-
- it.unimi.di.big.mg4j.document.AbstractDocumentCollection
-
- it.unimi.di.big.mg4j.document.FileSetDocumentCollection
-
- All Implemented Interfaces:
DocumentCollection
,DocumentSequence
,SafelyCloseable
,FlyweightPrototype<DocumentCollection>
,Closeable
,Serializable
,AutoCloseable
public class FileSetDocumentCollection extends AbstractDocumentCollection implements Serializable
ADocumentCollection
corresponding to a given set of files.This class provides a main method with a flexible syntax that serialises into a document collection a list of files given on the command line or piped into standard input. Optionally, you can provide a parallel list of URIs that will be associated with each file.
Warning: the number of file is limited by
Integer.MAX_VALUE
.- See Also:
- Serialized Form
-
-
Nested Class Summary
-
Nested classes/interfaces inherited from class it.unimi.di.big.mg4j.document.AbstractDocumentCollection
AbstractDocumentCollection.PropertyKeys
-
-
Field Summary
-
Fields inherited from interface it.unimi.di.big.mg4j.document.DocumentCollection
DEFAULT_EXTENSION
-
-
Constructor Summary
Constructors Constructor Description FileSetDocumentCollection(String[] file, DocumentFactory factory)
Builds a document collection corresponding to a given set of files specified as an array.FileSetDocumentCollection(String[] file, String[] uri, DocumentFactory factory)
Builds a document collection corresponding to a given set of files specified as an array and a parallel array of URIs, one for each file.
-
Method Summary
Modifier and Type Method Description void
close()
Closes this document sequence, releasing all resources.FileSetDocumentCollection
copy()
Document
document(long index)
Returns the document given its index.DocumentFactory
factory()
Returns the factory used by this sequence.static void
main(String[] arg)
Reference2ObjectMap<Enum<?>,Object>
metadata(long index)
Returns the metadata map for a document.long
size()
Returns the number of documents in this collection.InputStream
stream(long index)
Returns an input stream for the raw content of a document.-
Methods inherited from class it.unimi.di.big.mg4j.document.AbstractDocumentCollection
ensureDocumentIndex, iterator, printAllDocuments, toString
-
Methods inherited from class it.unimi.di.big.mg4j.document.AbstractDocumentSequence
filename, finalize, load
-
Methods inherited from class java.lang.Object
clone, equals, getClass, hashCode, notify, notifyAll, wait, wait, wait
-
Methods inherited from interface it.unimi.di.big.mg4j.document.DocumentSequence
filename
-
-
-
-
Constructor Detail
-
FileSetDocumentCollection
public FileSetDocumentCollection(String[] file, DocumentFactory factory)
Builds a document collection corresponding to a given set of files specified as an array.Beware. This class is not guaranteed to work if files are deleted or modified after creation!
- Parameters:
file
- an array containing the files that will be contained in the collection.factory
- the factory that will be used to create documents.
-
FileSetDocumentCollection
public FileSetDocumentCollection(String[] file, String[] uri, DocumentFactory factory)
Builds a document collection corresponding to a given set of files specified as an array and a parallel array of URIs, one for each file.Beware. This class is not guaranteed to work if files are deleted or modified after creation!
- Parameters:
file
- an array containing the files that will be contained in the collection.uri
- an array, parallel tofile
, containing URIs to be associated with each element offile
.factory
- the factory that will be used to create documents.
-
-
Method Detail
-
factory
public DocumentFactory factory()
Description copied from interface:DocumentSequence
Returns the factory used by this sequence.Every document sequence is based on a document factory that transforms raw bytes into a sequence of characters. The factory contains useful information such as the number of fields.
- Specified by:
factory
in interfaceDocumentSequence
- Returns:
- the factory used by this sequence.
-
size
public long size()
Description copied from interface:DocumentCollection
Returns the number of documents in this collection.- Specified by:
size
in interfaceDocumentCollection
- Returns:
- the number of documents in this collection.
-
metadata
public Reference2ObjectMap<Enum<?>,Object> metadata(long index)
Description copied from interface:DocumentCollection
Returns the metadata map for a document.- Specified by:
metadata
in interfaceDocumentCollection
- Parameters:
index
- an index between 0 (inclusive) andDocumentCollection.size()
(exclusive).- Returns:
- the metadata map for the document.
-
document
public Document document(long index) throws IOException
Description copied from interface:DocumentCollection
Returns the document given its index.- Specified by:
document
in interfaceDocumentCollection
- Parameters:
index
- an index between 0 (inclusive) andDocumentCollection.size()
(exclusive).- Returns:
- the
index
-th document. - Throws:
IOException
-
stream
public InputStream stream(long index) throws IOException
Description copied from interface:DocumentCollection
Returns an input stream for the raw content of a document.- Specified by:
stream
in interfaceDocumentCollection
- Parameters:
index
- an index between 0 (inclusive) andDocumentCollection.size()
(exclusive).- Returns:
- the raw content of the document as an input stream.
- Throws:
IOException
-
copy
public FileSetDocumentCollection copy()
- Specified by:
copy
in interfaceDocumentCollection
- Specified by:
copy
in interfaceFlyweightPrototype<DocumentCollection>
-
close
public void close() throws IOException
Description copied from interface:DocumentSequence
Closes this document sequence, releasing all resources.You should always call this method after having finished with this document sequence. Implementations are invited to call this method in a finaliser as a safety net (even better, implement
SafelyCloseable
), but since there is no guarantee as to when finalisers are invoked, you should not depend on this behaviour.- Specified by:
close
in interfaceAutoCloseable
- Specified by:
close
in interfaceCloseable
- Specified by:
close
in interfaceDocumentSequence
- Overrides:
close
in classAbstractDocumentSequence
- Throws:
IOException
-
main
public static void main(String[] arg) throws IOException, com.martiansoftware.jsap.JSAPException, InstantiationException, IllegalAccessException, InvocationTargetException, NoSuchMethodException
- Throws:
IOException
com.martiansoftware.jsap.JSAPException
InstantiationException
IllegalAccessException
InvocationTargetException
NoSuchMethodException
-
-