Package ai.djl.basicdataset.nlp
Class TextDataset
java.lang.Object
ai.djl.training.dataset.RandomAccessDataset
ai.djl.basicdataset.nlp.TextDataset
- All Implemented Interfaces:
ai.djl.training.dataset.Dataset
- Direct Known Subclasses:
GoEmotions,PennTreebankText,StanfordMovieReview,StanfordQuestionAnsweringDataset,TatoebaEnglishFrenchDataset,UniversalDependenciesEnglishEWT
public abstract class TextDataset
extends ai.djl.training.dataset.RandomAccessDataset
TextDataset is an abstract dataset that can be used for datasets for natural language
processing where either the source or target are text-based data.
The TextDataset fetches the data in the form of String, processes the data as
required, and creates embeddings for the tokens. Embeddings can be either pre-trained or trained
on the go. Pre-trained TextEmbedding must be set in the TextDataset.Builder. If no embeddings
are set, the dataset creates TrainableWordEmbedding based TrainableWordEmbedding
from the Vocabulary created within the dataset.
-
Nested Class Summary
Nested ClassesModifier and TypeClassDescriptionstatic classTextDataset.Builder<T extends TextDataset.Builder<T>>Abstract Builder that helps build aTextDataset.static final classA class storesTextDatasetsample information.Nested classes/interfaces inherited from class ai.djl.training.dataset.RandomAccessDataset
ai.djl.training.dataset.RandomAccessDataset.BaseBuilder<T extends ai.djl.training.dataset.RandomAccessDataset.BaseBuilder<T>>Nested classes/interfaces inherited from interface ai.djl.training.dataset.Dataset
ai.djl.training.dataset.Dataset.Usage -
Field Summary
FieldsModifier and TypeFieldDescriptionprotected ai.djl.ndarray.NDManagerprotected ai.djl.repository.MRLprotected booleanprotected List<TextDataset.Sample>protected TextDataprotected TextDataprotected ai.djl.training.dataset.Dataset.UsageFields inherited from class ai.djl.training.dataset.RandomAccessDataset
dataBatchifier, device, labelBatchifier, limit, pipeline, prefetchNumber, sampler, targetPipeline -
Constructor Summary
ConstructorsConstructorDescriptionTextDataset(TextDataset.Builder<?> builder) Creates a new instance ofRandomAccessDatasetwith the given necessary configurations. -
Method Summary
Modifier and TypeMethodDescriptiongetProcessedText(long index, boolean source) Gets the processed textual input.getRawText(long index, boolean source) Gets the raw textual input.Returns a list of sample information.ai.djl.modality.nlp.embedding.TextEmbeddinggetTextEmbedding(boolean source) Gets the word embedding used while pre-processing the dataset.ai.djl.modality.nlp.VocabularygetVocabulary(boolean source) Gets theDefaultVocabularybuilt while preprocessing the text data.protected voidpreprocess(List<String> newTextData, boolean source) Performs pre-processing steps on text data such as tokenising, applyingTextProcessors, creating vocabulary, and word embeddings.Methods inherited from class ai.djl.training.dataset.RandomAccessDataset
availableSize, get, getData, getData, getData, getData, newSubDataset, newSubDataset, randomSplit, size, subDataset, subDataset, subDataset, subDataset, toArrayMethods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, waitMethods inherited from interface ai.djl.training.dataset.Dataset
matchingTranslatorOptions, prepare, prepare
-
Field Details
-
sourceTextData
-
targetTextData
-
manager
protected ai.djl.ndarray.NDManager manager -
usage
protected ai.djl.training.dataset.Dataset.Usage usage -
mrl
protected ai.djl.repository.MRL mrl -
prepared
protected boolean prepared -
samples
-
-
Constructor Details
-
TextDataset
Creates a new instance ofRandomAccessDatasetwith the given necessary configurations.- Parameters:
builder- a builder with the necessary configurations
-
-
Method Details
-
getTextEmbedding
public ai.djl.modality.nlp.embedding.TextEmbedding getTextEmbedding(boolean source) Gets the word embedding used while pre-processing the dataset. This method must be called after preprocess has been called on this instance.- Parameters:
source- whether to get source or target text embedding- Returns:
- the text embedding
-
getVocabulary
public ai.djl.modality.nlp.Vocabulary getVocabulary(boolean source) Gets theDefaultVocabularybuilt while preprocessing the text data.- Parameters:
source- whether to get source or target vocabulary- Returns:
- the
DefaultVocabulary
-
getRawText
Gets the raw textual input.- Parameters:
index- the index of the text inputsource- whether to get text from source or target- Returns:
- the raw text
-
getProcessedText
Gets the processed textual input.- Parameters:
index- the index of the text inputsource- whether to get text from source or target- Returns:
- the processed text
-
getSamples
Returns a list of sample information.- Returns:
- a list of sample information
-
preprocess
protected void preprocess(List<String> newTextData, boolean source) throws ai.djl.modality.nlp.embedding.EmbeddingException Performs pre-processing steps on text data such as tokenising, applyingTextProcessors, creating vocabulary, and word embeddings.- Parameters:
newTextData- list of all unprocessed sentences in the datasetsource- whether the text data provided is source or target- Throws:
ai.djl.modality.nlp.embedding.EmbeddingException- if there is an error while embedding input
-