public abstract class TextDataset
extends ai.djl.training.dataset.RandomAccessDataset
TextDataset is an abstract dataset that can be used for datasets for natural language
processing where either the source or target are text-based data.
The TextDataset fetches the data in the form of String, processes the data as
required, and creates embeddings for the tokens. Embeddings can be either pre-trained or trained
on the go. Pre-trained TextEmbedding must be set in the TextDataset.Builder. If no embeddings
are set, the dataset creates TrainableWordEmbedding based TrainableWordEmbedding
from the Vocabulary created within the dataset.
| Modifier and Type | Class and Description |
|---|---|
static class |
TextDataset.Builder<T extends TextDataset.Builder<T>>
Abstract Builder that helps build a
TextDataset. |
| Modifier and Type | Field and Description |
|---|---|
protected ai.djl.ndarray.NDManager |
manager |
protected TextData |
sourceTextData |
protected TextData |
targetTextData |
| Constructor and Description |
|---|
TextDataset(TextDataset.Builder<?> builder)
Creates a new instance of
RandomAccessDataset with the given necessary
configurations. |
| Modifier and Type | Method and Description |
|---|---|
java.util.List<java.lang.String> |
getProcessedText(long index,
boolean source)
Gets the processed textual input.
|
java.lang.String |
getRawText(long index,
boolean source)
Gets the raw textual input.
|
ai.djl.modality.nlp.embedding.TextEmbedding |
getTextEmbedding(boolean source)
Gets the word embedding used while pre-processing the dataset.
|
ai.djl.modality.nlp.SimpleVocabulary |
getVocabulary(boolean source)
Gets the
SimpleVocabulary built while preprocessing the text data. |
protected void |
preprocess(java.util.List<java.lang.String> newTextData,
boolean source)
Performs pre-processing steps on text data such as tokenising, applying
TextProcessors, creating vocabulary, and word embeddings. |
protected TextData sourceTextData
protected TextData targetTextData
protected ai.djl.ndarray.NDManager manager
public TextDataset(TextDataset.Builder<?> builder)
RandomAccessDataset with the given necessary
configurations.builder - a builder with the necessary configurationspublic ai.djl.modality.nlp.embedding.TextEmbedding getTextEmbedding(boolean source)
source - whether to get source or target text embeddingpublic ai.djl.modality.nlp.SimpleVocabulary getVocabulary(boolean source)
SimpleVocabulary built while preprocessing the text data.source - whether to get source or target vocabularySimpleVocabularypublic java.lang.String getRawText(long index,
boolean source)
index - the index of the text inputsource - whether to get text from source or targetpublic java.util.List<java.lang.String> getProcessedText(long index,
boolean source)
index - the index of the text inputsource - whether to get text from source or targetprotected void preprocess(java.util.List<java.lang.String> newTextData,
boolean source)
TextProcessors, creating vocabulary, and word embeddings.newTextData - list of all unprocessed sentences in the datasetsource - whether the text data provided is source or target