Class TextDataset

java.lang.Object
ai.djl.training.dataset.RandomAccessDataset
ai.djl.basicdataset.nlp.TextDataset
All Implemented Interfaces:
ai.djl.training.dataset.Dataset
Direct Known Subclasses:
GoEmotions, PennTreebankText, StanfordMovieReview, StanfordQuestionAnsweringDataset, TatoebaEnglishFrenchDataset, UniversalDependenciesEnglishEWT

public abstract class TextDataset extends ai.djl.training.dataset.RandomAccessDataset
TextDataset is an abstract dataset that can be used for datasets for natural language processing where either the source or target are text-based data.

The TextDataset fetches the data in the form of String, processes the data as required, and creates embeddings for the tokens. Embeddings can be either pre-trained or trained on the go. Pre-trained TextEmbedding must be set in the TextDataset.Builder. If no embeddings are set, the dataset creates TrainableWordEmbedding based TrainableWordEmbedding from the Vocabulary created within the dataset.

  • Nested Class Summary

    Nested Classes
    Modifier and Type
    Class
    Description
    static class 
    Abstract Builder that helps build a TextDataset.
    static final class 
    A class stores TextDataset sample information.

    Nested classes/interfaces inherited from class ai.djl.training.dataset.RandomAccessDataset

    ai.djl.training.dataset.RandomAccessDataset.BaseBuilder<T extends ai.djl.training.dataset.RandomAccessDataset.BaseBuilder<T>>

    Nested classes/interfaces inherited from interface ai.djl.training.dataset.Dataset

    ai.djl.training.dataset.Dataset.Usage
  • Field Summary

    Fields
    Modifier and Type
    Field
    Description
    protected ai.djl.ndarray.NDManager
     
    protected ai.djl.repository.MRL
     
    protected boolean
     
     
    protected TextData
     
    protected TextData
     
    protected ai.djl.training.dataset.Dataset.Usage
     

    Fields inherited from class ai.djl.training.dataset.RandomAccessDataset

    dataBatchifier, device, labelBatchifier, limit, pipeline, prefetchNumber, sampler, targetPipeline
  • Constructor Summary

    Constructors
    Constructor
    Description
    Creates a new instance of RandomAccessDataset with the given necessary configurations.
  • Method Summary

    Modifier and Type
    Method
    Description
    getProcessedText(long index, boolean source)
    Gets the processed textual input.
    getRawText(long index, boolean source)
    Gets the raw textual input.
    Returns a list of sample information.
    ai.djl.modality.nlp.embedding.TextEmbedding
    getTextEmbedding(boolean source)
    Gets the word embedding used while pre-processing the dataset.
    ai.djl.modality.nlp.Vocabulary
    getVocabulary(boolean source)
    Gets the DefaultVocabulary built while preprocessing the text data.
    protected void
    preprocess(List<String> newTextData, boolean source)
    Performs pre-processing steps on text data such as tokenising, applying TextProcessors, creating vocabulary, and word embeddings.

    Methods inherited from class ai.djl.training.dataset.RandomAccessDataset

    availableSize, get, getData, getData, getData, getData, newSubDataset, newSubDataset, randomSplit, size, subDataset, subDataset, subDataset, subDataset, toArray

    Methods inherited from class java.lang.Object

    clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

    Methods inherited from interface ai.djl.training.dataset.Dataset

    matchingTranslatorOptions, prepare, prepare
  • Field Details

    • sourceTextData

      protected TextData sourceTextData
    • targetTextData

      protected TextData targetTextData
    • manager

      protected ai.djl.ndarray.NDManager manager
    • usage

      protected ai.djl.training.dataset.Dataset.Usage usage
    • mrl

      protected ai.djl.repository.MRL mrl
    • prepared

      protected boolean prepared
    • samples

      protected List<TextDataset.Sample> samples
  • Constructor Details

    • TextDataset

      public TextDataset(TextDataset.Builder<?> builder)
      Creates a new instance of RandomAccessDataset with the given necessary configurations.
      Parameters:
      builder - a builder with the necessary configurations
  • Method Details

    • getTextEmbedding

      public ai.djl.modality.nlp.embedding.TextEmbedding getTextEmbedding(boolean source)
      Gets the word embedding used while pre-processing the dataset. This method must be called after preprocess has been called on this instance.
      Parameters:
      source - whether to get source or target text embedding
      Returns:
      the text embedding
    • getVocabulary

      public ai.djl.modality.nlp.Vocabulary getVocabulary(boolean source)
      Gets the DefaultVocabulary built while preprocessing the text data.
      Parameters:
      source - whether to get source or target vocabulary
      Returns:
      the DefaultVocabulary
    • getRawText

      public String getRawText(long index, boolean source)
      Gets the raw textual input.
      Parameters:
      index - the index of the text input
      source - whether to get text from source or target
      Returns:
      the raw text
    • getProcessedText

      public List<String> getProcessedText(long index, boolean source)
      Gets the processed textual input.
      Parameters:
      index - the index of the text input
      source - whether to get text from source or target
      Returns:
      the processed text
    • getSamples

      public List<TextDataset.Sample> getSamples()
      Returns a list of sample information.
      Returns:
      a list of sample information
    • preprocess

      protected void preprocess(List<String> newTextData, boolean source) throws ai.djl.modality.nlp.embedding.EmbeddingException
      Performs pre-processing steps on text data such as tokenising, applying TextProcessors, creating vocabulary, and word embeddings.
      Parameters:
      newTextData - list of all unprocessed sentences in the dataset
      source - whether the text data provided is source or target
      Throws:
      ai.djl.modality.nlp.embedding.EmbeddingException - if there is an error while embedding input