Class UniversalDependenciesEnglishEWT

java.lang.Object
ai.djl.training.dataset.RandomAccessDataset
ai.djl.basicdataset.nlp.TextDataset
ai.djl.basicdataset.nlp.UniversalDependenciesEnglishEWT
All Implemented Interfaces:
ai.djl.training.dataset.Dataset

public class UniversalDependenciesEnglishEWT extends TextDataset
A Gold Standard Universal Dependencies Corpus for English, built over the source material of the English Web Treebank LDC2012T13.
See Also:
  • Constructor Details

    • UniversalDependenciesEnglishEWT

      protected UniversalDependenciesEnglishEWT(UniversalDependenciesEnglishEWT.Builder builder)
      Creates a new instance of UniversalDependenciesEnglish.
      Parameters:
      builder - the builder object to build from
  • Method Details

    • builder

      public static UniversalDependenciesEnglishEWT.Builder builder()
      Creates a new builder to build a UniversalDependenciesEnglishEWT.
      Returns:
      a new builder
    • prepare

      public void prepare(ai.djl.util.Progress progress) throws IOException, ai.djl.modality.nlp.embedding.EmbeddingException
      Prepares the dataset for use with tracked progress. In this method the TXT file will be parsed. The texts will be added to sourceTextData and the Universal POS tags will be added to universalPosTags. Only sourceTextData will then be preprocessed.
      Parameters:
      progress - the progress tracker
      Throws:
      IOException - for various exceptions depending on the dataset
      ai.djl.modality.nlp.embedding.EmbeddingException - if there are exceptions during the embedding process
    • get

      public ai.djl.training.dataset.Record get(ai.djl.ndarray.NDManager manager, long index)
      Gets the Record for the given index from the dataset.
      Specified by:
      get in class ai.djl.training.dataset.RandomAccessDataset
      Parameters:
      manager - the manager used to create the arrays
      index - the index of the requested data item
      Returns:
      a Record that contains the data and label of the requested data item. The data NDList contains one NDArray representing the text embedding, The label NDList contains one NDArray including the indices of the Universal POS tags of each token. For the index of each Universal POS tag, see the enum class UniversalDependenciesEnglishEWT.UniversalPosTag.
    • availableSize

      protected long availableSize()
      Returns the number of records available to be read in this Dataset.
      Specified by:
      availableSize in class ai.djl.training.dataset.RandomAccessDataset
      Returns:
      the number of records available to be read in this Dataset