Package ai.djl.basicdataset.nlp
Class UniversalDependenciesEnglishEWT
java.lang.Object
ai.djl.training.dataset.RandomAccessDataset
ai.djl.basicdataset.nlp.TextDataset
ai.djl.basicdataset.nlp.UniversalDependenciesEnglishEWT
- All Implemented Interfaces:
ai.djl.training.dataset.Dataset
A Gold Standard Universal Dependencies Corpus for English, built over the source material of the
English Web Treebank LDC2012T13.
- See Also:
-
Nested Class Summary
Nested ClassesNested classes/interfaces inherited from class ai.djl.basicdataset.nlp.TextDataset
TextDataset.SampleNested classes/interfaces inherited from class ai.djl.training.dataset.RandomAccessDataset
ai.djl.training.dataset.RandomAccessDataset.BaseBuilder<T extends ai.djl.training.dataset.RandomAccessDataset.BaseBuilder<T>>Nested classes/interfaces inherited from interface ai.djl.training.dataset.Dataset
ai.djl.training.dataset.Dataset.Usage -
Field Summary
Fields inherited from class ai.djl.basicdataset.nlp.TextDataset
manager, mrl, prepared, samples, sourceTextData, targetTextData, usageFields inherited from class ai.djl.training.dataset.RandomAccessDataset
dataBatchifier, device, labelBatchifier, limit, pipeline, prefetchNumber, sampler, targetPipeline -
Constructor Summary
ConstructorsModifierConstructorDescriptionprotectedCreates a new instance ofUniversalDependenciesEnglish. -
Method Summary
Modifier and TypeMethodDescriptionprotected longReturns the number of records available to be read in thisDataset.builder()Creates a new builder to build aUniversalDependenciesEnglishEWT.ai.djl.training.dataset.Recordget(ai.djl.ndarray.NDManager manager, long index) Gets theRecordfor the given index from the dataset.voidprepare(ai.djl.util.Progress progress) Prepares the dataset for use with tracked progress.Methods inherited from class ai.djl.basicdataset.nlp.TextDataset
getProcessedText, getRawText, getSamples, getTextEmbedding, getVocabulary, preprocessMethods inherited from class ai.djl.training.dataset.RandomAccessDataset
getData, getData, getData, getData, newSubDataset, newSubDataset, randomSplit, size, subDataset, subDataset, subDataset, subDataset, toArrayMethods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, waitMethods inherited from interface ai.djl.training.dataset.Dataset
matchingTranslatorOptions, prepare
-
Constructor Details
-
UniversalDependenciesEnglishEWT
Creates a new instance ofUniversalDependenciesEnglish.- Parameters:
builder- the builder object to build from
-
-
Method Details
-
builder
Creates a new builder to build aUniversalDependenciesEnglishEWT.- Returns:
- a new builder
-
prepare
public void prepare(ai.djl.util.Progress progress) throws IOException, ai.djl.modality.nlp.embedding.EmbeddingException Prepares the dataset for use with tracked progress. In this method the TXT file will be parsed. The texts will be added tosourceTextDataand the Universal POS tags will be added touniversalPosTags. OnlysourceTextDatawill then be preprocessed.- Parameters:
progress- the progress tracker- Throws:
IOException- for various exceptions depending on the datasetai.djl.modality.nlp.embedding.EmbeddingException- if there are exceptions during the embedding process
-
get
public ai.djl.training.dataset.Record get(ai.djl.ndarray.NDManager manager, long index) Gets theRecordfor the given index from the dataset.- Specified by:
getin classai.djl.training.dataset.RandomAccessDataset- Parameters:
manager- the manager used to create the arraysindex- the index of the requested data item- Returns:
- a
Recordthat contains the data and label of the requested data item. The dataNDListcontains oneNDArrayrepresenting the text embedding, The labelNDListcontains oneNDArrayincluding the indices of the Universal POS tags of each token. For the index of each Universal POS tag, see the enum classUniversalDependenciesEnglishEWT.UniversalPosTag.
-
availableSize
protected long availableSize()Returns the number of records available to be read in thisDataset.- Specified by:
availableSizein classai.djl.training.dataset.RandomAccessDataset- Returns:
- the number of records available to be read in this
Dataset
-