Package ai.djl.basicdataset.nlp
Class StanfordQuestionAnsweringDataset
java.lang.Object
ai.djl.training.dataset.RandomAccessDataset
ai.djl.basicdataset.nlp.TextDataset
ai.djl.basicdataset.nlp.StanfordQuestionAnsweringDataset
- All Implemented Interfaces:
ai.djl.training.dataset.Dataset,ai.djl.training.dataset.RawDataset<Object>
public class StanfordQuestionAnsweringDataset
extends TextDataset
implements ai.djl.training.dataset.RawDataset<Object>
Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of
questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every
question is a segment of text, or span, from the corresponding reading passage, or the question
might be unanswerable.
- See Also:
-
Nested Class Summary
Nested ClassesNested classes/interfaces inherited from class ai.djl.basicdataset.nlp.TextDataset
TextDataset.SampleNested classes/interfaces inherited from class ai.djl.training.dataset.RandomAccessDataset
ai.djl.training.dataset.RandomAccessDataset.BaseBuilder<T extends ai.djl.training.dataset.RandomAccessDataset.BaseBuilder<T>>Nested classes/interfaces inherited from interface ai.djl.training.dataset.Dataset
ai.djl.training.dataset.Dataset.Usage -
Field Summary
Fields inherited from class ai.djl.basicdataset.nlp.TextDataset
manager, mrl, prepared, samples, sourceTextData, targetTextData, usageFields inherited from class ai.djl.training.dataset.RandomAccessDataset
dataBatchifier, device, labelBatchifier, limit, pipeline, prefetchNumber, sampler, targetPipeline -
Constructor Summary
ConstructorsModifierConstructorDescriptionprotectedCreates a new instance ofStanfordQuestionAnsweringDataset. -
Method Summary
Modifier and TypeMethodDescriptionprotected longReturns the number of records available to be read in thisDataset.builder()Creates a new builder to build aStanfordQuestionAnsweringDataset.ai.djl.training.dataset.Recordget(ai.djl.ndarray.NDManager manager, long index) Gets theRecordfor the given index from the dataset.getData()Get data from the SQuAD dataset.voidprepare(ai.djl.util.Progress progress) Prepares the dataset for use with tracked progress.protected voidpreprocess(List<String> newTextData, boolean source) Performs pre-processing steps on text data such as tokenising, applyingTextProcessors, creating vocabulary, and word embeddings.Methods inherited from class ai.djl.basicdataset.nlp.TextDataset
getProcessedText, getRawText, getSamples, getTextEmbedding, getVocabularyMethods inherited from class ai.djl.training.dataset.RandomAccessDataset
getData, getData, getData, getData, newSubDataset, newSubDataset, randomSplit, size, subDataset, subDataset, subDataset, subDataset, toArrayMethods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, waitMethods inherited from interface ai.djl.training.dataset.Dataset
getData, getData, matchingTranslatorOptions, prepare
-
Constructor Details
-
StanfordQuestionAnsweringDataset
Creates a new instance ofStanfordQuestionAnsweringDataset.- Parameters:
builder- the builder object to build from
-
-
Method Details
-
builder
Creates a new builder to build aStanfordQuestionAnsweringDataset.- Returns:
- a new builder
-
prepare
public void prepare(ai.djl.util.Progress progress) throws IOException, ai.djl.modality.nlp.embedding.EmbeddingException Prepares the dataset for use with tracked progress. In this method the JSON file will be parsed. The question, context, title will be added tosourceTextDataand the answers will be added totargetTextData. Both of them will then be preprocessed.- Specified by:
preparein interfaceai.djl.training.dataset.Dataset- Parameters:
progress- the progress tracker- Throws:
IOException- for various exceptions depending on the datasetai.djl.modality.nlp.embedding.EmbeddingException- if there are exceptions during the embedding process
-
get
public ai.djl.training.dataset.Record get(ai.djl.ndarray.NDManager manager, long index) Gets theRecordfor the given index from the dataset.- Specified by:
getin classai.djl.training.dataset.RandomAccessDataset- Parameters:
manager- the manager used to create the arraysindex- the index of the requested data item- Returns:
- a
Recordthat contains the data and label of the requested data item. The dataNDListcontains threeNDArrays representing the embedded title, context and question, which are named accordingly. The labelNDListcontains multipleNDArrays corresponding to each embedded answer.
-
availableSize
protected long availableSize()Returns the number of records available to be read in thisDataset. In this implementation, the actual size of available records are the size ofquestionInfoList.- Specified by:
availableSizein classai.djl.training.dataset.RandomAccessDataset- Returns:
- the number of records available to be read in this
Dataset
-
getData
Get data from the SQuAD dataset. This method will directly return the whole dataset as an object- Specified by:
getDatain interfaceai.djl.training.dataset.RawDataset<Object>- Returns:
- an object of
Objectclass in the structure of JSON, e.g.Map<String, List<Map<...>>> - Throws:
IOException
-
preprocess
protected void preprocess(List<String> newTextData, boolean source) throws ai.djl.modality.nlp.embedding.EmbeddingException Performs pre-processing steps on text data such as tokenising, applyingTextProcessors, creating vocabulary, and word embeddings. Since the record number in this dataset is not equivalent to the length ofsourceTextDataandtargetTextData, the limit should be processed.- Overrides:
preprocessin classTextDataset- Parameters:
newTextData- list of all unprocessed sentences in the datasetsource- whether the text data provided is source or target- Throws:
ai.djl.modality.nlp.embedding.EmbeddingException- if there is an error while embedding input
-