mtyrrell
/

CPU_Conditional_Classifier

Text Classification

Generated from Trainer

text-embeddings-inference

Model card Files Files and versions

Metrics Training metrics Community

mtyrrell commited on Jul 23, 2023

Commit

bbe3bbb

·

1 Parent(s): d3d9b5e

Update README.md

Files changed (1) hide show

README.md +1 -1

README.md CHANGED Viewed

@@ -59,7 +59,7 @@ The pre-processing operations used to produce the final training dataset were as
 3. If 'context_translated' is available and the 'language' is not English, 'context' is replaced with 'context_translated'.
 4. The dataset is "exploded" - i.e., the text samples in the 'context' column, which are lists, are converted into separate rows - and labels are merged to align with the associated samples.
 5. The 'match_onanswer' and 'answerWordcount' are used conditionally to select high quality samples (prefers high % of word matches in 'match_onanswer', but will take lower if there is a high 'answerWordcount')
-6. Data is then augmented using sentence shuffle from the ```albumentations``` library
 ## Training procedure

 3. If 'context_translated' is available and the 'language' is not English, 'context' is replaced with 'context_translated'.
 4. The dataset is "exploded" - i.e., the text samples in the 'context' column, which are lists, are converted into separate rows - and labels are merged to align with the associated samples.
 5. The 'match_onanswer' and 'answerWordcount' are used conditionally to select high quality samples (prefers high % of word matches in 'match_onanswer', but will take lower if there is a high 'answerWordcount')
+6. Data is then augmented using sentence shuffle from the ```albumentations``` library (NLP methods insertion and substitution were also tried, but lowered the performance of the model and were therefore not included in the final training data)
 ## Training procedure