-
Notifications
You must be signed in to change notification settings - Fork 2
Closed
Labels
featureNew feature or requestNew feature or request
Milestone
Description
Issues with current Specials:
- User can't modify the string value of the special
- Text can't be printed out with join after reverse numericalize due to enum type (common use case)
original_text = ' '.join(vocab.reverse_numericalize(batch_x.text)) # itos contains both `str` and `SpecialVocabSymbols`Idea:
Make Specials subclass str. Inheritance from Special base class is an identifier that a string is a Special. Each Special has a method that knows how to apply it to a token and/or sequence. Example:
class EOS(Special):
def apply(self, sequence_or_token):
# Core special, handled by Vocab
if type(sequence_or_token) is str:
raise ValueError("EOS can only be applied to sequence")
elif type(sequence_or_token) is list:
# Extend with self
return sequence_or_token + [self.data]
class UNK(Special):
def apply(self, sequence_or_token):
# Core special, handled by Vocab
pass
class Special(str):
@abc.abstractmethod
def apply(self, sequence_or_token):
# Method is used ONLY in Vocab.numericalize
if type(sequence_or_token) is str:
# Apply to token
pass
elif type(sequence_or_token) in (list, tuple):
# Apply to sequence
passThis allows us to:
eos = EOS('<eos>')
sequence = ['this', 'is', 'a', 'sequence']
print(' '.join(eos.apply(sequence)))
>>> this is a sequence <eos>So the user can define (1) the string for the special and (2) is handled.
Up for discussion:
- Do we introduce an inheritance in specials? A natural inheritance is
- CoreSpecial (PAD, UNK -- behavior hardcoded in vocab, apply isn't used)
- TokenSpecial (Applied on token-level for efficiency. Example: substituting numbers with or masking tokens can be a special instead of a hook)
- SequenceSpecial (anything that works on sequence level: EOS, BOS, maybe MASK)
- Referencing specials
- The Vocab needs to find the core specials in order to provide them to Field (e.g. for padding)
- The
Vocab.padding_indexmethod has to check forif PAD in self.stoi/self.specials(TBD: maybe make list of specials an attribute of vocab)- Proposal (maybe bad): make
__hash__and__equals__of specials trigger on concrete class, and not string- Required: there can be only one of each special in the Vocab (natural)
- Checking for
if PAD in stoiwould essentially checkif stoi[idx] == PAD.__class__instead of== str(PAD)(illustrative)
- Proposal (maybe bad): make
- Alternative:
- Check
for special in self.specials:
if type(special) is PAD:
return self.stoi[special]- Requires storing all specials as attribute (probably nicer)
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
featureNew feature or requestNew feature or request