Skip to content

Proposal: specials revamp #216

@mttk

Description

@mttk

Issues with current Specials:

  • User can't modify the string value of the special
  • Text can't be printed out with join after reverse numericalize due to enum type (common use case)
original_text = ' '.join(vocab.reverse_numericalize(batch_x.text)) # itos contains both `str` and `SpecialVocabSymbols`

Idea:
Make Specials subclass str. Inheritance from Special base class is an identifier that a string is a Special. Each Special has a method that knows how to apply it to a token and/or sequence. Example:

class EOS(Special):
    def apply(self, sequence_or_token):
        # Core special, handled by Vocab
        if type(sequence_or_token) is str:
            raise ValueError("EOS can only be applied to sequence")
        elif type(sequence_or_token) is list:
            # Extend with self
            return sequence_or_token + [self.data]

class UNK(Special):
    def apply(self, sequence_or_token):
        # Core special, handled by Vocab
        pass

class Special(str):
    @abc.abstractmethod
    def apply(self, sequence_or_token):
        # Method is used ONLY in Vocab.numericalize
        if type(sequence_or_token) is str:
            # Apply to token
            pass
        elif type(sequence_or_token) in (list, tuple):
            # Apply to sequence
            pass

This allows us to:

eos = EOS('<eos>')
sequence = ['this', 'is', 'a', 'sequence']
print(' '.join(eos.apply(sequence)))
>>> this is a sequence <eos>

So the user can define (1) the string for the special and (2) is handled.
Up for discussion:

  1. Do we introduce an inheritance in specials? A natural inheritance is
  • CoreSpecial (PAD, UNK -- behavior hardcoded in vocab, apply isn't used)
  • TokenSpecial (Applied on token-level for efficiency. Example: substituting numbers with or masking tokens can be a special instead of a hook)
  • SequenceSpecial (anything that works on sequence level: EOS, BOS, maybe MASK)
  1. Referencing specials
  • The Vocab needs to find the core specials in order to provide them to Field (e.g. for padding)
  • The Vocab.padding_index method has to check for if PAD in self.stoi/self.specials (TBD: maybe make list of specials an attribute of vocab)
    • Proposal (maybe bad): make __hash__ and __equals__ of specials trigger on concrete class, and not string
      • Required: there can be only one of each special in the Vocab (natural)
      • Checking for if PAD in stoi would essentially check if stoi[idx] == PAD.__class__ instead of == str(PAD) (illustrative)
  • Alternative:
    - Check
for special in self.specials:
  if type(special) is PAD:
    return self.stoi[special]
- Requires storing all specials as attribute (probably nicer)

Metadata

Metadata

Assignees

No one assigned

    Labels

    featureNew feature or request

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions