-
Notifications
You must be signed in to change notification settings - Fork 2
Vocab specials #230
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Vocab specials #230
Changes from all commits
d22723c
9cc3217
6a74734
d598afa
91f86e2
64021af
6736b74
ccdd440
64a8cd7
2f4e9ee
681311d
5397980
a5b671b
e37f10f
329b7f3
9d4a374
d9858a8
42d74e6
84040b6
f471919
b14de8e
5eee1d1
810ff98
d5a85cc
1631474
1851bb5
e210118
7b0209d
8e8c070
e5c9076
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -100,7 +100,7 @@ And we're done! We can now add our hook to the text field either through the :me | |
| Removing punctuation as a posttokenization hook | ||
| ----------------------------------------------- | ||
|
|
||
| We will now similarly define a posttokenization hook to remove punctuation. We will use the punctuation list from python's built-in ``string`` module, which we will store as an attribute of our hook. | ||
| We will now similarly define a posttokenization hook to remove punctuation. We will use the punctuation list from python's built-in ``str`` module, which we will store as an attribute of our hook. | ||
|
|
||
| .. code-block:: python | ||
|
|
||
|
|
@@ -133,6 +133,69 @@ We can see that our hooks worked: the raw data was lowercased prior to tokenizat | |
|
|
||
| We have prepared a number of predefined hooks which are ready for you to use. You can see them here: :ref:`predefined-hooks`. | ||
|
|
||
| .. _specials: | ||
|
|
||
| Special tokens | ||
| =============== | ||
| We have earlier mentioned special tokens, but now is the time to elaborate on what exactly they are. In Podium, each special token is a subclass of the python ``str`` which also encapsulates the functionality for adding that special token in the tokenized sequence. The ``Vocab`` handles special tokens differently -- each special token is guaranteed a place in the ``Vocab``, which is what makes them... *special*. | ||
|
|
||
| Since our idea of special tokens was made to be extensible, we will take a brief look at how they are implemented, so we can better understand how to use them. We mentioned that each special token is a subclass of the python string, but there is an intermediary -- the :class:`podium.storage.vocab.Special` base class. The ``Special`` base class implements the following functionality, while still being an instance of a string: | ||
|
|
||
| 1. Extending the constructor of the special token with a default value functionality. The default value for each special token should be set via the ``default_value`` class attribute, while if another value is passed upon creation, it will be used. | ||
| 2. Adds a stub ``apply`` method which accepts a sequence of tokens and adds the special token to that sequence. In its essence, the apply method is a post-tokenization hook (applied to the tokenized sequence after other post-tokenization hooks) which doesn't see the raw data whose job is to add the special token to the sequence of replace some of the existing tokens with the special token. The special tokens are applied after all post-tokenization hooks in the order they are passed to the :class:`podium.storage.vocab.Vocab` constructor. Each concrete implementation of a Special token has to implement this method. | ||
| 3. Implements singleton-like hash and equality checks. The ``Special`` class overrides the default hash and equals and instead of checking for string value equality, it checks for *class name equality*. We use this type of check to ensure that each ``Vocab`` has a single instance of each Special and for simpler referencing and contains checks. | ||
|
|
||
| There is a number of special tokens used throughout NLP for a number of purposes. The most frequently used ones are the unknown token (UNK), which is used as a catch-all substitute for tokens which are not present in the vocabulary, and the padding token (PAD), which is used to nicely pack variable length sequences into fixed size batch tensors. | ||
| Alongside these two, common special tokens include the beginning-of-sequence and end-of-sequence tokens (BOS, EOS), the separator token (SEP) and the mask token introduced in BERT (MASK). | ||
|
|
||
| To better understand how specials work, we will walk through the implementation of one of special tokens implemented in Podium: the beginning-of-sequence (BOS) token. | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Do you maybe happen to know a resource which contains typical Specials used in NLP we could link here? After a quick Google search I could not find one.
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Vocabs in transformers (or tokenizers? not sure where they delegated the vocab) had quite a large number of reserved tokens.
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, but this is the best I could find: https://huggingface.co/transformers/main_classes/tokenizer.html#pretrainedtokenizer |
||
|
|
||
| .. code-block:: python | ||
|
|
||
| >>> from podium.storage.vocab import Special | ||
| >>> class BOS(Special): | ||
| >>> default_value = "<BOS>" | ||
| >>> | ||
| >>> def apply(self, sequence): | ||
| >>> # Prepend to the sequence | ||
| >>> return [self] + sequence | ||
| >>> | ||
| >>> bos = BOS() | ||
| >>> print(bos) | ||
| <BOS> | ||
|
|
||
| This code block is the full implementation of a special token! All we needed to do is set the default value and implement the ``apply`` function. The default value is ``None`` by default and if not set, you have to make sure it is passed upon construction, like so: | ||
|
|
||
| .. code-block:: python | ||
|
|
||
| >>> my_bos = BOS("<MY_BOS>") | ||
mttk marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| >>> print(my_bos) | ||
| <MY_BOS> | ||
| >>> print(bos == my_bos) | ||
| True | ||
|
|
||
| We can also see that although we have changed the string representation of the special token, the equality check will still return True due to the ``Special`` base class changes mentioned earlier. | ||
|
|
||
| To see the effect of the ``apply`` method, we will once again take a look at the SST dataset: | ||
|
|
||
| .. code-block:: python | ||
|
|
||
| >>> from podium import Vocab, Field, LabelField | ||
| >>> from podium.datasets import SST | ||
| >>> | ||
| >>> vocab = Vocab(specials=(bos)) | ||
| >>> text = Field(name='text', numericalizer=vocab) | ||
| >>> label = LabelField(name='label') | ||
| >>> fields = {'text': text, 'label': label} | ||
| >>> sst_train, sst_test, sst_dev = SST.get_dataset_splits(fields=fields) | ||
| >>> print(sst_train[222].text) | ||
| (None, ['<BOS>', 'A', 'slick', ',', 'engrossing', 'melodrama', '.']) | ||
|
|
||
| Where we can see that the special token was indeed added to the beginning of the tokenized sequence. | ||
|
|
||
| Finally, it is important to note that there is an implicit distinction between special tokens. The unknown (:class:`podium.storage.vocab.UNK`) and padding (:class:`podium.storage.vocab.PAD`) special tokens are something we refer to as **core** special tokens, whose functionality is hardcoded in the implementation of the ``Vocab`` due to them being deeply integrated with the way iterators and numericalization work. | ||
| The only difference between normal and core specials is that core specials are added to the sequence by other Podium classes (their behavior is hardcoded) instead of by their apply method. | ||
|
|
||
| Custom numericalization functions | ||
| =========================================== | ||
|
|
||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,29 @@ | ||
| Special tokens | ||
| =============== | ||
| .. autoclass:: podium.vocab.Special | ||
| :members: | ||
| :no-undoc-members: | ||
|
|
||
| The unknown token | ||
| ^^^^^^^^^^^^^^^^^^ | ||
| .. autoclass:: podium.vocab.UNK | ||
| :members: | ||
| :no-undoc-members: | ||
|
|
||
| The padding token | ||
| ^^^^^^^^^^^^^^^^^^ | ||
| .. autoclass:: podium.vocab.PAD | ||
| :members: | ||
| :no-undoc-members: | ||
|
|
||
| The beginning-of-sequence token | ||
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | ||
| .. autoclass:: podium.vocab.BOS | ||
| :members: | ||
| :no-undoc-members: | ||
|
|
||
| The end-of-sequence token | ||
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | ||
| .. autoclass:: podium.vocab.EOS | ||
| :members: | ||
| :no-undoc-members: |
Uh oh!
There was an error while loading. Please reload this page.