Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 3 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -64,7 +64,7 @@ For usage examples see the documentation pages [walkthrough](http://takelab.fer.
Use some of our pre-defined datasets:

```python
>>> from podium.datasets import SST
>>> from podium import SST
>>> sst_train, sst_test, sst_dev = SST.get_dataset_splits()
>>> print(sst_train)
SST({
Expand Down Expand Up @@ -93,7 +93,7 @@ Load datasets from [🤗/datasets](https://github.com/huggingface/datasets):

```python

>>> from podium.datasets.hf import HFDatasetConverter
>>> from podium import HFDatasetConverter
>>> import datasets
>>> # Load the huggingface dataset
>>> imdb = datasets.load_dataset('imdb')
Expand Down Expand Up @@ -124,8 +124,7 @@ Load datasets from [🤗/datasets](https://github.com/huggingface/datasets):
Load your own dataset from a standardized tabular format (e.g. `csv`, `tsv`, `jsonl`):

```python
>>> from podium.datasets import TabularDataset
>>> from podium import Vocab, Field, LabelField
>>> from podium import Vocab, Field, LabelField, TabularDataset
>>> fields = {'premise': Field('premise', numericalizer=Vocab()),
... 'hypothesis':Field('hypothesis', numericalizer=Vocab()),
... 'label': LabelField('label')}
Expand Down
22 changes: 8 additions & 14 deletions docs/source/advanced.rst
Original file line number Diff line number Diff line change
@@ -1,8 +1,6 @@
.. testsetup:: *

from podium import Field, LabelField, Vocab, Iterator, TabularDataset
from podium.datasets import SST
from podium.vectorizers import GloVe, TfIdfVectorizer
from podium import Field, LabelField, Vocab, Iterator, TabularDataset, SST, GloVe, TfIdfVectorizer

The Podium data flow
====================
Expand All @@ -14,7 +12,7 @@ The data is processed immediately when the instance is loaded from disk and then

.. doctest:: sst_field

>>> from podium.datasets import SST
>>> from podium import SST
>>> sst_train, sst_test, sst_dev = SST.get_dataset_splits()
>>> print(sst_train[222])
Example({'text': (None, ['A', 'slick', ',', 'engrossing', 'melodrama', '.']), 'label': (None, 'positive')})
Expand Down Expand Up @@ -159,7 +157,7 @@ To better understand how specials work, we will walk through the implementation

.. doctest:: specials

>>> from podium.vocab import Special
>>> from podium import Special
>>> class BOS(Special):
... default_value = "<BOS>"
...
Expand Down Expand Up @@ -187,8 +185,7 @@ To see the effect of the ``apply`` method, we will once again take a look at the

.. doctest:: specials

>>> from podium import Vocab, Field, LabelField
>>> from podium.datasets import SST
>>> from podium import Vocab, Field, LabelField, SST
>>>
>>> vocab = Vocab(specials=(bos))
>>> text = Field(name='text', numericalizer=vocab)
Expand Down Expand Up @@ -236,8 +233,7 @@ We have so far covered the case where you have a single input column, tokenize a

.. doctest:: multioutput

>>> from podium import Vocab, Field, LabelField
>>> from podium.datasets import SST
>>> from podium import Vocab, Field, LabelField, SST
>>> char = Field(name='char', numericalizer=Vocab(), tokenizer=list)
>>> text = Field(name='word', numericalizer=Vocab())
>>> label = LabelField(name='label')
Expand Down Expand Up @@ -303,8 +299,7 @@ For this reason, usage of :class:`podium.datasets.BucketIterator` is recommended

.. code-block:: python

>>> from podium import Vocab, Field, LabelField
>>> from podium.datasets import SST, IMDB
>>> from podium import Vocab, Field, LabelField, SST, IMDB
>>> vocab = Vocab()
>>> text = Field(name='text', numericalizer=vocab)
>>> label = LabelField(name='label')
Expand Down Expand Up @@ -343,7 +338,7 @@ The ``bucket_sort_key`` function defines how the instances in the dataset should
For Iterator, padding = 148141 out of 281696 = 52.588961149608096%
For BucketIterator, padding = 2125 out of 135680 = 1.5661851415094339%

As we can see, the difference between using a regular Iterator and a BucketIterator is massive. Not only do we reduce the amount of padding, we have reduced the total amount of tokens processed by about 50%. The SST dataset, however, is a relatively small dataset so this experiment might be a bit biased. Let's take a look at the same statistics for the :class:`podium.datasets.IMDB` dataset. After changing the highligted data loading line in the first snippet to:
As we can see, the difference between using a regular ``Iterator`` and a ``BucketIterator`` is massive. Not only do we reduce the amount of padding, we have reduced the total amount of tokens processed by about 50%. The SST dataset, however, is a relatively small dataset so this experiment might be a bit biased. Let's take a look at the same statistics for the :class:`podium.datasets.IMDB` dataset. After changing the highligted data loading line in the first snippet to:

.. code-block:: python

Expand Down Expand Up @@ -374,8 +369,7 @@ As an example, we will again turn to the SST dataset and some of our previously
.. doctest:: saveload
:options: +NORMALIZE_WHITESPACE

>>> from podium import Vocab, Field, LabelField
>>> from podium.datasets import SST
>>> from podium import Vocab, Field, LabelField, SST
>>>
>>> vocab = Vocab(max_size=5000, min_freq=2)
>>> text = Field(name='text', numericalizer=vocab)
Expand Down
5 changes: 2 additions & 3 deletions docs/source/faq.rst
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ FAQ

.. code-block:: python

>>> from podium.datasets import SST
>>> from podium import SST
>>> sst_train, sst_test, sst_dev = SST.get_dataset_splits()
>>> x, y = sst_train.batch()
>>> print(x.text.shape, y.label.shape, sep='\n')
Expand All @@ -20,8 +20,7 @@ Be aware that you will get a dataset as a matrix by default -- meaning that all

.. code-block:: python

>>> from podium.datasets import SST
>>> from podium import Vocab, Field, LabelField
>>> from podium import Vocab, Field, LabelField, SST
>>> text = Field(name='text', numericalizer=Vocab(), disable_batch_matrix=True)
>>> label = LabelField(name='label')
>>> fields = {'text':text, 'label':label}
Expand Down
5 changes: 2 additions & 3 deletions docs/source/preprocessing.rst
Original file line number Diff line number Diff line change
Expand Up @@ -43,8 +43,7 @@ Regex Replace

.. code-block:: python

>>> from podium import Field, LabelField, Vocab
>>> from podium.datasets import SST
>>> from podium import Field, LabelField, Vocab, SST
>>>
>>> text = Field('text', numericalizer=Vocab())
>>> label = LabelField('label')
Expand Down Expand Up @@ -123,7 +122,7 @@ Truecase

.. code-block:: python

>>> from podium.preproc import truecase
>>> from podium import truecase
>>> apply_truecase = truecase(oov='as-is')
>>> print(apply_truecase('hey, what is the weather in new york?'))
Hey, what is the weather in New York?
Expand Down
17 changes: 7 additions & 10 deletions docs/source/walkthrough.rst
Original file line number Diff line number Diff line change
@@ -1,9 +1,7 @@

.. testsetup:: *

from podium import Field, LabelField, Vocab, Iterator, TabularDataset
from podium.datasets import SST
from podium.vectorizers import GloVe, TfIdfVectorizer
from podium import Field, LabelField, Vocab, Iterator, TabularDataset, SST, GloVe, TfIdfVectorizer


Walkthrough
Expand All @@ -29,7 +27,7 @@ One built-in dataset available in Podium is the `Stanford Sentiment Treebank <ht
.. doctest:: sst
:options: +NORMALIZE_WHITESPACE

>>> from podium.datasets import SST
>>> from podium import SST
>>> sst_train, sst_test, sst_valid = SST.get_dataset_splits() # doctest:+ELLIPSIS
>>> print(sst_train)
SST({
Expand Down Expand Up @@ -100,7 +98,7 @@ This way, we can define a static dictionary which we might have obtained on anot

.. doctest:: custom_vocab

>>> from podium.vocab import UNK
>>> from podium import UNK
>>> custom_itos = [UNK(), 'this', 'is', 'a', 'sample']
>>> vocab = Vocab.from_itos(custom_itos)
>>> print(vocab)
Expand Down Expand Up @@ -285,7 +283,7 @@ The output of the function call is a numpy matrix of word embeddings which you c

.. code-block:: python

>>> from podium.vectorizers import GloVe
>>> from podium import GloVe
>>> vocab = fields['text'].vocab
>>> glove = GloVe()
>>> embeddings = glove.load_vocab(vocab)
Expand All @@ -308,8 +306,7 @@ As we intend to use the whole dataset at once, we will also set ``disable_batch_

.. doctest:: vectorizer

>>> from podium.datasets import SST
>>> from podium import Vocab, Field, LabelField
>>> from podium import Vocab, Field, LabelField, SST
>>> vocab = Vocab(max_size=5000)
>>> text = Field(name='text', numericalizer=vocab, disable_batch_matrix=True)
>>> label = LabelField(name='label')
Expand All @@ -320,7 +317,7 @@ Since the Tf-Idf vectorizer needs information from the dataset to compute the in

.. doctest:: vectorizer

>>> from podium.vectorizers.tfidf import TfIdfVectorizer
>>> from podium import TfIdfVectorizer
>>> tfidf_vectorizer = TfIdfVectorizer()
>>> tfidf_vectorizer.fit(dataset=sst_train, field=text)

Expand Down Expand Up @@ -433,7 +430,7 @@ You can load a dataset in 🤗/datasets and then convert it to a Podium dataset

.. code-block:: python

>>> from podium.datasets.hf import HFDatasetConverter
>>> from podium import HFDatasetConverter
>>> import datasets
>>> # Loading a huggingface dataset returns an instance of DatasetDict
>>> # which contains the dataset splits (usually: train, valid, test,
Expand Down
Loading