Hi @Or-Tal @nadav366,
Thank you for providing no transformer version of PAST model.
While trying to train PAST streamable model, I have noticed that causal mask is not used on bert transformer encoder. Does that experiment was tried already, if so is it reducing the performance?
Thanks