Tensor masking

Using tensor_iterator to iterate over a slice of tensor is quite slow. This is due to the complex indexing and the non-contiguity of data. It might be that masking the tensor is a better approach. It is also easier the parallelize operations on masked tensors compared to custom iterator