Shared memory for slice & cat operator is cool but introduce too much complexity in the code. The idea behind that was that slice & cat would be used intensively inside other functions like reduce(axis) etc.
However, it seems that the complexity introduced is too big compared to the advantages it brings (aka iterator etc). Just put main branch into a side branch and rework slice & cat -> remove range attribute for tensors. Maybe create a TensorView struct for everything related to the iterator.
Shared memory for slice & cat operator is cool but introduce too much complexity in the code. The idea behind that was that slice & cat would be used intensively inside other functions like reduce(axis) etc.
However, it seems that the complexity introduced is too big compared to the advantages it brings (aka iterator etc). Just put main branch into a side branch and rework slice & cat -> remove range attribute for tensors. Maybe create a TensorView struct for everything related to the iterator.