Machine Learning Tasks for Long Texts using BELT (BERT for Longer Texts)
To enhance the suitability of public datasets for LLM (Large Language Model) training and fine-tuning, datasets need to be presented in a consistent, structured format. The DatasetFormatter class written below is a good skeleton that can allow you to do so. Moreover, this notebook also shows us how to deal with texts that may be longer than the maximum tokens a transformer can take by using the BELT model for classification tasks and also trying to find vector similarity between two documents from scratch.