How do we finetune llava for object detection tasks, or predicting a trajectory of actions. How would that work? Then I need some regression based loss like MSE right ? And instead of outputting text we would want to output a set of coordinates. From other repos it seems like fine tuning for regression tasks doesn’t seem to work well.
How do we finetune llava for object detection tasks, or predicting a trajectory of actions. How would that work? Then I need some regression based loss like MSE right ? And instead of outputting text we would want to output a set of coordinates. From other repos it seems like fine tuning for regression tasks doesn’t seem to work well.