VacoCam uses a Visual Large Language Model to generate focused sports video from unfocused raw footage.
I learned how to maintain an object detection dataset and train and evaluate a custom YoloV8 model.
The main challenge was figuring out how to present the tracked information to the VLLMs AND evaluate the performance of different approaches.
I created a custom benchmark that allowed me to try different methods and read many papers on arXiv to find inspiration.
I wrote this project when I knew absolutely nothing about LLMs or AI models in general.
Used Label Studio, Weights & Biases, and learned Python for this project.
This was my first introduction into the field and I challenged myself to learn through a moderately ambitious project.
Built with YoloV8 for object detection, Gemini 1.0 for visual understanding, and ffmpeg for video rendering.