Research Scientist - (CIFRE PhD) - (X/F/M)
👉 THE SUBJECT
Moments Lab is a French company pioneering video understanding for media, sports and entertainment, with offices in New York and Paris. Its flagship technology called MXT solves multiple problems in today’s media workflows: it describes and categorizes a video at various levels, extracts automatic chapters and finds highlights in the wild.
This industrial PhD, in close partnership with the Institut Polytechnique de Paris, is an opportunity to set the groundwork for future iterations of video understanding systems among which VLMs play a crucial role.
Indeed, Vision-Language Models (VLMs) have significantly enhanced multimodal understanding, particularly in the context of video indexing, enabling precise question answering and rich semantic descriptions.
However, their reliance solely on visual and textual inputs limits their performance in complex, real-world scenarios. Incorporating additional modalities, notably audio and sparse textual metadata such as file annotations, presents an opportunity to improve accuracy. More recently, small VLMs are showing increasingly good performances while ensuring efficient and scalable video processing (Marafioti, et al., 2025).
This work will be the opportunity to either propose a new small architecture or to explore an omni-modal approach (Chen et al., 2024) to integrate audio streams and sparse textual metadata into existing vision-language models. Among the desired capability of the among will be the temporal localization (Liu, et al., 2025) of certain events within a video: actions, emotions, transitions, etc.
The main goal is to reach state-of-the-art performance across video captioning, video question answering and video chaptering with an efficient architecture. Traditional benchmarks such as VideoMME, TempCompass, TimeScope will be considered as well as potential new benchmarks created during the PhD.
The PhD student should additionally investigate how the proposed system can use meta-learning to seamlessly adapt to new tasks at inference-time, without requiring to store each task-specific model. Particular emphasis will be placed on enabling person re-identification (Hill, et al., 2025) within a long video while performing captioning (Han, et al., 2024).
Short bibliography
- Marafioti, A., Zohar, O., Farré, M., Noyan, M., Bakouch, E., Cuenca, P., Zakka, C., Allal, L.B., Lozhkov, A., Tazi, N. and Srivastav, V., 2025. Smolvlm: Redefining small and efficient multimodal models. arXiv preprint arXiv:2504.05299. https://arxiv.org/pdf/2504.05299
- Chen, L., Hu, H., Zhang, M., Chen, Y., Wang, Z., Li, Y., Shyam, P., Zhou, T., Huang, H., Yang, M.H. and Gong, B., 2024. Omnixr: Evaluating omni-modality language models on reasoning across modalities. arXiv preprint arXiv:2410.12219. https://arxiv.org/pdf/2410.12219v1
- Liu, Z., Zhu, L., Shi, B., Zhang, Z., Lou, Y., Yang, S., Xi, H., Cao, S., Gu, Y., Li, D. and Li, X., 2025. Nvila: Efficient frontier visual language models. In Proceedings of the Computer Vision and Pattern Recognition Conference (pp. 4122-4134). https://arxiv.org/pdf/2412.04468
- Hill, C., Yellin, F., Regmi, K., Du, D. and McCloskey, S., 2025, February. Re-identifying People in Video via Learned Temporal Attention and Multi-modal Foundation Models. In 2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) (pp. 6259-6268). IEEE. https://openaccess.thecvf.com/content/WACV2025/papers/Hill_Re-Identifying_People_in_Video_via_Learned_Temporal_Attention_and_Multi-Modal_WACV_2025_paper.pdf
- Han, T., Bain, M., Nagrani, A., Varol, G., Xie, W. and Zisserman, A., 2024. Autoad iii: The prequel-back to the pixels. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 18164-18174). https://www.robots.ox.ac.uk/~vgg/research/autoad/
👉 SUPERVISION
Academic supervisor: Prof. Mounim El Yacoubi
Industry supervisor: Dr. Yannis Tevissen
👉👉 👉 👉 HOW OUR REMOTE SETUP WORKS
- Your first days at Moments Lab will be based in the office for onboarding, shadowing, and meeting key colleagues.
- After that, you have the flexibility to choose how you work—whether that’s coming into the office every day, working fully remotely, or finding a balance that suits you, like a local co-working space or a café. Our priority is that you feel fulfilled and can perform at your best.
- That being said, while we have a strong remote policy, this role requires frequent visits to our Paris office. We also value in-person connections and come together a few times a year for both work and social events, including our annual two-day “Out of the Lab” meetup.
👉THE CONDITIONS OF EMPLOYMENT
- This job is based in Paris 🇫🇷 with regular academic meetings in Palaiseau
- Apart from these recurrent meetings with the academic partner, remote work is possible at the convenience of the employee
- We have a bunch of benefits, including lunch vouchers, you can view them all in this page
- Salary range for this position will be between 30k€ and 40k€ yearly gross depending on the candidate experience
THE RECRUITMENT PROCESS 👇
3 steps to join us:.
- 📞 First, a 15-minute Google Meet call with our Head of People.
- 📹 The second interview is a meeting with two members of the company including one from your future team. We will discuss your past experiences and projects before presenting you the details of your future position.
- 📽️ The last meeting will be with your manager and the academic partner that will go more in depth into the subject you will be working on.
🤝 Offer
⏰ We don’t want to leave you hanging, so we aim to take no more than 4 weeks to get to offer stage for this position.
How to apply
There’s no need to write a cover letter, but we’d like you to tell us, in your own way, a bit about who you are, what you like and how you see life. Don’t forget to send us your CV/resume in English.
❌ Please note that any CV/resume not written in English will be automatically rejected. Also, as part of your application, please let us know when you’d be available to start.
- Department
- Product & Technology
- Locations
- The 🇫🇷 Lab
- Remote status
- Hybrid
