Add video support #430

iejMac · 2023-02-15T11:11:55Z

Current plan is to use a ViViT architecture (factorised encoder variant) with the image and text encoder initialized with CLIP weights. We can also opt to train this CoCa style, then we would initialize the image encoder and entire text tower (including decoder) with CoCa weights so we'd likely want to train an L/14 size model.

This PR will likely require the addition/adaptation of:

vivit.py - the implementation of the ViViT model which would depend on 2 layers of transformers coming from our transformer.py file. It would implement the whole CLIP-like model that would be used in train.py
data.py - need to add a video dataloader

Will use this PR to track progress. Lmk if there's a better way of approaching this
@rwightman @rom1504 @mitchellnw

iejMac · 2023-02-15T11:48:39Z

Hmmm I don't see how we can reliably train this though. Let's say for L/14 max local batch size on a 40GB GPU is like a few hundred images so if a video has 100 frames at 1 FPS that gives us a local batch size <10 for rather short videos which probably still won't work very well unless we have more GPU's than we have

rom1504 · 2023-02-15T12:58:46Z

Gradient accumulation maybe ?

…

On Wed, Feb 15, 2023, 12:48 Maciej Kilian ***@***.***> wrote: Hmmm I don't see how we can reliably train this though. Let's say for L/14 max local batch size on a 40GB GPU is like a few hundred images so if a video has 100 frames at 1 FPS that gives us a local batch size <10 for rather short videos which probably still won't work very well unless we have more GPU's than we have — Reply to this email directly, view it on GitHub <#430 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAR437QTXKWHS75KDPHW6OTWXS7CFANCNFSM6AAAAAAU4WTXDE> . You are receiving this because you were mentioned.Message ID: ***@***.***>

lucidrains · 2023-02-15T17:48:05Z

@iejMac nice! i can contribute to this

i believe for video, we can do much more aggressive patch dropout in the beginning. well, if the video does not resemble this lol

lucidrains · 2023-02-15T17:54:00Z

iejMac · 2023-02-16T01:36:16Z

@lucidrains cool! I'll start filling out the code a bit today. And yeah good idea with aggressive patch dropout. So currently we have:

aggressive patch dropout
grad accumulation

as some tricks to make this a bit more tractable. Anything to maximize batch size here will be really important.

iejMac · 2023-02-16T01:38:50Z

ah also another thing I want to add here: simultaneous (image, video) - text training. I.e. the final model should be able to both handle temporal sequences and static images. The thing I'm unsure about is if we apply the image loss after the temporal decoder essentially saying "the time transformer should understand singular images" or before, just at the spatial transformer side

iejMac · 2023-02-21T00:00:42Z

@lucidrains, does anything look wrong to you in the modeling code? Specifically this ViViT class. I'm getting very strange loss curves and was wondering if you might have an idea.

iejMac · 2023-02-21T00:04:50Z

could also be the data loader code but I figured I'd ask you about the model since I'm comparing with your vivit

lucidrains · 2023-02-21T17:11:27Z

@iejMac nice! i'll do a code review later this week when i find some downtime

…o very well

iejMac · 2023-03-27T07:26:46Z

Next task - initialize spatial and text transformer from CLIP model

iejMac · 2023-04-04T12:29:45Z

Next things to do:

Benchmarks benchmarks benchmarks
https://arxiv.org/pdf/2303.15343v1.pdf
aggressive patch dropout

Add video support

3ec0ed8

iejMac marked this pull request as draft February 16, 2023 01:21

Merge branch 'main' into vivit

0403d1d

iejMac added 13 commits February 18, 2023 09:04

data loading: correct shapes in training loop (crappy code)

1cd33e4

Merge branch 'vivit' of https://github.com/iejMac/open_clip into vivit

107292f

update model progress

b995750

rename file + create_model loads something

be04c06

update

0ad7168

update

f9dfd02

embeddings get to loss, time to implement video encoding

df1c698

update, set num_samples

7643ce3

more filling in

cb12acd

slightly improved preprocessing

67a2d33

update

f0615f9

update weird lag

7bd848e

simpler dataloader same results

80f41a0

properly normalize frames

f5af600

iejMac added 4 commits February 21, 2023 21:34

update no temporal

982ee88

filter no mp4 samples

3f4fde7

update

7176d95

adding projection removes weird const loss bug but training doesn't g…

3f64b62

…o very well

iejMac added 3 commits February 27, 2023 01:08

some updates

8d3cc48

save changes

4230428

update dataloader to use video2dataset

9413c5e

iejMac added 3 commits March 27, 2023 07:27

update

5c65a52

repeat is bad

f2fa5bd

enable loading CLIP weights to spatial and text encoders

3125171

iejMac and others added 2 commits April 15, 2023 13:53

update

b093a4c

update

9db7425

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add video support #430

Add video support #430

iejMac commented Feb 15, 2023 •

edited

Loading

iejMac commented Feb 15, 2023

rom1504 commented Feb 15, 2023 via email

lucidrains commented Feb 15, 2023 •

edited

Loading

lucidrains commented Feb 15, 2023

iejMac commented Feb 16, 2023

iejMac commented Feb 16, 2023

iejMac commented Feb 21, 2023

iejMac commented Feb 21, 2023

lucidrains commented Feb 21, 2023

iejMac commented Mar 27, 2023

iejMac commented Apr 4, 2023

Add video support #430

Are you sure you want to change the base?

Add video support #430

Conversation

iejMac commented Feb 15, 2023 • edited Loading

iejMac commented Feb 15, 2023

rom1504 commented Feb 15, 2023 via email

lucidrains commented Feb 15, 2023 • edited Loading

lucidrains commented Feb 15, 2023

iejMac commented Feb 16, 2023

iejMac commented Feb 16, 2023

iejMac commented Feb 21, 2023

iejMac commented Feb 21, 2023

lucidrains commented Feb 21, 2023

iejMac commented Mar 27, 2023

iejMac commented Apr 4, 2023

iejMac commented Feb 15, 2023 •

edited

Loading

lucidrains commented Feb 15, 2023 •

edited

Loading