Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Combining CLIPA-v2 and SigLIP (both big_vision based) models #660

Merged
merged 24 commits into from
Oct 20, 2023

Conversation

rwightman
Copy link
Collaborator

No description provided.

@rwightman rwightman mentioned this pull request Oct 6, 2023
@rwightman
Copy link
Collaborator Author

Almost there on this one, it got a bit epic. Still some more cleanup to do but want to get this in for more eyes.

Main goal is to support SigLIP pretrained weights and CLIPA-v2 (I don't think supporting v1 is worth it at this point)

There is associated cleanup and additions to support

  • HF tokenization with extra post-processing
  • Different image interpolation and resizing modes, add support for setting via pretrained configs
  • Builtin TextTransformer pooling options (first/last/argmax)
  • Builtin ViT pooling options (tok/avg)
    • avg pool is always on spatial tokens, even though some models were trained to incl cls, still works well so no going to support both anymore
  • Tried to merge ViT pool cleanly with the attentional pooling for CoCa and provide a path to fix the bugs in that (cascade or parallel)
  • Allow final LN in ViT to be applied before or after the pool (differs for some models, CLIPA is after avg pool, SigLIP before avg pool)
  • Remove input patchnorm for future maintenance ease

For @zw615

  • I did not find any noteworthy difference btw tanh approx and normal GELU. This matches my past experience porting weights from jax trained models. I'll leave that out for simplicity
  • The TF tokenizer is insanely slow as used and not acceptable as a dep, HF bert uncased appears to match with an added post-processing step

I still have more testing to do, I've run through all SigLIP weights, and ~50% of the v2 CLIPA. At the large end, both the SO400M SigLIP and the new G/14 CLIPA check out at a hair past 83% top-1

There is a drop of ~0-0.2 in the zero-shot scores from jax versions, this fits past experience.

Need to do some more testing, push weights to hub, CoCa is probably broken right now.

@rwightman
Copy link
Collaborator Author

@zw615 do you want to set up and organization and HF hub so we can push under something official? Or should I do it into the timm organization myself? It's much easier to get the weights on the hub.

If you do set up an org I can provide some command lines to push once I have everything else ready.

@rom1504
Copy link
Collaborator

rom1504 commented Oct 11, 2023

can you fix the merge conflict so tests will run please?



@torch.no_grad()
def load_big_vision_weights(model: CustomTextCLIP, checkpoint_path: str):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how stable do we expect this to be ? should we somehow (at least a comment) lock to a specific commit of big_vision ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rom1504 it works for big vision siglip weights -> timm + builtin text models only, I don't see why it wouldn't be stable, it has nothing to do with the code revision, it's only the weight files that are relevant. This wouldn't work for their lit models, but could be extended, etc.

I could drop this code after I convert and push to HF hub, but it could be useful reference or for future models. It's isolated to checkpoint loading with a npz/npy filename and has no extra deps.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok you're saying this will work for some specific big vision trained checkpoints.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then maybe we could indicate those above the function?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, will make a comment that it only supports the big_vision SigLIP weights right now, any other big_vision weights (present or future) could be added by expanding the support by checking for various keys in the numpy archive....

@rwightman
Copy link
Collaborator Author

can you fix the merge conflict so tests will run please?

Pretty sure it's broken at this point, have some more things to do before I'd expect a pass. Will merge main tomorrow. Would like bigger picture feedback/comments early, regardless of tests.

@rwightman
Copy link
Collaborator Author

Also @gabrielilharco if you have any comments...

@rwightman
Copy link
Collaborator Author

@gpucce while you're poking around here, there are changes impacting CoCa in this PR (and probably some of your future adds) ... just a heads up, I put a sketch of an attentional pool as desribed in the paper in there, while keeping bwd compat

@gabrielilharco
Copy link
Collaborator

really cool @rwightman! At a high level the design looks good to me. Since a lot changed and some of it affects existing models, we should make sure that the performance of all pretrained models we support remains the same after the changes

@gpucce
Copy link
Contributor

gpucce commented Oct 11, 2023

@gpucce while you're poking around here, there are changes impacting CoCa in this PR (and probably some of your future adds) ... just a heads up, I put a sketch of an attentional pool as desribed in the paper in there, while keeping bwd compat

thanks, will try and keep it all together, there was another small PR trying to address this, #551, fixing another small issue, don´t know if you ever had time to have a look

@rom1504
Copy link
Collaborator

rom1504 commented Oct 12, 2023

It's a lot of changes but looks ok to me. Would definitely be safer to merge (and potentially revert) in small PRs bringing more scoped changes

My main worry is I can't tell from the code if things still work.

The tests partially cover inference still working the same. Maybe it needs to be expanded to new models

What about training, how do we check?

@rwightman
Copy link
Collaborator Author

rwightman commented Oct 12, 2023

@rom1504 yeah, smaller usually better but clipa and siglip are both inter-related in a non-trivial way since they both come big_vision but with differences that conflict with past decisions made here, and they also need extra (but different) tokenizer support, preprocessing, etc.

As a counter point, things would have ended up more of a mess if CLIPA had been merged as is, and then someone else decided to add SigLIP on top, and the other fixes got added, etc ... at least this will be pulled together with a goal to leave things a bit better in the end with hopefully a cleaner base to add more

@rwightman
Copy link
Collaborator Author

@rom1504 I feel testing inference will be enough here, the training risk should be low, the breaks should all be detectable running inference on pretrained models across all types

@rom1504
Copy link
Collaborator

rom1504 commented Oct 17, 2023

Looks like tests are passing

Once this is merged, we can adapt

models_to_test = [(model_name, False) for model_name in models_to_test]
to make sure siglip models also get into non regression tests

@gabrielilharco
Copy link
Collaborator

@gabrielilharco okay! reworked to use a class attribute for context_length with an optional override via the fn argument (so backwards compat behaviour is maintained).

Nice @rwightman! Looking great. I'm able to run evals now for all new models, will send results once they're done

@rwightman
Copy link
Collaborator Author

@gpucce think I found it, the LN wasn't being applied properly with the attn_pool set, have to do it differently for the legacy (current) CoCa and future double pool options...

@gpucce
Copy link
Contributor

gpucce commented Oct 19, 2023

@rwightman it seems there is still smth different for the captions, let me try and have a better look at the outputs more deeply

Edit: if you meant that last commit should be the fix

@gabrielilharco
Copy link
Collaborator

I'm seeing small discrepancies in performance for CoCa models too after the latest change. Some sample results below

Before (without this PR): {'Caltech-101': 0.943293454951418, 'CIFAR-10': 0.9717, 'CIFAR-100': 0.8318000000000001, 'CLEVR Counts': 0.35646666666666604, 'CLEVR Distance': 0.23646666666666602, 'Country211': 0.25459715639810404, 'Describable Textures': 0.627127659574468, 'EuroSAT': 0.685, 'FGVC Aircraft': 0.362210338680926, 'Food-101': 0.9044752475247521, 'GTSRB': 0.557244655581947, 'ImageNet 1k': 0.7564200000000001, 'ImageNet Sketch': 0.645915620271571, 'ImageNet v2': 0.6794, ...}

After the latest changes: {'Caltech-101': 0.9430201183691431, 'CIFAR-10': 0.9722000000000001, 'CIFAR-100': 0.8318000000000001, 'CLEVR Counts': 0.37806666666666605, 'CLEVR Distance': 0.2446, 'Country211': 0.25507109004739303, 'Describable Textures': 0.623936170212765, 'EuroSAT': 0.675185185185185, 'FGVC Aircraft': 0.358957219251336, 'Food-101': 0.903801980198019, 'GTSRB': 0.562391132224861, 'ImageNet 1k': 0.7560600000000001, 'ImageNet Sketch': 0.6453064512959571, 'ImageNet v2': 0.6798000000000001, ...}

@rwightman
Copy link
Collaborator Author

Oh fun, yeah thought that last change would bring the CoCa ViT outputs to match old ones, they would have been different (for the tokens output) before, the pooled output should have been the same...

@rwightman
Copy link
Collaborator Author

FWIW, it's probably worth checking CoCa on the main branch to ensure it was in a good state there too, in case there was a prior regression we didn't pick up

…atch prev CoCa behaviour, but at odds with argmax which leaves special tokens in (not consistent)
@gpucce
Copy link
Contributor

gpucce commented Oct 19, 2023

@rwightman until booster is up (tomorrow I think) I can't run the evals again because the gpu I have where I am is filled right now and also leonardo is under maintenance currently I am checking instances by hand

@gabrielilharco
Copy link
Collaborator

FWIW, it's probably worth checking CoCa on the main branch to ensure it was in a good state there too, in case there was a prior regression we didn't pick up

@rwightman good point, I'm running some evals on main now and they are matching the numbers from d7542e4 so far

@rwightman
Copy link
Collaborator Author

@gabrielilharco @gpucce okay, so I did have another issue in the text pool, the 'tokens' included the cls token which would impact generation, but not zero-shot eval. Fixed that in the very latest. But for both of the previous fixes, the broken path was in the 'tokens' output from vision or text encoder, so zero-shot should have been consistent with main.

@rwightman
Copy link
Collaborator Author

@gabrielilharco nice eval looking good, but double checked that bigG CLIPA, should be 83, looks like I mixed up a config value, durr... confirming

@gpucce
Copy link
Contributor

gpucce commented Oct 20, 2023

@rwightman eval for already fine-tuned model is consistent with main, I am running fine-tuning

@gpucce
Copy link
Contributor

gpucce commented Oct 20, 2023

@rwightman eval for already fine-tuned model is consistent with main, I am running fine-tuning

fine-tuning seems fine too

@rwightman
Copy link
Collaborator Author

@gpucce nice, thanks!

okay, so are we ready to merge? @gabrielilharco

  • the bigG CLIPA needs updated eval after last fix to config
  • I can get CLIPA HF hub models moved to a UCSC related location @zw615 (I think I can have someone move the ones I've already uploaded to a different org, and then the README can be updated as desired). Also, the bigG 224x224 model is missing, has same link as the 336?

@gabrielilharco
Copy link
Collaborator

gabrielilharco commented Oct 20, 2023

@rwightman all good from my side! I'm getting 83.09 for bigG CLIPA on ImageNet. Full evals are still running (it's slow since the model is so big), I'll update the numbers once they are done

@rwightman
Copy link
Collaborator Author

k, I will merge now so we can get changes and various fixes in now.

Maybe hold off on a versioned release for another day or two to give a few early adopters chance to try main branch?

We should still add some text to README / PRETRAINED.MD about CLIPA-v2 and SigLIP but can think about that one...

@rwightman rwightman merged commit a5f3ae9 into main Oct 20, 2023
5 checks passed
@rwightman rwightman deleted the siglip_clipa_models branch October 20, 2023 15:52
@rom1504
Copy link
Collaborator

rom1504 commented Oct 20, 2023 via email

@gpucce
Copy link
Contributor

gpucce commented Oct 20, 2023

@rom1504 @rwightman I can add the reg tests for generation

@zw615
Copy link
Contributor

zw615 commented Oct 21, 2023

Hi @rwightman , sure! I will discuss with and work on this HF organization thing with other co-authors, and should be finish it real quick.

@gpucce nice, thanks!

okay, so are we ready to merge? @gabrielilharco

* the bigG CLIPA needs updated eval after last fix to config

* I can get CLIPA HF hub models moved to a UCSC related location @zw615 (I think I can have someone move the ones I've already uploaded to a different org, and then the README can be updated as desired). Also, the bigG 224x224 model is missing, has same link as the 336?

@zw615
Copy link
Contributor

zw615 commented Oct 22, 2023

@rwightman Hi, ross. Could you please move those weights to the UCSC-VLAA organization? (https://huggingface.co/UCSC-VLAA)

@rwightman
Copy link
Collaborator Author

rwightman commented Oct 22, 2023 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants