Combining CLIPA-v2 and SigLIP (both big_vision based) models #660

rwightman · 2023-10-06T19:04:51Z

No description provided.

…t_kwargs

…lip_clipa_models

…ax based modelling code).

rwightman · 2023-10-11T05:11:17Z

Almost there on this one, it got a bit epic. Still some more cleanup to do but want to get this in for more eyes.

Main goal is to support SigLIP pretrained weights and CLIPA-v2 (I don't think supporting v1 is worth it at this point)

There is associated cleanup and additions to support

HF tokenization with extra post-processing
Different image interpolation and resizing modes, add support for setting via pretrained configs
Builtin TextTransformer pooling options (first/last/argmax)
Builtin ViT pooling options (tok/avg)
- avg pool is always on spatial tokens, even though some models were trained to incl cls, still works well so no going to support both anymore
Tried to merge ViT pool cleanly with the attentional pooling for CoCa and provide a path to fix the bugs in that (cascade or parallel)
Allow final LN in ViT to be applied before or after the pool (differs for some models, CLIPA is after avg pool, SigLIP before avg pool)
Remove input patchnorm for future maintenance ease

For @zw615

I did not find any noteworthy difference btw tanh approx and normal GELU. This matches my past experience porting weights from jax trained models. I'll leave that out for simplicity
The TF tokenizer is insanely slow as used and not acceptable as a dep, HF bert uncased appears to match with an added post-processing step

I still have more testing to do, I've run through all SigLIP weights, and ~50% of the v2 CLIPA. At the large end, both the SO400M SigLIP and the new G/14 CLIPA check out at a hair past 83% top-1

There is a drop of ~0-0.2 in the zero-shot scores from jax versions, this fits past experience.

Need to do some more testing, push weights to hub, CoCa is probably broken right now.

rwightman · 2023-10-11T05:13:17Z

@zw615 do you want to set up and organization and HF hub so we can push under something official? Or should I do it into the timm organization myself? It's much easier to get the weights on the hub.

If you do set up an org I can provide some command lines to push once I have everything else ready.

rom1504 · 2023-10-11T06:47:30Z

can you fix the merge conflict so tests will run please?

rom1504 · 2023-10-11T06:49:05Z

src/open_clip/big_vision.py

+
+
+@torch.no_grad()
+def load_big_vision_weights(model: CustomTextCLIP, checkpoint_path: str):


how stable do we expect this to be ? should we somehow (at least a comment) lock to a specific commit of big_vision ?

@rom1504 it works for big vision siglip weights -> timm + builtin text models only, I don't see why it wouldn't be stable, it has nothing to do with the code revision, it's only the weight files that are relevant. This wouldn't work for their lit models, but could be extended, etc.

I could drop this code after I convert and push to HF hub, but it could be useful reference or for future models. It's isolated to checkpoint loading with a npz/npy filename and has no extra deps.

Ok you're saying this will work for some specific big vision trained checkpoints.

Then maybe we could indicate those above the function?

sure, will make a comment that it only supports the big_vision SigLIP weights right now, any other big_vision weights (present or future) could be added by expanding the support by checking for various keys in the numpy archive....

rwightman · 2023-10-11T07:10:42Z

can you fix the merge conflict so tests will run please?

Pretty sure it's broken at this point, have some more things to do before I'd expect a pass. Will merge main tomorrow. Would like bigger picture feedback/comments early, regardless of tests.

rwightman · 2023-10-11T16:53:00Z

Also @gabrielilharco if you have any comments...

rwightman · 2023-10-11T17:15:44Z

@gpucce while you're poking around here, there are changes impacting CoCa in this PR (and probably some of your future adds) ... just a heads up, I put a sketch of an attentional pool as desribed in the paper in there, while keeping bwd compat

gabrielilharco · 2023-10-11T17:30:39Z

really cool @rwightman! At a high level the design looks good to me. Since a lot changed and some of it affects existing models, we should make sure that the performance of all pretrained models we support remains the same after the changes

gpucce · 2023-10-11T17:31:00Z

@gpucce while you're poking around here, there are changes impacting CoCa in this PR (and probably some of your future adds) ... just a heads up, I put a sketch of an attentional pool as desribed in the paper in there, while keeping bwd compat

thanks, will try and keep it all together, there was another small PR trying to address this, #551, fixing another small issue, don´t know if you ever had time to have a look

…CLIPA

rom1504 · 2023-10-12T03:57:28Z

It's a lot of changes but looks ok to me. Would definitely be safer to merge (and potentially revert) in small PRs bringing more scoped changes

My main worry is I can't tell from the code if things still work.

The tests partially cover inference still working the same. Maybe it needs to be expanded to new models

What about training, how do we check?

rwightman · 2023-10-12T04:03:18Z

@rom1504 yeah, smaller usually better but clipa and siglip are both inter-related in a non-trivial way since they both come big_vision but with differences that conflict with past decisions made here, and they also need extra (but different) tokenizer support, preprocessing, etc.

As a counter point, things would have ended up more of a mess if CLIPA had been merged as is, and then someone else decided to add SigLIP on top, and the other fixes got added, etc ... at least this will be pulled together with a goal to leave things a bit better in the end with hopefully a cleaner base to add more

rwightman · 2023-10-12T04:05:47Z

@rom1504 I feel testing inference will be enough here, the training risk should be low, the breaks should all be detectable running inference on pretrained models across all types

…ocessing / text tokenization sensibly.

rom1504 · 2023-10-17T05:43:29Z

Looks like tests are passing

Once this is merged, we can adapt

open_clip/tests/test_inference.py

Line 44 in e7b39e4

models_to_test = [(model_name, False) for model_name in models_to_test]

to make sure siglip models also get into non regression tests

gabrielilharco · 2023-10-19T03:59:28Z

@gabrielilharco okay! reworked to use a class attribute for context_length with an optional override via the fn argument (so backwards compat behaviour is maintained).

Nice @rwightman! Looking great. I'm able to run evals now for all new models, will send results once they're done

rwightman · 2023-10-19T06:42:12Z

@gpucce think I found it, the LN wasn't being applied properly with the attn_pool set, have to do it differently for the legacy (current) CoCa and future double pool options...

gpucce · 2023-10-19T14:58:57Z

@rwightman it seems there is still smth different for the captions, let me try and have a better look at the outputs more deeply

Edit: if you meant that last commit should be the fix

gabrielilharco · 2023-10-19T15:25:55Z

I'm seeing small discrepancies in performance for CoCa models too after the latest change. Some sample results below

Before (without this PR): {'Caltech-101': 0.943293454951418, 'CIFAR-10': 0.9717, 'CIFAR-100': 0.8318000000000001, 'CLEVR Counts': 0.35646666666666604, 'CLEVR Distance': 0.23646666666666602, 'Country211': 0.25459715639810404, 'Describable Textures': 0.627127659574468, 'EuroSAT': 0.685, 'FGVC Aircraft': 0.362210338680926, 'Food-101': 0.9044752475247521, 'GTSRB': 0.557244655581947, 'ImageNet 1k': 0.7564200000000001, 'ImageNet Sketch': 0.645915620271571, 'ImageNet v2': 0.6794, ...}

After the latest changes: {'Caltech-101': 0.9430201183691431, 'CIFAR-10': 0.9722000000000001, 'CIFAR-100': 0.8318000000000001, 'CLEVR Counts': 0.37806666666666605, 'CLEVR Distance': 0.2446, 'Country211': 0.25507109004739303, 'Describable Textures': 0.623936170212765, 'EuroSAT': 0.675185185185185, 'FGVC Aircraft': 0.358957219251336, 'Food-101': 0.903801980198019, 'GTSRB': 0.562391132224861, 'ImageNet 1k': 0.7560600000000001, 'ImageNet Sketch': 0.6453064512959571, 'ImageNet v2': 0.6798000000000001, ...}

rwightman · 2023-10-19T16:14:00Z

Oh fun, yeah thought that last change would bring the CoCa ViT outputs to match old ones, they would have been different (for the tokens output) before, the pooled output should have been the same...

rwightman · 2023-10-19T16:19:36Z

FWIW, it's probably worth checking CoCa on the main branch to ensure it was in a good state there too, in case there was a prior regression we didn't pick up

…atch prev CoCa behaviour, but at odds with argmax which leaves special tokens in (not consistent)

gpucce · 2023-10-19T16:33:26Z

@rwightman until booster is up (tomorrow I think) I can't run the evals again because the gpu I have where I am is filled right now and also leonardo is under maintenance currently I am checking instances by hand

gabrielilharco · 2023-10-19T16:38:04Z

FWIW, it's probably worth checking CoCa on the main branch to ensure it was in a good state there too, in case there was a prior regression we didn't pick up

@rwightman good point, I'm running some evals on main now and they are matching the numbers from d7542e4 so far

rwightman · 2023-10-19T16:42:54Z

@gabrielilharco @gpucce okay, so I did have another issue in the text pool, the 'tokens' included the cls token which would impact generation, but not zero-shot eval. Fixed that in the very latest. But for both of the previous fixes, the broken path was in the 'tokens' output from vision or text encoder, so zero-shot should have been consistent with main.

rwightman · 2023-10-20T04:28:04Z

@gabrielilharco nice eval looking good, but double checked that bigG CLIPA, should be 83, looks like I mixed up a config value, durr... confirming

gpucce · 2023-10-20T08:03:06Z

@rwightman eval for already fine-tuned model is consistent with main, I am running fine-tuning

gpucce · 2023-10-20T14:57:58Z

@rwightman eval for already fine-tuned model is consistent with main, I am running fine-tuning

fine-tuning seems fine too

rwightman · 2023-10-20T15:17:12Z

@gpucce nice, thanks!

okay, so are we ready to merge? @gabrielilharco

the bigG CLIPA needs updated eval after last fix to config
I can get CLIPA HF hub models moved to a UCSC related location @zw615 (I think I can have someone move the ones I've already uploaded to a different org, and then the README can be updated as desired). Also, the bigG 224x224 model is missing, has same link as the 336?

gabrielilharco · 2023-10-20T15:45:05Z

@rwightman all good from my side! I'm getting 83.09 for bigG CLIPA on ImageNet. Full evals are still running (it's slow since the model is so big), I'll update the numbers once they are done

rwightman · 2023-10-20T15:51:20Z

k, I will merge now so we can get changes and various fixes in now.

Maybe hold off on a versioned release for another day or two to give a few early adopters chance to try main branch?

We should still add some text to README / PRETRAINED.MD about CLIPA-v2 and SigLIP but can think about that one...

rom1504 · 2023-10-20T16:08:53Z

Great! I think as follow up to merging it would be nice to - include clippa and siglip in the regression tests so they keep working - same for coca - add a regression test for the caption generation of coca The first 2 are just about enabling these models in the test, no code change

…

On Fri, Oct 20, 2023, 23:52 Ross Wightman ***@***.***> wrote: Merged #660 <#660> into main. — Reply to this email directly, view it on GitHub <#660 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAR437Q2HHDX57EV3X7SIW3YAKM3FAVCNFSM6AAAAAA5WHN2YKVHI2DSMVQWIX3LMV45UABCJFZXG5LFIV3GK3TUJZXXI2LGNFRWC5DJN5XDWMJQG4ZDMMRSGY3TEMQ> . You are receiving this because you were mentioned.Message ID: ***@***.***>

gpucce · 2023-10-20T16:34:58Z

@rom1504 @rwightman I can add the reg tests for generation

zw615 · 2023-10-21T00:50:55Z

Hi @rwightman , sure! I will discuss with and work on this HF organization thing with other co-authors, and should be finish it real quick.

@gpucce nice, thanks!

okay, so are we ready to merge? @gabrielilharco

* the bigG CLIPA needs updated eval after last fix to config

* I can get CLIPA HF hub models moved to a UCSC related location @zw615 (I think I can have someone move the ones I've already uploaded to a different org, and then the README can be updated as desired). Also, the bigG 224x224 model is missing, has same link as the 336?

zw615 · 2023-10-22T00:14:50Z

@rwightman Hi, ross. Could you please move those weights to the UCSC-VLAA organization? (https://huggingface.co/UCSC-VLAA)

rwightman · 2023-10-22T00:32:04Z

Okay, will ask someone to do that on Monday!

…

On Sat, Oct 21, 2023, 5:15 PM zw ***@***.***> wrote: @rwightman <https://github.com/rwightman> Hi, ross. Could you please move those weights to the *UCSC-VLAA* organization? ( https://huggingface.co/UCSC-VLAA) — Reply to this email directly, view it on GitHub <#660 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABLQICHUYOY3GNC6M62KNQTYARQQJAVCNFSM6AAAAAA5WHN2YKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTONZTHE2TEMZRHE> . You are receiving this because you were mentioned.Message ID: ***@***.***>

…ations#660)

zw615 and others added 4 commits July 26, 2023 15:41

merge changes for clipa inference

64f4644

update get_tokenizer to pass CI test; replace gelu_appoximate with ac…

546e8ae

…t_kwargs

Merge remote-tracking branch 'zwclipa/clipa_torch_inference' into sig…

8450c95

…lip_clipa_models

Temporary, cannot have a force tf dependency

e3c2ea2

rwightman mentioned this pull request Oct 6, 2023

CLIPA Torch Inference #579

Closed

Supporting SigLIP and CLIPA-v2 models (both sourced from big_vision j…

0316911

…ax based modelling code).

rwightman requested review from rom1504 and mitchellnw October 11, 2023 05:14

rom1504 reviewed Oct 11, 2023

View reviewed changes

rwightman mentioned this pull request Oct 11, 2023

CoCa: fix MultimodalTransformer init + Mask CLS token at end of seq #551

Open

rwightman added 3 commits October 11, 2023 12:54

Merge remote-tracking branch 'origin/main' into siglip_clipa_models

9d8385e

Fix some test failures, remove old v1 CLIPA configs, add add 336 H14 …

39ba303

…CLIPA

Fix torchscript

0724aab

rwightman and others added 5 commits October 11, 2023 21:33

Fix CoCa expand typo, force final LN after attentional pool

f04eee8

Merge branch 'main' into siglip_clipa_models

2f568cd

Used wrong default clean fn in SimpleTokenizer, put lower case back

2c396d2

Attempt to fix xlm roberta test w/ pretrained hf weight difference

e14f34b

SigLIP weights working. More changes to support differing image prepr…

3637f9d

…ocessing / text tokenization sensibly.

fix ViT-SO400M-14-SigLIP name

07f2c16

Fix CoCa pool LN, improve clarity of ViT pooling logic

d7542e4

Exclude first/last tokens from tokens output of text models, should m…

85f19b8

…atch prev CoCa behaviour, but at odds with argmax which leaves special tokens in (not consistent)

Add eval results for CLIPA + SigLIP models

a9d8d58

Fixup bigG CLIPA config, 83.03 top-1 IN-1k

95ae868

gabrielilharco approved these changes Oct 20, 2023

View reviewed changes

rwightman merged commit a5f3ae9 into main Oct 20, 2023
5 checks passed

rwightman deleted the siglip_clipa_models branch October 20, 2023 15:52

Interpause added a commit to Interpause/open_clip that referenced this pull request May 23, 2024

Combining CLIPA-v2 and SigLIP (both big_vision based) models (mlfound…

dd76e74

…ations#660)

d-ailin mentioned this pull request May 28, 2024

generate_w_clip ERROR d-ailin/CLIP-Guided-Decoding#2

Closed



		@torch.no_grad()
		def load_big_vision_weights(model: CustomTextCLIP, checkpoint_path: str):

Combining CLIPA-v2 and SigLIP (both big_vision based) models #660

Combining CLIPA-v2 and SigLIP (both big_vision based) models #660

Conversation

rwightman commented Oct 6, 2023

rwightman commented Oct 11, 2023

rwightman commented Oct 11, 2023

rom1504 commented Oct 11, 2023

rom1504 Oct 11, 2023

Choose a reason for hiding this comment

rwightman Oct 11, 2023

Choose a reason for hiding this comment

rom1504 Oct 12, 2023

Choose a reason for hiding this comment

rom1504 Oct 12, 2023

Choose a reason for hiding this comment

rwightman Oct 12, 2023

Choose a reason for hiding this comment

rwightman commented Oct 11, 2023

rwightman commented Oct 11, 2023

rwightman commented Oct 11, 2023

gabrielilharco commented Oct 11, 2023

gpucce commented Oct 11, 2023

rom1504 commented Oct 12, 2023

rwightman commented Oct 12, 2023 • edited Loading

rwightman commented Oct 12, 2023

rom1504 commented Oct 17, 2023

gabrielilharco commented Oct 19, 2023

rwightman commented Oct 19, 2023

gpucce commented Oct 19, 2023 • edited Loading

gabrielilharco commented Oct 19, 2023

rwightman commented Oct 19, 2023

rwightman commented Oct 19, 2023

gpucce commented Oct 19, 2023 • edited Loading

gabrielilharco commented Oct 19, 2023

rwightman commented Oct 19, 2023

rwightman commented Oct 20, 2023

gpucce commented Oct 20, 2023 • edited Loading

gpucce commented Oct 20, 2023

rwightman commented Oct 20, 2023

gabrielilharco commented Oct 20, 2023 • edited Loading

rwightman commented Oct 20, 2023

rom1504 commented Oct 20, 2023 via email

gpucce commented Oct 20, 2023

zw615 commented Oct 21, 2023

zw615 commented Oct 22, 2023

rwightman commented Oct 22, 2023 via email

rwightman commented Oct 12, 2023 •

edited

Loading

gpucce commented Oct 19, 2023 •

edited

Loading

gpucce commented Oct 19, 2023 •

edited

Loading

gpucce commented Oct 20, 2023 •

edited

Loading

gabrielilharco commented Oct 20, 2023 •

edited

Loading