Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug: JSON Schema - enum behind a $ref generates an object with unrestricted properties #8073

Open
cikkle opened this issue Jun 23, 2024 · 3 comments · May be fixed by #8199
Open

Bug: JSON Schema - enum behind a $ref generates an object with unrestricted properties #8073

cikkle opened this issue Jun 23, 2024 · 3 comments · May be fixed by #8199
Labels
bug-unconfirmed low severity Used to report low severity bugs in llama.cpp (e.g. cosmetic issues, non critical UI glitches)

Comments

@cikkle
Copy link

cikkle commented Jun 23, 2024

What happened?

I'm using the json_schema feature in llama-server. Using a simple prompt like Write a dialog between Alice and Biff, if I send a schema like:

{
    "type": "array",
    "minItems": 15,
    "maxItems": 15,
    "items": { "$ref": "#/$defs/TALK" },

    "$defs": {
        "TALK": {
            "type": "object",
            "required": [ "character", "emote", "dialog" ],
            "properties": {
                "character": { "enum": [ "Alice", "Biff"] },
                "emote": { "enum": ["EXCLAMATION", "CONFUSION", "CHEERFUL", "LOVE", "ANGRY", "NERVOUS", "ANNOYED", "SILENCE", "INSPIRED", "SLEEPING"] },
                "dialog": {
                    "type": "string",
                    "minLength": 1,
                    "maxLength": 200
                }
            }
        }
    }
}

I get back an array of responses in the format I'd expect, like:

{ "character": "Alice", "emote": "SILENCE", "dialog": "I'm just saying, it's not like you to be so... quiet. Is everything alright?" }
{"character": "Biff", "emote": "NERVOUS", "dialog": "Yeah, everything's fine. Just... busy. You know how it is." }

Things stop working right if I try to put the enums in separate definitions. The following schema:

{
    "type": "array",
    "minItems": 15,
    "maxItems": 15,
    "items": { "$ref": "#/$defs/TALK" },

    "$defs": {
        "characters": { "enum": ["Biff", "Alice"] },
        "emotes": { "enum": ["EXCLAMATION", "CONFUSION", "CHEERFUL", "LOVE", "ANGRY"] },

        "TALK": {
            "type": "object",
            "required": [ "character", "emote", "dialog" ],
            "properties": {
                "character": { "$ref": "#/$defs/characters" },
                "emote": { "$ref": "#/$defs/emotes" },
                "dialog": {
                    "type": "string",
                    "minLength": 1,
                    "maxLength": 200
                }
            }
        }
    }
}

...gives me arbitrary things like:

{ "character": {"name": "Alice","description": "Alice, a young woman, has a bright and curious expression on her face."},
{"emotion": "curious"}
 { "character": {"name": "Biff","description": "Biff, a friendly-looking man, has a warm smile and a hint of mischief in his eyes."},
{"emotion": "amused"}

The output should follow the same format in both, but I get an object with random properties in place of the enum, and possibly more random things afterward (in this run, it was a bonus object tagging along, but it can vary).

Notably if I reorder the properties to put "dialog" before "character" I'll actually get the dialog property and string I asked for, so things only seem to go off the rails when it reaches one of the referenced enums.

I'm aware json_schema currently has some known bugs and features yet to implemented, but I didn't see anything in the readme I thought this would fall under. Terminal output from llama-server doesn't appear to show anything relevant but it's included for completeness.

Name and Version

o0@hades:/ai/llama.cpp$ ./llama-cli --version
version: 3203 (b5a5f34)
built with cc (Ubuntu 11.4.0-1ubuntu1
22.04) 11.4.0 for x86_64-linux-gnu

What operating system are you seeing the problem on?

Linux

Relevant log output

INFO [                    main] build info | tid="139722331939776" timestamp=1719130838 build=3203 commit="b5a5f34e"
INFO [                    main] system info | tid="139722331939776" timestamp=1719130838 n_threads=12 n_threads_batch=-1 total_threads=24 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | "
llama_model_loader: loaded meta data with 26 key-value pairs and 291 tensors from ../models/text/L3-8B-Stheno-v3.2-Q8_0.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = L3-8B-Stheno-v3.2
llama_model_loader: - kv   2:                          llama.block_count u32              = 32
llama_model_loader: - kv   3:                       llama.context_length u32              = 8192
llama_model_loader: - kv   4:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv   6:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv   7:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv   8:                       llama.rope.freq_base f32              = 500000.000000
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  10:                          general.file_type u32              = 7
llama_model_loader: - kv  11:                           llama.vocab_size u32              = 128256
llama_model_loader: - kv  12:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv  13:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  14:                         tokenizer.ggml.pre str              = llama-bpe
llama_model_loader: - kv  15:                      tokenizer.ggml.tokens arr[str,128256]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  16:                  tokenizer.ggml.token_type arr[i32,128256]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  17:                      tokenizer.ggml.merges arr[str,280147]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv  18:                tokenizer.ggml.bos_token_id u32              = 128000
llama_model_loader: - kv  19:                tokenizer.ggml.eos_token_id u32              = 128009
llama_model_loader: - kv  20:                    tokenizer.chat_template str              = {% set loop_messages = messages %}{% ...
llama_model_loader: - kv  21:               general.quantization_version u32              = 2
llama_model_loader: - kv  22:                      quantize.imatrix.file str              = /models/L3-8B-Stheno-v3.2-GGUF/L3-8B-...
llama_model_loader: - kv  23:                   quantize.imatrix.dataset str              = /training_data/calibration_datav3.txt
llama_model_loader: - kv  24:             quantize.imatrix.entries_count i32              = 224
llama_model_loader: - kv  25:              quantize.imatrix.chunks_count i32              = 125
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type q8_0:  226 tensors
llm_load_vocab: special tokens cache size = 256
llm_load_vocab: token to piece cache size = 0.8000 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 128256
llm_load_print_meta: n_merges         = 280147
llm_load_print_meta: n_ctx_train      = 8192
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 4
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 14336
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 500000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 8192
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 8B
llm_load_print_meta: model ftype      = Q8_0
llm_load_print_meta: model params     = 8.03 B
llm_load_print_meta: model size       = 7.95 GiB (8.50 BPW)
llm_load_print_meta: general.name     = L3-8B-Stheno-v3.2
llm_load_print_meta: BOS token        = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token        = 128009 '<|eot_id|>'
llm_load_print_meta: LF token         = 128 'Ä'
llm_load_print_meta: EOT token        = 128009 '<|eot_id|>'
llm_load_print_meta: max token length = 256
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:   no
ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes
ggml_cuda_init: found 2 ROCm devices:
  Device 0: Radeon RX 7900 XTX, compute capability 11.0, VMM: no
  Device 1: Radeon RX 7900 XTX, compute capability 11.0, VMM: no
llm_load_tensors: ggml ctx size =    0.44 MiB
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors:      ROCm0 buffer size =  3757.53 MiB
llm_load_tensors:      ROCm1 buffer size =  3847.80 MiB
llm_load_tensors:        CPU buffer size =   532.31 MiB
.........................................................................................
llama_new_context_with_model: n_ctx      = 8192
llama_new_context_with_model: n_batch    = 2048
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 500000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:      ROCm0 KV buffer size =   416.50 MiB
llama_kv_cache_init:      ROCm1 KV buffer size =   367.50 MiB
llama_new_context_with_model: KV self size  =  784.00 MiB, K (q8_0):  272.00 MiB, V (f16):  512.00 MiB
llama_new_context_with_model:  ROCm_Host  output buffer size =     0.98 MiB
llama_new_context_with_model: pipeline parallelism enabled (n_copies=4)
llama_new_context_with_model:      ROCm0 compute buffer size =   640.01 MiB
llama_new_context_with_model:      ROCm1 compute buffer size =   640.02 MiB
llama_new_context_with_model:  ROCm_Host compute buffer size =    72.02 MiB
llama_new_context_with_model: graph nodes  = 1030
llama_new_context_with_model: graph splits = 3
INFO [                    init] initializing slots | tid="139722331939776" timestamp=1719130849 n_slots=1
INFO [                    init] new slot | tid="139722331939776" timestamp=1719130849 id_slot=0 n_ctx_slot=8192
INFO [                    main] model loaded | tid="139722331939776" timestamp=1719130849
INFO [                    main] chat template | tid="139722331939776" timestamp=1719130849 chat_example="<|start_header_id|>system<|end_header_id|>\n\nYou are a helpful assistant<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nHello<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nHi there<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nHow are you?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n" built_in=true
INFO [                    main] HTTP server listening | tid="139722331939776" timestamp=1719130849 n_threads_http="23" port="5000" hostname="0.0.0.0"
INFO [            update_slots] all slots are idle | tid="139722331939776" timestamp=1719130849
INFO [   launch_slot_with_task] slot is processing task | tid="139722331939776" timestamp=1719131079 id_slot=0 id_task=0
INFO [            update_slots] kv cache rm [p0, end) | tid="139722331939776" timestamp=1719131079 id_slot=0 id_task=0 p0=0
INFO [           print_timings] prompt eval time     =     111.88 ms /    55 tokens (    2.03 ms per token,   491.61 tokens per second) | tid="139722331939776" timestamp=1719131141 id_slot=0 id_task=0 t_prompt_processing=111.878 n_prompt_tokens_processed=55 t_token=2.0341454545454547 n_tokens_second=491.60692897620623
INFO [           print_timings] generation eval time =   61940.54 ms /  1522 runs   (   40.70 ms per token,    24.57 tokens per second) | tid="139722331939776" timestamp=1719131141 id_slot=0 id_task=0 t_token_generation=61940.538 n_decoded=1522 t_token=40.696805519053875 n_tokens_second=24.57195318516607
INFO [           print_timings]           total time =   62052.42 ms | tid="139722331939776" timestamp=1719131141 id_slot=0 id_task=0 t_prompt_processing=111.878 t_token_generation=61940.538 t_total=62052.416
INFO [            update_slots] slot released | tid="139722331939776" timestamp=1719131141 id_slot=0 id_task=0 n_ctx=8192 n_past=1576 n_system_tokens=0 n_cache_tokens=0 truncated=false
INFO [            update_slots] all slots are idle | tid="139722331939776" timestamp=1719131141
INFO [            update_slots] all slots are idle | tid="139722331939776" timestamp=1719131141
INFO [            update_slots] all slots are idle | tid="139722331939776" timestamp=1719131253
@cikkle cikkle added bug-unconfirmed low severity Used to report low severity bugs in llama.cpp (e.g. cosmetic issues, non critical UI glitches) labels Jun 23, 2024
@ochafik
Copy link
Collaborator

ochafik commented Jun 24, 2024

Hi @cikkle, thanks for the report!

As you can see in the docs the support for external $refs hasn't been implemented in the C++ json schema -> grammar converter yet (will need to use CURL, it's on my todo list).

I guess we should probably find a way to issue warnings about it or even hard-fail (currently we just silently degrade to "anything goes" so the items in your example are just any json object), but in the meantime as a workaround you might wanna try and run python examples/json_schema_to_grammar.py schema.json and pass the resulting grammar to your server call, if possible. Or possibly, paste the schema in your prompt to rely on a mix of good will from the model your items and schema constraint only for the high-level shape of the output.

Cheers

@cikkle
Copy link
Author

cikkle commented Jun 24, 2024

Sorry, I might have a pretty basic misunderstanding as far as the spec, the docs, or the terms used. I took "external" to mean a reference to a schema in another file; the characters and emotes in the $def block in the second example should be all local, right? Even given the notes in the docs I thought that would work.

Anyhow, for my use case I don't have much of a problem with just copying enums around the schema where they need to go, but the fallback behavior did surprise me, so thanks for looking at this.

@ochafik
Copy link
Collaborator

ochafik commented Jun 24, 2024

Sorry, I might have a pretty basic misunderstanding as far as the spec, the docs, or the terms used. I took "external" to mean a reference to a schema in another file; the characters and emotes in the $def block in the second example should be all local, right? Even given the notes in the docs I thought that would work.

Ohh sorry I completely misread your bug report, thanks for the clarification 🫣

Anyhow, for my use case I don't have much of a problem with just copying enums around the schema where they need to go, but the fallback behavior did surprise me, so thanks for looking at this.

It looks like it's a bug specific to the C++ implementation of the JSON Schema -> Grammar conversion, I'll try to send a fix shortly.

In the meantime you can use python examples/json_schema_to_grammar.py schema.json which seems to work with your example.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug-unconfirmed low severity Used to report low severity bugs in llama.cpp (e.g. cosmetic issues, non critical UI glitches)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants