ggml inference time is significantly slower than onnxruntime #841

Francis235 · 2024-05-28T07:16:15Z

I use ggml to deploy the mobilenetv2 model, and compared with the deployment using onnxruntime, I found that the inference time of ggml is nearly 100 times that of onnxruntime. My ggml inference code part is as follows:


float * mobilenetv2_inference(ggml_tensor * input, mobilenetv2_model model, ggml_context * ctx0) {
    ggml_tensor * result = apply_conv2d(ctx0, input, model.conv2d_layers[0]);
    result = apply_conv_depthwise_2d(ctx0, result, model.conv2d_layers[1]);
    result = apply_conv2d_no_clamp(ctx0, result, model.conv2d_layers[2]);
    result = apply_conv2d(ctx0, result, model.conv2d_layers[3]);
    result = apply_conv_depthwise_2d(ctx0, result, model.conv2d_layers[4]);
    ggml_tensor * result_res = apply_conv2d_no_clamp(ctx0, result, model.conv2d_layers[5]);
    result = apply_conv2d(ctx0, result_res, model.conv2d_layers[6]);
    result = apply_conv_depthwise_2d(ctx0, result, model.conv2d_layers[7]);
    result = apply_conv2d_no_clamp(ctx0, result, model.conv2d_layers[8]);
    result = ggml_add(ctx0, result_res, result);
    result = apply_conv2d(ctx0, result, model.conv2d_layers[9]);
    result = apply_conv_depthwise_2d(ctx0, result, model.conv2d_layers[10]);
    result_res = apply_conv2d_no_clamp(ctx0, result, model.conv2d_layers[11]);
    result = apply_conv2d(ctx0, result_res, model.conv2d_layers[12]);
    result = apply_conv_depthwise_2d(ctx0, result, model.conv2d_layers[13]);
    result = apply_conv2d_no_clamp(ctx0, result, model.conv2d_layers[14]);
    result_res = ggml_add(ctx0, result_res, result);
    result = apply_conv2d(ctx0, result_res, model.conv2d_layers[15]);
    result = apply_conv_depthwise_2d(ctx0, result, model.conv2d_layers[16]);
    result = apply_conv2d_no_clamp(ctx0, result, model.conv2d_layers[17]);
    result = ggml_add(ctx0, result_res, result);
    result = apply_conv2d(ctx0, result, model.conv2d_layers[18]);
    result = apply_conv_depthwise_2d(ctx0, result, model.conv2d_layers[19]);
    result_res = apply_conv2d_no_clamp(ctx0, result, model.conv2d_layers[20]);
    result = apply_conv2d(ctx0, result_res, model.conv2d_layers[21]);
    result = apply_conv_depthwise_2d(ctx0, result, model.conv2d_layers[22]);
    result = apply_conv2d_no_clamp(ctx0, result, model.conv2d_layers[23]);
    result_res = ggml_add(ctx0, result_res, result);
    result = apply_conv2d(ctx0, result, model.conv2d_layers[24]);
    result = apply_conv_depthwise_2d(ctx0, result, model.conv2d_layers[25]);
    result = apply_conv2d_no_clamp(ctx0, result, model.conv2d_layers[26]);
    result_res = ggml_add(ctx0, result_res, result);
    result = apply_conv2d(ctx0, result_res, model.conv2d_layers[27]);
    result = apply_conv_depthwise_2d(ctx0, result, model.conv2d_layers[28]);
    result = apply_conv2d_no_clamp(ctx0, result, model.conv2d_layers[29]);
    result = ggml_add(ctx0, result_res, result);
    result = apply_conv2d(ctx0, result, model.conv2d_layers[30]);
    result = apply_conv_depthwise_2d(ctx0, result, model.conv2d_layers[31]);
    result_res = apply_conv2d_no_clamp(ctx0, result, model.conv2d_layers[32]);
    result = apply_conv2d(ctx0, result_res, model.conv2d_layers[33]);
    result = apply_conv_depthwise_2d(ctx0, result, model.conv2d_layers[34]);
    result = apply_conv2d_no_clamp(ctx0, result, model.conv2d_layers[35]);
    result_res = ggml_add(ctx0, result_res, result);
    result = apply_conv2d(ctx0, result_res, model.conv2d_layers[36]);
    result = apply_conv_depthwise_2d(ctx0, result, model.conv2d_layers[37]);
    result = apply_conv2d_no_clamp(ctx0, result, model.conv2d_layers[38]);
    result = ggml_add(ctx0, result_res, result);
    result = apply_conv2d(ctx0, result, model.conv2d_layers[39]);
    result = apply_conv_depthwise_2d(ctx0, result, model.conv2d_layers[40]);
    result_res = apply_conv2d_no_clamp(ctx0, result, model.conv2d_layers[41]);
    result = apply_conv2d(ctx0, result_res, model.conv2d_layers[42]);
    result = apply_conv_depthwise_2d(ctx0, result, model.conv2d_layers[43]);
    result = apply_conv2d_no_clamp(ctx0, result, model.conv2d_layers[44]);
    result_res = ggml_add(ctx0, result_res, result);
    result = apply_conv2d(ctx0, result_res, model.conv2d_layers[45]);
    result = apply_conv_depthwise_2d(ctx0, result, model.conv2d_layers[46]);
    result = apply_conv2d_no_clamp(ctx0, result, model.conv2d_layers[47]);
    result = ggml_add(ctx0, result_res, result);
    result = apply_conv2d(ctx0, result, model.conv2d_layers[48]);
    result = apply_conv_depthwise_2d(ctx0, result, model.conv2d_layers[49]);
    result = apply_conv2d_no_clamp(ctx0, result, model.conv2d_layers[50]);
    result = apply_conv2d(ctx0, result, model.conv2d_layers[51]);
    result = ggml_pool_2d(ctx0, result, GGML_OP_POOL_AVG, 7, 7, 1, 1, 0, 0);
    result = ggml_reshape_2d(ctx0, result, result->ne[2], result->ne[3]);
    result = ggml_mul_mat(ctx0, model.gemm_layers[0].weights, result);
    result = ggml_add(ctx0, result, model.gemm_layers[0].biases);

    struct ggml_cgraph * gf = ggml_new_graph(ctx0);

    ggml_build_forward_expand(gf, result);

    const int64_t t_start_ms = ggml_time_ms();
    ggml_graph_compute_with_ctx(ctx0, gf, 1);
    const int64_t t_end_ms = ggml_time_ms();
    std::cout << "ggml_graph_compute_with_ctx exec time(ms): " << t_end_ms-t_start_ms << std::endl;

    float * output = ggml_get_data_f32(result);

    return output;
}

Is there something wrong when I build the model? Do you have any suggestions? Thanks in advance.

The text was updated successfully, but these errors were encountered:

ggerganov · 2024-05-28T09:01:35Z

Did you build in Release? What do the apply_ functions do?

The ggml convolution operations are for sure not very optimal, but 100x difference is too much

Francis235 · 2024-05-29T01:55:35Z

Thanks for your reply. I build in master branch. apply_ functions is the wrappper of conv, as follow:

static ggml_tensor * apply_conv2d_no_clamp(ggml_context * ctx, ggml_tensor * input, const conv2d_layer & layer)
{
    ggml_tensor * result = ggml_conv_2d(ctx,  layer.weights, input, \
        layer.stride,  layer.stride, \
        layer.padding,  layer.padding, \
        layer.dilation,  layer.dilation);
    return result;
}

static ggml_tensor * apply_conv2d(ggml_context * ctx, ggml_tensor * input, const conv2d_layer & layer)
{
    ggml_tensor * result = ggml_conv_2d(ctx, layer.weights, input, layer.stride, layer.stride, layer.padding, layer.padding, layer.dilation, layer.dilation);
    result = ggml_clamp(ctx, result, 0.0f, 6.0f);
    return result;
}

static ggml_tensor * apply_conv_depthwise_2d(ggml_context * ctx, ggml_tensor * input, const conv2d_layer & layer)
{
    ggml_tensor * result = ggml_conv_depthwise_2d(ctx, layer.weights, input, layer.stride, layer.stride, layer.padding, layer.padding, layer.dilation, layer.dilation);
    result = ggml_clamp(ctx, result, 0.0f, 6.0f);
    return result;
}

Did you build in Release? What do the apply_ functions do?

The ggml convolution operations are for sure not very optimal, but 100x difference is too much

Francis235 · 2024-05-29T05:38:36Z

Did you build in Release? What do the apply_ functions do?

The ggml convolution operations are for sure not very optimal, but 100x difference is too much

I tested mobilenetv2 inference on the release branch code, and the inference time was about the same.

ggerganov · 2024-05-29T10:03:41Z

By Release I mean to build with -O3 optimizaion flags. What hardware are you running on?

Francis235 · 2024-05-31T08:26:50Z

By Release I mean to build with -O3 optimizaion flags. What hardware are you running on?

I build with -O3 flags, the inference time has been accelerated, but it is still not ideal, about 15x slower than onnxruntime inference. I test on my PC, CPU info: Intel(R) Core(TM) i7-7560U CPU @ 2.40GHz 2.40 GHz.

ggerganov · 2024-05-31T09:53:34Z

Make sure you are building with AVX2 support and ramp up the threads a bit:

const int n_threads = 4;
ggml_graph_compute_with_ctx(ctx0, gf, n_threads);

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ggml inference time is significantly slower than onnxruntime #841

ggml inference time is significantly slower than onnxruntime #841

Francis235 commented May 28, 2024 •

edited

Loading

ggerganov commented May 28, 2024

Francis235 commented May 29, 2024

Francis235 commented May 29, 2024

ggerganov commented May 29, 2024

Francis235 commented May 31, 2024

ggerganov commented May 31, 2024

ggml inference time is significantly slower than onnxruntime #841

ggml inference time is significantly slower than onnxruntime #841

Comments

Francis235 commented May 28, 2024 • edited Loading

ggerganov commented May 28, 2024

Francis235 commented May 29, 2024

Francis235 commented May 29, 2024

ggerganov commented May 29, 2024

Francis235 commented May 31, 2024

ggerganov commented May 31, 2024

Francis235 commented May 28, 2024 •

edited

Loading