Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ggml inference time is significantly slower than onnxruntime #841

Open
Francis235 opened this issue May 28, 2024 · 6 comments
Open

ggml inference time is significantly slower than onnxruntime #841

Francis235 opened this issue May 28, 2024 · 6 comments

Comments

@Francis235
Copy link

Francis235 commented May 28, 2024

I use ggml to deploy the mobilenetv2 model, and compared with the deployment using onnxruntime, I found that the inference time of ggml is nearly 100 times that of onnxruntime. My ggml inference code part is as follows:


float * mobilenetv2_inference(ggml_tensor * input, mobilenetv2_model model, ggml_context * ctx0) {
    ggml_tensor * result = apply_conv2d(ctx0, input, model.conv2d_layers[0]);
    result = apply_conv_depthwise_2d(ctx0, result, model.conv2d_layers[1]);
    result = apply_conv2d_no_clamp(ctx0, result, model.conv2d_layers[2]);
    result = apply_conv2d(ctx0, result, model.conv2d_layers[3]);
    result = apply_conv_depthwise_2d(ctx0, result, model.conv2d_layers[4]);
    ggml_tensor * result_res = apply_conv2d_no_clamp(ctx0, result, model.conv2d_layers[5]);
    result = apply_conv2d(ctx0, result_res, model.conv2d_layers[6]);
    result = apply_conv_depthwise_2d(ctx0, result, model.conv2d_layers[7]);
    result = apply_conv2d_no_clamp(ctx0, result, model.conv2d_layers[8]);
    result = ggml_add(ctx0, result_res, result);
    result = apply_conv2d(ctx0, result, model.conv2d_layers[9]);
    result = apply_conv_depthwise_2d(ctx0, result, model.conv2d_layers[10]);
    result_res = apply_conv2d_no_clamp(ctx0, result, model.conv2d_layers[11]);
    result = apply_conv2d(ctx0, result_res, model.conv2d_layers[12]);
    result = apply_conv_depthwise_2d(ctx0, result, model.conv2d_layers[13]);
    result = apply_conv2d_no_clamp(ctx0, result, model.conv2d_layers[14]);
    result_res = ggml_add(ctx0, result_res, result);
    result = apply_conv2d(ctx0, result_res, model.conv2d_layers[15]);
    result = apply_conv_depthwise_2d(ctx0, result, model.conv2d_layers[16]);
    result = apply_conv2d_no_clamp(ctx0, result, model.conv2d_layers[17]);
    result = ggml_add(ctx0, result_res, result);
    result = apply_conv2d(ctx0, result, model.conv2d_layers[18]);
    result = apply_conv_depthwise_2d(ctx0, result, model.conv2d_layers[19]);
    result_res = apply_conv2d_no_clamp(ctx0, result, model.conv2d_layers[20]);
    result = apply_conv2d(ctx0, result_res, model.conv2d_layers[21]);
    result = apply_conv_depthwise_2d(ctx0, result, model.conv2d_layers[22]);
    result = apply_conv2d_no_clamp(ctx0, result, model.conv2d_layers[23]);
    result_res = ggml_add(ctx0, result_res, result);
    result = apply_conv2d(ctx0, result, model.conv2d_layers[24]);
    result = apply_conv_depthwise_2d(ctx0, result, model.conv2d_layers[25]);
    result = apply_conv2d_no_clamp(ctx0, result, model.conv2d_layers[26]);
    result_res = ggml_add(ctx0, result_res, result);
    result = apply_conv2d(ctx0, result_res, model.conv2d_layers[27]);
    result = apply_conv_depthwise_2d(ctx0, result, model.conv2d_layers[28]);
    result = apply_conv2d_no_clamp(ctx0, result, model.conv2d_layers[29]);
    result = ggml_add(ctx0, result_res, result);
    result = apply_conv2d(ctx0, result, model.conv2d_layers[30]);
    result = apply_conv_depthwise_2d(ctx0, result, model.conv2d_layers[31]);
    result_res = apply_conv2d_no_clamp(ctx0, result, model.conv2d_layers[32]);
    result = apply_conv2d(ctx0, result_res, model.conv2d_layers[33]);
    result = apply_conv_depthwise_2d(ctx0, result, model.conv2d_layers[34]);
    result = apply_conv2d_no_clamp(ctx0, result, model.conv2d_layers[35]);
    result_res = ggml_add(ctx0, result_res, result);
    result = apply_conv2d(ctx0, result_res, model.conv2d_layers[36]);
    result = apply_conv_depthwise_2d(ctx0, result, model.conv2d_layers[37]);
    result = apply_conv2d_no_clamp(ctx0, result, model.conv2d_layers[38]);
    result = ggml_add(ctx0, result_res, result);
    result = apply_conv2d(ctx0, result, model.conv2d_layers[39]);
    result = apply_conv_depthwise_2d(ctx0, result, model.conv2d_layers[40]);
    result_res = apply_conv2d_no_clamp(ctx0, result, model.conv2d_layers[41]);
    result = apply_conv2d(ctx0, result_res, model.conv2d_layers[42]);
    result = apply_conv_depthwise_2d(ctx0, result, model.conv2d_layers[43]);
    result = apply_conv2d_no_clamp(ctx0, result, model.conv2d_layers[44]);
    result_res = ggml_add(ctx0, result_res, result);
    result = apply_conv2d(ctx0, result_res, model.conv2d_layers[45]);
    result = apply_conv_depthwise_2d(ctx0, result, model.conv2d_layers[46]);
    result = apply_conv2d_no_clamp(ctx0, result, model.conv2d_layers[47]);
    result = ggml_add(ctx0, result_res, result);
    result = apply_conv2d(ctx0, result, model.conv2d_layers[48]);
    result = apply_conv_depthwise_2d(ctx0, result, model.conv2d_layers[49]);
    result = apply_conv2d_no_clamp(ctx0, result, model.conv2d_layers[50]);
    result = apply_conv2d(ctx0, result, model.conv2d_layers[51]);
    result = ggml_pool_2d(ctx0, result, GGML_OP_POOL_AVG, 7, 7, 1, 1, 0, 0);
    result = ggml_reshape_2d(ctx0, result, result->ne[2], result->ne[3]);
    result = ggml_mul_mat(ctx0, model.gemm_layers[0].weights, result);
    result = ggml_add(ctx0, result, model.gemm_layers[0].biases);

    struct ggml_cgraph * gf = ggml_new_graph(ctx0);

    ggml_build_forward_expand(gf, result);

    const int64_t t_start_ms = ggml_time_ms();
    ggml_graph_compute_with_ctx(ctx0, gf, 1);
    const int64_t t_end_ms = ggml_time_ms();
    std::cout << "ggml_graph_compute_with_ctx exec time(ms): " << t_end_ms-t_start_ms << std::endl;

    float * output = ggml_get_data_f32(result);

    return output;
}

Is there something wrong when I build the model? Do you have any suggestions? Thanks in advance.

@ggerganov
Copy link
Owner

Did you build in Release? What do the apply_ functions do?

The ggml convolution operations are for sure not very optimal, but 100x difference is too much

@Francis235
Copy link
Author

Thanks for your reply. I build in master branch. apply_ functions is the wrappper of conv, as follow:

static ggml_tensor * apply_conv2d_no_clamp(ggml_context * ctx, ggml_tensor * input, const conv2d_layer & layer)
{
    ggml_tensor * result = ggml_conv_2d(ctx,  layer.weights, input, \
        layer.stride,  layer.stride, \
        layer.padding,  layer.padding, \
        layer.dilation,  layer.dilation);
    return result;
}

static ggml_tensor * apply_conv2d(ggml_context * ctx, ggml_tensor * input, const conv2d_layer & layer)
{
    ggml_tensor * result = ggml_conv_2d(ctx, layer.weights, input, layer.stride, layer.stride, layer.padding, layer.padding, layer.dilation, layer.dilation);
    result = ggml_clamp(ctx, result, 0.0f, 6.0f);
    return result;
}

static ggml_tensor * apply_conv_depthwise_2d(ggml_context * ctx, ggml_tensor * input, const conv2d_layer & layer)
{
    ggml_tensor * result = ggml_conv_depthwise_2d(ctx, layer.weights, input, layer.stride, layer.stride, layer.padding, layer.padding, layer.dilation, layer.dilation);
    result = ggml_clamp(ctx, result, 0.0f, 6.0f);
    return result;
}

Did you build in Release? What do the apply_ functions do?

The ggml convolution operations are for sure not very optimal, but 100x difference is too much

@Francis235
Copy link
Author

Did you build in Release? What do the apply_ functions do?

The ggml convolution operations are for sure not very optimal, but 100x difference is too much

I tested mobilenetv2 inference on the release branch code, and the inference time was about the same.

@ggerganov
Copy link
Owner

By Release I mean to build with -O3 optimizaion flags. What hardware are you running on?

@Francis235
Copy link
Author

By Release I mean to build with -O3 optimizaion flags. What hardware are you running on?

I build with -O3 flags, the inference time has been accelerated, but it is still not ideal, about 15x slower than onnxruntime inference. I test on my PC, CPU info: Intel(R) Core(TM) i7-7560U CPU @ 2.40GHz 2.40 GHz.

@ggerganov
Copy link
Owner

Make sure you are building with AVX2 support and ramp up the threads a bit:

const int n_threads = 4;
ggml_graph_compute_with_ctx(ctx0, gf, n_threads);

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants