Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SUPPORT] - Performance Variation in Hudi 0.14 #11481

Open
RuyRoaV opened this issue Jun 21, 2024 · 3 comments
Open

[SUPPORT] - Performance Variation in Hudi 0.14 #11481

RuyRoaV opened this issue Jun 21, 2024 · 3 comments

Comments

@RuyRoaV
Copy link

RuyRoaV commented Jun 21, 2024

Describe the problem you faced

We have a Glue 4.0 job to perform an upsert on a Hudi managed COW table. In some occasions, the Glue job runs in under 5 minutes, whereas in others it runs for up to 20 minutes. Moreover, we have noticed that, in those instances, the job is performing a count at HoodieSparkSqlWriter.scala:1072 action for over 17 minutes; in other job runs this only takes around 1 minute.

Regarding some specifications for the table:

We have 3 partition fields:

  • year : int
  • month: int
  • day : int

A precombine field:

  • epoch: bigint

and 3 recordkey fields:

  • node_id : string
  • container_id : string
  • container_label: string

You can see more about the table description here:

Screenshot 2024-06-21 at 13 31 19

We are also using a BLOOM type index and these are some other configurations that we are setting.

Screenshot 2024-06-21 at 13 12 16

Could you please advise us on which actions we should take to bring down the execution time?

Expected behavior

We would like to understand why we are looking this variation in the execution times and advice on the actions needed

to prevent this behaviour.

Environment Description

  • Glue version: 4

  • Worker Type: G.2x

  • Hudi version : 0.14.1

  • Spark version : 3.3

  • Max DPU Capacity: 120

  • Storage (HDFS/S3/GCS..) : S3

  • Running on Docker? (yes/no) : No

@ad1happy2go
Copy link
Collaborator

@RuyRoaV Can you provide event logs or spark UI.

On configurations, I recommend not to use archive beyond save point. You can also try to use SIMPLE index once. As for some of the usecases where most of the file groups are updated, SIMPLE index perform much better.

@RuyRoaV
Copy link
Author

RuyRoaV commented Jun 24, 2024

Hello @ad1happy2go

I have attached some screenshots of the Spark UI. Is there any specific screen that you'd like to see?

Screenshot 2024-06-24 at 13 23 33

Screenshot 2024-06-24 at 13 23 53

Thanks for the input, will take that into account. I've also seen on some other GitHub issues, seen changing to and RLI index being recommended. Would that work for a COW table? or would the SIMPLE index still be a better approach?

Best regards,

@ad1happy2go
Copy link
Collaborator

@RuyRoaV RLI will work if you need global index. It works for COW table as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Awaiting Triage
Development

No branches or pull requests

2 participants