You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We have a Glue 4.0 job to perform an upsert on a Hudi managed COW table. In some occasions, the Glue job runs in under 5 minutes, whereas in others it runs for up to 20 minutes. Moreover, we have noticed that, in those instances, the job is performing a count at HoodieSparkSqlWriter.scala:1072 action for over 17 minutes; in other job runs this only takes around 1 minute.
Regarding some specifications for the table:
We have 3 partition fields:
year : int
month: int
day : int
A precombine field:
epoch: bigint
and 3 recordkey fields:
node_id : string
container_id : string
container_label: string
You can see more about the table description here:
We are also using a BLOOM type index and these are some other configurations that we are setting.
Could you please advise us on which actions we should take to bring down the execution time?
Expected behavior
We would like to understand why we are looking this variation in the execution times and advice on the actions needed
to prevent this behaviour.
Environment Description
Glue version: 4
Worker Type: G.2x
Hudi version : 0.14.1
Spark version : 3.3
Max DPU Capacity: 120
Storage (HDFS/S3/GCS..) : S3
Running on Docker? (yes/no) : No
The text was updated successfully, but these errors were encountered:
On configurations, I recommend not to use archive beyond save point. You can also try to use SIMPLE index once. As for some of the usecases where most of the file groups are updated, SIMPLE index perform much better.
I have attached some screenshots of the Spark UI. Is there any specific screen that you'd like to see?
Thanks for the input, will take that into account. I've also seen on some other GitHub issues, seen changing to and RLI index being recommended. Would that work for a COW table? or would the SIMPLE index still be a better approach?
Describe the problem you faced
We have a Glue 4.0 job to perform an upsert on a Hudi managed COW table. In some occasions, the Glue job runs in under 5 minutes, whereas in others it runs for up to 20 minutes. Moreover, we have noticed that, in those instances, the job is performing a count at
HoodieSparkSqlWriter.scala:1072
action for over 17 minutes; in other job runs this only takes around 1 minute.Regarding some specifications for the table:
We have 3 partition fields:
A precombine field:
and 3 recordkey fields:
You can see more about the table description here:
We are also using a BLOOM type index and these are some other configurations that we are setting.
Could you please advise us on which actions we should take to bring down the execution time?
Expected behavior
We would like to understand why we are looking this variation in the execution times and advice on the actions needed
to prevent this behaviour.
Environment Description
Glue version: 4
Worker Type: G.2x
Hudi version : 0.14.1
Spark version : 3.3
Max DPU Capacity: 120
Storage (HDFS/S3/GCS..) : S3
Running on Docker? (yes/no) : No
The text was updated successfully, but these errors were encountered: