Large Task Deserialization Time during Optimization #333

Jiaweihu08 · 2024-06-13T09:28:02Z

What went wrong?

There's an enormous task deserialization time during optimizations—specifically the last collect from RollupDataWriter.compact().

The IndexStatus.cubeStatuses is packaged within each task, and their size increases as the metadata size increases.

How to reproduce?

Try to optimize a relatively large table and compare the Task Deserialization Time from the second collect with that from the execute.; The values from execute should be an order of magnitude smaller.

2. Branch and commit id: `main,` b7f1906

3. Spark version: `3.5.0`

4. Hadoop version: `3.3.4`

5. How are you running Spark? `locally,` `distributed`

The text was updated successfully, but these errors were encountered:

Jiaweihu08 added the bug Something isn't working label Jun 13, 2024

Jiaweihu08 linked a pull request Jun 13, 2024 that will close this issue

Issue 333: Broadcast cube weights during optimization file writing #334

Open

6 tasks

cugni mentioned this issue Jun 14, 2024

Dataset api #335

Draft

3 tasks

cdelfosse assigned Jiaweihu08 Jun 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Large Task Deserialization Time during Optimization #333

Large Task Deserialization Time during Optimization #333

Jiaweihu08 commented Jun 13, 2024

Large Task Deserialization Time during Optimization #333

Large Task Deserialization Time during Optimization #333

Comments

Jiaweihu08 commented Jun 13, 2024

What went wrong?

How to reproduce?

2. Branch and commit id: main, b7f1906

3. Spark version: 3.5.0

4. Hadoop version: 3.3.4

5. How are you running Spark? locally, distributed

2. Branch and commit id: `main,` b7f1906

3. Spark version: `3.5.0`

4. Hadoop version: `3.3.4`

5. How are you running Spark? `locally,` `distributed`