POC: Tempting to introduce peer-to-peer chunk data exchange #728

jqdai · 2023-09-28T13:00:31Z

What do these changes do?

This is a proof-of-concept draft pull request.

In Xorbits, storage is used to store intermediate data and final results during the computation process, supporting various types of storage such as GPU memory, main memory, and disk. Currently in the Xorbits, data produced by workers is stored and managed by a cetralized storage_api.

This project hopes to introduce peer-to-peer data storage and communication, where each Xorbits worker hold their own data locally. A meta_api maintains the keys of data and the address of the worker that produced this data. Each subtask runner holds an independent RunnerStorage to maintain all data created in this runner and respond to requests for data (if it has). When a runner needs a non-local data, it looks up the meta_api and finds the address of the runner that holds the data, and then fetches the data. Thus, a centralized data storage is no longer necessary, which may bring potential speed accelerance.

Check code requirements

tests added / passed (if needed)
Ensure all linting tests pass

…unk-data

codingl2k1 · 2023-09-29T13:04:36Z

python/xorbits/_mars/services/subtask/worker/processor.py

+                    # )
+                    # self.result.status = SubtaskStatus.errored
+                    # raise
+                    runner_storage: RunnerStorageActor = await mo.create_actor(


Why need this creation? The RunnerStorageActor is already created by SubtaskRunnerManagerActor.

There used to be a 'RunnerStorage not found' error during debugging. Now the implementation is changed to raising an exception when runner storage is not found. Thank you.

codingl2k1 · 2023-09-29T13:07:57Z

python/xorbits/_mars/services/subtask/worker/processor.py

@@ -341,13 +388,40 @@ async def _store_data(self, chunk_graph: ChunkGraph):
                storage_level,
            )
            if puts:
+                try:
+                    runner_storage: RunnerStorageActor = await mo.actor_ref(
+                        uid=RunnerStorageActor.gen_uid(self._band[1], self._slot_id),


Just store the data to current RunnerStorageActor. It's OK to get the actor ref and call put_data API. The in-process actor communication is optimized to a function call automatically.

Here we use mo.actor_ref to get the runner storage of the current subtask runner because RunnerStorage is currently not a property of SubtaskProcessor class or SubtaskRunnerActor class. Thus, the processors of a runner may not have direct access to the runner storage of the runner. I wonder if there's a better or more direct implementation.

Additionally, sometimes I'm confused whether the address of runner storage should be set to the supervisor_address of the processor or the address stored in band (band[0], specifically). Current implementation takes the former but I'm not quite sure about it. SubtaskRunnerActor uses self.address which is a property not seen in its construction function.

codingl2k1 · 2023-09-29T13:09:52Z

python/xorbits/_mars/services/subtask/worker/processor.py

                put_infos = asyncio.create_task(self._storage_api.put.batch(*puts))
                try:
                    store_infos = await put_infos
                    for store_key, store_info in zip(stored_keys, store_infos):
                        data_key_to_store_size[store_key] = store_info.store_size
                        data_key_to_memory_size[store_key] = store_info.memory_size
                        data_key_to_object_id[store_key] = store_info.object_id
+                        data_key_to_band[store_key] = self._band
+                        data_key_to_slot_id[store_key] = self._slot_id


The slot id is not stored to meta?

In line 562 of python/xorbits/_mars/services/subtask/worker/processor.py, in _store_meta(), we directly added current band and slot_id (self._band and self._slot_id, specifically). Thus we did not modify the original _stor_mapper_data of SubtaskProcessor to store mappings from data_key to band or slot_id. Is this still necessary?

Meanwhile in file python/xorbits/_mars/services/storage/handler.py#L639, StorageHandler calls modified meta_api.add_chunk_bands, which needs [slot_id] as additional input, which is not provided by StorageHandlerActor itself. This is still unsolved.

…unk-data

jqdai and others added 19 commits August 3, 2023 09:20

Create RunnerStorageActor and tests, bugs unfixed

749dafc

fix

5ddb674

Merge branch 'xorbitsai:main' into p2p-chunk-data

52f1f90

store data and meta for p2p chunk data

5c6b6e7

Merge branch 'p2p-chunk-data' of github.com:jqdai/xorbits into p2p-ch…

402d5ea

…unk-data

Merge branch 'xorbitsai:main' into p2p-chunk-data

b4c3079

Merge branch 'p2p-chunk-data' of github.com:jqdai/xorbits into p2p-ch…

d2a3a16

…unk-data

Merge branch 'main' of github.com:jqdai/xorbits into p2p-chunk-data

65f27e3

store meta of band and slot id

e7b2b2b

test new meta api and modify load input data from peer runner storage

5776517

modify runner_storage

0f56ed0

Merge branch 'main' of github.com:jqdai/xorbits into p2p-chunk-data

9c83bfd

modify test_subtask, all passed

e82a2af

Merge branch 'main' of github.com:jqdai/xorbits into p2p-chunk-data

5f4b2d8

Debug test_local, unfinished

9403914

Merge branch 'main' of github.com:jqdai/xorbits into p2p-chunk-data

93efd43

Merge branch 'main' of github.com:jqdai/xorbits into p2p-chunk-data

4f183d8

Merge branch 'main' of github.com:jqdai/xorbits into p2p-chunk-data

42438b9

Merge branch 'xorbitsai:main' into p2p-chunk-data

30fac4c

XprobeBot added the gpu label Sep 28, 2023

XprobeBot added this to the v0.7.0 milestone Sep 28, 2023

jqdai changed the title ~~POC: temping to introduce peer-to-peer chunk data exchange~~ POC: Tempting to introduce peer-to-peer chunk data exchange Sep 29, 2023

codingl2k1 reviewed Sep 29, 2023

View reviewed changes

jqdai added 3 commits October 7, 2023 05:42

Adjust the implementation of store_data

4f1de77

Merge branch 'main' of github.com:jqdai/xorbits into p2p-chunk-data

caf21e8

Merge branch 'p2p-chunk-data' of github.com:jqdai/xorbits into p2p-ch…

31014c3

…unk-data

XprobeBot modified the milestones: v0.7.0, v0.7.1 Oct 23, 2023

XprobeBot modified the milestones: v0.7.1, v0.7.2 Nov 21, 2023

XprobeBot modified the milestones: v0.7.2, v0.7.3 Jan 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

POC: Tempting to introduce peer-to-peer chunk data exchange #728

POC: Tempting to introduce peer-to-peer chunk data exchange #728

jqdai commented Sep 28, 2023 •

edited

Loading

codingl2k1 Sep 29, 2023

jqdai Oct 2, 2023

codingl2k1 Sep 29, 2023

jqdai Oct 2, 2023

jqdai Oct 2, 2023

codingl2k1 Sep 29, 2023

jqdai Oct 2, 2023

jqdai Oct 2, 2023

POC: Tempting to introduce peer-to-peer chunk data exchange #728

Are you sure you want to change the base?

POC: Tempting to introduce peer-to-peer chunk data exchange #728

Conversation

jqdai commented Sep 28, 2023 • edited Loading

What do these changes do?

Check code requirements

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jqdai commented Sep 28, 2023 •

edited

Loading