feat: offload shards to S3 upload/download paths #5119

moogacs · 2024-06-10T12:44:01Z

What's being changed:

this PR implements the tenant offloading feature to cloud

it updates UpdateTenants endpoint to be able to return the state after updating it by read after write to handle cases like from FROZEN to FREEZING.
all shard process like upload/download are async in nature.
it update RAFT by adding new commands and in memory ShardProcess map to be able to upload/download all nodes and keep their status
for upload is required all nodes upload their shard content and then after it will be deleted on the node.
for download 1 successful download is sufficient

Limitations to be addressed later on

on shard rebalance need to adjust the either by remove or add on from cloud

State machine for tenant status

HOT
- if the API returns this status, the shard is ready to use
- if the user specifies this status, when the status is any other status, the shard is warmed all the way to HOT (this includes pulling from cloud storage if it was FROZEN before)
- if the user specifies this status, when the status is already HOT, it’s a no-op
COLD
- if the API returns this status, the shard is on disk, but not ready to be used
- if the user specifies this status, when the status is FROZEN, the shard is warmed to COLD, i.e. pulled from cloud storage <-- new
- if the user specifies this status, when the status is HOT, the shard is deactivated to COLD <--- already exists
- if the user specifies this status, when the status is already COLD, it’s a no-op
FROZEN
- if the API returns this status, the shard on cloud storage and not on local disks
- if the user specifies this status, when the status is any other status, the shard is frozen all the way to FROZEN, i.e. offloading it to cloud storage <-- new
- if the user specifies this status, when the status is already FROZEN, it’s a no-op
UNFREEZING
- if the API returns this status, the shard is currently being warmed up, any shard change attempt will fail
- the user cannot set this status, this is a read-only status that only the server can return
- NOTE: It’s enough to only have this status from FROZEN to target status. going from COLD->HOT is fast enough to skip the status check
FREEZING
- if the API returns this status, the shard is currently being frozen, any shard change attempt will fail
- the user cannot set this status, this is a read-only status that only the server can return

missing in current implementation

metrics for actual upload and download will be added as follow up
delete from cloud after download

Links
offload module
Issue

Usage

if we wanna run it locally we need to run minio 1st and then run weaviate cluster

Run via minio

docker run \
-p 9000:9000 \
-p 9001:9001 \
--user $(id -u):$(id -g) \
--name minio1 \
-e "MINIO_ROOT_USER=aws_access_key" \
-e "MINIO_ROOT_PASSWORD=aws_secret_key" \
-v ${HOME}/minio/data:/data \
[quay.io/minio/minio](http://quay.io/minio/minio) server /data --console-address ":9001"

Run Weaviate with offload-module (already configured to use minio with the previous command keys)

./tools/dev/run_dev_server.sh local-offload-s3
./tools/dev/run_dev_server.sh second-offload-s3
./tools/dev/run_dev_server.sh third-offload-s3

Offload module config env vars source code

OFFLOAD_S3_ENDPOINT default will ready AWS config from environment
default: will read AWS config
S3_ENDPOINT_URL

default will read AWS config
e.g. incase we want to assign local minio, otherwise don’t do anything and it will be read automatically when AWS config is exported
```
export S3_ENDPOINT_URL="[http://localhost:9000](http://localhost:9000/)
```
OFFLOAD_S3_BUCKET : the name of the bucket used to store the offloaded tenants

default weaviate-offload
OFFLOAD_S3_CONCURRENCY : pass concurrency to s5cmd

default 25
OFFLOAD_TIMEOUT : this is the context timeout for all the module requests (upload , download, create bucket)

default 10 sec

Review checklist

Documentation has been updated, if necessary. Link to changed documentation:
Chaos pipeline run or not necessary. Link to pipeline:
All new code is covered by tests where it is reasonable.
Performance tests have been run or not necessary.

antas-marcin · 2024-06-12T13:52:48Z

adapters/repos/db/migrator_shard_status_ops.go

+	defer idx.backupMutex.RUnlock()
+
+	eg := enterrors.NewErrorGroupWrapper(m.logger)
+	eg.SetLimit(_NUMCPU * 2)


maybe we should not go over _NUMCPU value, wdyt? I have seen that in most of the places we are constraining goroutines to just _NUMCPU value

well, then we have a mix because in the same file we have it eg.SetLimit(_NUMCPU * 2) , i don't have a hard opinion on it. for sure can adjust it

i have gave it a thought and i think it would be beneficial to speed up download/upload 🤔 , given both are I/O-bound ops than CPU heavy ops

adapters/repos/db/migrator.go

adapters/repos/db/migrator_shard_status_ops.go

test/docker/compose.go

adapters/repos/db/migrator_shard_status_ops.go

cluster/schema/meta_class.go

…odes

sonarcloud · 2024-06-13T16:27:30Z

Quality Gate failed

Failed conditions
39.7% Duplication on New Code (required ≤ 3%)

See analysis details on SonarCloud

antas-marcin · 2024-06-21T09:38:28Z

test/docker/minio.go

@@ -58,6 +58,7 @@ func startMinIO(ctx context.Context, networkName string) (*DockerContainer, erro
 	}
 	envSettings := make(map[string]string)
 	envSettings["BACKUP_S3_ENDPOINT"] = fmt.Sprintf("%s:%s", MinIO, port.Port())
+	envSettings["S3_ENDPOINT_URL"] = fmt.Sprintf("http://%s:%s", MinIO, port.Port())


what is the difference between OFFLOAD_S3_ENDPOINT and S3_ENDPOINT_URL? can we just use OFFLOAD_S3_ENDPOINT? this brings a little bit of ambiguity

moogacs force-pushed the offload-module branch from b521805 to a439ad0 Compare June 10, 2024 15:50

moogacs force-pushed the offload-upload-download branch from 8a701d7 to e7ddc29 Compare June 10, 2024 16:03

Base automatically changed from offload-module to main June 10, 2024 16:07

moogacs force-pushed the offload-upload-download branch 13 times, most recently from a3ed0e4 to d7dd6f5 Compare June 12, 2024 13:30

moogacs marked this pull request as ready for review June 12, 2024 13:37

moogacs requested a review from a team as a code owner June 12, 2024 13:37

moogacs force-pushed the offload-upload-download branch from d7dd6f5 to f47f779 Compare June 12, 2024 13:45

antas-marcin reviewed Jun 12, 2024

View reviewed changes

parkerduckworth reviewed Jun 12, 2024

View reviewed changes

adapters/repos/db/migrator.go Outdated Show resolved Hide resolved

parkerduckworth reviewed Jun 12, 2024

View reviewed changes

adapters/repos/db/migrator_shard_status_ops.go Outdated Show resolved Hide resolved

antas-marcin reviewed Jun 12, 2024

View reviewed changes

test/docker/compose.go Outdated Show resolved Hide resolved

moogacs force-pushed the offload-upload-download branch from d54fcf0 to 785fc26 Compare June 12, 2024 15:24

parkerduckworth reviewed Jun 12, 2024

View reviewed changes

adapters/repos/db/migrator_shard_status_ops.go Show resolved Hide resolved

parkerduckworth reviewed Jun 12, 2024

View reviewed changes

cluster/schema/meta_class.go Outdated Show resolved Hide resolved

moogacs force-pushed the offload-upload-download branch from 3903e92 to 178eacf Compare June 12, 2024 17:37

moogacs added 3 commits June 13, 2024 11:19

feat: offload upload/download init

e196ebd

remove not used status

5854925

add missing go mod tidy

997cce2

moogacs added 15 commits June 13, 2024 11:19

refacto more renaming and handle errors in migrator

7ee1f9b

add more extra test case for download

74a754a

catch edge cases and cleaning

03549c6

hanlde on download failure and 1 download is sufficient to change status

1128ac3

simplify tenants update state machine

8d6f28e

fix typo

acaeda3

address review comments

6efdbf9

delete cloud after succesful download

aaff96b

refact bits in meta class

f23e325

update for cloud deletetion and address more review comments

82c4604

fix typos

e0e9d10

don't update request status in updateShardProcess if unfreezing

191d026

convert if's to switch in RAFT shardProcess ops

9266bc9

satisfy linter

dc0dfb7

chore: rename backupMutex (#5144)

d06b579

moogacs force-pushed the offload-upload-download branch 2 times, most recently from 271ffeb to d710625 Compare June 13, 2024 09:55

refactor offload acceptance test

71ad558

moogacs force-pushed the offload-upload-download branch 4 times, most recently from c67d276 to 4d2198e Compare June 13, 2024 15:38

avoid deep copies when updating the shard, fix case download on new n…

a918af8

…odes

moogacs force-pushed the offload-upload-download branch from 4d2198e to a918af8 Compare June 13, 2024 15:49

fix unit tests

6c93a8c

moogacs requested review from parkerduckworth and antas-marcin June 13, 2024 16:48

moogacs changed the title ~~feat: offload to S3 upload/download paths~~ feat: offload shards to S3 upload/download paths Jun 14, 2024

antas-marcin reviewed Jun 21, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: offload shards to S3 upload/download paths #5119

feat: offload shards to S3 upload/download paths #5119

moogacs commented Jun 10, 2024 •

edited

Loading

antas-marcin Jun 12, 2024

moogacs Jun 12, 2024

moogacs Jun 12, 2024

sonarcloud bot commented Jun 13, 2024

antas-marcin Jun 21, 2024

feat: offload shards to S3 upload/download paths #5119

Are you sure you want to change the base?

feat: offload shards to S3 upload/download paths #5119

Conversation

moogacs commented Jun 10, 2024 • edited Loading

What's being changed:

Usage

Review checklist

antas-marcin Jun 12, 2024

Choose a reason for hiding this comment

moogacs Jun 12, 2024

Choose a reason for hiding this comment

moogacs Jun 12, 2024

Choose a reason for hiding this comment

sonarcloud bot commented Jun 13, 2024

Quality Gate failed

antas-marcin Jun 21, 2024

Choose a reason for hiding this comment

moogacs commented Jun 10, 2024 •

edited

Loading