Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: offload shards to S3 upload/download paths #5119

Open
wants to merge 26 commits into
base: main
Choose a base branch
from

Conversation

moogacs
Copy link
Contributor

@moogacs moogacs commented Jun 10, 2024

What's being changed:

this PR implements the tenant offloading feature to cloud

  • it updates UpdateTenants endpoint to be able to return the state after updating it by read after write to handle cases like from FROZEN to FREEZING.
  • all shard process like upload/download are async in nature.
  • it update RAFT by adding new commands and in memory ShardProcess map to be able to upload/download all nodes and keep their status
  • for upload is required all nodes upload their shard content and then after it will be deleted on the node.
  • for download 1 successful download is sufficient

Limitations to be addressed later on

  • on shard rebalance need to adjust the either by remove or add on from cloud

image

State machine for tenant status

  • HOT
    • if the API returns this status, the shard is ready to use
    • if the user specifies this status, when the status is any other status, the shard is warmed all the way to HOT (this includes pulling from cloud storage if it was FROZEN before)
    • if the user specifies this status, when the status is already HOT, it’s a no-op
  • COLD
    • if the API returns this status, the shard is on disk, but not ready to be used
    • if the user specifies this status, when the status is FROZEN, the shard is warmed to COLD, i.e. pulled from cloud storage <-- new
    • if the user specifies this status, when the status is HOT, the shard is deactivated to COLD <--- already exists
    • if the user specifies this status, when the status is already COLD, it’s a no-op
  • FROZEN
    • if the API returns this status, the shard on cloud storage and not on local disks
    • if the user specifies this status, when the status is any other status, the shard is frozen all the way to FROZEN, i.e. offloading it to cloud storage <-- new
    • if the user specifies this status, when the status is already FROZEN, it’s a no-op
  • UNFREEZING
    • if the API returns this status, the shard is currently being warmed up, any shard change attempt will fail
    • the user cannot set this status, this is a read-only status that only the server can return
    • NOTE: It’s enough to only have this status from FROZEN to target status. going from COLD->HOT is fast enough to skip the status check
  • FREEZING
    • if the API returns this status, the shard is currently being frozen, any shard change attempt will fail
    • the user cannot set this status, this is a read-only status that only the server can return

missing in current implementation

  • metrics for actual upload and download will be added as follow up
  • delete from cloud after download

Links
offload module
Issue

Usage

if we wanna run it locally we need to run minio 1st and then run weaviate cluster

  1. Run via minio
docker run \
-p 9000:9000 \
-p 9001:9001 \
--user $(id -u):$(id -g) \
--name minio1 \
-e "MINIO_ROOT_USER=aws_access_key" \
-e "MINIO_ROOT_PASSWORD=aws_secret_key" \
-v ${HOME}/minio/data:/data \
[quay.io/minio/minio](http://quay.io/minio/minio) server /data --console-address ":9001"
  1. Run Weaviate with offload-module (already configured to use minio with the previous command keys)
./tools/dev/run_dev_server.sh local-offload-s3
./tools/dev/run_dev_server.sh second-offload-s3
./tools/dev/run_dev_server.sh third-offload-s3

Offload module config env vars source code

  • OFFLOAD_S3_ENDPOINT default will ready AWS config from environment
    default: will read AWS config

  • S3_ENDPOINT_URL

    default will read AWS config
    e.g. incase we want to assign local minio, otherwise don’t do anything and it will be read automatically when AWS config is exported

    export S3_ENDPOINT_URL="[http://localhost:9000](http://localhost:9000/)
  • OFFLOAD_S3_BUCKET : the name of the bucket used to store the offloaded tenants

    default weaviate-offload

  • OFFLOAD_S3_CONCURRENCY : pass concurrency to s5cmd

    default 25

  • OFFLOAD_TIMEOUT : this is the context timeout for all the module requests (upload , download, create bucket)

    default 10 sec

Review checklist

  • Documentation has been updated, if necessary. Link to changed documentation:
  • Chaos pipeline run or not necessary. Link to pipeline:
  • All new code is covered by tests where it is reasonable.
  • Performance tests have been run or not necessary.

Base automatically changed from offload-module to main June 10, 2024 16:07
@moogacs moogacs force-pushed the offload-upload-download branch 13 times, most recently from a3ed0e4 to d7dd6f5 Compare June 12, 2024 13:30
@moogacs moogacs marked this pull request as ready for review June 12, 2024 13:37
@moogacs moogacs requested a review from a team as a code owner June 12, 2024 13:37
defer idx.backupMutex.RUnlock()

eg := enterrors.NewErrorGroupWrapper(m.logger)
eg.SetLimit(_NUMCPU * 2)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we should not go over _NUMCPU value, wdyt? I have seen that in most of the places we are constraining goroutines to just _NUMCPU value

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well, then we have a mix because in the same file we have it eg.SetLimit(_NUMCPU * 2) , i don't have a hard opinion on it. for sure can adjust it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i have gave it a thought and i think it would be beneficial to speed up download/upload 🤔 , given both are I/O-bound ops than CPU heavy ops

@moogacs moogacs force-pushed the offload-upload-download branch 2 times, most recently from 271ffeb to d710625 Compare June 13, 2024 09:55
@moogacs moogacs force-pushed the offload-upload-download branch 4 times, most recently from c67d276 to 4d2198e Compare June 13, 2024 15:38
Copy link

sonarcloud bot commented Jun 13, 2024

Quality Gate Failed Quality Gate failed

Failed conditions
39.7% Duplication on New Code (required ≤ 3%)

See analysis details on SonarCloud

@moogacs moogacs changed the title feat: offload to S3 upload/download paths feat: offload shards to S3 upload/download paths Jun 14, 2024
@@ -58,6 +58,7 @@ func startMinIO(ctx context.Context, networkName string) (*DockerContainer, erro
}
envSettings := make(map[string]string)
envSettings["BACKUP_S3_ENDPOINT"] = fmt.Sprintf("%s:%s", MinIO, port.Port())
envSettings["S3_ENDPOINT_URL"] = fmt.Sprintf("http://%s:%s", MinIO, port.Port())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the difference between OFFLOAD_S3_ENDPOINT and S3_ENDPOINT_URL? can we just use OFFLOAD_S3_ENDPOINT? this brings a little bit of ambiguity

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants