Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

proposal: Reliability of kmesh restart #417

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

nlgwcy
Copy link
Contributor

@nlgwcy nlgwcy commented Jun 5, 2024

What type of PR is this?

/kind enhancement
What this PR does / why we need it:

Which issue(s) this PR fixes:
Fixes #322

Special notes for your reviewer:

Does this PR introduce a user-facing change?:


@kmesh-bot kmesh-bot added the kind/enhancement New feature or request label Jun 5, 2024
@kmesh-bot
Copy link
Collaborator

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from nlgwcy. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment


In normal and abnormal restart scenarios, the data plane governance function is not affected, including:

1. After the pod is restarted, it is automatically restored to the status before the restart.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
1. After the pod is restarted, it is automatically restored to the status before the restart.
1. After kmesh daemon is restarted, it is automatically restored to the status before.

In normal and abnormal restart scenarios, the data plane governance function is not affected, including:

1. After the pod is restarted, it is automatically restored to the status before the restart.
2. During the Kmesh restart, newly established links have the governance capability.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
2. During the Kmesh restart, newly established links have the governance capability.
2. During the Kmesh restart, newly established connections should be governed by Kmesh.


1. After the pod is restarted, it is automatically restored to the status before the restart.
2. During the Kmesh restart, newly established links have the governance capability.
3. After the Kmesh is restarted, the newly created link has the governance capability.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

2. During the Kmesh restart, newly established links have the governance capability.
3. After the Kmesh is restarted, the newly created link has the governance capability.
4. The governance capability of old links is not affected by restart.
5. When Kmesh is deleted, resources can be completely cleared to avoid residual resources.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
5. When Kmesh is deleted, resources can be completely cleared to avoid residual resources.
5. After Kmesh is uninstalled, resources should be completely cleared to avoid leaking resources.

2. After the Kmesh is restarted, the configuration restoration module is added to restore the governance information of the pods managed by the Kmesh on the node.
3. In the scenario where the Kmesh is deleted, the bpf prog and bpf map resources must be cleared in a timely manner to ensure that no resource leakage occurs.

When the Kmesh is started, the bpf prog and bpf map may already exist on the node. In this case, compatibility processing is required.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you elaborate on compatibility processing


| Scenario | workload model | Kmesh management | Kmesh byPass | Auth | waypoint Interconnection |
| ------------ | -------------- | ---------------- | ------------ | ---- | ------------------------------------------------------------ |
| Existed link | Y | Y | Y | Y | P(If the connection is complete but the send request has not been sent) |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does Y and N mean

- eBPF management

- To keep the governance capability during the Kmesh restart, you need to place the bpf prog and bpf map pin to the /sys/fs/bpf directory on the node.
- During the startup of the Kmesh, the BPF PROG is replaced and the BPF MAP is inherited.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BPF PROG is replaced

How will this influence the governing prog

- Configure recovery

- Workload resources
- During the startup of the Kmesh, the local bpf map data is restored to the user-mode cache before the Kmesh subscribes to workload model data from the Istiod.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on offline discussion, how can user space cache be restored correctly is still not clear

- During the startup of the Kmesh, the local bpf map data is restored to the user-mode cache before the Kmesh subscribes to workload model data from the Istiod.
- Incremental subscription to Istiod based on workload resources in the local cache;
- Kmesh management/byPass
- Move the management logic from kmesh-plg to kmesh-daemon.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kmesh-plg/kmesh-cni ?

- Authentication
- Restore the user-mode authentication cache information based on the authentication rule delivered by the Istiod.
- Waypoint interconnection
- Restore the user-mode authentication cache information based on the authentication rule delivered by the Istiod. The procedure is the same as that for restoring workload resources.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

?


- Postponed clearance

When the Kmesh exits, start a delay clearing task on the node. If the Kmesh restarts abnormally, delete the scheduled task after the restart.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ALternatively we can detect whether there is a new kmesh daemon pod in the same node. And only clear the resources if no such pod exists

Copy link
Collaborator

@LiZhenCheng9527 LiZhenCheng9527 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

image
Is this normal?


#### Status Quo and Strategy Analysis

The following uses the workload mode as an example to describe the operations on each BPF map during Kmesh governance:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it a bpf map or a BPF map. In this article it needs to be standardised


- The bpf prog and map need to be maintained during restart.

- After the Kmesh is restarted:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- After the Kmesh is restarted:
- After Kmesh restart:

- The bpf prog and map need to be maintained during restart.
- After the Kmesh is started, the data in the map needs to be restored and the data needs to be synchronized with the control plane.

Note: Currently, Kmesh supports only link-level authentication.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Note: Currently, Kmesh supports only link-level authentication.
Note: Currently, Kmesh only supports link-level authentication.

@hzxuzhonghu
Copy link
Member

@nlgwcy Please update

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/enhancement New feature or request size/L
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support automatically manage workloads after kmesh rollout
4 participants