Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: OnExit node can retry when retry a workflow (#13184) #13185

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

JasonChen86899
Copy link

Fixes #13184

Motivation

fix retry logic

Modifications

  1. function FormulateRetryWorkflow in workflow/util/util.go file
  2. function isDescendantNodeSucceeded in workflow/util/util.go file

Verification

OnExit node and its parent can be retried when a workflow is retried
yaml: test/e2e/testdata/retry-workflow-with-continueon.yaml

  1. do retrying
image
  1. done
image

@JasonChen86899 JasonChen86899 changed the title Fix: OnExit node can retry when retry a workflow (#13184) fix: OnExit node can retry when retry a workflow (#13184) Jun 14, 2024
@tooptoop4
Copy link
Contributor

what if there are multiple onExit nodes?

@agilgur5 agilgur5 added area/exit-handler area/retry-manual Manual workflow "Retry" Action (API/CLI/UI). See retryStrategy for template-level retries labels Jun 14, 2024
@JasonChen86899
Copy link
Author

JasonChen86899 commented Jun 14, 2024

what if there are multiple onExit nodes?

The logic of multiple onExit nodes and one onExit node is the same, no mater parent node status, the onExit node will be retried

There are two types of failed parent nodes that can be retried:

  1. There are no successful child nodes other than the onExit node
  2. In addition to the onExit node, there are other successful child nodes, and all of these nodes are required to retry at the same time

@@ -905,7 +912,7 @@ func FormulateRetryWorkflow(ctx context.Context, wf *wfv1.Workflow, restartSucce
}
switch node.Phase {
case wfv1.NodeSucceeded, wfv1.NodeSkipped:
if strings.HasPrefix(node.Name, onExitNodeName) || doForceResetNode {
if strings.HasSuffix(node.Name, onExitNodeNameSuffix) || doForceResetNode {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Succeeded child nodes of onExitNode will not be retried?

@jswxstw
Copy link
Member

jswxstw commented Jun 19, 2024

Can you try this workflow:

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  name: workflow-exit-handler-demo
spec:
  entrypoint: main
  onExit: exit-handler
  templates:
  - name: main
    steps:
    - - name: fail
        template: fail
  - name: exit-handler
    steps:
    - - name: alarm
        template: alarm
  - name: alarm
    container:
      image: alpine:3.18
      command: [sh, -c]
      args: ["echo alarm"]
  - name: fail
    container:
      image: alpine:3.18
      command: [sh, -c]
      args: ["exit 1"]

Can step alarm in the exit handler be retried after manual retry?

@JasonChen86899
Copy link
Author

@jswxstw Thank you very much for the example, I found that the alarm was not re-executed after I performed a retry, based on this example I re-modified the code logic, see the latest code submission for details

@jswxstw
Copy link
Member

jswxstw commented Jun 24, 2024

@jswxstw Thank you very much for the example, I found that the alarm was not re-executed after I performed a retry, based on this example I re-modified the code logic, see the latest code submission for details

There is already a function to determine whether or not node run as exit handler. You should consider using it directly.

// IsExitNode returns whether or not node run as exit handler.
func (ws NodeStatus) IsExitNode() bool {
return strings.HasSuffix(ws.DisplayName, ".onExit")
}

However, this function has a bug now: the DisplayName of child nodes in exit handler(if it runs as Steps) misses the .onExit suffix.
I fixed it in #13016 as below, but this requires confirmation from other reviewers.
image

@JasonChen86899
Copy link
Author

@jswxstw The function you provided is great, I will modify the code based on it, thanks

Review your change: All children of onExit are also added with the .OnExit suffix. The current onExit child node does not have this suffix, which is why running your example makes it impossible to retry with my pervious code.
I have a question for your changes: how about DAG, does DAG have such a bug as well?

So isExitNode function that I think can do this: recursively determine whether the parent group node contains the .onExit suffix. Then the children node name no need be added this suffix (related pr:#13016)

@jswxstw
Copy link
Member

jswxstw commented Jun 24, 2024

I have a question for your changes: how about DAG, does DAG have such a bug as well?

Good question. DAG do have the same bug.

So isExitNode function that I think can do this: recursively determine whether the parent group node contains the .onExit suffix. Then the children node name no need be added this suffix (related pr:#13016)

Good idea! But isExitNode is just the function of NodeStatus, it can not access its root node with boundaryID.

@JasonChen86899
Copy link
Author

Good idea! But isExitNode is just the function of NodeStatus, it can not access its root node with boundaryID.

boundaryID is parent group node just like this function resetConnectedParentGroupNodes

func resetConnectedParentGroupNodes(oldWF *wfv1.Workflow, newWF *wfv1.Workflow, currentNode wfv1.NodeStatus, resetParentGroupNodes []string) (*wfv1.Workflow, []string) {

@jswxstw
Copy link
Member

jswxstw commented Jun 24, 2024

But isExitNode is just the function of NodeStatus, it can not access its root node with boundaryID.

#13016 add a new function IsPartOfExitHandler as below, which applies to exit handler of types such as Steps, DAG, Container, and ContainerSet.

// IsPartOfExitHandler returns whether node is part of exit handler.
func (woc *wfOperationCtx) IsPartOfExitHandler(node wfv1.NodeStatus) bool {
	if node.BoundaryID == "" {
		return node.IsExitNode()
	}
	if boundaryNode, err := woc.wf.Status.Nodes.Get(node.BoundaryID); err == nil {
		return boundaryNode.IsExitNode()
	}
	return false
}

@JasonChen86899
Copy link
Author

@jswxstw the code has been updated and can be reviewed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/exit-handler area/retry-manual Manual workflow "Retry" Action (API/CLI/UI). See retryStrategy for template-level retries
Projects
None yet
4 participants