Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change Flyte CR naming scheme to better support namespace_mapping #5480

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

ddl-ebrown
Copy link
Contributor

@ddl-ebrown ddl-ebrown commented Jun 14, 2024

@ddl-rliu did most of the work on this one - making this an upstream PR as it resolved a real issue for us.

Tracking issue

Why are the changes needed?

  • Typically Flyte is configured so that each project / domain has its
    own Kubernetes namespace.

    Certain environments may change this behavior by using the Flyteadmin
    namespace_mapping setting to put all executions in fewer (or a singular)
    Kubernetes namespace. This is problematic because it can lead to
    collisions in the naming of the CR that flyteadmin generates.

What changes were proposed in this pull request?

  • This patch fixes 2 important things to make this work properly inside
    of Flyte:

    • it adds a random element to the CR name in Flyte so that the CR is
      named by the execution + some unique value when created by
      flyteadmin

      Without this change, an execution Foo in project A will prevent an
      execution Foo in project B from launching, because the name of the
      CR thats generated in Kubernetes assumes that the namespace the
      CRs are put into is different for project A and project B

      When namespace_mapping is set to a singular value, that assumption
      is wrong

    • it makes sure that when flytepropeller cleans up the CR resource
      that it uses Kubernetes labels to find the correct CR -- so instead
      of assuming that it can use the execution name, it instead uses the
      project, domain and execution labels

How was this patch tested?

This is deployed in a live Flyte setup where we have automated tests. We observed that the CR names were correctly unique after this and the initial collision no longer occurred.

Setup process

Screenshots

Check all the applicable boxes

  • I updated the documentation accordingly.
  • All new and existing tests passed.
  • All commits are signed-off.

Related PRs

Docs link

@ddl-ebrown ddl-ebrown force-pushed the change-flyte-CR-naming-scheme branch from 806c40b to 438dd97 Compare June 14, 2024 22:39
Copy link

codecov bot commented Jun 14, 2024

Codecov Report

Attention: Patch coverage is 90.00000% with 2 lines in your changes missing coverage. Please review.

Project coverage is 61.00%. Comparing base (bba8c11) to head (d994304).
Report is 2 commits behind head on master.

Files Patch % Lines
flyteadmin/pkg/workflowengine/impl/k8s_executor.go 85.71% 1 Missing and 1 partial ⚠️
Additional details and impacted files
@@           Coverage Diff           @@
##           master    #5480   +/-   ##
=======================================
  Coverage   60.99%   61.00%           
=======================================
  Files         793      793           
  Lines       51325    51366   +41     
=======================================
+ Hits        31305    31334   +29     
- Misses      17136    17146   +10     
- Partials     2884     2886    +2     
Flag Coverage Δ
unittests-datacatalog 69.31% <ø> (ø)
unittests-flyteadmin 58.73% <85.71%> (+0.02%) ⬆️
unittests-flytecopilot 17.79% <ø> (ø)
unittests-flytectl 67.97% <ø> (ø)
unittests-flyteidl 79.04% <ø> (ø)
unittests-flyteplugins 61.84% <ø> (+0.02%) ⬆️
unittests-flytepropeller 57.32% <100.00%> (-0.01%) ⬇️
unittests-flytestdlib 65.80% <ø> (-0.03%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@ddl-ebrown ddl-ebrown force-pushed the change-flyte-CR-naming-scheme branch from 438dd97 to 532888b Compare June 14, 2024 22:44
ctx,
v1.DeleteOptions{PropagationPolicy: &deletePropagationBackground},
v1.ListOptions{
LabelSelector: v1.FormatLabelSelector(executionLabelSelector(data.ExecutionID)),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even though new executions will have different CR names, this deletion mechanism is fully backwards compatible - thanks for a good solution @ddl-rliu !

@ddl-ebrown ddl-ebrown force-pushed the change-flyte-CR-naming-scheme branch 2 times, most recently from 33e6e7d to f97c77a Compare June 14, 2024 23:27
 - Typically Flyte is configured so that each project / domain has its
   own Kubernetes namespace.

   Certain environments may change this behavior by using the Flyteadmin
   namespace_mapping setting to put all executions in fewer (or a singular)
   Kubernetes namespace. This is problematic because it can lead to
   collisions in the naming of the CR that flyteadmin generates.

 - This patch fixes 2 important things to make this work properly inside
   of Flyte:

   * it adds a random element to the CR name in Flyte so that the CR is
     named by the execution + some unique value when created by
     flyteadmin

     Without this change, an execution Foo in project A will prevent an
     execution Foo in project B from launching, because the name of the
     CR thats generated in Kubernetes *assumes* that the namespace the
     CRs are put into is different for project A and project B

     When namespace_mapping is set to a singular value, that assumption
     is wrong

   * it makes sure that when flytepropeller cleans up the CR resource
     that it uses Kubernetes labels to find the correct CR -- so instead
     of assuming that it can use the execution name, it instead uses the
     project, domain and execution labels

Signed-off-by: ddl-ebrown <[email protected]>
@ddl-ebrown ddl-ebrown force-pushed the change-flyte-CR-naming-scheme branch from f97c77a to d994304 Compare June 14, 2024 23:52
@@ -165,6 +165,10 @@ func TestDynamic(t *testing.T) {
Name: "name",
},
"namespace")
// make sure real CR has randomized suffix
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is probably the simplest way to ensure existing tests continue to pass

rand.Seed(seed)
// K8s has a limitation of 63 chars
name = name[:minInt(63-ExecutionIDSuffixLength, len(name))]
execName := name + "-" + rand.String(ExecutionIDSuffixLength-1)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without this randomization, use of namespace_mapping in the config

const (
namespaceMappingKey = "namespace_mapping"
defaultTemplate = "{{ project }}-{{ domain }}"
)
var namespaceMappingConfig = config.MustRegisterSection(namespaceMappingKey, &interfaces.NamespaceMappingConfig{
Template: defaultTemplate,
})
with a value like foo causes problems when executions have the same names across projects

Copy link
Contributor

@hamersaw hamersaw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.

@@ -159,6 +161,17 @@ func generateName(wfID *core.Identifier, execID *core.WorkflowExecutionIdentifie
}
}

const ExecutionIDSuffixLength = 21
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make this a configurable value and if set to 0 (default?) then have it disabled (so no random characters are appended). My concern is that if anything relies on the CR name, this will break it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we found / updated the spots where the name is a "contract" -- but if we want to be extra super careful we could make this configurable.

That said, I think the tradeoff we have to consider is:

  • backward compatibility vs
  • adding extra config / managing different behaviors

I would probably vote for not introducing an extra config and keeping this behavior not configurable (I'd argue prior behavior was a bug), but admit to not knowing the potential blast radius beyond core Flyte (i.e. plugins and such).

I'm happy to go either way since it's not my project :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I understand the config bloat all to well :). As you suggest, my main concern is breaking backwards compatibility here. I know there are Flyte users that rely on the FlyteWorkflow CR to be named identical to the execution ID, which this would break. For me to be comfortable merging this, I feel it should be defaulted to the current behavior. cc @eapolinario thoughts?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah thanks!

If you know there are users depending on the existing CR naming scheme somehow, then making this behavior configurable seems like the only thing to do right now. I can update my PR.

Not sure if you track potential breaking change tickets anywhere, but maybe file one away to remove the option to configure this on the next major release boundary?

@kumare3
Copy link
Contributor

kumare3 commented Jun 25, 2024

I am not in favor of this, as randomness will lead to leaky workflows and duplicates. We should use the project id itself or generate a consistent hash to increase inter project execution entropy

@ddl-ebrown
Copy link
Contributor Author

I am not in favor of this, as randomness will lead to leaky workflows and duplicates. We should use the project id itself or generate a consistent hash to increase inter project execution entropy

Ah thanks @kumare3 for the heads up! We clearly didn't realize there was something internal to Flyte that depends on deterministic naming for CRs -- will make some updates taking that into account as well

@ddl-ebrown
Copy link
Contributor Author

I am not in favor of this, as randomness will lead to leaky workflows and duplicates. We should use the project id itself or generate a consistent hash to increase inter project execution entropy

Ah thanks @kumare3 for the heads up! We clearly didn't realize there was something internal to Flyte that depends on deterministic naming for CRs -- will make some updates taking that into account as well

Also, should mention @kumare3 that if by "leaky" you meant "CR might not be deleted from the cluster", the deletion process is robust because this uses the actual key of the workflow in conjunction with CR labels to perform deletes, rather than the CR name.

If there are dupe CRs for the same workflow though, that's clearly an issue regardless :)

@EngHabu
Copy link
Contributor

EngHabu commented Jun 25, 2024

@ddl-ebrown I agree with not introducing randomization... specially that the name already starts with a random string :-)

Instead, I would update this call to use something like project-domain-rand(10) and hash that and that becomes the execution name...

I would also make the length of the execution name configurable in flyteadmin. so in your deployment you can make it longer and give you better entropy...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants