Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ENHANCEMENT] add dry-run mode for ANP/BANP #230

Open
npinaeva opened this issue May 21, 2024 · 18 comments
Open

[ENHANCEMENT] add dry-run mode for ANP/BANP #230

npinaeva opened this issue May 21, 2024 · 18 comments

Comments

@npinaeva
Copy link
Member

npinaeva commented May 21, 2024

Is your enhancement request related to a problem? Please describe.
Users may want to ensure no unexpected connections will be denied by a new (B)ANP.
A dry-run mode should not affect any connections, and allow the networking plugin to provide feedback (e.g. via logging or observability tools) to see which connections will be dropped/allowed once this (B)ANP is enforced.

Describe the solution you'd like
A potential solution if to have a dry-run flag, that would make sure (B)ANP is not actually enforced.
It could allow network plugins to add logging/observability on top of this flag, but also make sure that the behaviour (not enforcing (B)ANP) is the same for all plugins.
While this flag is most useful in combination with plugin-specific logging/observability, network plugins don't have to provide anything on top of it.

The workflow should go something like this:

  • design network policies for the cluster (exactly as you would do without any dry-run mode)
  • apply them in dry-run mode to check if there are any unexpected effects of the designed network policies.

Similar features:

Describe alternatives you've considered

Additional context
First discussed on May 21, 2024 in the SIG meeting.
Don't forget to provide recommendations for existing connections that may not be affected by the new netpol.

@huntergregory
Copy link
Contributor

Users may want to ensure no unexpected connections will be denied by a new (B)ANP.

This is definitely an important scenario for user tooling. In March at KubeCon (recording, slides), we proposed that upstream tooling would solve this scenario, and there was good excitement and feedback which was the catalyst for putting #221 (same scenario) on the roadmap for Policy Assistant.

I think we should solve this scenario in upstream tooling (Policy Assistant) as opposed to a dry-run mode downstream. A few reasons:

  • Our upstream tooling handles this and more. It is already a “dry-run” policy engine.
  • Benefits of upstream vs. downstream tooling:
    • Developing new features will be quicker/easier upstream.
    • One set of changes will benefit all users regardless of implementation.
    • Less work for implementations.
  • I think upstream should provide a consistent story around user tooling. If upstream suggests using an upstream tool for one scenario, and a downstream tool/configuration for another scenario, there is a disjointed experience for users, and mixed messaging.

There were two concerns I recall from yesterday’s discussion around Policy Assistant’s solution to this scenario:

  • How could Policy Assistant know about connections to determine denied connections?
    • Only Pods and policies are needed actually. The CLI can get all Pods and policies in the cluster and calculate which Pods are relevant to the new policy. Then for all theoretical traffic to/from these Pods, it can determine which protocols and port ranges were allowed/denied before the policy was added vs. after. It can present a diff of before vs. after.
  • Will Policy Assistant be efficient enough?
    • There are some optimizations we can make if needed.

Taking a Step Back

I think there is a larger problem we are trying to solve.

A user wants to add a new policy to a cluster. There are a few considerations:

  • What does this policy specify?
  • What happens when I add this policy?
    • Does this ANP have an allow rule conflicting with a deny rule of another ANP with same priority? If so, the policy should NOT be added because it introduces vendor lock-in (traffic verdicts are up to the fate of the implementation).
    • Is any traffic now allowed because of this policy?
    • Is any traffic now dropped?

The dry-run mode is meant to address only the last question. Policy Assistant can provide a complete, easy-to-parse answer to that question, and it can also answer the other questions.

If dry-run mode aims to ease concerns that an implementation might have unexpected behavior, then I’d say that implementation is not conformant? We should strongly discourage users from creating policies which have “undefined” behavior delegated to implementations (discourage via Policy Assistant)?

@npinaeva
Copy link
Member Author

Great summary @huntergregory !
I think one of the extra questions around "How could Policy Assistant know about connections to determine denied connections?" is how to figure which connections actually exist in the cluster instead of theoretically (which egress connections are initiated from the pod and which ingress connection are initiated to the pod, which may be answered e.g. with a one day dry-run in a cluster), and then to see how these existing connections will be affected. I can imagine such data to be collected via some observability tools, and then given to the assistant, but that sounds a bit complicated. Maybe you have some other ideas on how to figure this out with the policy assistant?

@huntergregory
Copy link
Contributor

Aren't theoretical connections (i.e. summary of all possible connections) more informative? For instance, what if a denied connection will happen in the future but the downstream tooling does not observe it?

@npinaeva
Copy link
Member Author

I think these 2 tools just give different kind of information. For example, I am convinced that I allowed everything my pod needs, but forgot to allow ingress from monitoring. In the static analysis, everything will look fine, because it will just tell me the same thing as my ANP, but if I run it on a real cluster I will find that denied connection from the monitoring namespace and figure out I forgot it. Another example is using FQDN: statically you don't know to which IP it will be resolved, so you can't actually say that the higher-priority deny rule will drop traffic that should be allowed by a lower-priority FQDN rule. In the real cluster with real traffic I will be able to see that.
So I think static analysis help you to check that config aligns with your intent, but dry-run helps you check that the intent itself is right.

@huntergregory
Copy link
Contributor

I'd like to understand the argument for downstream tooling a bit more in Tuesday's meeting. I believe Policy Assistant's dry-run is by nature more informative than a CNI's dry-run, and it'd be great to get your opinion and others' opinions on how to best display that information.

The first example is not quite accurate:

In the static analysis, everything will look fine, because it will just tell me the same thing as my ANP, but if I run it on a real cluster I will find that denied connection from the monitoring namespace and figure out I forgot it.

Policy Assistant could account for / resolve FQDNs?

Another example is using FQDN: statically you don't know to which IP it will be resolved

@huntergregory
Copy link
Contributor

huntergregory commented Jun 4, 2024

Discussed extensively today (6/4/24) in the SIG meeting.

In the static analysis, everything will look fine, because it will just tell me the same thing as my ANP, but if I run it on a real cluster I will find that denied connection from the monitoring namespace and figure out I forgot it.

I see now what Nadia means: while Policy Assistant explains the connections which are now allowed/denied by a policy, the CNI dry-run mode would clearly highlight real connections being dropped.

So CNI dry-run mode would hold the admin's hand a bit more.

@huntergregory
Copy link
Contributor

We'll want to prevent folks from shooting themselves in the foot for a CNI dry-run mode. A few pitfalls discussed today:

  • What if some connections will be dropped, but they occur outside the audit period?
  • What if some Pods (like a CronJob) come up outside the audit period?
  • User must check CNI logs on every Node since drops can occur on any Node.

@nathanjsweet
Copy link

I agree with much of @huntergregory's assessment, and even more implementation hurdles could be discussed (such as distributed logging in large clusters not being scaleable, etc.), which I am happy to do if this comment is unsuccessful. However, I want to focus on one crucial point:

This feature goes beyond proposing policy enforcement syntax that requires CNIs to enforce network connection rules within a proverbial "black box". It requires CNIs to audit policies for users in a standardized way. This feature considerably expands the scope of what this SIG defines for CNIs to implement from a security perspective. While a standardized logging mechanism from the CNI is only alluded to in this proposal, something like it is necessary for this feature. If that is the case, then this feature proposal violates SIG Network's charter against specifying the CNI (which is outside of the Kubernetes project). This proposal is beyond this SIG's traditional ambit and its authority.

@npinaeva
Copy link
Member Author

npinaeva commented Jun 7, 2024

We'll want to prevent folks from shooting themselves in the foot for a CNI dry-run mode. A few pitfalls discussed today:

* What if some connections will be dropped, but they occur outside the audit period?
* What if some Pods (like a CronJob) come up outside the audit period?
* User must check CNI logs on every Node since drops can occur on any Node.

These are valid points, that we should definitely outline if we ever implement this feature, but I think there is a little misunderstanding of how the dry-run mode should be used.
It is not an authoritative source of truth for the network policy spec, but a way to verify the netpol spec won't break the cluster.
The workflow should go something like this:

  1. design network policies for the cluster (exactly as you would do without any dry-run mode)
  2. apply them in dry-run mode to check if there are any unexpected effects of the designed network policies.

Getting back to the mentioned pitfalls:

  • if some connection that should be allowed is dropped by the netpol
    • without dry-run: it is just dropped, potential disruption
    • with dry-run: it will be logged if the connection happens during the dry-run period and save cluster from problems (one could argue that cluster admin should have an idea of how long the dry-run period should be to ensure the most important connections succeed)
    • I see how dry-run mode can give users too much confidence about their netpol correctness and be misused somehow, but I don't think it is a good enough reason to not consider this feature
  • how to consume the "logs" mostly goes to Nathan's comment about the logging or observability for netpols in general, but if the logging is possible, this problems is most likely solved.

While a standardized logging mechanism from the CNI is only alluded to in this proposal, something like it is necessary for this feature.

This feature is definitely most useful when logging is possible, but one of the reasons that makes me think it can be considered for the API is that this field (or a well-known label?) should change the behaviour of the network policy to "log" (or do nothing if the plugin doesn't support logging). This functionality can't be implemented just downstream on top of the upstream API, because it will break the semantics of a network policy (spec says allow, but nothing actually happens). So the main effect of this feature in the API is that the policy in dry-run mode won't affect the traffic. Sounds useless by itself, but this will allow implementations to enable logging on top of it and provide a useful tool without the need to create downstream CRDs.

@nathanjsweet
Copy link

It is not an authoritative source of truth for the network policy spec, but a way to verify the netpol spec won't break the cluster.

You do not get to decide how users see a feature. Many of them will see it as authoritative (which is not a problem for any other feature in SIG-Network-API) and shoot themselves in the foot with this thing.

Sounds useless by itself, but this will allow implementations to enable logging on top of it and provide a useful tool without the need to create downstream CRDs.

I see two problems with this thinking:

  1. What if CNIs don't even do the bare minimum to no-op a policy with this field present, and a user applies a policy with the field without realizing that they're about to apply a real policy to their cluster (this is actually a problem even if we end up doing this feature).
  2. Assuming that I'm right about specifying logging as being prohibited, what is the point of the API if you cannot specify a logging solution? That negates the usefulness of making this a specification. This SIG would be saying, "Here is this API field, but the method of observability is implementation specific, so check with your CNI to see if they support this and (if they do) how to actually get the logs, UI, or whatever." I understand that there are other unimplemented features in this SIG, but this is the only one that, when used in an unimplemented CNI, can cause a catastrophe rather than the user simply realizing that the feature is unsupported (if a CNI does not support ANPs, a user will quickly realize it without any adverse effects).

if the logging is possible, this problems is most likely solved.

This feature is most useful in larger clusters with numerous disparate policies (where admins might be confused about the effect of adding a policy). That means that every node affected by the policy (and maybe all of them, implementation dependent) will have to send logs to some central place (the kubeapiserver?). I think there are legitimate scalability questions for such a solution, but we cannot really review them until we have a full proposal.

@npinaeva I think our path forward is this:

  1. We need to clarify what we can and cannot propose for CNIs to do vis-a-vis logging and/or observability.
  2. Given our full understanding of what we can do observability-wise, we need to extend this proposal to include that solution (if we can) so that we can review it properly.

@shaneutt
Copy link
Member

We talked about this in the sync today, I think as a next step it could only help to share the context with some CNI folks to get their thoughts and considerations.

@npinaeva
Copy link
Member Author

What if CNIs don't even do the bare minimum to no-op a policy with this field present, and a user applies a policy with the field without realizing that they're about to apply a real policy to their cluster (this is actually a problem even if we end up doing this feature).

That is the only requirement for this field, and the only reason to make it an API field: it will be required to be considered conformant. I don't think ignoring a netpol is a difficult-to-implement-feature for any plugin.
Then on top of this field some plugins may enable logging or whatever they find a useful addition to the dry-run mode. If the CNI doesn't do anything on top it, there is no catastrophe, the behaviour will be the same, just no extra logs/observability. As discussed in the meeting, some CNIs may decide to enforce the rules in the dry-run mode and e.g. log real matched connections, while others may want to skip any datapath-related actions and do some static analysis.

There is no intention for this feature upstream to specify anything extra than the fact that the netpol rules should not be applied in dry-run mode. If we don't think this feature is useful without specifying logging/observability side-effects, then we can just reject it.

@danwinship
Copy link
Contributor

While a standardized logging mechanism from the CNI is only alluded to in this proposal, something like it is necessary for this feature. If that is the case, then this feature proposal violates SIG Network's charter against specifying the CNI (which is outside of the Kubernetes project).

That is incorrect. The charter says (1) SIG Network does not own the CNI specification, and (2) SIG Network does not own any specific network plugin. (Furthermore, these are just clarifications, not "rules".)

SIG Network absolutely owns the definition of what functionality is required in a conforming Kubernetes pod network (and that already includes functionality which is not part of CNI; CNI does not have anything to say about node-to-node traffic, for example).

(Not saying that this feature is necessarily a good idea, or that any of your other objections to it are wrong, just that this particular object is incorrect.)

@danwinship
Copy link
Contributor

A potential solution if to have a dry-run flag, that would turn all (B)ANP actions into logs.

It seems really really weird to define this without already having (standardized) support for logging.

Also, from an API perspective, objects describe state, not actions, so "dry-run mode" isn't quite the right model. The linked Calico feature with "staged" policies is more semantically correct; it's not that you're instructing the plugin to audit a particular policy, it's that you're creating a policy whose semantics are inherently "audit these rules" rather than "enforce these rules".

@danwinship
Copy link
Contributor

danwinship commented Jun 27, 2024

It seems really really weird to define this without already having (standardized) support for logging.

Alternatively, the feature could be implemented by updating the policy's status (or creating Events). Obviously you'd only be able to log a very limited amount of data in that case, but if the expectation is that "this Deny rule shouldn't actually be denying anything when the cluster is working correctly", then you don't need a lot of logging; just seeing the first handful of packets that hit the rule should be enough to see what mistake you made

@nathanjsweet
Copy link

@npinaeva

That is the only requirement for this field, and the only reason to make it an API field: it will be required to be considered conformant. I don't think ignoring a netpol is a difficult-to-implement-feature for any plugin.
...
There is no intention for this feature upstream to specify anything extra than the fact that the netpol rules should not be applied in dry-run mode.

It's not difficult but it is a breaking change in the API.

@danwinship

That is incorrect. The charter says (1) SIG Network does not own the CNI specification, and (2) SIG Network does not own any specific network plugin. (Furthermore, these are just clarifications, not "rules".)

SIG Network absolutely owns the definition of what functionality is required in a conforming Kubernetes pod network (and that already includes functionality which is not part of CNI; CNI does not have anything to say about node-to-node traffic, for example).
...
It seems really really weird to define this without already having (standardized) support for logging.

Also, from an API perspective, objects describe state, not actions, so "dry-run mode" isn't quite the right model.

This is what I'm talking about. Specifying action rather than state, especially a specific type of output (logging, in this case) that is unrelated to the API this group is meant to work on, seems out of scope and off the charter. CNI logs are within the purview of the CNI specification and are even mentioned in the project spec.

Alternatively, the feature could be implemented by updating the policy's status (or creating Events). Obviously you'd only be able to log a very limited amount of data in that case, but if the expectation is that "this Deny rule shouldn't actually be denying anything when the cluster is working correctly", then you don't need a lot of logging; just seeing the first handful of packets that hit the rule should be enough to see what mistake you made

I find this dubious. In a large cluster, with potentially thousands of endpoints affected by the new dry-run policy, how is the CNI meant to determine which of the logged packets/flows to apply to the status field of one policy? Especially with ANPs, the affected policy might be downstream of the dry-run policy.

@danwinship
Copy link
Contributor

Specifying action rather than state, especially a specific type of output (logging, in this case) that is unrelated to the API this group is meant to work on, seems out of scope and off the charter.

It is related to the API this group is working on though. It's an ANP feature.

CNI logs are within the purview of the CNI specification

This feature is not talking about CNI at all. Yes, people keep saying "CNI", but they don't mean it, people just misuse the term "CNI" to refer to the sum of all of the pod-network-related components in a cluster, including in particular in this case, the NP/ANP implementation.


The ANP spec currently requires an ANP implementation to have certain powers like "being able to drop packets traveling between pods" and "being able to drop packets going from a pod to an external IP". There are other powers that ANP implementations are not currently required to have, but which we have discussed requiring, like "being able to drop packets traveling from one node to another node" or "being able to distinguish packets from different hostNetwork pods on the same node".

This NPEP suggests adding the requirement that ANP implementations have to have the power of "being able to log messages in a certain way in response to certain packets traveling between pods". That falls entirely within the ANP spec, and is within the scope of the NP API WG.

(It may happen that some implementations of Kubernetes pod networking are written in such a way that adding this feature would require interactions between some component which might be called "a NetworkPolicy implementation" and some component which might be called "a CNI plugin", but the fact that a particular implementation splits up networking features up in a particular way does not mean that the upstream APIs must reflect that same split.)

(And again, I am not saying we should add this feature, I'm just saying that we have to actually decide on the merits of the feature, because it is in scope as a possible ANP feature.)

(And "the merits of the feature" absolutely includes the extent to which people are able/willing to implement it, since we can't get to GA without multiple compatible implementations of the feature.)

@npinaeva
Copy link
Member Author

A potential solution if to have a dry-run flag, that would turn all (B)ANP actions into logs.

It seems really really weird to define this without already having (standardized) support for logging.

Right, I have not updated the description for a while, fixed now. I think the easiest solution for upstream is to say that the dry-run mode action is "ignore" instead of "enforce", which allows converting it to "audit" downstream, as audit doesn't change netpol behaviour, but adds a side-effect on top.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants