-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(aws): aws check and metadata fixes #4251
Changes from 4 commits
a15a9f7
9ab2095
05b119b
c90a00f
8a47995
f08ecc0
41d222a
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,4 +1,5 @@ | ||
import json | ||
import time | ||
from enum import Enum | ||
from typing import Optional | ||
|
||
|
@@ -145,6 +146,7 @@ def __describe_instance_information__(self, regional_client): | |
id=resource_id, | ||
region=regional_client.region, | ||
) | ||
time.sleep(0.1) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm not sure about including this here, it doesn't seem as something deterministic. If there are rate limiting errors we should handle that explicitly. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The boto3 paginator should be handling the throttling exceptions and retrying, but it doesn't work consistently for ssm:DescribeInstanceInformation and exhausts retries. We escalated this to AWS support and their recommendation was "Reduce the frequency of the API calls", which is why I've been using a short sleep between pages like this which has been working. The rate limits for this action are not documented and seem to behave differently from other AWS actions, and the problem is difficult to test for because it only shows up when you have large numbers of instances running, in our case ephemeral Spot processing jobs. When the problem occurs it generates large numbers of false positives because the ssm instance metadata is not captured and ec2_instance_managed_by_ssm fails for most of the instances, and we have not seen that happen after adding the sleep. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Could you add this message as a comment above the time sleep? It'd be great to have this. Thanks for your analysis 👏 There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Good idea, done. |
||
|
||
except Exception as error: | ||
logger.error( | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add some test to handle this new behaviour? One for each new state. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
New tests added for "running", "stopped", and "terminated". Moto doesn't appear to support mocking the temporary statuses like "pending" in a way that the prowler services can read.