Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

@FifoQueueListener retry does not work when FIFO queue has messages of same message group id #369

Open
johnnycashash opened this issue Aug 27, 2021 · 4 comments

Comments

@johnnycashash
Copy link

Using @FifoQueueListener with concurrency of 10 in spring boot application.

SQS FIFO queue has configured visibility timeout of 10 seconds and configured DLQ with max receives count 3.

First: Putting 2 messages in FIFO queue with same message group id and different deduplication id.
Second: Starting Spring boot consumer @FifoQueueListener application.

3 attempts are done on 1st message dropped in queue as exception is thrown from the process. But not noticing any attempt done on 2nd message and sometimes 3 attempts(APPROXIMATE_RECEIVE_COUNT -1,2,3) done on 1st message and only one attempt done on 2nd message(APPROXIMATE_RECEIVE_COUNT - 3).

And finally both messages end up in DLQ.

Expectation should be 3 attempts done on both the messages before putting messages in DLQ.

Noticed that on application start, both messages move to inflight(means both are read in 1st poll).
Maybe the poller for @FifoQueueListener keeps on increasing the receive count of 2nd message while first message attempts are going on.

Also tried increasing the visibility on sqs to 1 hour and overriding visibility to 5 seconds in @FifoQueueListener.

@johnnycashash johnnycashash changed the title @FifoQueueListener does not work when FIFO queue has messages of same message group id @FifoQueueListener retry does not work when FIFO queue has messages of same message group id Aug 27, 2021
@johnnycashash
Copy link
Author

johnnycashash commented Aug 27, 2021

Attaching debug logs:

Started MessageRetriever
c.j.s.r.b.BatchingMessageRetriever : Waiting for 10 requests for messages within 5000ms. Total currently waiting: 0
c.j.s.b.c.ConcurrentMessageBroker : Changing concurrency from 0 to 10
c.j.s.r.b.BatchingMessageRetriever : Requesting 10 messages
c.j.s.r.b.BatchingMessageRetriever : Downloaded 2 messages
c.j.s.r.b.BatchingMessageRetriever : Waiting for 10 requests for messages within 5000ms. Total currently waiting: 10
c.j.s.r.b.BatchingMessageRetriever : Requesting 10 messages

print logs for first message attempt 1

c.j.s.b.grouping.GroupingMessageBroker : Error processing message com.jashmore.sqs.processor.MessageProcessingException

c.j.s.r.b.BatchingMessageRetriever : Downloaded 2 messages
c.j.s.r.b.BatchingMessageRetriever : Waiting for 10 requests for messages within 5000ms. Total currently waiting: 11
c.j.s.r.b.BatchingMessageRetriever : Requesting 10 messages

print logs for first message attempt 2

c.j.s.b.grouping.GroupingMessageBroker : Error processing message com.jashmore.sqs.processor.MessageProcessingException

c.j.s.r.b.BatchingMessageRetriever : Downloaded 2 messages
c.j.s.r.b.BatchingMessageRetriever : Waiting for 10 requests for messages within 5000ms. Total currently waiting: 12
c.j.s.r.b.BatchingMessageRetriever : Requesting 10 messages

print logs for first message attempt 3

c.j.s.b.grouping.GroupingMessageBroker : Error processing message com.jashmore.sqs.processor.MessageProcessingException

At every attempt the receive count of 2nd message keeps incrementing without processing and both messages end up in DLQ

Maybe it an expected scenario but if both messages are consumed in single poll, then 2nd message will go to DLQ without processing and if 2nd message is read in 2nd retry poll, 2nd message is attempted twice and if 2nd message is read in some poll when all retry are finished for 1st message then 2nd message is retried 3 times.

Can there be consistency in such scenarios or this is expected/recommended from SQS FIFO queue?

@JaidenAshmore
Copy link
Owner

Hey, yeah so from memory if the first message fails to be processed it will not attempt the second message because that breaks the FIFO ordering. E.g in this scenario:

  • attempts first message...fails, put back on the queue for a retry
  • attempts second message...succeds
  • attempts first message again...succeeds

You can see in this scenario that you have broken the FIFO ordering if you allow the second message to be processed after the first has failed.

Are you seeing cases where the first message is being placed into the DLQ but then it does attempt the second message afterwards?

@johnnycashash
Copy link
Author

johnnycashash commented Aug 31, 2021

Hi, Below are the scenarios: (Let me know if the consumer understanding is correct)

firstmessage - m1, secondmessage - m2 . Both are part of same message group
Sent back - Becomes visible after timeout

Scenario 1: (when m1 is successfully sent to DLQ and then m2 is placed in queue)

  • Put m1 into the queue.
  • Consumer picks m1 and starts attempt 1. It fails. Sent back to queue.
  • Consumer picks m1 and starts attempt 2. It fails. Sent back to queue.
  • Consumer picks m1 and starts attempt 3. It fails. Sent back to DLQ after 3 receives (So that it does not block consumers from picking messages of same group id).
  • Put m2 into queue.
  • Consumer picks m2 and starts attempt 1. It fails. Sent back to queue.
  • Consumer picks m2 and starts attempt 2. It fails. Sent back to queue.
  • Consumer picks m2 and starts attempt 3. It fails. Sent back to DLQ after 3 receives.

Scenario 2: (when m1 is retried 1st time and then m2 is placed in queue)

  • Put m1 into the queue.
  • Consumer picks m1 and starts attempt 1. It fails. Sent back to queue.
  • Put m2 into queue.
  • Consumer picks m1 and m2 and starts attempt 2 on m1 and receive count on m2 increases to 1 but m2 is not processed. m1 fails. m1 Sent back to queue.
  • Consumer picks m1 and m2 and starts attempt 3 on m1 and receive count on m2 increases to 2 but m2 is not processed. m1 fails. m1 Sent back to DLQ after 3 receives
  • Consumer picks m2 and starts attempt 3 as m2 is being received from queue 3rd time. m2 fails. m2 Sent back to DLQ after 3 receives.

Attaching logs for scenario 2 for clarity:
Scenario 2 logs:

BatchingMessageRetriever : Started MessageRetriever
BatchingMessageRetriever : Waiting for 10 requests for messages within 5000ms. Total currently waiting: 10
BatchingMessageRetriever : Requesting 10 messages
BatchingMessageRetriever : Downloaded 1 messages (m1)
BatchingMessageRetriever : Waiting for 10 requests for messages within 5000ms. Total currently waiting: 10
BatchingMessageRetriever : Requesting 10 messages
Received Message with attempt::1 (m1)

Exception for m1

BatchingMessageRetriever : Downloaded 2 messages (m1,m2)
BatchingMessageRetriever : Waiting for 10 requests for messages within 5000ms. Total currently waiting: 11
BatchingMessageRetriever : Requesting 10 messages
Received Message with attempt::2 (m1)

Exception for m1

BatchingMessageRetriever : Downloaded 2 messages (m1,m2)
BatchingMessageRetriever : Waiting for 10 requests for messages within 5000ms. Total currently waiting: 12
BatchingMessageRetriever : Requesting 10 messages
Received Message with attempt::3 (m1)

Exception for m1
POINT - details mentioned below
BatchingMessageRetriever : Downloaded 1 messages (m2)
BatchingMessageRetriever : Waiting for 10 requests for messages within 5000ms. Total currently waiting: 13
BatchingMessageRetriever : Requesting 10 messages
Received Message with attempt::3 (m2)

Exception for m2

So if both m1 and m2 are received first time in same poll, then m2 is never retried if m1 keeps on failing and both goes to DLQ.
But if m1 is received alone in 1st poll and m2 is received in next poll then m2 is retried at least once even if m1 keeps on failing and is still not put in DLQ.

Adding some thought:
Maybe at POINT in log, m1 got placed in DLQ, so consumer is free to read and process next message m2

Thanks

@JaidenAshmore
Copy link
Owner

Hmm yeah I have been thinking about this and unsure what the best approach for this is as this is what the SQS FIFO queue allows us to do.

If you want to make sure that the second scenario cannot happen and it follows similar logic as scenario one, then I would recommend to set maximumMessagesInMessageGroup to be 1. In this case we would never pickup m1 and m2 at the same time and therefore would be practically the same as that first case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants