-
Notifications
You must be signed in to change notification settings - Fork 605
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Azure Batch]: TimeoutException when Nextflow attempts to read files following task completion for large-scale analyses in Azure #5067
Comments
I forgot to add, but while most of the above
Here, there's more information in the stack trace, including that the
That is, here is nextflow/modules/nextflow/src/main/groovy/nextflow/trace/TraceRecord.groovy Lines 420 to 424 in e6a5e17
And here is nextflow/plugins/nf-azure/src/main/nextflow/cloud/azure/nio/AzFileSystem.groovy Lines 201 to 207 in e6a5e17
In this case, the exception appears to occur when getting the properties of the file (which might be why the stack trace is different). |
Thanks for the detailed reporting. You are right, there's no retry on nextflow side since it's expected to be retried by the underlying azure SDK. it looks like more an issue on the SDK implementation. Not sure how to address it. |
Thanks so much for the response @pditommaso 😄 We are still working out the exact cause of the issue/a solution for it. As a bit of an update in case what we've tried is useful to others. Increased Azure HTTP LoggingWe increased the Azure HTTP logging by setting
This suggests the timeout is happing when running nextflow/plugins/nf-azure/src/main/nextflow/cloud/azure/nio/AzFileSystem.groovy Lines 201 to 209 in e6a5e17
When we look into recorded requests to the Also, the nextflow/plugins/nf-azure/src/main/nextflow/cloud/azure/batch/AzBatchService.groovy Lines 917 to 924 in e6a5e17
But, whatever request this was makes it through after a few attempts. Increased number of retriesWe are currently attempting to increase the number of retries (by adjusting the |
The Azure Blob SDK library was updated in the latest Nextflow Edge version, 24.05.0-edge. You may want to give it a try. Another possibility would be to use, the Batch task exit code via the Batch API to determine the task status. But I was not able to make it work nextflow/plugins/nf-azure/src/main/nextflow/cloud/azure/batch/AzBatchTaskHandler.groovy Line 118 in 9709067
|
Thanks so much @pditommaso 😄 . We are actually able to get it to work, but required some changes to the Trying to increase the maximum number of tries via the However, by modifying the nextflow/plugins/nf-azure/src/main/nextflow/cloud/azure/batch/AzHelper.groovy Lines 215 to 218 in f3a86de
If this is changed to below to add a RequestRetryOptions as part of the configuration, then the retry options for connections can be adjusted: return new BlobServiceClientBuilder()
.credential(credential)
.endpoint(endpoint)
.retryOptions(new RequestRetryOptions(RetryPolicyType.EXPONENTIAL, maxTries, tryTimeoutInSeconds, null, null, null))
.buildClient() You can see all the changes I made here apetkau/nextflow@v24.04.2...nf-azure-1.6.0-nmlpatch0 (everything else is additional logging statements to help us debug). I am wondering if Please let me know if you'd prefer I made a new feature request issue for this. |
That's a great feedback. Thank you so much. Tagging @bentsherman and @adamrtalbot for visibility |
@apetkau if you put in a feature request we can see if we can help implement. |
Thanks @vsmalladi, I've made a feature request here: #5097 |
Bug report
We are using Nextflow with Azure batch to process collections of microbial genomes (whole-genome sequence data). We have began testing out processing larger collections of genomes and have been encountering issues with some of the tasks run by Nextflow that cause Nextflow to fail the task with a
java.util.concurrent.TimeoutException
. This primarily occurs when attempting to read the.exitcode
from a task from blob storage, which causes Nextflow to fail the task/returnInteger.MAX_INTEGER
as the exit code. For example (see below for more context in error message):This behavior occurs only when scaling up to many genomes, and only impacts some of our runs. It also seems to impact random tasks/processes (in the above case, it is when running
QUAST
, but it occurs for random processes in the full pipeline). The pipeline we are using is https://github.com/phac-nml/mikrokondo/. I have observed it in other pipelines, but much less frequently. I believe it occurs in this pipeline since it does a lot of processing/may take up to 2 hours to process a genome.Expected behavior and actual behavior
I would expect all of our pipeline executions to complete successfully and for there to be no
TimeoutExceptions
when reading outputs of a task (e.g., the.exitcode
file in Azure blob storage).Steps to reproduce the problem
As this is an issue that occurs mainly with large-scale analysis of genomes within Azure, and does not happen every time, it is a bit more difficult to provide a specific set of steps to reproduce the issue. However, here is a rough sketch:
Program output
I am unfortunately unable to share the full nextflow.log, but here is the relevant section.
What I have observed is that the below exception prints out 3 times, and then on the fourth time the task fails:
Note that the default timeout is
60,000 milliseconds
(1 minute), but we modified it be600,000 milliseconds
(10 minutes) by adjusting theAZURE_REQUEST_RESPONSE_TIMEOUT
environment variable as described here: https://github.com/Azure/azure-sdk-for-java/blob/main/sdk/core/azure-core/README.md#http-timeouts. This didn't fix the problem though, just increased the time before failure.Environment
Additional context
We have spent a bit of time trying to identify the cause of this issue, but haven't been able to isolate and address it. However, we have some additional information on the locations in code where this is happening. I'm providing this information in case it is useful, but please feel free to skip the below (I might be wrong in some of this too).
1. TimeoutException failing a task
The
TimeoutException
that finally triggers failing a task is logged in the following line:This occurs in this section of Nextflow code (the
AzBatchTaskHandler
):nextflow/plugins/nf-azure/src/main/nextflow/cloud/azure/batch/AzBatchTaskHandler.groovy
Lines 170 to 178 in e6a5e17
That is, it's attempting to read the exit status via the
exitFile.text
property. TheexitFile
is of typeAzPath
, which is of typePath
.2. Reading from Path.text in Azure
I am not as familiar with how Groovy ultimately handles
Path.text
for reading and returning the contents of a file as text, but in Nextflow with Azure I think this ultimately runs this bit of code inAzFileSystem
to open up a stream to the file in blob storage:nextflow/plugins/nf-azure/src/main/nextflow/cloud/azure/nio/AzFileSystem.groovy
Lines 201 to 217 in e6a5e17
3. BlobClient.openInputStream()
Line 208 above runs
client.openInputStream()
. Theclient
object is of typeBlobClient
, which extends fromBlobClientBase
and theopenInputStream()
is defined here in the Azure Java SDK:https://github.com/Azure/azure-sdk-for-java/blob/421555531b3e83a5df3ca605653c46f8c9c7d6de/sdk/storage/azure-storage-blob/src/main/java/com/azure/storage/blob/specialized/BlobClientBase.java#L292-L300
4. Azure Java SDK HTTP Pipeline
The Azure Java SDK seems to use an HTTP Pipeline for defining different steps to take when handling an API request to Azure: https://learn.microsoft.com/en-us/azure/developer/java/sdk/http-client-pipeline#http-pipeline
In particular, there exists configurable RetryPolicies that are part of the Azure SDK: https://learn.microsoft.com/en-us/azure/developer/java/sdk/http-client-pipeline#common-http-pipeline-policies
5. Azure/HTTP Pipeline retry policies
This brings us to the other part of the exception in the Nextflow log files:
This appears to be triggered by this code:
https://github.com/Azure/azure-sdk-for-java/blob/421555531b3e83a5df3ca605653c46f8c9c7d6de/sdk/core/azure-core-http-netty/src/main/java/com/azure/core/http/netty/implementation/AzureSdkHandler.java#L199-L207
6. Nextflow retry policies
I do know there are
azure.retryPolicy.*
parameters that can be adjusted in the Nextflow config: https://www.nextflow.io/docs/latest/config.html#config-azure. We have tried adjusting them, but I'm guessing in this case when reading the.exitcode
file from Azure blob storage, these policies aren't being applied, and so it's defaulting to the retry policies that are configured as default within the Azure SDK. I'm not sure if this is expected to cause any issues?Also, some of the above description of the code might be incorrect. I have been running up against my own lack of knowledge of how everything works, which makes it hard to debug.
The text was updated successfully, but these errors were encountered: