[Azure Batch]: TimeoutException when Nextflow attempts to read files following task completion for large-scale analyses in Azure #5067

apetkau · 2024-06-14T17:23:39Z

Bug report

We are using Nextflow with Azure batch to process collections of microbial genomes (whole-genome sequence data). We have began testing out processing larger collections of genomes and have been encountering issues with some of the tasks run by Nextflow that cause Nextflow to fail the task with a java.util.concurrent.TimeoutException. This primarily occurs when attempting to read the .exitcode from a task from blob storage, which causes Nextflow to fail the task/return Integer.MAX_INTEGER as the exit code. For example (see below for more context in error message):

Jun... [Task monitor] DEBUG n.c.azure.batch.AzBatchTaskHandler - [AZURE BATCH] Cannot read exit status for task: `MIKROKONDO:POST_ASSEMBLY:QC_ASSEMBLY:QUAST (...)` | java.util.concurrent.TimeoutException: Channel response timed out after 600000 milliseconds.
...
Jun... [TaskFinalizer-10] DEBUG nextflow.processor.TaskProcessor - Handling unexpected condition for
  task: name=MIKROKONDO:POST_ASSEMBLY:QC_ASSEMBLY:QUAST (...); work-dir=az://...
  error [nextflow.exception.ProcessFailedException]: Process `MIKROKONDO:POST_ASSEMBLY:QC_ASSEMBLY:QUAST (...)` terminated for an unknown reason -- Likely it has been terminated by the external system

This behavior occurs only when scaling up to many genomes, and only impacts some of our runs. It also seems to impact random tasks/processes (in the above case, it is when running QUAST, but it occurs for random processes in the full pipeline). The pipeline we are using is https://github.com/phac-nml/mikrokondo/. I have observed it in other pipelines, but much less frequently. I believe it occurs in this pipeline since it does a lot of processing/may take up to 2 hours to process a genome.

Expected behavior and actual behavior

I would expect all of our pipeline executions to complete successfully and for there to be no TimeoutExceptions when reading outputs of a task (e.g., the .exitcode file in Azure blob storage).

Steps to reproduce the problem

As this is an issue that occurs mainly with large-scale analysis of genomes within Azure, and does not happen every time, it is a bit more difficult to provide a specific set of steps to reproduce the issue. However, here is a rough sketch:

Use pipeline: https://github.com/phac-nml/mikrokondo (version 0.2.1)
Run multiple instances of this pipeline using Azure batch, perhaps 10 to 20, one instance per microbial genome (whole-genome sequences).
1. The rate I have observed is ~10% of the runs will fail with the below exception.
2. It is important to use real-world sized genomes (e.g., from NCBI). The specific data I am testing with is a selection of Illumina paired-end sequence reads of Salmonella genomes from this BioProject PRJNA305824.
Monitor runs. Roughly %10 will fail with the below timeout exceptions. May take up to 2 hours or more for failure. I have attempted to test out on smaller datasets that run quickly, but it's much harder to reproduce failures in that case.

Program output

I am unfortunately unable to share the full nextflow.log, but here is the relevant section.

Jun... [reactor-http-epoll-2] WARN  r.n.http.client.HttpClientConnect - [..., L:... - R:...] The connection observed an error
java.util.concurrent.TimeoutException: Channel response timed out after 600000 milliseconds.
	at com.azure.core.http.netty.implementation.AzureSdkHandler.responseTimedOut(AzureSdkHandler.java:202)
	at com.azure.core.http.netty.implementation.AzureSdkHandler.lambda$startResponseTracking$2(AzureSdkHandler.java:187)
	at io.netty.util.concurrent.PromiseTask.runTask(PromiseTask.java:98)
	at io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:153)
	at io.netty.util.concurrent.AbstractEventExecutor.runTask(AbstractEventExecutor.java:173)
	at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:166)
	at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:470)
	at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:416)
	at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997)
	at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
	at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
	at java.base/java.lang.Thread.run(Thread.java:840)
Jun... [Task monitor] DEBUG n.c.azure.batch.AzBatchTaskHandler - [AZURE BATCH] Cannot read exit status for task: `MIKROKONDO:POST_ASSEMBLY:QC_ASSEMBLY:QUAST (...)` | java.util.concurrent.TimeoutException: Channel response timed out after 600000 milliseconds.
Jun... [pool-11-thread-1] INFO  c.m.a.adal4j.AuthenticationAuthority - [Correlation ID: ...] Instance discovery was successful
Jun... [Task monitor] DEBUG n.processor.TaskPollingMonitor - Task completed > TaskHandler[id: 18; name: MIKROKONDO:POST_ASSEMBLY:QC_ASSEMBLY:QUAST (...); status: COMPLETED; exit: -; error: -; workDir: az://...]
Jun... [TaskFinalizer-10] DEBUG nextflow.processor.TaskProcessor - Handling unexpected condition for
  task: name=MIKROKONDO:POST_ASSEMBLY:QC_ASSEMBLY:QUAST (...); work-dir=az://...
  error [nextflow.exception.ProcessFailedException]: Process `MIKROKONDO:POST_ASSEMBLY:QC_ASSEMBLY:QUAST (...)` terminated for an unknown reason -- Likely it has been terminated by the external system

What I have observed is that the below exception prints out 3 times, and then on the fourth time the task fails:

Jun... [reactor-http-epoll-2] WARN  r.n.http.client.HttpClientConnect - [..., L:... - R:...] The connection observed an error
java.util.concurrent.TimeoutException: Channel response timed out after 600000 milliseconds.
	at com.azure.core.http.netty.implementation.AzureSdkHandler.responseTimedOut(AzureSdkHandler.java:202)
	at com.azure.core.http.netty.implementation.AzureSdkHandler.lambda$startResponseTracking$2(AzureSdkHandler.java:187)
	at io.netty.util.concurrent.PromiseTask.runTask(PromiseTask.java:98)
	at io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:153)
	at io.netty.util.concurrent.AbstractEventExecutor.runTask(AbstractEventExecutor.java:173)
	at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:166)
	at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:470)
	at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:416)
	at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997)
	at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
	at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
	at java.base/java.lang.Thread.run(Thread.java:840)

Note that the default timeout is 60,000 milliseconds (1 minute), but we modified it be 600,000 milliseconds (10 minutes) by adjusting the AZURE_REQUEST_RESPONSE_TIMEOUT environment variable as described here: https://github.com/Azure/azure-sdk-for-java/blob/main/sdk/core/azure-core/README.md#http-timeouts. This didn't fix the problem though, just increased the time before failure.

Environment

Nextflow version: 24.04.2 build 5914
Java version: Groovy 4.0.21 on OpenJDK 64-Bit Server VM 17.0.10+7-LTS
Operating system: Linux 5.10.102.2-microsoft-standard
Bash version: [Unsure. I will update when I can get the bash version]

Additional context

We have spent a bit of time trying to identify the cause of this issue, but haven't been able to isolate and address it. However, we have some additional information on the locations in code where this is happening. I'm providing this information in case it is useful, but please feel free to skip the below (I might be wrong in some of this too).

1. TimeoutException failing a task

The TimeoutException that finally triggers failing a task is logged in the following line:

Jun... [Task monitor] DEBUG n.c.azure.batch.AzBatchTaskHandler - [AZURE BATCH] Cannot read exit status for task: `MIKROKONDO:POST_ASSEMBLY:QC_ASSEMBLY:QUAST (...)` | java.util.concurrent.TimeoutException: Channel response timed out after 600000 milliseconds.

This occurs in this section of Nextflow code (the AzBatchTaskHandler):

nextflow/plugins/nf-azure/src/main/nextflow/cloud/azure/batch/AzBatchTaskHandler.groovy

Lines 170 to 178 in e6a5e17

    
           protected int readExitFile() { 
        
               try { 
        
                   exitFile.text as Integer 
        
               } 
        
               catch( Exception e ) { 
        
                   log.debug "[AZURE BATCH] Cannot read exit status for task: `$task.name` | ${e.message}" 
        
                   return Integer.MAX_VALUE 
        
               } 
        
           }

That is, it's attempting to read the exit status via the exitFile.text property. The exitFile is of type AzPath, which is of type Path.

2. Reading from Path.text in Azure

I am not as familiar with how Groovy ultimately handles Path.text for reading and returning the contents of a file as text, but in Nextflow with Azure I think this ultimately runs this bit of code in AzFileSystem to open up a stream to the file in blob storage:

nextflow/plugins/nf-azure/src/main/nextflow/cloud/azure/nio/AzFileSystem.groovy

Lines 201 to 217 in e6a5e17

    
           SeekableByteChannel newReadableByteChannel(AzPath path) { 
        
               if( path.isContainer() ) 
        
                   throw new IllegalArgumentException("Operation not allowed on blob container path: ${path.toUriString()}") 
        
               final client = path.blobClient() 
        
               try { 
        
                   final size = client.getProperties().getBlobSize() 
        
                   final channel = Channels.newChannel( client.openInputStream() ) 
        
                   return new AzReadableByteChannel(channel, size) 
        
               } 
        
               catch (BlobStorageException e) { 
        
                   if( e.statusCode == 404 ) 
        
                       throw new NoSuchFileException(path.toUriString()) 
        
                   else 
        
                       throw new IOException("Unexpected exception processing Azure container blob: ${path.toUriString()}", e) 
        
               } 
        
           }

3. BlobClient.openInputStream()

Line 208 above runs client.openInputStream(). The client object is of type BlobClient, which extends from BlobClientBase and the openInputStream() is defined here in the Azure Java SDK:

    /**
     * Opens a blob input stream to download the blob.
     *
     * @return An <code>InputStream</code> object that represents the stream to use for reading from the blob.
     * @throws BlobStorageException If a storage service error occurred.
     */
    public BlobInputStream openInputStream() {
        return openInputStream((BlobRange) null, null);
    }

https://github.com/Azure/azure-sdk-for-java/blob/421555531b3e83a5df3ca605653c46f8c9c7d6de/sdk/storage/azure-storage-blob/src/main/java/com/azure/storage/blob/specialized/BlobClientBase.java#L292-L300

4. Azure Java SDK HTTP Pipeline

The Azure Java SDK seems to use an HTTP Pipeline for defining different steps to take when handling an API request to Azure: https://learn.microsoft.com/en-us/azure/developer/java/sdk/http-client-pipeline#http-pipeline

In particular, there exists configurable RetryPolicies that are part of the Azure SDK: https://learn.microsoft.com/en-us/azure/developer/java/sdk/http-client-pipeline#common-http-pipeline-policies

5. Azure/HTTP Pipeline retry policies

This brings us to the other part of the exception in the Nextflow log files:

Jun... [reactor-http-epoll-2] WARN  r.n.http.client.HttpClientConnect - [..., L:... - R:...] The connection observed an error
java.util.concurrent.TimeoutException: Channel response timed out after 600000 milliseconds.
	at com.azure.core.http.netty.implementation.AzureSdkHandler.responseTimedOut(AzureSdkHandler.java:202)
	at com.azure.core.http.netty.implementation.AzureSdkHandler.lambda$startResponseTracking$2(AzureSdkHandler.java:187)
	at io.netty.util.concurrent.PromiseTask.runTask(PromiseTask.java:98)
	at io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:153)
	at io.netty.util.concurrent.AbstractEventExecutor.runTask(AbstractEventExecutor.java:173)
	at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:166)
	at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:470)
	at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:416)
	at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997)
	at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
	at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
	at java.base/java.lang.Thread.run(Thread.java:840)

This appears to be triggered by this code:

    void responseTimedOut(ChannelHandlerContext ctx) {
        if (!closed) {
            disposeResponseTimeoutWatcher();
            ctx.fireExceptionCaught(
                new TimeoutException("Channel response timed out after " + responseTimeoutMillis + " milliseconds."));
            ctx.close();
            closed = true;
        }
    }

https://github.com/Azure/azure-sdk-for-java/blob/421555531b3e83a5df3ca605653c46f8c9c7d6de/sdk/core/azure-core-http-netty/src/main/java/com/azure/core/http/netty/implementation/AzureSdkHandler.java#L199-L207

6. Nextflow retry policies

I do know there are azure.retryPolicy.* parameters that can be adjusted in the Nextflow config: https://www.nextflow.io/docs/latest/config.html#config-azure. We have tried adjusting them, but I'm guessing in this case when reading the .exitcode file from Azure blob storage, these policies aren't being applied, and so it's defaulting to the retry policies that are configured as default within the Azure SDK. I'm not sure if this is expected to cause any issues?

Also, some of the above description of the code might be incorrect. I have been running up against my own lack of knowledge of how everything works, which makes it hard to debug.

The text was updated successfully, but these errors were encountered:

apetkau · 2024-06-14T18:40:32Z

I forgot to add, but while most of the above TimeoutExceptions occur when attempting to read the .exitCode file, sometimes they have occured when attempting to get the task trace record. Below is an exception from another run (different run from the exception I provided above):

Jun... [TaskFinalizer-1] DEBUG nextflow.processor.TaskHandler - Unable to get task trace record -- cause: java.util.concurrent.TimeoutException: Channel response timed out after 600000 milliseconds.
reactor.core.Exceptions$ReactiveException: java.util.concurrent.TimeoutException: Channel response timed out after 600000 milliseconds.
	at reactor.core.Exceptions.propagate(Exceptions.java:396)
	at reactor.core.publisher.BlockingSingleSubscriber.blockingGet(BlockingSingleSubscriber.java:98)
	at reactor.core.publisher.Mono.block(Mono.java:1742)
	at com.azure.storage.common.implementation.StorageImplUtils.blockWithOptionalTimeout(StorageImplUtils.java:147)
	at com.azure.storage.blob.specialized.BlobClientBase.getPropertiesWithResponse(BlobClientBase.java:1379)
	at com.azure.storage.blob.specialized.BlobClientBase.getProperties(BlobClientBase.java:1348)
	at nextflow.cloud.azure.nio.AzFileSystem.newReadableByteChannel(AzFileSystem.groovy:207)
	at nextflow.cloud.azure.nio.AzFileSystemProvider.newByteChannel(AzFileSystemProvider.groovy:407)
	at java.base/java.nio.file.Files.newByteChannel(Files.java:380)
	at java.base/java.nio.file.Files.newByteChannel(Files.java:432)
	at java.base/java.nio.file.spi.FileSystemProvider.newInputStream(FileSystemProvider.java:422)
	at java.base/java.nio.file.Files.newInputStream(Files.java:160)
	at java.base/java.nio.file.Files.newBufferedReader(Files.java:2922)
	at org.apache.groovy.nio.extensions.NioExtensions.newReader(NioExtensions.java:1436)
	at org.apache.groovy.nio.extensions.NioExtensions.getText(NioExtensions.java:400)
	at nextflow.trace.TraceRecord.parseTraceFile(TraceRecord.groovy:422)
	at org.codehaus.groovy.vmplugin.v8.IndyInterface.fromCache(IndyInterface.java:321)
	at nextflow.processor.TaskHandler.getTraceRecord(TaskHandler.groovy:217)
	at nextflow.cloud.azure.batch.AzBatchTaskHandler.getTraceRecord(AzBatchTaskHandler.groovy:189)
	at nextflow.processor.TaskHandler.safeTraceRecord(TaskHandler.groovy:160)
	at nextflow.Session.notifyTaskComplete(Session.groovy:1044)
	at nextflow.processor.TaskPollingMonitor.finalizeTask(TaskPollingMonitor.groovy:679)
	at jdk.internal.reflect.GeneratedMethodAccessor223.invoke(Unknown Source)
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.base/java.lang.reflect.Method.invoke(Method.java:568)
	at org.codehaus.groovy.reflection.CachedMethod.invoke(CachedMethod.java:343)
	at groovy.lang.MetaMethod.doMethodInvoke(MetaMethod.java:328)
	at groovy.lang.MetaClassImpl.doInvokeMethod(MetaClassImpl.java:1333)
	at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1088)
	at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1007)
	at org.codehaus.groovy.runtime.InvokerHelper.invokePogoMethod(InvokerHelper.java:645)
	at org.codehaus.groovy.runtime.InvokerHelper.invokeMethod(InvokerHelper.java:628)
	at org.codehaus.groovy.runtime.InvokerHelper.invokeMethodSafe(InvokerHelper.java:82)
	at nextflow.processor.TaskPollingMonitor$_checkTaskStatus_lambda8.doCall(TaskPollingMonitor.groovy:666)
	at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539)
	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
	at java.base/java.lang.Thread.run(Thread.java:840)
	Suppressed: java.lang.Exception: #block terminated with an error
		at reactor.core.publisher.BlockingSingleSubscriber.blockingGet(BlockingSingleSubscriber.java:100)
		... 37 common frames omitted
Caused by: java.util.concurrent.TimeoutException: Channel response timed out after 600000 milliseconds.
	at com.azure.core.http.netty.implementation.AzureSdkHandler.responseTimedOut(AzureSdkHandler.java:202)
	at com.azure.core.http.netty.implementation.AzureSdkHandler.lambda$startResponseTracking$2(AzureSdkHandler.java:187)
	at io.netty.util.concurrent.PromiseTask.runTask(PromiseTask.java:98)
	at io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:153)
	at io.netty.util.concurrent.AbstractEventExecutor.runTask(AbstractEventExecutor.java:173)
	at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:166)
	at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:470)
	at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:416)
	at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997)
	at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
	at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
	... 1 common frames omitted
	Suppressed: java.util.concurrent.TimeoutException: Channel response timed out after 600000 milliseconds.
		at com.azure.core.http.netty.implementation.AzureSdkHandler.responseTimedOut(AzureSdkHandler.java:202)
		at com.azure.core.http.netty.implementation.AzureSdkHandler.lambda$startResponseTracking$2(AzureSdkHandler.java:187)
		at io.netty.util.concurrent.PromiseTask.runTask(PromiseTask.java:98)
		at io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:153)
		at io.netty.util.concurrent.AbstractEventExecutor.runTask(AbstractEventExecutor.java:173)
		at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:166)
		at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:470)
		at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:413)
		... 4 common frames omitted
	Suppressed: java.util.concurrent.TimeoutException: Channel response timed out after 600000 milliseconds.
		at com.azure.core.http.netty.implementation.AzureSdkHandler.responseTimedOut(AzureSdkHandler.java:202)
		at com.azure.core.http.netty.implementation.AzureSdkHandler.lambda$startResponseTracking$2(AzureSdkHandler.java:187)
		at io.netty.util.concurrent.PromiseTask.runTask(PromiseTask.java:98)
		at io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:153)
		at io.netty.util.concurrent.AbstractEventExecutor.runTask(AbstractEventExecutor.java:173)
		at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:166)
		at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:470)
		at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:413)
		... 4 common frames omitted
	Suppressed: java.util.concurrent.TimeoutException: Channel response timed out after 600000 milliseconds.
		... 12 common frames omitted

Here, there's more information in the stack trace, including that the Path.text does ultimately trigger the code AzFileSystem.newReadableByteChannel():

	at nextflow.cloud.azure.nio.AzFileSystem.newReadableByteChannel(AzFileSystem.groovy:207)
	at nextflow.cloud.azure.nio.AzFileSystemProvider.newByteChannel(AzFileSystemProvider.groovy:407)
	at java.base/java.nio.file.Files.newByteChannel(Files.java:380)
	at java.base/java.nio.file.Files.newByteChannel(Files.java:432)
	at java.base/java.nio.file.spi.FileSystemProvider.newInputStream(FileSystemProvider.java:422)
	at java.base/java.nio.file.Files.newInputStream(Files.java:160)
	at java.base/java.nio.file.Files.newBufferedReader(Files.java:2922)
	at org.apache.groovy.nio.extensions.NioExtensions.newReader(NioExtensions.java:1436)
	at org.apache.groovy.nio.extensions.NioExtensions.getText(NioExtensions.java:400)
	at nextflow.trace.TraceRecord.parseTraceFile(TraceRecord.groovy:422)

That is, here is TraceRecord.parseTraceFile():

nextflow/modules/nextflow/src/main/groovy/nextflow/trace/TraceRecord.groovy

Lines 420 to 424 in e6a5e17

    
           TraceRecord parseTraceFile( Path file ) { 
        
               final text = file.text 
        
               final lines = text.readLines()

And here is AzFileSystem.newReadableByteChannel():

nextflow/plugins/nf-azure/src/main/nextflow/cloud/azure/nio/AzFileSystem.groovy

Lines 201 to 207 in e6a5e17

    
           SeekableByteChannel newReadableByteChannel(AzPath path) { 
        
               if( path.isContainer() ) 
        
                   throw new IllegalArgumentException("Operation not allowed on blob container path: ${path.toUriString()}") 
        
               final client = path.blobClient() 
        
               try { 
        
                   final size = client.getProperties().getBlobSize()

In this case, the exception appears to occur when getting the properties of the file (which might be why the stack trace is different).

pditommaso · 2024-06-17T09:55:01Z

Thanks for the detailed reporting. You are right, there's no retry on nextflow side since it's expected to be retried by the underlying azure SDK.

it looks like more an issue on the SDK implementation. Not sure how to address it.

apetkau · 2024-06-19T17:05:54Z

Thanks so much for the response @pditommaso 😄

We are still working out the exact cause of the issue/a solution for it. As a bit of an update in case what we've tried is useful to others.

Increased Azure HTTP Logging

We increased the Azure HTTP logging by setting AZURE_HTTP_LOG_DETAIL_LEVEL=HEADERS (as described in https://learn.microsoft.com/en-us/azure/developer/java/sdk/logging-overview#configure-http-logging-with-an-environment-variable. This gives us a bit more details in the Nextflow logs:

Jun-18 17:52:30.142 [Task monitor] DEBUG n.cloud.azure.batch.AzBatchService - Azure TooManyRequests response error - attempt: 1; reason: java.net.SocketTimeoutException: timeout
Jun-18 17:54:30.429 [Task monitor] DEBUG n.cloud.azure.batch.AzBatchService - Azure TooManyRequests response error - attempt: 2; reason: java.net.SocketTimeoutException: timeout
Jun-18 17:54:30.532 [Task monitor] INFO  c.a.s.b.i.B.getProperties - {"az.sdk.message":"HTTP request","method":"HEAD","url":"[...].exitcode","tryCount":1, ...}
Jun-18 18:04:30.577 [reactor-http-epoll-1] WARN  r.n.http.client.HttpClientConnect - [..., L:... - R:...] The connection observed an error
java.util.concurrent.TimeoutException: Channel response timed out after 600000 milliseconds.
...
Jun-18 18:04:30.605 [reactor-http-epoll-1] WARN  c.a.s.b.i.B.getProperties - {"az.sdk.message":"HTTP FAILED","exception":"Channel response timed out after 600000 milliseconds.", ...}
Jun-18 18:04:34.609 [parallel-3] INFO  c.a.s.b.i.B.getProperties - {"az.sdk.message":"HTTP request","method":"HEAD","url":"[...].exitcode","tryCount":2}
Jun-18 18:14:34.712 [reactor-http-epoll-4] WARN  r.n.http.client.HttpClientConnect - [..., L:... - R:...] The connection observed an error
java.util.concurrent.TimeoutException: Channel response timed out after 600000 milliseconds.
...
Jun-18 18:14:34.714 [reactor-http-epoll-4] WARN  c.a.s.b.i.B.getProperties - {"az.sdk.message":"HTTP FAILED","exception":"Channel response timed out after 600000 milliseconds.", ...}
Jun-18 18:14:46.716 [parallel-4] INFO  c.a.s.b.i.B.getProperties - {"az.sdk.message":"HTTP request","method":"HEAD","url":"[...].exitcode","tryCount":3, ...}
Jun-18 18:24:46.732 [reactor-http-epoll-2] WARN  r.n.http.client.HttpClientConnect - [..., L:... - R:...] The connection observed an error
java.util.concurrent.TimeoutException: Channel response timed out after 600000 milliseconds.
...
Jun-18 18:24:46.734 [reactor-http-epoll-2] WARN  c.a.s.b.i.B.getProperties - {"az.sdk.message":"HTTP FAILED","exception":"Channel response timed out after 600000 milliseconds.", ...}
Jun-18 18:25:14.736 [parallel-1] INFO  c.a.s.b.i.B.getProperties - {"az.sdk.message":"HTTP request","method":"HEAD","url":"[...].exitcode","tryCount":4, ...}
Jun-18 18:35:14.837 [reactor-http-epoll-3] WARN  r.n.http.client.HttpClientConnect - [..., L:... - R:...] The connection observed an error
java.util.concurrent.TimeoutException: Channel response timed out after 600000 milliseconds.
...
Jun-18 18:35:14.838 [reactor-http-epoll-3] WARN  c.a.s.b.i.B.getProperties - {"az.sdk.message":"HTTP FAILED","exception":"Channel response timed out after 600000 milliseconds.", ...}
Jun-18 18:35:14.844 [Task monitor] DEBUG n.c.azure.batch.AzBatchTaskHandler - [AZURE BATCH] Cannot read exit status for task: `PHACNML_IRIDANEXT:IRIDANEXT:CUSTOM_DUMPSOFTWAREVERSIONS (1)` | java.util.concurrent.TimeoutException: Channel response timed out after 600000 milliseconds.
Jun-18 18:35:14.992 [Task monitor] DEBUG n.processor.TaskPollingMonitor - Task completed > TaskHandler[id: 6; name: PHACNML_IRIDANEXT:IRIDANEXT:CUSTOM_DUMPSOFTWAREVERSIONS (1); status: COMPLETED; exit: -; error: -; workDir: az://...]
Jun-18 18:35:15.024 [TaskFinalizer-6] DEBUG nextflow.processor.TaskProcessor - Handling unexpected condition for
  task: name=PHACNML_IRIDANEXT:IRIDANEXT:CUSTOM_DUMPSOFTWAREVERSIONS (1); work-dir=az://...
  error [nextflow.exception.ProcessFailedException]: Process `PHACNML_IRIDANEXT:IRIDANEXT:CUSTOM_DUMPSOFTWAREVERSIONS (1)` terminated for an unknown reason -- Likely it has been terminated by the external system

This suggests the timeout is happing when running BlobClient.getProperties() in order to get the blob file size (line 207 below):

nextflow/plugins/nf-azure/src/main/nextflow/cloud/azure/nio/AzFileSystem.groovy

Lines 201 to 209 in e6a5e17

    
           SeekableByteChannel newReadableByteChannel(AzPath path) { 
        
               if( path.isContainer() ) 
        
                   throw new IllegalArgumentException("Operation not allowed on blob container path: ${path.toUriString()}") 
        
               final client = path.blobClient() 
        
               try { 
        
                   final size = client.getProperties().getBlobSize() 
        
                   final channel = Channels.newChannel( client.openInputStream() ) 
        
                   return new AzReadableByteChannel(channel, size)

When we look into recorded requests to the .exitcode in the Azure Blob storage end we aren't seeing any of these requests. As if something is preventing the request from making it to Azure (or from Azure recording the request).

Also, the TooManyRequests message at the top of this log entry is printed as part of the Nextflow Azure retry policies:

nextflow/plugins/nf-azure/src/main/nextflow/cloud/azure/batch/AzBatchService.groovy

Lines 917 to 924 in e6a5e17

    
           protected <T> RetryPolicy<T> retryPolicy(Predicate<? extends Throwable> cond) { 
        
               final cfg = config.retryConfig() 
        
               final listener = new EventListener<ExecutionAttemptedEvent<T>>() { 
        
                   @Override 
        
                   void accept(ExecutionAttemptedEvent<T> event) throws Throwable { 
        
                       log.debug("Azure TooManyRequests response error - attempt: ${event.attemptCount}; reason: ${event.lastFailure.message}") 
        
                   } 
        
               }

But, whatever request this was makes it through after a few attempts.

Increased number of retries

We are currently attempting to increase the number of retries (by adjusting the AZURE_REQUEST_RETRY_COUNT environment variable) instead of the timeout for each retry to see if this helps.

pditommaso · 2024-06-20T06:55:10Z

The Azure Blob SDK library was updated in the latest Nextflow Edge version, 24.05.0-edge. You may want to give it a try.

Another possibility would be to use, the Batch task exit code via the Batch API to determine the task status. But I was not able to make it work

nextflow/plugins/nf-azure/src/main/nextflow/cloud/azure/batch/AzBatchTaskHandler.groovy

Line 118 in 9709067

task.exitStatus = readExitFile()

apetkau · 2024-06-20T14:27:46Z

Thanks so much @pditommaso 😄 .

We are actually able to get it to work, but required some changes to the nf-azure plugin to allow us to increase the maximum number of retries in the BlobServiceClient.

Trying to increase the maximum number of tries via the AZURE_REQUEST_RETRY_COUNT environment variable didn't work. It remained at 4 (I believe this comes from the default value set here: https://github.com/Azure/azure-sdk-for-java/blob/a17d6744deb0dd6501e62ca8356c839799423d79/sdk/storage/azure-storage-common/src/main/java/com/azure/storage/common/policy/RequestRetryOptions.java#L101)

However, by modifying the BlobServiceClientBuilder code that is part of nf-azure, we were able to make adjustments to the retry options for blob connections. This is specifically configured in this bit of code in nf-azure:

nextflow/plugins/nf-azure/src/main/nextflow/cloud/azure/batch/AzHelper.groovy

Lines 215 to 218 in f3a86de

    
           return new BlobServiceClientBuilder() 
        
                   .credential(credential) 
        
                   .endpoint(endpoint) 
        
                   .buildClient()

If this is changed to below to add a RequestRetryOptions as part of the configuration, then the retry options for connections can be adjusted:

        return new BlobServiceClientBuilder()
                .credential(credential)
                .endpoint(endpoint)
                .retryOptions(new RequestRetryOptions(RetryPolicyType.EXPONENTIAL, maxTries, tryTimeoutInSeconds, null, null, null))
                .buildClient()

You can see all the changes I made here apetkau/nextflow@v24.04.2...nf-azure-1.6.0-nmlpatch0 (everything else is additional logging statements to help us debug).

I am wondering if nf-azure could be updated to include the ability to configure these retry options for BlobServiceClient? Would it make the most sense to pass in the existing values from the Nextflow config for Azure here (the azure.retryPolicy.maxAttempts and so on)?

Please let me know if you'd prefer I made a new feature request issue for this.

pditommaso · 2024-06-20T15:03:57Z

That's a great feedback. Thank you so much. Tagging @bentsherman and @adamrtalbot for visibility

vsmalladi · 2024-06-27T15:55:51Z

@apetkau if you put in a feature request we can see if we can help implement.

apetkau · 2024-06-28T17:05:35Z

Thanks @vsmalladi, I've made a feature request here: #5097

apetkau mentioned this issue Jun 28, 2024

Allow configuration of additional Azure retry options for connection to blob storage in nf-azure #5097

Open

bentsherman added the storage/azure label Jun 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Azure Batch]: TimeoutException when Nextflow attempts to read files following task completion for large-scale analyses in Azure #5067

[Azure Batch]: TimeoutException when Nextflow attempts to read files following task completion for large-scale analyses in Azure #5067

apetkau commented Jun 14, 2024

apetkau commented Jun 14, 2024 •

edited

Loading

pditommaso commented Jun 17, 2024

apetkau commented Jun 19, 2024

pditommaso commented Jun 20, 2024

apetkau commented Jun 20, 2024

pditommaso commented Jun 20, 2024

vsmalladi commented Jun 27, 2024

apetkau commented Jun 28, 2024

[Azure Batch]: TimeoutException when Nextflow attempts to read files following task completion for large-scale analyses in Azure #5067

[Azure Batch]: TimeoutException when Nextflow attempts to read files following task completion for large-scale analyses in Azure #5067

Comments

apetkau commented Jun 14, 2024

Bug report

Expected behavior and actual behavior

Steps to reproduce the problem

Program output

Environment

Additional context

1. TimeoutException failing a task

2. Reading from Path.text in Azure

3. BlobClient.openInputStream()

4. Azure Java SDK HTTP Pipeline

5. Azure/HTTP Pipeline retry policies

6. Nextflow retry policies

apetkau commented Jun 14, 2024 • edited Loading

pditommaso commented Jun 17, 2024

apetkau commented Jun 19, 2024

Increased Azure HTTP Logging

Increased number of retries

pditommaso commented Jun 20, 2024

apetkau commented Jun 20, 2024

pditommaso commented Jun 20, 2024

vsmalladi commented Jun 27, 2024

apetkau commented Jun 28, 2024

apetkau commented Jun 14, 2024 •

edited

Loading