Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failed config reloads in Logstash causes resources leaks #16202

Closed
edmocosta opened this issue Jun 6, 2024 · 6 comments · Fixed by #16264
Closed

Failed config reloads in Logstash causes resources leaks #16202

edmocosta opened this issue Jun 6, 2024 · 6 comments · Fixed by #16264

Comments

@edmocosta
Copy link
Contributor

edmocosta commented Jun 6, 2024

Logstash information:

Please include the following information:

  1. Logstash version: 8.14.0
  2. Logstash installation source: zip archive
  3. How is Logstash being run: Via command line

Description of the problem including expected versus actual behavior:
Logstash is not properly handling failed configuration reloads.

When Logstash tries to reload the pipeline configuration and it fails (due to some issue, e.g database down), subsequent reloads causes resource leaks, as no attempt to stop the input plugins is done. It also keeps the old pipeline's plugin in the metric store, filling the memory and returning outdated resources on the stats API.

The issues seems to be on the java_pipeline.rb#shutdown logic. When the reload fails, the finished_execution? is true, so it skips the stop_inputs, wait_for_shutdown and clear_pipeline_metrics methods execution, leaking those resources and increasing them linearly.

For example, given theIf pipeline:

input {
  jdbc {
    id => "db_1"
    type => "jdbc"
     ...
   }

   jdbc {
     id => "db_2"
     type => "jdbc"
      ...
    }
}

If the pipeline configuration reload fails due to db_2 being unavailable, the db_1 connections won't be closed, and it will duplicate the resource usage (number of connections) in every failed reload attempt. It will also keep both pipelines "old" and "new" plugins on the metric store, increasing the number of elements per retry.

Steps to reproduce:

  1. Configure a pipeline with 2 JDBC inputs
  2. Start Logstash with auto reload enabled
  3. Check the number of DB's connections (e.g mysql show processlist) and plugins metrics:
    curl -s http://localhost:9600/_node/stats | jq -r '.pipelines."<pipeline_name>".plugins.codecs | length'
  4. Change the pipeline in a way that 1 plugins fails (wrong username for example).
  5. Change the pipeline again (empty space, enter, etc).
  6. Repeat the step 3 and compare the results.
@yaauie
Copy link
Member

yaauie commented Jun 11, 2024

Fly-by thoughts:

JavaPipeline#clear_pipeline_metrics is only called for JavaPipeline#shutdown, and does not occur when a pipeline fails to start. IIRC we want for a pipeline to leave its metrics in-place when it stops "normally" or exceptionally, but we want for them to be cleared when the pipeline has been shut down.

I think that the "easiest" way to handle this would be to also clear them as a part of starting up, but we would need to be careful to ensure that if a pipeline has overlap during a restart that we keep ordering correct.

Alternatively, we could change the default id generation algorithm to somehow fingerprint the source of the plugin.

@edmocosta
Copy link
Contributor Author

Hey @yaauie, thank you for the input! I'm wondering if changing the JavaPipeline#shutdown method, adding a reloading parameter to avoid the shutdown early return would solve this issue?, E.g:

def shutdown(reloading: false)
 return if finished_execution? && !reloading
end

Then on the PipelineAction::Reload#execute:

old_pipeline.shutdown(reloading: true)

My initial tests seems to be working with this code change, but I'm still unsure if it can have any other impact on the reloading process. WDYT?

@yaauie
Copy link
Member

yaauie commented Jun 11, 2024

🤔 The comments lead me to believe that we guard on JavaPipeline#finished_execution? to ensure we don't have a race with two threads starting and stopping the pipeline simultaneously, and that we have a chance to block until a running pipeline has fully shut down. But because it bails from the shutdown method entirely it fails to do the cleanup that needs to happen whether or not the to-be-shutdown pipeline has already completed execution.

I think that if we change the scope of the guard a little, we can get the behaviour we want in both cases without needing to add the reloading control variable:

diff --git a/logstash-core/lib/logstash/java_pipeline.rb b/logstash-core/lib/logstash/java_pipeline.rb
index 85f182fdb..0e4bd5c44 100644
--- a/logstash-core/lib/logstash/java_pipeline.rb
+++ b/logstash-core/lib/logstash/java_pipeline.rb
@@ -446,15 +446,16 @@ module LogStash; class JavaPipeline < AbstractPipeline
   # this method is intended to be called from outside the pipeline thread
   # and will block until the pipeline has successfully shut down.
   def shutdown
-    return if finished_execution?
-    # shutdown can only start once the pipeline has completed its startup.
-    # avoid potential race condition between the startup sequence and this
-    # shutdown method which can be called from another thread at any time
-    sleep(0.1) while !ready?
-
-    # TODO: should we also check against calling shutdown multiple times concurrently?
-    stop_inputs
-    wait_for_shutdown
+    unless finished_execution?
+      # shutdown can only start once the pipeline has completed its startup.
+      # avoid potential race condition between the startup sequence and this
+      # shutdown method which can be called from another thread at any time
+      sleep(0.1) while !ready?
+
+      # TODO: should we also check against calling shutdown multiple times concurrently?
+      stop_inputs
+      wait_for_shutdown
+    end
     clear_pipeline_metrics
   end # def shutdown
 

@edmocosta
Copy link
Contributor Author

edmocosta commented Jun 17, 2024

Hi @yaauie, If I'm not mistaken, It would only solve the pipeline metrics issue, given the clear_pipeline_metrics was moved out of the guard, but the resources leak problem would still happen, as successfully started input plugins wouldn't be closed on subsequent failed reloads.

For example:

Let's assume the following pipeline started successfully and both plugins are running:

input {
  jdbc {
    id => "db_1"
    type => "jdbc"
     ...
   }

   jdbc {
     id => "db_2"
     type => "jdbc"
      ...
    }
}

Then, the user changes the jdbc { id => "db_2" } settings, making it invalid (wrong password, for example).
On the first reload attempt, the shutdown method gets finished_execution? => false, cleaning the resources and metrics properly.

But then, the user changes it again, but the configuration is still invalid. This time, when Logstash tries to reload the pipeline configuration, the shutdown method will get finished_execution? => true, as the pipeline being reloaded/shut down is the same that previously failed to start, skipping the resources cleaning up, and letting the jdbc { id => "db_1" } connections open (as no stop_inputs are invoked).

@jsvd
Copy link
Member

jsvd commented Jun 25, 2024

The scenario I was able to reproduce consistently has to do with resources created during the register method of input plugins.
This didn't reproduce when resources were created by filters, or if they were created after register during start.

To reproduce I took simple inputs like heartbeat and generator and added a resource creation (connection) during register, and added an conditional exception to one of them. With this I could trigger the resource pile up.

The proposed solution fixes this scenario, and I don't see any problems it could create. In this case we're sort of abusing stop_inputs to perform the cleanup, even though inputs never ended up starting in the first place, but it works. A more structured fix would be to differentiate this situation and call close on each input (as opposed to stop).

@jsvd
Copy link
Member

jsvd commented Jun 25, 2024

In the meantime the fix proposed for cleaning up metrics resources is still valid and can be done regardless of the fix for the input register leaks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants