Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LogStash fails to acquire lock causing LockException. #16173

Open
sasikiranvaddi opened this issue May 21, 2024 · 6 comments
Open

LogStash fails to acquire lock causing LockException. #16173

sasikiranvaddi opened this issue May 21, 2024 · 6 comments

Comments

@sasikiranvaddi
Copy link

sasikiranvaddi commented May 21, 2024

We observe LockException when logstash process is running. Looking at the logs, before LockException has occurred logstash.agent is trying to fetch the pipelines count but it couldn't get casuing JavanNullPointerException.

[logstash.agent] Failed to execute action {:action=>LogStash::PipelineAction::Reload/pipeline_id:logstash, :exception=>'Java::JavaLang::NullPointerException', :message=>'', :backtrace=>['org.jruby.runtime.scopes.DynamicScope5.getValue(Unknown Source)', 'org.jruby.ir.interpreter.InterpreterEngine.retrieveOp(InterpreterEngine.java:594)', 'org.jruby.ir.interpreter.InterpreterEngine.processCall(InterpreterEngine.java:348)', 'org.jruby.ir.interpreter.StartupInterpreterEngine.interpret(StartupInterpreterEngine.java:66)', 'org.jruby.ir.interpreter.Interpreter.INTERPRET_BLOCK(Interpreter.java:116)', 'org.jruby.runtime.MixedModeIRBlockBody.commonYieldPath(MixedModeIRBlockBody.java:136)', 'org.jruby.runtime.IRBlockBody.yieldSpecific(IRBlockBody.java:76)', 'org.jruby.runtime.Block.yieldSpecific(Block.java:158)', 'org.jruby.ext.monitor.Monitor.synchronize(Monitor.java:82)', 'org.jruby.ext.monitor.Monitor$INVOKER$i$0$0$synchronize.call(Monitor$INVOKER$i$0$0$synchronize.gen)', 'org.jruby.internal.runtime.methods.JavaMethod$JavaMethodZeroBlock.call(JavaMethod.java:561)', 'org.jruby.runtime.callsite.CachingCallSite.call(CachingCallSite.java:90)', 'org.jruby.runtime.callsite.CachingCallSite.callIter(CachingCallSite.java:103)', 'org.jruby.ir.instructions.CallBase.interpret(CallBase.java:545)', 'org.jruby.ir.interpreter.InterpreterEngine.processCall(InterpreterEngine.java:367)', 'org.jruby.ir.interpreter.StartupInterpreterEngine.interpret(StartupInterpreterEngine.java:66)', 'org.jruby.ir.interpreter.InterpreterEngine.interpret(InterpreterEngine.java:82)', 'org.jruby.internal.runtime.methods.MixedModeIRMethod.INTERPRET_METHOD(MixedModeIRMethod.java:201)', 'org.jruby.internal.runtime.methods.MixedModeIRMethod.call(MixedModeIRMethod.java:188)', 'org.jruby.internal.runtime.methods.DynamicMethod.call(DynamicMethod.java:220)', 'org.jruby.runtime.callsite.CachingCallSite.call(CachingCallSite.java:242)', 'org.jruby.ir.interpreter.InterpreterEngine.processCall(InterpreterEngine.java:318)', 'org.jruby.ir.interpreter.StartupInterpreterEngine.interpret(StartupInterpreterEngine.java:66)', 'org.jruby.ir.interpreter.InterpreterEngine.interpret(InterpreterEngine.java:82)', 'org.jruby.internal.runtime.methods.MixedModeIRMethod.INTERPRET_METHOD(MixedModeIRMethod.java:201)', 'org.jruby.internal.runtime.methods.MixedModeIRMethod.call(MixedModeIRMethod.java:188)', 'org.jruby.runtime.callsite.CachingCallSite.call(CachingCallSite.java:257)', 'org.jruby.runtime.callsite.CachingCallSite.callIter(CachingCallSite.java:270)', 'org.jruby.ir.interpreter.InterpreterEngine.processCall(InterpreterEngine.java:341)', 'org.jruby.ir.interpreter.StartupInterpreterEngine.interpret(StartupInterpreterEngine.java:66)', 'org.jruby.ir.interpreter.InterpreterEngine.interpret(InterpreterEngine.java:88)', 'org.jruby.internal.runtime.methods.MixedModeIRMethod.INTERPRET_METHOD(MixedModeIRMethod.java:238)', 'org.jruby.internal.runtime.methods.MixedModeIRMethod.call(MixedModeIRMethod.java:225)', 'org.jruby.internal.runtime.methods.DynamicMethod.call(DynamicMethod.java:228)', 'opt.logstash.logstash_minus_core.lib.logstash.agent.RUBY$block$converge_state$2(/opt/logstash/logstash-core/lib/logstash/agent.rb:386)', 'org.jruby.runtime.CompiledIRBlockBody.callDirect(CompiledIRBlockBody.java:141)', 'org.jruby.runtime.IRBlockBody.call(IRBlockBody.java:64)', 'org.jruby.runtime.IRBlockBody.call(IRBlockBody.java:58)', 'org.jruby.runtime.Block.call(Block.java:144)', 'org.jruby.RubyProc.call(RubyProc.java:352)', 'org.jruby.internal.runtime.RubyRunnable.run(RubyRunnable.java:111)', 'java.base/java.lang.Thread.run(Thread.java:829)']}

From the following traceback, logstash is trying to execute the reload pipeline

github.com
begin
logger.debug("Executing action", :action => action)
action_result = action.execute(self, @pipelines_registry)
converge_result.add(action, action_result)
unless action_result.successful?
logger.error("Failed to execute action",
:id => action.pipeline_id,
:action_type => action_result.class,
:message => action_result.message,
:backtrace => action_result.backtrace
)
end
Further tracing back, probably when the below lines of code is executed it is returning null and leaving the lock not getting removed.

github.com
elastic/logstash/blob/main/logstash-core/lib/logstash/pipeline_action/reload.rb#L39-L42
def execute(agent, pipelines_registry)
old_pipeline = pipelines_registry.get_pipeline(pipeline_id)
if old_pipeline.nil?

Could you please let us know on what all scenarios NullPointerException, LockException is occurred. In case if the transaction has failed then as a rescue should it clean the lock for upcoming transaction to complete successfully.

@roaksoax
Copy link
Contributor

Hi @sasikiranvaddi

Thanks for filing the issue. Could you please provide the information requested on the issue template?

That is:

  • Logstash Version
  • Installation Source with Steps followed to install
  • Pipeline sample causing the issue
  • Steps to reproduce the issue
  • Logs (providing the traceback is not enough).

Thank you!

@sasikiranvaddi
Copy link
Author

sasikiranvaddi commented May 22, 2024

Hi @roaksoax,

Thank you for acknowledging the issue. Please find below the requested information.

Logstash Version:
8.11.3

Installation Source with Steps followed to install:
Built from source code.

Pipeline sample causing the issue:

pipelines.yml:
----
- pipeline.id: logstash
  queue.type: persisted
  queue.max_bytes: 1024mb
  path.config: "/opt/logstash/resource/logstash.conf"
- pipeline.id: opensearch
  queue.type: persisted
  queue.max_bytes: 1024mb
  path.config: "/opt/logstash/resource/searchengine.conf"

Steps to reproduce the issue:
Steps are unknown on what is causing the Java NPE and why the lock is not getting released

Logs (providing the traceback is not enough).
Attached logs.txt
logs.txt

@yaauie
Copy link
Member

yaauie commented May 22, 2024

The backtrace from the issue description appears to be the jruby runtime hitting an NPE in the course of interpreting and running our code (as opposed to our code failing to acquire a lock).

This is not a normal scenario, and is certainly caused by a bug in Jruby.

The NPE appears to have occurred while starting the pipeline, after the Logstash code had acquired the PQ's lock (which is two-fold; when opening the queue, we first ensure that no other process has the queue open using an on-disk lock file, and then ensure that the current process only opens it once; it appears that both levels of locks had been acquired prior to jruby throwing the NPE).

The Agent's config state converger prevented the exception from crashing the Logstash process, but the queue's lock was left in a locked state. Because the lock is supposed to live beyond the starting of the pipeline, there is no implicit auto-close handling.

Since the lock had been acquired and was not released, subsequent reloads of the pipeline cannot acquire it. The only way to get the pipeline running again is to stop the process, manually remove the offending queue's lock file, and restart the process.


Logstash 8.11.3's distribution from Elastic is bundled with Jruby 9.4.5.0 and Adoptium's JDK 17.0.9p9, but since you have built from source there are a number of additional variables at play.

Since you built from source it would also be helpful to know:

  • whether you are running the assembled components directly ("${SOURCE}/bin/logstash") or from an assembled artifact (such as a tarball, rpm, or deb)
  • whether your build correctly vendored jruby 9.4.5.0 or later (I can see similar issues resolved in the jruby project as recently as 9.4.4.0)
  • which commit you built from (git rev-parse HEAD); if that commit is not on elastic/logstash, then the diff between your head and a base commit that is on elastic/logstash would be necessary.
  • which JDK you are using to build logstash
  • which JDK you are using to run the logstash process

@sasikiranvaddi
Copy link
Author

Hi @yaauie ,

Thank you for sharing the detailed analysis.
Just a quick question, The Agent's config state converger will not release lock in case of Exception and it is made on purpose. Is my understanding correct? The only way forward is to stop the logstash process manually, delete the lock and start again.

Please find the below info, which you have requested.

  1. whether you are running the assembled components directly ("${SOURCE}/bin/logstash") or from an assembled artifact (such as a tarball, rpm, or deb)
    whether your build correctly vendored jruby 9.4.5.0 or later (I can see similar issues resolved in the jruby project as recently as 9.4.4.0)
    The build uses 9.3.10 version of jruby and JDK version 11. We mirror github repository and use it from there.
  2. which commit you built from (git rev-parse HEAD); if that commit is not on elastic/logstash, then the diff between your head and a base commit that is on elastic/logstash would be necessary.
    $ git rev-parse HEAD
    45f6dce
  3. which JDK you are using to build logstash
    openjdk version "11" 2018-09-25
    OpenJDK Runtime Environment 18.9 (build 11+28)
    OpenJDK 64-Bit Server VM 18.9 (build 11+28, mixed mode)
  4. which JDK you are using to run the logstash process
    openjdk 11.0.22 2024-01-16
    OpenJDK Runtime Environment (build 11.0.22+7-suse-150000.3.110.1-x8664)
    OpenJDK 64-Bit Server VM (build 11.0.22+7-suse-150000.3.110.1-x8664, mixed mode)

@sasikiranvaddi
Copy link
Author

Hi @yaauie,

Just to add, we observe LockException multiple times and the pattern we observe is always it is followed by some error. Some of them like

  1. NPE
  2. Key Not Foundfailed to coerce
  3. rubyobj.LogStash.Instrument.MetricType.Gauge to org.logstash.instrument.metrics.counter.LongCounter

Proposal:
We see https://github.com/elastic/logstash/blob/main/logstash-core/src/main/java/org/logstash/ackedqueue/Queue.java#L167-L192 does the two-fold locking here. If the exception has occurred in this part of execution https://github.com/elastic/logstash/blob/main/logstash-core/src/main/java/org/logstash/ackedqueue/Queue.java#L171-L177, we could see the Thread lock has been released in the finally block but on-disk lock still exists.

Can we handle releasing of on-disk lock in this part of code https://github.com/elastic/logstash/blob/main/logstash-core/src/main/java/org/logstash/ackedqueue/Queue.java#L175-L177 by invoking method releaseLockAndSwallow();

@sasikiranvaddi
Copy link
Author

Hi,
Could you please check my latest comment ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants