Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding DynamoDB high availability features #675

Open
wants to merge 24 commits into
base: master
Choose a base branch
from

Conversation

mcowgill-stripe
Copy link
Contributor

@mcowgill-stripe mcowgill-stripe commented May 23, 2024

Context

The main purpose of this change is to enable the ability for external consumers to leverage high availability features without needing ZooKeeper and to be able to do that with configuration instead of needing to modify source code or re-implement logic that is available here today.

The first part of this is creating an extension to use DynamoDB for leader election and leader monitoring. We had submitted prior work from @crioux-stripe and myself to add a KV Store for DynamoDB. That started to divide the extension for a "store" project. Looking at what we'd have to add to separate the store from the leader election/monitoring it makes more sense to create an extension for just DynamoDB and include the store, leader election, and leader monitoring. In theory the only difference is the size of the dependencies and any of the three bring with it roughly the same size. However, if the maintainers have different opinions I'm happy to pivot.

Looking at the code today and projects/docs. The idea seems to be that mantis-control-plane-core is a common dependency between the server and client. This does have exceptions, but seems like it's the intent. With that in mind I designed to have the DynamoDB extension only takea dependency on core and I deprecated and moved things into core to enable that. If that's not the goal/intent of the design today, I'm happy to move all that back.

In order to finish this goal of providing a way to inject dependencies for DynamoDB there are additional changes that need to happen in MasterMain.java for mantis server and HighAvailabilityServices inside of the client configurations. The latter one is more blocking for our internal use case today, since it's harder to unwind the different client paths than it is to re-implement MasterMain.

In order to make sure I don't go down paths that the maintainers won't accept, I'm opening the PR with the initial DynamoDB code for leader election and leader monitoring. The minimal deprecations and refactoring needed to manage the dependency flow. I am only commenting near the code blocks to add injection. Should I continue in this PR for those sections or should I try to finish and merge this and return to refactor the design for those two code paths (MasterMain and HighAvailabilityServicesUtil).

Checklist

  • ./gradlew build compiles code correctly
  • Added new tests where applicable
  • ./gradlew test passes all tests
  • Extended README or added javadocs where applicable

@Andyz26
Copy link
Collaborator

Andyz26 commented May 24, 2024

@mcowgill-stripe overll lgtm. I see you still have some todo/comments so ping me when you think this is ready for merge.

Copy link

github-actions bot commented May 24, 2024

Test Results

533 tests  +13   527 ✅ +13   7m 52s ⏱️ -15s
139 suites + 4     6 💤 ± 0 
139 files   + 4     0 ❌ ± 0 

Results for commit 8744c4e. ± Comparison against base commit d773089.

This pull request removes 1 and adds 14 tests. Note that renamed tests count towards both.
io.mantisrx.server.master.store.KeyValueStoreTest ‑ testUpsertOrdered
io.mantisrx.extensions.dynamodb.DynamoDBClientSingletonTest ‑ highAvailabilityServices
io.mantisrx.extensions.dynamodb.DynamoDBLeaderElectorTest ‑ becomesCurrentLeader
io.mantisrx.extensions.dynamodb.DynamoDBLeaderElectorTest ‑ respectsExistingLock
io.mantisrx.extensions.dynamodb.DynamoDBLeaderElectorTest ‑ restartsOnFailures
io.mantisrx.extensions.dynamodb.DynamoDBLeaderElectorTest ‑ shutdownAsLeader
io.mantisrx.extensions.dynamodb.DynamoDBLeaderElectorTest ‑ shutdownNotAsLeader
io.mantisrx.extensions.dynamodb.DynamoDBMasterMonitorTest ‑ getCurrentLeader
io.mantisrx.extensions.dynamodb.DynamoDBMasterMonitorTest ‑ runShutdown
io.mantisrx.extensions.dynamodb.DynamoDBStoreTest ‑ testInsertAndDelete
io.mantisrx.extensions.dynamodb.DynamoDBStoreTest ‑ testInsertAndDeleteMoreThan25
…

♻️ This comment has been updated with latest results.

@mcowgill-stripe
Copy link
Contributor Author

@Andyz26 - I've broken something in container tests it appears. It's not obvious to me what it is yet. I am trying to slowly walk things backwards until I can make this test pass again. If you have a suggestion that would probably accelerate me. If not I'll keep trying to bisect the change.

Regarding the TODO's this PR had reached a size where I wanted to pause and check in. We could merge these changes and open a new PR to complete those TODOs (assuming I fix the test issue above). Or I could finish proposing the change to enable injection for the high availability services.

Do you have a preference on the continuing this PR or trying to reach a merge point and separating into a new PR?

@mcowgill-stripe
Copy link
Contributor Author

I am not sure if I ran into a class erasure issue with the naming or not. After finding that this commit passed locally, I chose to rename the KeyValueStore to IKeyValueStore to reduce that possibility.

@Andyz26
Copy link
Collaborator

Andyz26 commented May 28, 2024

@Andyz26 - I've broken something in container tests it appears. It's not obvious to me what it is yet. I am trying to slowly walk things backwards until I can make this test pass again. If you have a suggestion that would probably accelerate me. If not I'll keep trying to bisect the change.

Regarding the TODO's this PR had reached a size where I wanted to pause and check in. We could merge these changes and open a new PR to complete those TODOs (assuming I fix the test issue above). Or I could finish proposing the change to enable injection for the high availability services.

Do you have a preference on the continuing this PR or trying to reach a merge point and separating into a new PR?

IMO, as long as nothing is broken, it's up to you whether you prefer one PR or multiple.
For test errors, the test container can be debugged if you put a breakpoint or something and examine the error in your local Docker app, too.

@@ -78,4 +77,12 @@ public interface CoreConfiguration {
@Config("mantis.asyncHttpClient.readTimeoutMs")
@Default("10000")
int getAsyncHttpClientReadTimeoutMs();

@Config("mantis.leader.elector.factory")
@Default("io.mantisrx.server.core.master.LocalLeaderFactory")
Copy link
Collaborator

@Andyz26 Andyz26 May 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few issues I see here:

  • This leaderElector is only for control plane so it doesn't need to be in coreConfig (and all the workerConfig related classes) for worker/agents.
  • We are actually trying to deprecate/move away from config-injected objects. In this case if the electorFactory is only used in control-plane startup I assume it can be just a string in masterConfig directly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Happy to remove this part, I noticed it today as well.

I'm not a huge fan of the DI with Skife today (and it's a little out of date). So I'm happy to hear that position.

What are the current thoughts on how we should think about extending Mantis then? I started with the view point I should only take a dependency on core, this way client or server could later bundle my jar. The only option I could find was the object-injection with Skife today. Should I be looking for another pattern here?

Potentially we could separate each component (@crioux-stripe and I started there with store). That felt pretty heavy weight for each store and we duplicated a lot of DynamoDB config, so this time I went with just an extension for DynamoDB in a gradle project. We could separate each feature into subprojects in the Gradle project and choose different dependencies for each subproject:

mantis-ext-dynamodb
  * common (common dynamodb config each subject depends on common)
  * server [subproject only depends on mantis-control-plane-server] (contains store and leader elector) 
  * leader-monitor [subproject depends on mantis-control-plane-core] (contains only monitor)

The above feels pretty heavy weight, but perhaps more precise? We'd have to document this somewhere as well and it's sort of straightforward, but definitely not clear when you look at the current source code.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I started with the view point I should only take a dependency on core, this way client or server could later bundle my jar. The only option I could find was the object-injection with Skife today. Should I be looking for another pattern here?
Are you saying you don't take a dependency on mantis-control-plane-server, so you don't have access to the MasterConfiguration?

For skife injection, we don't have a full plan yet on how to replace it but we are trying to remove existing usage and avoid adding new ones. Reflection with config string is an ok workaround for now IMO, or have some builder/ctor pass the instance from your wrapper module.
I am open to a zoom call etc. too if you want to further discuss this.

Copy link
Contributor Author

@mcowgill-stripe mcowgill-stripe May 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you saying you don't take a dependency on mantis-control-plane-server, so you don't have access to the MasterConfiguration?

There is no issue with that access that I see because that is working in my branch today (unless you know something I don't)

import rx.subjects.BehaviorSubject;

@Slf4j
public class DynamoDBMasterMonitor extends BaseService implements MasterMonitor {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This design is a simple polling design to monitor the key in DynamoDB. After I finished this code, I know it can be vastly improved. I chose to not push further because there is potential to extract this. Leader monitoring could be completely separate from leader election. In the current implementation ZK and DynamoDB leader monitoring is very knowledgable about how leader is elected. If we change the design to the leader publishes when elected, and heartbeats regularly on a common messaging framework, then monitoring can be separated. This would push requirements to leader election, but I think we could actually push those to leadership manager. That refactor I thought should be a completely separate PR and this implementation should meet the current need.

import software.amazon.awssdk.services.dynamodb.model.*;

@Slf4j
public class DynamoDBStore implements IKeyValueStore {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rename and not modified.

@@ -0,0 +1,8 @@
# mantis.ext.dynamodb.store.table=mantis-dynamodb-kv
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an example file, but since the config framework has defaults I left them commented out.

}

/**
* The use of global state here and the fact the helper rules all start and stop servers around
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to yank this out or is my trade off a good choice?

public static final String V1 = "value1";
public static final String V4 = "value4";

@ClassRule public static DynamoDBLocalRule dynamoDb = new DynamoDBLocalRule();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved with only minor modifications to fit the Rule patterns.

import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

public class ZookeeperLeaderElector extends BaseService {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original purpose was name change only, but that did grow as the CuratorService point of view required modifications to the constructor. I removed the builder model as I felt the factory eliminated the need for it. I did not see wide usage of that pattern and I left LeaderElector if users still want it.

import io.mantisrx.server.master.config.MasterConfiguration;
import io.mantisrx.shaded.org.apache.curator.utils.ZKPaths;

public class ZookeeperLeadershipFactory extends ZookeeperLeaderMonitorFactory implements ILeaderElectorFactory {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This felt awkward to write, but was required since the interface is split and core needed a monitor factory.

@@ -103,17 +96,14 @@ public class MasterMain implements Service {
private final ServiceLifecycle mantisServices = new ServiceLifecycle();
private final AtomicBoolean shutdownInitiated = new AtomicBoolean(false);
private KeyValueBasedPersistenceProvider storageProvider;
private CountDownLatch blockUntilShutdown = new CountDownLatch(1);
private volatile CuratorService curatorService = null;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This becomes a singleton in the factory. I believe that is safe, but worth noting.

private MasterConfiguration config;
private ILeadershipManager leadershipManager;
private final MantisPropertiesLoader dynamicPropertiesLoader;
Copy link
Contributor Author

@mcowgill-stripe mcowgill-stripe Jun 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there are cleanups where I removed things that appear out of IDE suggested best practice. I am happy to start reverting these.

// set up leader election
final ILeaderElectorFactory leaderFactory;
final MasterMonitor monitor;
if(!config.isLocalMode() && config.getLeaderElectorFactory() instanceof LocalLeaderFactory) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Critical injection point is here. The if/else is only necessary because I wanted to support backwards compatibility with configs that exist today. I thought if I broke configs as they are today we would need a major version bump. I tested with the old config and then modified the properties files in the container test to make sure it worked. I left the source code updated for the new property configuration.

@@ -224,53 +228,6 @@ public MasterMain(
}
}

private static Properties loadProperties(String propFile) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

moved to config utils for common reuse.

@@ -346,20 +286,6 @@ public MasterConfiguration getConfig() {
return config;
}

public String getDescriptionJson() {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are these safe to delete? I didn't find issues in testing.

@@ -329,13 +276,6 @@ public void shutdown() {
logger.info("Shutting down Mantis Master");
mantisServices.shutdown();
logger.info("mantis services shutdown complete");
boolean shutdownCuratorEnabled = ConfigurationProvider.getConfig().getShutdownCuratorServiceEnabled();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is moved into the Zookeeper Leader Elector.

@mcowgill-stripe
Copy link
Contributor Author

@Andyz26 - I've removed the class based injection and reverted to a simple string. Locally, I have the build passing. Are there other places I need to make changes?

@Andyz26
Copy link
Collaborator

Andyz26 commented Jun 11, 2024

@Andyz26 - I've removed the class based injection and reverted to a simple string. Locally, I have the build passing. Are there other places I need to make changes?

hi @mcowgill-stripe did you get a chance to look at the discussion thread from @sundargates. He made some proposals regarding these interfaces.

@crioux-stripe
Copy link
Collaborator

@Andyz26 @sundargates How would we feel about merging this, and we can have someone take an immediate task to do the interface refactor? This is the last blocker for us going to production before the end of Q2. We can work on the prod rollout and the refactor at the same time.

@Andyz26
Copy link
Collaborator

Andyz26 commented Jun 13, 2024

@Andyz26 @sundargates How would we feel about merging this, and we can have someone take an immediate task to do the interface refactor? This is the last blocker for us going to production before the end of Q2. We can work on the prod rollout and the refactor at the same time.

sure that's fine. Let me try a snapshot build to do some validation internally.


@Config("mantis.leader.monitor.factory")
@Default("io.mantisrx.server.core.master.LocalLeaderFactory")
String getLeaderMonitorFactory();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: getLeaderMonitorFactoryName

@crioux-stripe
Copy link
Collaborator

The snapshot error is interesting:

https://github.com/Netflix/mantis/blob/master/build.gradle#L163

[Fatal Error] :1:1: Content is not allowed in prolog.
[Fatal Error] :1:1: Content is not allowed in prolog.
[Fatal Error] :1:1: Content is not allowed in prolog.
[Fatal Error] :1:1: Content is not allowed in prolog.
[Fatal Error] :1:1: Content is not allowed in prolog.
[Fatal Error] :1:1: Content is not allowed in prolog.
[Fatal Error] :1:1: Content is not allowed in prolog.
[Fatal Error] :1:1: Content is not allowed in prolog.

FAILURE: Build completed with 36 failures.

1: Task failed with an exception.
-----------
* Where:
Build file '/home/runner/work/mantis/mantis/build.gradle' line: 163

* What went wrong:
Execution failed for task ':mantis-discovery-proto:printReleasedArtifact'.
> org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 1; Content is not allowed in prolog.

* Try:
> Run with --debug option to get more log output.

* Exception is:
org.gradle.api.tasks.TaskExecutionException: Execution failed for task ':mantis-discovery-proto:printReleasedArtifact'.

Looking into this.

@Andyz26
Copy link
Collaborator

Andyz26 commented Jun 17, 2024

The snapshot error is interesting:

https://github.com/Netflix/mantis/blob/master/build.gradle#L163

[Fatal Error] :1:1: Content is not allowed in prolog.
[Fatal Error] :1:1: Content is not allowed in prolog.
[Fatal Error] :1:1: Content is not allowed in prolog.
[Fatal Error] :1:1: Content is not allowed in prolog.
[Fatal Error] :1:1: Content is not allowed in prolog.
[Fatal Error] :1:1: Content is not allowed in prolog.
[Fatal Error] :1:1: Content is not allowed in prolog.
[Fatal Error] :1:1: Content is not allowed in prolog.

FAILURE: Build completed with 36 failures.

1: Task failed with an exception.
-----------
* Where:
Build file '/home/runner/work/mantis/mantis/build.gradle' line: 163

* What went wrong:
Execution failed for task ':mantis-discovery-proto:printReleasedArtifact'.
> org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 1; Content is not allowed in prolog.

* Try:
> Run with --debug option to get more log output.

* Exception is:
org.gradle.api.tasks.TaskExecutionException: Execution failed for task ':mantis-discovery-proto:printReleasedArtifact'.

Looking into this.

this seems to be only happening on forked repo's branch. I've added both of you to the collaborator list so you can try push ing the branch in the main repo and snapshot build from there.

@kmg-stripe
Copy link
Collaborator

The snapshot error is interesting:
https://github.com/Netflix/mantis/blob/master/build.gradle#L163

[Fatal Error] :1:1: Content is not allowed in prolog.
[Fatal Error] :1:1: Content is not allowed in prolog.
[Fatal Error] :1:1: Content is not allowed in prolog.
[Fatal Error] :1:1: Content is not allowed in prolog.
[Fatal Error] :1:1: Content is not allowed in prolog.
[Fatal Error] :1:1: Content is not allowed in prolog.
[Fatal Error] :1:1: Content is not allowed in prolog.
[Fatal Error] :1:1: Content is not allowed in prolog.

FAILURE: Build completed with 36 failures.

1: Task failed with an exception.
-----------
* Where:
Build file '/home/runner/work/mantis/mantis/build.gradle' line: 163

* What went wrong:
Execution failed for task ':mantis-discovery-proto:printReleasedArtifact'.
> org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 1; Content is not allowed in prolog.

* Try:
> Run with --debug option to get more log output.

* Exception is:
org.gradle.api.tasks.TaskExecutionException: Execution failed for task ':mantis-discovery-proto:printReleasedArtifact'.

Looking into this.

this seems to be only happening on forked repo's branch. I've added both of you to the collaborator list so you can try push ing the branch in the main repo and snapshot build from there.

@Andyz26 - @mcowgill-stripe and @crioux-stripe are out for the next few weeks. to ensure we make progress on this, i am stepping in to help out here. could you please add me to the collaborator list? thanks!

@Andyz26
Copy link
Collaborator

Andyz26 commented Jun 18, 2024

The snapshot error is interesting:
https://github.com/Netflix/mantis/blob/master/build.gradle#L163

[Fatal Error] :1:1: Content is not allowed in prolog.
[Fatal Error] :1:1: Content is not allowed in prolog.
[Fatal Error] :1:1: Content is not allowed in prolog.
[Fatal Error] :1:1: Content is not allowed in prolog.
[Fatal Error] :1:1: Content is not allowed in prolog.
[Fatal Error] :1:1: Content is not allowed in prolog.
[Fatal Error] :1:1: Content is not allowed in prolog.
[Fatal Error] :1:1: Content is not allowed in prolog.

FAILURE: Build completed with 36 failures.

1: Task failed with an exception.
-----------
* Where:
Build file '/home/runner/work/mantis/mantis/build.gradle' line: 163

* What went wrong:
Execution failed for task ':mantis-discovery-proto:printReleasedArtifact'.
> org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 1; Content is not allowed in prolog.

* Try:
> Run with --debug option to get more log output.

* Exception is:
org.gradle.api.tasks.TaskExecutionException: Execution failed for task ':mantis-discovery-proto:printReleasedArtifact'.

Looking into this.

this seems to be only happening on forked repo's branch. I've added both of you to the collaborator list so you can try push ing the branch in the main repo and snapshot build from there.

@Andyz26 - @mcowgill-stripe and @crioux-stripe are out for the next few weeks. to ensure we make progress on this, i am stepping in to help out here. could you please add me to the collaborator list? thanks!

@kmg-stripe could you ask cody to send me an email to verify your github id? Sorry just to be cautious as the write permission is pretty powerful.

@kmg-stripe
Copy link
Collaborator

The snapshot error is interesting:
https://github.com/Netflix/mantis/blob/master/build.gradle#L163

[Fatal Error] :1:1: Content is not allowed in prolog.
[Fatal Error] :1:1: Content is not allowed in prolog.
[Fatal Error] :1:1: Content is not allowed in prolog.
[Fatal Error] :1:1: Content is not allowed in prolog.
[Fatal Error] :1:1: Content is not allowed in prolog.
[Fatal Error] :1:1: Content is not allowed in prolog.
[Fatal Error] :1:1: Content is not allowed in prolog.
[Fatal Error] :1:1: Content is not allowed in prolog.

FAILURE: Build completed with 36 failures.

1: Task failed with an exception.
-----------
* Where:
Build file '/home/runner/work/mantis/mantis/build.gradle' line: 163

* What went wrong:
Execution failed for task ':mantis-discovery-proto:printReleasedArtifact'.
> org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 1; Content is not allowed in prolog.

* Try:
> Run with --debug option to get more log output.

* Exception is:
org.gradle.api.tasks.TaskExecutionException: Execution failed for task ':mantis-discovery-proto:printReleasedArtifact'.

Looking into this.

this seems to be only happening on forked repo's branch. I've added both of you to the collaborator list so you can try push ing the branch in the main repo and snapshot build from there.

@Andyz26 - @mcowgill-stripe and @crioux-stripe are out for the next few weeks. to ensure we make progress on this, i am stepping in to help out here. could you please add me to the collaborator list? thanks!

@kmg-stripe could you ask cody to send me an email to verify your github id? Sorry just to be cautious as the write permission is pretty powerful.

@Andyz26 understood. i just sent cody a message. he is OOO for a few weeks, so hoping he happens to see it while he is out.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants