-
Notifications
You must be signed in to change notification settings - Fork 196
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding DynamoDB high availability features #675
base: master
Are you sure you want to change the base?
Adding DynamoDB high availability features #675
Conversation
...trol-plane-store-dynamodb/src/test/java/io/mantisrx/server/master/store/DynamoStoreTest.java
Outdated
Show resolved
Hide resolved
@mcowgill-stripe overll lgtm. I see you still have some todo/comments so ping me when you think this is ready for merge. |
Test Results533 tests +13 527 ✅ +13 7m 52s ⏱️ -15s Results for commit 8744c4e. ± Comparison against base commit d773089. This pull request removes 1 and adds 14 tests. Note that renamed tests count towards both.
♻️ This comment has been updated with latest results. |
@Andyz26 - I've broken something in container tests it appears. It's not obvious to me what it is yet. I am trying to slowly walk things backwards until I can make this test pass again. If you have a suggestion that would probably accelerate me. If not I'll keep trying to bisect the change. Regarding the TODO's this PR had reached a size where I wanted to pause and check in. We could merge these changes and open a new PR to complete those TODOs (assuming I fix the test issue above). Or I could finish proposing the change to enable injection for the high availability services. Do you have a preference on the continuing this PR or trying to reach a merge point and separating into a new PR? |
I am not sure if I ran into a class erasure issue with the naming or not. After finding that this commit passed locally, I chose to rename the KeyValueStore to IKeyValueStore to reduce that possibility. |
IMO, as long as nothing is broken, it's up to you whether you prefer one PR or multiple. |
@@ -78,4 +77,12 @@ public interface CoreConfiguration { | |||
@Config("mantis.asyncHttpClient.readTimeoutMs") | |||
@Default("10000") | |||
int getAsyncHttpClientReadTimeoutMs(); | |||
|
|||
@Config("mantis.leader.elector.factory") | |||
@Default("io.mantisrx.server.core.master.LocalLeaderFactory") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few issues I see here:
- This leaderElector is only for control plane so it doesn't need to be in coreConfig (and all the workerConfig related classes) for worker/agents.
- We are actually trying to deprecate/move away from config-injected objects. In this case if the electorFactory is only used in control-plane startup I assume it can be just a string in masterConfig directly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Happy to remove this part, I noticed it today as well.
I'm not a huge fan of the DI with Skife today (and it's a little out of date). So I'm happy to hear that position.
What are the current thoughts on how we should think about extending Mantis then? I started with the view point I should only take a dependency on core, this way client or server could later bundle my jar. The only option I could find was the object-injection with Skife today. Should I be looking for another pattern here?
Potentially we could separate each component (@crioux-stripe and I started there with store). That felt pretty heavy weight for each store and we duplicated a lot of DynamoDB config, so this time I went with just an extension for DynamoDB in a gradle project. We could separate each feature into subprojects in the Gradle project and choose different dependencies for each subproject:
mantis-ext-dynamodb
* common (common dynamodb config each subject depends on common)
* server [subproject only depends on mantis-control-plane-server] (contains store and leader elector)
* leader-monitor [subproject depends on mantis-control-plane-core] (contains only monitor)
The above feels pretty heavy weight, but perhaps more precise? We'd have to document this somewhere as well and it's sort of straightforward, but definitely not clear when you look at the current source code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I started with the view point I should only take a dependency on core, this way client or server could later bundle my jar. The only option I could find was the object-injection with Skife today. Should I be looking for another pattern here?
Are you saying you don't take a dependency on mantis-control-plane-server, so you don't have access to the MasterConfiguration?
For skife injection, we don't have a full plan yet on how to replace it but we are trying to remove existing usage and avoid adding new ones. Reflection with config string is an ok workaround for now IMO, or have some builder/ctor pass the instance from your wrapper module.
I am open to a zoom call etc. too if you want to further discuss this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you saying you don't take a dependency on mantis-control-plane-server, so you don't have access to the MasterConfiguration?
There is no issue with that access that I see because that is working in my branch today (unless you know something I don't)
import rx.subjects.BehaviorSubject; | ||
|
||
@Slf4j | ||
public class DynamoDBMasterMonitor extends BaseService implements MasterMonitor { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This design is a simple polling design to monitor the key in DynamoDB. After I finished this code, I know it can be vastly improved. I chose to not push further because there is potential to extract this. Leader monitoring could be completely separate from leader election. In the current implementation ZK and DynamoDB leader monitoring is very knowledgable about how leader is elected. If we change the design to the leader publishes when elected, and heartbeats regularly on a common messaging framework, then monitoring can be separated. This would push requirements to leader election, but I think we could actually push those to leadership manager. That refactor I thought should be a completely separate PR and this implementation should meet the current need.
import software.amazon.awssdk.services.dynamodb.model.*; | ||
|
||
@Slf4j | ||
public class DynamoDBStore implements IKeyValueStore { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rename and not modified.
@@ -0,0 +1,8 @@ | |||
# mantis.ext.dynamodb.store.table=mantis-dynamodb-kv |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an example file, but since the config framework has defaults I left them commented out.
} | ||
|
||
/** | ||
* The use of global state here and the fact the helper rules all start and stop servers around |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to yank this out or is my trade off a good choice?
public static final String V1 = "value1"; | ||
public static final String V4 = "value4"; | ||
|
||
@ClassRule public static DynamoDBLocalRule dynamoDb = new DynamoDBLocalRule(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved with only minor modifications to fit the Rule patterns.
import org.slf4j.Logger; | ||
import org.slf4j.LoggerFactory; | ||
|
||
public class ZookeeperLeaderElector extends BaseService { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The original purpose was name change only, but that did grow as the CuratorService
point of view required modifications to the constructor. I removed the builder model as I felt the factory eliminated the need for it. I did not see wide usage of that pattern and I left LeaderElector
if users still want it.
import io.mantisrx.server.master.config.MasterConfiguration; | ||
import io.mantisrx.shaded.org.apache.curator.utils.ZKPaths; | ||
|
||
public class ZookeeperLeadershipFactory extends ZookeeperLeaderMonitorFactory implements ILeaderElectorFactory { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This felt awkward to write, but was required since the interface is split and core needed a monitor factory.
@@ -103,17 +96,14 @@ public class MasterMain implements Service { | |||
private final ServiceLifecycle mantisServices = new ServiceLifecycle(); | |||
private final AtomicBoolean shutdownInitiated = new AtomicBoolean(false); | |||
private KeyValueBasedPersistenceProvider storageProvider; | |||
private CountDownLatch blockUntilShutdown = new CountDownLatch(1); | |||
private volatile CuratorService curatorService = null; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This becomes a singleton in the factory. I believe that is safe, but worth noting.
private MasterConfiguration config; | ||
private ILeadershipManager leadershipManager; | ||
private final MantisPropertiesLoader dynamicPropertiesLoader; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there are cleanups where I removed things that appear out of IDE suggested best practice. I am happy to start reverting these.
// set up leader election | ||
final ILeaderElectorFactory leaderFactory; | ||
final MasterMonitor monitor; | ||
if(!config.isLocalMode() && config.getLeaderElectorFactory() instanceof LocalLeaderFactory) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Critical injection point is here. The if/else is only necessary because I wanted to support backwards compatibility with configs that exist today. I thought if I broke configs as they are today we would need a major version bump. I tested with the old config and then modified the properties files in the container test to make sure it worked. I left the source code updated for the new property configuration.
@@ -224,53 +228,6 @@ public MasterMain( | |||
} | |||
} | |||
|
|||
private static Properties loadProperties(String propFile) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
moved to config utils for common reuse.
@@ -346,20 +286,6 @@ public MasterConfiguration getConfig() { | |||
return config; | |||
} | |||
|
|||
public String getDescriptionJson() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are these safe to delete? I didn't find issues in testing.
@@ -329,13 +276,6 @@ public void shutdown() { | |||
logger.info("Shutting down Mantis Master"); | |||
mantisServices.shutdown(); | |||
logger.info("mantis services shutdown complete"); | |||
boolean shutdownCuratorEnabled = ConfigurationProvider.getConfig().getShutdownCuratorServiceEnabled(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is moved into the Zookeeper Leader Elector.
@Andyz26 - I've removed the class based injection and reverted to a simple string. Locally, I have the build passing. Are there other places I need to make changes? |
hi @mcowgill-stripe did you get a chance to look at the discussion thread from @sundargates. He made some proposals regarding these interfaces. |
@Andyz26 @sundargates How would we feel about merging this, and we can have someone take an immediate task to do the interface refactor? This is the last blocker for us going to production before the end of Q2. We can work on the prod rollout and the refactor at the same time. |
sure that's fine. Let me try a snapshot build to do some validation internally. |
|
||
@Config("mantis.leader.monitor.factory") | ||
@Default("io.mantisrx.server.core.master.LocalLeaderFactory") | ||
String getLeaderMonitorFactory(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: getLeaderMonitorFactoryName
The snapshot error is interesting: https://github.com/Netflix/mantis/blob/master/build.gradle#L163
Looking into this. |
this seems to be only happening on forked repo's branch. I've added both of you to the collaborator list so you can try push ing the branch in the main repo and snapshot build from there. |
@Andyz26 - @mcowgill-stripe and @crioux-stripe are out for the next few weeks. to ensure we make progress on this, i am stepping in to help out here. could you please add me to the collaborator list? thanks! |
@kmg-stripe could you ask cody to send me an email to verify your github id? Sorry just to be cautious as the write permission is pretty powerful. |
@Andyz26 understood. i just sent cody a message. he is OOO for a few weeks, so hoping he happens to see it while he is out. |
Context
The main purpose of this change is to enable the ability for external consumers to leverage high availability features without needing ZooKeeper and to be able to do that with configuration instead of needing to modify source code or re-implement logic that is available here today.
The first part of this is creating an extension to use DynamoDB for leader election and leader monitoring. We had submitted prior work from @crioux-stripe and myself to add a KV Store for DynamoDB. That started to divide the extension for a "store" project. Looking at what we'd have to add to separate the store from the leader election/monitoring it makes more sense to create an extension for just DynamoDB and include the store, leader election, and leader monitoring. In theory the only difference is the size of the dependencies and any of the three bring with it roughly the same size. However, if the maintainers have different opinions I'm happy to pivot.
Looking at the code today and projects/docs. The idea seems to be that mantis-control-plane-core is a common dependency between the server and client. This does have exceptions, but seems like it's the intent. With that in mind I designed to have the DynamoDB extension only takea dependency on core and I deprecated and moved things into core to enable that. If that's not the goal/intent of the design today, I'm happy to move all that back.
In order to finish this goal of providing a way to inject dependencies for DynamoDB there are additional changes that need to happen in MasterMain.java for mantis server and
HighAvailabilityServices
inside of the client configurations. The latter one is more blocking for our internal use case today, since it's harder to unwind the different client paths than it is to re-implement MasterMain.In order to make sure I don't go down paths that the maintainers won't accept, I'm opening the PR with the initial DynamoDB code for leader election and leader monitoring. The minimal deprecations and refactoring needed to manage the dependency flow. I am only commenting near the code blocks to add injection. Should I continue in this PR for those sections or should I try to finish and merge this and return to refactor the design for those two code paths (MasterMain and HighAvailabilityServicesUtil).
Checklist
./gradlew build
compiles code correctly./gradlew test
passes all tests