Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Keras 3 gives incorrect output from evaluate/fit in distributed context #19891

Open
drasmuss opened this issue Jun 20, 2024 · 12 comments
Open

Keras 3 gives incorrect output from evaluate/fit in distributed context #19891

drasmuss opened this issue Jun 20, 2024 · 12 comments
Assignees
Labels

Comments

@drasmuss
Copy link

drasmuss commented Jun 20, 2024

In Keras 3, changing the number of replicas during distributed training/evaluation changes the output of the model:

import tensorflow as tf

import keras
# import tf_keras as keras

keras.utils.set_random_seed(0)

n_replicas = 4

gpus = tf.config.list_physical_devices("GPU")
tf.config.set_logical_device_configuration(
    gpus[0], [tf.config.LogicalDeviceConfiguration(memory_limit=1000)] * n_replicas
)

batch_size = 12
x = tf.random.uniform((batch_size, 1), -1, 1, seed=0)
y = tf.random.uniform((batch_size, 10), -1, 1, seed=1)

strategy = tf.distribute.MirroredStrategy()
with strategy.scope():
    inp = keras.Input(shape=(1,))
    layer = keras.layers.Dense(10)
    model = keras.Model(inp, layer(inp))
    model.compile(loss="mse", optimizer="sgd")

    gt = keras.losses.mean_squared_error(y, model.predict(x, batch_size=batch_size))
    eval = model.evaluate(x, y, batch_size=batch_size)
    model.fit(x, y, batch_size=batch_size, epochs=1)
    post_gt = keras.losses.mean_squared_error(
        y, model.predict(x, batch_size=batch_size)
    )
    print(f"ground truth: {tf.reduce_mean(gt)}")
    print(f"evaluate: {eval}")
    print(f"post-fit output: {tf.reduce_mean(post_gt)}")

This gives output:

  • n_replicas=1:
ground truth: 0.43009480834007263
evaluate: 0.43009480834007263
post-fit output: 0.4297996461391449
  • n_replicas=2:
ground truth: 0.43009480834007263
evaluate: 0.5054659843444824
post-fit output: 0.4298612177371979
  • n_replicas=4:
ground truth: 0.43009480834007263
evaluate: 0.5540136098861694
post-fit output: 0.4299061596393585

We can see that the ground truth is invariant to the number of replicas, as expected. But the loss value calculated by evaluate is incorrect for all n_replicas > 1. And this doesn't just impact the evaluation, we can see that fit results in a different change in the model output as we change the number of replicas.

If we switch to tf-keras, then we get the expected output regardless of the number of replicas:

  • n_replicas=1:
ground truth: 0.43009480834007263
evaluate: 0.43009480834007263
post-fit output: 0.4297996461391449
  • n_replicas=2:
ground truth: 0.43009480834007263
evaluate: 0.43009480834007263
post-fit output: 0.4297996461391449
  • n_replicas=4:
ground truth: 0.43009480834007263
evaluate: 0.43009480834007263
post-fit output: 0.4297996461391449
@drasmuss drasmuss changed the title Keras 3 gives inconsistent output from evaluate/fit in distributed context Keras 3 gives incorrect output from evaluate/fit in distributed context Jun 20, 2024
@drasmuss
Copy link
Author

drasmuss commented Jun 21, 2024

With a bit more investigation I figured out that what's going on is that evaluate is only reporting the loss from the first replica, and ignoring the rest. Here's an updated example demonstrating this:

import tensorflow as tf
import keras
# import tf_keras as keras

keras.utils.set_random_seed(0)

n_replicas = 2

gpus = tf.config.list_physical_devices("GPU")
tf.config.set_logical_device_configuration(
    gpus[0], [tf.config.LogicalDeviceConfiguration(memory_limit=1000)] * n_replicas
)

batch_size = 12
x = tf.random.uniform((batch_size, 1), -1, 1, seed=0)
y = tf.random.uniform((batch_size, 10), -1, 1, seed=1)

with tf.distribute.MirroredStrategy().scope():
    inp = keras.Input(shape=(1,))
    layer = keras.layers.Dense(10)
    model = keras.Model(inp, layer(inp))
    model.compile(loss="mse", optimizer="sgd")

    gt = keras.losses.mean_squared_error(y, model.predict(x, batch_size=batch_size))
    eval = model.evaluate(x, y, batch_size=batch_size)
    model.fit(x, y, batch_size=batch_size, epochs=1)
    print(f"ground truth: {tf.reduce_mean(gt)}")
    print(f"loss from first replica: {tf.reduce_mean(gt[:batch_size//n_replicas])}")
    print(f"evaluate: {eval}")

Which gives output:

  • n_replicas=1:
ground truth: 0.43009480834007263
loss from first replica: 0.43009480834007263
evaluate: 0.43009480834007263
  • n_replicas=2:
ground truth: 0.43009480834007263
loss from first replica: 0.5054659843444824
evaluate: 0.5054659843444824
  • n_replicas=4:
ground truth: 0.43009480834007263
loss from first replica: 0.5540136098861694
evaluate: 0.5540136098861694

@drasmuss
Copy link
Author

One more piece of investigation. I believe the above issue with evaluate is mainly a display issue. The model is computing the loss value correctly in each replica, but only the value from the first replica is being returned from evaluate.

I think that the reason the fit output changes as we change the number of replicas is that the weight updates are not being synced between replicas. For example:

import tensorflow as tf
import keras
# import tf_keras as keras

keras.utils.set_random_seed(0)

n_replicas = 2

gpus = tf.config.list_physical_devices("GPU")
tf.config.set_logical_device_configuration(
    gpus[0], [tf.config.LogicalDeviceConfiguration(memory_limit=1000)] * n_replicas
)

batch_size = 12
local_batch_size = batch_size // n_replicas
x = tf.random.uniform((batch_size, 1), -1, 1, seed=0)
y = tf.random.uniform((batch_size, 1), -1, 1, seed=1)

strategy = tf.distribute.MirroredStrategy()
with strategy.scope():
    inp = keras.Input(shape=(1,))
    layer = keras.layers.Dense(
        1, use_bias=False, kernel_initializer=keras.initializers.constant(1)
    )
    model = keras.Model(inp, layer(inp))
    model.compile(loss="mse", optimizer=keras.optimizers.SGD(learning_rate=1.0))

    model.fit(x, y, batch_size=batch_size, epochs=1)
    weights = strategy.run(lambda: layer.kernel.value).values
    print(f"per-replica weights: {[w.numpy() for w in weights]}")

We can see that each replica is maintaining independent weights:

  • n_replicas=1:
per-replica weights: [array([[0.6677284]], dtype=float32)]
  • n_replicas=2:
per-replica weights: [array([[0.82471704]], dtype=float32), array([[0.5107398]], dtype=float32)]
  • n_replicas=4:
per-replica weights: [array([[0.4283927]], dtype=float32), array([[1.2210413]], dtype=float32), array([[0.92960465]], dtype=float32), array([[0.09187502]], dtype=float32)]

If we switch to tf-keras, then the weights are synced across replicas, as expected:

  • n_replicas=1:
per-replica weights: [array([0.6677284], dtype=float32)]
  • n_replicas=2:
per-replica weights: [array([[0.6677284]], dtype=float32), array([[0.6677284]], dtype=float32)]
  • n_replicas=4:
per-replica weights: [array([[0.6677284]], dtype=float32), array([[0.6677284]], dtype=float32), array([[0.6677284]], dtype=float32), array([[0.6677284]], dtype=float32)]

@sachinprasadhs sachinprasadhs added keras-team-review-pending Pending review by a Keras team member. type:Bug labels Jun 21, 2024
@sachinprasadhs
Copy link
Collaborator

Thank you for providing detailed investigation on the issue. We will look into it.

@drasmuss
Copy link
Author

I think I was able to get past this issue, but then I run into this bug #19246 so I can't really tell if things are working correctly or not.

@SamanehSaadat
Copy link
Member

Hi @drasmuss!

Based on this comment it seems like you have been able to resolve the problem that was raised in this issue and there is already an open issue for the outstanding bug. I'm closing this issue then.

If possible, it would be great if you could add more information about how you were able to resolve the problem discussed in this issue. Thanks!

Copy link

Are you satisfied with the resolution of your issue?
Yes
No

@SamanehSaadat SamanehSaadat removed the keras-team-review-pending Pending review by a Keras team member. label Jun 28, 2024
@drasmuss
Copy link
Author

No, the issue is not resolved. I had been working on a fix locally, but was unable to verify it due to that other bug. But this issue itself is still present, Keras gives incorrect output from fit and evaluate in a distributed context.

@SamanehSaadat
Copy link
Member

I see! Thanks for clarifying! We're looking into the other bug that you linked here! After resolution of #19246, please let us know if the issue persisted!

@drasmuss
Copy link
Author

This issue will still require a pull request (or two) of its own to fix, it definitely won't be resolved on its own after #19246 is fixed.

@SamanehSaadat SamanehSaadat reopened this Jun 28, 2024
@SamanehSaadat
Copy link
Member

I see! I re-opened the issue! Is it the display issue you mentioned in #19891 (comment)

@drasmuss
Copy link
Author

I believe it's actually two separate issues (both requiring fixes). One is the wrong value being returned from evaluate. The other is that the gradient aggregation is not happening, so the distributed replicas are not sharing information at all during training (which essentially means that you're getting no benefit from the distributed training).

@SamanehSaadat
Copy link
Member

Thanks for clarifying! I'll look into this!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants