Keras 3 gives incorrect output from evaluate/fit in distributed context #19891

drasmuss · 2024-06-20T19:53:49Z

In Keras 3, changing the number of replicas during distributed training/evaluation changes the output of the model:

import tensorflow as tf

import keras
# import tf_keras as keras

keras.utils.set_random_seed(0)

n_replicas = 4

gpus = tf.config.list_physical_devices("GPU")
tf.config.set_logical_device_configuration(
    gpus[0], [tf.config.LogicalDeviceConfiguration(memory_limit=1000)] * n_replicas
)

batch_size = 12
x = tf.random.uniform((batch_size, 1), -1, 1, seed=0)
y = tf.random.uniform((batch_size, 10), -1, 1, seed=1)

strategy = tf.distribute.MirroredStrategy()
with strategy.scope():
    inp = keras.Input(shape=(1,))
    layer = keras.layers.Dense(10)
    model = keras.Model(inp, layer(inp))
    model.compile(loss="mse", optimizer="sgd")

    gt = keras.losses.mean_squared_error(y, model.predict(x, batch_size=batch_size))
    eval = model.evaluate(x, y, batch_size=batch_size)
    model.fit(x, y, batch_size=batch_size, epochs=1)
    post_gt = keras.losses.mean_squared_error(
        y, model.predict(x, batch_size=batch_size)
    )
    print(f"ground truth: {tf.reduce_mean(gt)}")
    print(f"evaluate: {eval}")
    print(f"post-fit output: {tf.reduce_mean(post_gt)}")

This gives output:

n_replicas=1:

ground truth: 0.43009480834007263
evaluate: 0.43009480834007263
post-fit output: 0.4297996461391449

n_replicas=2:

ground truth: 0.43009480834007263
evaluate: 0.5054659843444824
post-fit output: 0.4298612177371979

n_replicas=4:

ground truth: 0.43009480834007263
evaluate: 0.5540136098861694
post-fit output: 0.4299061596393585

We can see that the ground truth is invariant to the number of replicas, as expected. But the loss value calculated by evaluate is incorrect for all n_replicas > 1. And this doesn't just impact the evaluation, we can see that fit results in a different change in the model output as we change the number of replicas.

If we switch to tf-keras, then we get the expected output regardless of the number of replicas:

n_replicas=1:

ground truth: 0.43009480834007263
evaluate: 0.43009480834007263
post-fit output: 0.4297996461391449

n_replicas=2:

ground truth: 0.43009480834007263
evaluate: 0.43009480834007263
post-fit output: 0.4297996461391449

n_replicas=4:

ground truth: 0.43009480834007263
evaluate: 0.43009480834007263
post-fit output: 0.4297996461391449

The text was updated successfully, but these errors were encountered:

drasmuss · 2024-06-21T15:07:56Z

With a bit more investigation I figured out that what's going on is that evaluate is only reporting the loss from the first replica, and ignoring the rest. Here's an updated example demonstrating this:

import tensorflow as tf
import keras
# import tf_keras as keras

keras.utils.set_random_seed(0)

n_replicas = 2

gpus = tf.config.list_physical_devices("GPU")
tf.config.set_logical_device_configuration(
    gpus[0], [tf.config.LogicalDeviceConfiguration(memory_limit=1000)] * n_replicas
)

batch_size = 12
x = tf.random.uniform((batch_size, 1), -1, 1, seed=0)
y = tf.random.uniform((batch_size, 10), -1, 1, seed=1)

with tf.distribute.MirroredStrategy().scope():
    inp = keras.Input(shape=(1,))
    layer = keras.layers.Dense(10)
    model = keras.Model(inp, layer(inp))
    model.compile(loss="mse", optimizer="sgd")

    gt = keras.losses.mean_squared_error(y, model.predict(x, batch_size=batch_size))
    eval = model.evaluate(x, y, batch_size=batch_size)
    model.fit(x, y, batch_size=batch_size, epochs=1)
    print(f"ground truth: {tf.reduce_mean(gt)}")
    print(f"loss from first replica: {tf.reduce_mean(gt[:batch_size//n_replicas])}")
    print(f"evaluate: {eval}")

Which gives output:

n_replicas=1:

ground truth: 0.43009480834007263
loss from first replica: 0.43009480834007263
evaluate: 0.43009480834007263

n_replicas=2:

ground truth: 0.43009480834007263
loss from first replica: 0.5054659843444824
evaluate: 0.5054659843444824

n_replicas=4:

ground truth: 0.43009480834007263
loss from first replica: 0.5540136098861694
evaluate: 0.5540136098861694

drasmuss · 2024-06-21T18:11:49Z

One more piece of investigation. I believe the above issue with evaluate is mainly a display issue. The model is computing the loss value correctly in each replica, but only the value from the first replica is being returned from evaluate.

I think that the reason the fit output changes as we change the number of replicas is that the weight updates are not being synced between replicas. For example:

import tensorflow as tf
import keras
# import tf_keras as keras

keras.utils.set_random_seed(0)

n_replicas = 2

gpus = tf.config.list_physical_devices("GPU")
tf.config.set_logical_device_configuration(
    gpus[0], [tf.config.LogicalDeviceConfiguration(memory_limit=1000)] * n_replicas
)

batch_size = 12
local_batch_size = batch_size // n_replicas
x = tf.random.uniform((batch_size, 1), -1, 1, seed=0)
y = tf.random.uniform((batch_size, 1), -1, 1, seed=1)

strategy = tf.distribute.MirroredStrategy()
with strategy.scope():
    inp = keras.Input(shape=(1,))
    layer = keras.layers.Dense(
        1, use_bias=False, kernel_initializer=keras.initializers.constant(1)
    )
    model = keras.Model(inp, layer(inp))
    model.compile(loss="mse", optimizer=keras.optimizers.SGD(learning_rate=1.0))

    model.fit(x, y, batch_size=batch_size, epochs=1)
    weights = strategy.run(lambda: layer.kernel.value).values
    print(f"per-replica weights: {[w.numpy() for w in weights]}")

We can see that each replica is maintaining independent weights:

n_replicas=1:

per-replica weights: [array([[0.6677284]], dtype=float32)]

n_replicas=2:

per-replica weights: [array([[0.82471704]], dtype=float32), array([[0.5107398]], dtype=float32)]

n_replicas=4:

per-replica weights: [array([[0.4283927]], dtype=float32), array([[1.2210413]], dtype=float32), array([[0.92960465]], dtype=float32), array([[0.09187502]], dtype=float32)]

If we switch to tf-keras, then the weights are synced across replicas, as expected:

n_replicas=1:

per-replica weights: [array([0.6677284], dtype=float32)]

n_replicas=2:

per-replica weights: [array([[0.6677284]], dtype=float32), array([[0.6677284]], dtype=float32)]

n_replicas=4:

per-replica weights: [array([[0.6677284]], dtype=float32), array([[0.6677284]], dtype=float32), array([[0.6677284]], dtype=float32), array([[0.6677284]], dtype=float32)]

sachinprasadhs · 2024-06-21T18:28:19Z

Thank you for providing detailed investigation on the issue. We will look into it.

drasmuss · 2024-06-24T16:18:01Z

I think I was able to get past this issue, but then I run into this bug #19246 so I can't really tell if things are working correctly or not.

SamanehSaadat · 2024-06-28T23:13:51Z

Hi @drasmuss!

Based on this comment it seems like you have been able to resolve the problem that was raised in this issue and there is already an open issue for the outstanding bug. I'm closing this issue then.

If possible, it would be great if you could add more information about how you were able to resolve the problem discussed in this issue. Thanks!

google-ml-butler · 2024-06-28T23:13:52Z

Are you satisfied with the resolution of your issue?
Yes
No

drasmuss · 2024-06-28T23:27:55Z

No, the issue is not resolved. I had been working on a fix locally, but was unable to verify it due to that other bug. But this issue itself is still present, Keras gives incorrect output from fit and evaluate in a distributed context.

SamanehSaadat · 2024-06-28T23:34:05Z

I see! Thanks for clarifying! We're looking into the other bug that you linked here! After resolution of #19246, please let us know if the issue persisted!

drasmuss · 2024-06-28T23:52:32Z

This issue will still require a pull request (or two) of its own to fix, it definitely won't be resolved on its own after #19246 is fixed.

SamanehSaadat · 2024-06-28T23:58:15Z

I see! I re-opened the issue! Is it the display issue you mentioned in #19891 (comment)

drasmuss · 2024-06-29T00:03:13Z

I believe it's actually two separate issues (both requiring fixes). One is the wrong value being returned from evaluate. The other is that the gradient aggregation is not happening, so the distributed replicas are not sharing information at all during training (which essentially means that you're getting no benefit from the distributed training).

SamanehSaadat · 2024-06-29T00:06:03Z

Thanks for clarifying! I'll look into this!

github-actions bot assigned sachinprasadhs Jun 20, 2024

drasmuss changed the title ~~Keras 3 gives inconsistent output from evaluate/fit in distributed context~~ Keras 3 gives incorrect output from evaluate/fit in distributed context Jun 20, 2024

sachinprasadhs added keras-team-review-pending Pending review by a Keras team member. type:Bug labels Jun 21, 2024

drasmuss mentioned this issue Jun 25, 2024

[BUG][all_reduce] INVALID_ARGUMENT: You must feed a value for placeholder tensor #19246

Open

SamanehSaadat closed this as completed Jun 28, 2024

SamanehSaadat removed the keras-team-review-pending Pending review by a Keras team member. label Jun 28, 2024

SamanehSaadat self-assigned this Jun 28, 2024

SamanehSaadat added the stat:awaiting response from contributor label Jun 28, 2024

SamanehSaadat reopened this Jun 28, 2024

google-ml-butler bot removed the stat:awaiting response from contributor label Jun 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Keras 3 gives incorrect output from evaluate/fit in distributed context #19891

Keras 3 gives incorrect output from evaluate/fit in distributed context #19891

drasmuss commented Jun 20, 2024 •

edited

Loading

drasmuss commented Jun 21, 2024 •

edited

Loading

drasmuss commented Jun 21, 2024

sachinprasadhs commented Jun 21, 2024

drasmuss commented Jun 24, 2024

SamanehSaadat commented Jun 28, 2024

google-ml-butler bot commented Jun 28, 2024

drasmuss commented Jun 28, 2024

SamanehSaadat commented Jun 28, 2024

drasmuss commented Jun 28, 2024

SamanehSaadat commented Jun 28, 2024

drasmuss commented Jun 29, 2024

SamanehSaadat commented Jun 29, 2024

Keras 3 gives incorrect output from evaluate/fit in distributed context #19891

Keras 3 gives incorrect output from evaluate/fit in distributed context #19891

Comments

drasmuss commented Jun 20, 2024 • edited Loading

drasmuss commented Jun 21, 2024 • edited Loading

drasmuss commented Jun 21, 2024

sachinprasadhs commented Jun 21, 2024

drasmuss commented Jun 24, 2024

SamanehSaadat commented Jun 28, 2024

google-ml-butler bot commented Jun 28, 2024

drasmuss commented Jun 28, 2024

SamanehSaadat commented Jun 28, 2024

drasmuss commented Jun 28, 2024

SamanehSaadat commented Jun 28, 2024

drasmuss commented Jun 29, 2024

SamanehSaadat commented Jun 29, 2024

drasmuss commented Jun 20, 2024 •

edited

Loading

drasmuss commented Jun 21, 2024 •

edited

Loading