From f96a9512736575d8c3d7c7c963700c4158629746 Mon Sep 17 00:00:00 2001 From: Jannis Klinkenberg <j.klinkenberg@itc.rwth-aachen.de> Date: Fri, 6 Dec 2024 10:55:27 +0100 Subject: [PATCH] updated README.md --- tensorflow/cifar10_distributed/README.md | 12 +++++++++--- 1 file changed, 9 insertions(+), 3 deletions(-) diff --git a/tensorflow/cifar10_distributed/README.md b/tensorflow/cifar10_distributed/README.md index a54fcc9..2368c26 100644 --- a/tensorflow/cifar10_distributed/README.md +++ b/tensorflow/cifar10_distributed/README.md @@ -1,9 +1,15 @@ # TensorFlow - Distributed Training This folder contains the following 3 example versions for distributed training: -- **Version 1: (`submit_job_container_single-node.sh`)** A TensorFlow native version that is constraint to a single compute node with multiple GPUs. A single process is serving multiple GPUs with a `tf.distribute.MirroredStrategy`. -- **Version 2: (`submit_job_container.sh`)** A TensorFlow native version that utilizes multiple processes (1 process per GPU) that work together using a `tf.distribute.MultiWorkerMirroredStrategy`. Although this is not constraint to a single node, it requires a bit more preparation to setup the distributed environment (via `TF_CONFIG` environment variable) -- **Version 3: (`submit_job_container_horovod.sh`)** A version that is using Horovod ontop of TensorFlow to perform the distributed training and communication of e.g. model weights/updates. Typically, these calls also use 1 process per GPU. + +## Version 1: (`submit_job_container_single-node.sh`) +A TensorFlow native version that is constraint to a single compute node with multiple GPUs. A single process is serving multiple GPUs with a `tf.distribute.MirroredStrategy`. + +## Version 2: (`submit_job_container.sh`) +A TensorFlow native version that utilizes multiple processes (1 process per GPU) that work together using a `tf.distribute.MultiWorkerMirroredStrategy`. Although this is not constraint to a single node, it requires a bit more preparation to setup the distributed environment (via `TF_CONFIG` environment variable) + +## Version 3: (`submit_job_container_horovod.sh`) +A version that is using Horovod ontop of TensorFlow to perform the distributed training and communication of e.g. model weights/updates. Typically, these calls also use 1 process per GPU. More information and examples concerning Horovod can be found under: - https://horovod.readthedocs.io/en/stable/tensorflow.html -- GitLab