Skip to content
Snippets Groups Projects
Verified Commit f96a9512 authored by Jannis Klinkenberg's avatar Jannis Klinkenberg
Browse files

updated README.md

parent 7e52a5f6
No related branches found
No related tags found
No related merge requests found
# TensorFlow - Distributed Training
This folder contains the following 3 example versions for distributed training:
- **Version 1: (`submit_job_container_single-node.sh`)** A TensorFlow native version that is constraint to a single compute node with multiple GPUs. A single process is serving multiple GPUs with a `tf.distribute.MirroredStrategy`.
- **Version 2: (`submit_job_container.sh`)** A TensorFlow native version that utilizes multiple processes (1 process per GPU) that work together using a `tf.distribute.MultiWorkerMirroredStrategy`. Although this is not constraint to a single node, it requires a bit more preparation to setup the distributed environment (via `TF_CONFIG` environment variable)
- **Version 3: (`submit_job_container_horovod.sh`)** A version that is using Horovod ontop of TensorFlow to perform the distributed training and communication of e.g. model weights/updates. Typically, these calls also use 1 process per GPU.
## Version 1: (`submit_job_container_single-node.sh`)
A TensorFlow native version that is constraint to a single compute node with multiple GPUs. A single process is serving multiple GPUs with a `tf.distribute.MirroredStrategy`.
## Version 2: (`submit_job_container.sh`)
A TensorFlow native version that utilizes multiple processes (1 process per GPU) that work together using a `tf.distribute.MultiWorkerMirroredStrategy`. Although this is not constraint to a single node, it requires a bit more preparation to setup the distributed environment (via `TF_CONFIG` environment variable)
## Version 3: (`submit_job_container_horovod.sh`)
A version that is using Horovod ontop of TensorFlow to perform the distributed training and communication of e.g. model weights/updates. Typically, these calls also use 1 process per GPU.
More information and examples concerning Horovod can be found under:
- https://horovod.readthedocs.io/en/stable/tensorflow.html
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please register or to comment