Skip to content
Snippets Groups Projects
Verified Commit 7e52a5f6 authored by Jannis Klinkenberg's avatar Jannis Klinkenberg
Browse files

updated README.md

parent 3f416d53
No related branches found
No related tags found
No related merge requests found
# TensorFlow - Distributed Training # TensorFlow - Distributed Training
This folder contains the following 2 example versions for distributed training: This folder contains the following 3 example versions for distributed training:
- **Version 1:** A TensorFlow native version, that requires a bit more preparation - **Version 1: (`submit_job_container_single-node.sh`)** A TensorFlow native version that is constraint to a single compute node with multiple GPUs. A single process is serving multiple GPUs with a `tf.distribute.MirroredStrategy`.
- **Version 2:** A version that is using Horovod ontop of TensorFlow - **Version 2: (`submit_job_container.sh`)** A TensorFlow native version that utilizes multiple processes (1 process per GPU) that work together using a `tf.distribute.MultiWorkerMirroredStrategy`. Although this is not constraint to a single node, it requires a bit more preparation to setup the distributed environment (via `TF_CONFIG` environment variable)
- **Version 3: (`submit_job_container_horovod.sh`)** A version that is using Horovod ontop of TensorFlow to perform the distributed training and communication of e.g. model weights/updates. Typically, these calls also use 1 process per GPU.
More information and examples concerning Horovod can be found under: More information and examples concerning Horovod can be found under:
- https://horovod.readthedocs.io/en/stable/tensorflow.html - https://horovod.readthedocs.io/en/stable/tensorflow.html
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please register or to comment