如何在AWS集群上运行TensorFlow ?

时间:2022-12-22 00:55:39

I'm trying to run distributed tensorflow on an EMR/EC2 cluster but I don't know how to specify different instances in the cluster to run parts of the code.

我试图在EMR/EC2集群上运行分布式tensorflow,但是我不知道如何在集群中指定不同的实例来运行部分代码。

In the documentation, they've used tf.device("/gpu:0") to specify a gpu. But what if I have a master CPU and 5 different slave GPU instances running in an EMR cluster and I want to specify those GPUs to run some code? I can't input tf.device() with the public DNS names of the instances because it throws an error saying the name cannot be resolved.

在文档中,他们使用tf.device(“/gpu:0”)来指定gpu。但是,如果我有一个主CPU和在EMR集群中运行的5个不同的从GPU实例,并且我想指定这些GPU来运行一些代码,那会怎么样呢?我不能使用实例的公共DNS名称输入tf.device(),因为它抛出一个错误,说名称不能被解析。

1 个解决方案

#1


0  

Since your question, AWS has released some code to ease the use of distributed TensorFlow on an EC2 cluster.

由于您的问题,AWS已经发布了一些代码来简化在EC2集群上使用分布式TensorFlow。

See this github repository. Everything is described in the README.md but the short story is that, it will create an AWS stack with

看到这个github库。一切都在自述中描述。但简而言之,它将创建一个AWS堆栈

  • Security Groups
  • 安全组
  • Elastic File System
  • 弹性文件系统
  • EC2 instances with the AWS deeplearning AMI and the EFS mounted on them,
  • AWS深入学习AMI和上面的EFS实例,
  • The EC2 instances will be configured so you can easily run a distributed tensorflow run by running a command on the master node (see the Running Distributed Training on TensorFlow section).
  • 将配置EC2实例,以便您可以通过在主节点上运行命令轻松地运行分布式tensorflow(请参阅在tensorflow部分上运行的分布式培训)。

#1


0  

Since your question, AWS has released some code to ease the use of distributed TensorFlow on an EC2 cluster.

由于您的问题,AWS已经发布了一些代码来简化在EC2集群上使用分布式TensorFlow。

See this github repository. Everything is described in the README.md but the short story is that, it will create an AWS stack with

看到这个github库。一切都在自述中描述。但简而言之,它将创建一个AWS堆栈

  • Security Groups
  • 安全组
  • Elastic File System
  • 弹性文件系统
  • EC2 instances with the AWS deeplearning AMI and the EFS mounted on them,
  • AWS深入学习AMI和上面的EFS实例,
  • The EC2 instances will be configured so you can easily run a distributed tensorflow run by running a command on the master node (see the Running Distributed Training on TensorFlow section).
  • 将配置EC2实例,以便您可以通过在主节点上运行命令轻松地运行分布式tensorflow(请参阅在tensorflow部分上运行的分布式培训)。