DeepLearning笔记:如何用亚马逊云服务 GPU 训练神经网络
在 Udacity 的深度学习纳米学位课程中,5 个实战项目里至少有 3 个需要用到 GPU 来训练模型。课程附带了 100 刀的亚马逊云服务(AWS)credit,这篇笔记分享如何使用 AWS 完成模型的训练。
注册帐户
首先,注册亚马逊 AWS 的免费帐号:Amazon Web Services Cloud。
在项目中要用到 Elastic Compute Cloud (EC2),它可以启动 GPU 运行的虚拟服务,具体类型是 p2.xlarge
。
我们会用到 this AMI (Amazon Machine Image) 去定义所需要的环境。在使用之前,需要选择离你最近的 AWS 地区:
- EU (Ireland)
- Asia Pacific (Seoul)
- Asia Pacific (Tokyo)
- Asia Pacific (Sydney)
- US East (N. Virginia)
- US East (Ohio)
- US West (Oregon)
选择好后,查看 EC2 Service Limit report,找到 「正在按需运行的 p2.xlarge 实例」项目:
如果限制是 0,点击右侧「请求提高限制」链接。提高限值不会收费,运行 instance 才会收费。
提高限制的表单需要填写:
- Region: 选择前面步骤的 AWS 地区
- Primary Instance Type: p2.xlarge
- Limit: Instance Limit
- New Limit Value: 1 (more if you like)
- Use Case Description: I would like to use GPU instances for deep learning.
如果之前没有启动过 AWS 服务,可能会收到确认邮件。
在 Billing Management Console 页面输入 Udacity 提供的优惠代码。
运行实例
Launch an Instance
访问 EC2 Management Console, 点击 “Launch Instance” 。
选择 AMI (Amazon Machine Image)
如下图,进入 AWS Marketplace,搜索 Deep Learning AMI with Source Code (CUDA 8, Ubuntu)。
Select the Instance Type
在步骤 2: 选择一个实例类型中
- Filter the instance list to only show “GPU compute”
- Select the p2.xlarge instance type
- Review and Launch
Configure the Security Group
在 步骤 7: 核查实例启动 中点击「编辑安全组」
On the "Configure Security Group" page:
- Select "Create a new security group"
- Set the "Security group name" (i.e. "Jupyter")
- Click "Add Rule"
- Set a "Custom TCP Rule"
- Set the "Port Range" to "8888"
- Select "Anywhere" as the "Source"
- Click "Review and Launch" (again)
Create an Authentication Key Pair
"Create a new key pair” and click the "Download Key Pair" button. 下载 .pem 文件并保存好,在启动时需要这个文件。
下载完成后,继续点击「启动实例」按钮。
设置计费提醒
此刻开始,启动这个 EC2 instance,AWS 会开始计费。费用可以查看 EC2 On-Demand Pricing page
p2.xlarge: $0.9 每小时
Most importantly, remember to “stop” (i.e. shutdown) your instances when you are not using them. Otherwise, your instances might run for a day, week, month, or longer without you remembering, and you’ll wind up with a large bill!
登录云服务器
实例启动后,在命令行中进入 .pem 文件保存的目录,输入命令(IP 是控制台提供的 IP,每次都不同):
ssh -i DLND.pem [email protected]
这时候看到错误提示:
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@ WARNING: UNPROTECTED PRIVATE KEY FILE! @
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
Permissions 0644 for 'DLND.pem' are too open.
It is required that your private key files are NOT accessible by others.
This private key will be ignored.
Load key "DLND.pem": bad permissions
[email protected]: Permission denied (publickey).
查找到 排查实例的连接问题 - Amazon Elastic Compute Cloud
您的密钥必须不公开可见,SSH 才能工作。要修复此错误,请执行以下命令:
chmod 400 DLND.pem
配置 Jupyter notebook
连接服务器后,输入以下命令创建 Jupyter notebook 的配置文件:
jupyter notebook --generate-config
服务器返回:
Writing default config to: /home/ubuntu/.jupyter/jupyter_notebook_config.py
然后,修改 notebook 的 IP 地址设置:
sed -ie "s/#c.NotebookApp.ip = 'localhost'/#c.NotebookApp.ip = '*'/g" ~/.jupyter/jupyter_notebook_config.py
测试实例
On the EC2 instance
- Clone a GitHub repository
git clone https://github.com/udacity/aind2-cnn.git
- Enter the repo directory
cd aind2-cnn
- Install the requirements
sudo python3 -m pip install -r requirements/requirements-gpu.txt
- Start Jupyter notebook
jupyter notebook --ip=0.0.0.0 --no-browser
From your local machine
You will need the token generated by your jupyter notebook to access it. On your instance terminal, there will be the following line:
Copy/paste this URL into your browser when you connect for the first time, to login with a token:
. Copy everything starting with the:8888/?token=
.http://13.115.162.209:8888/?token=94e72e170ca3fdbe1cd7c58a3fd898e9533e740beb6070fa
Access the Jupyter notebook index from your web browser by visiting: X.X.X.X:8888/?token=... (where X.X.X.X is the IP address of your EC2 instance and everything starting with :8888/?token= is what you just copied).
Click on "mnist_mlp" to enter the folder, and select the "mnist_mlp.ipynb" notebook.
Run each cell in the notebook.
实验完,记得 stop instance。
新建环境
参考深度学习学前须知及常见问题 - DLND: 深度学习纳米学位 - 优达学城论坛
安装 conda
wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh
新建环境
conda create -n dlnd python=3
激活环境
source activate dlnd
安装 tf
pip install --ignore-installed https://storage.googleapis.com/tensorflow/linux/gpu/tensorflow_gpu-1.1.0-cp36-cp36m-linux_x86_64.whl
下次进入环境
- 从链接 EC2 Management Console 启动实例
- 本地输入命令
ssh -i DLND.pem [email protected]
- 连接服务器后,激活环境
source activate dlnd
- 启动 jupyter notebook
jupyter notebook --ip=0.0.0.0
- 在浏览器打开:http://52.197.226.169:8888/?token=a55e1cfbc162df6d3358e3553d220b4d269e2789df6e5ddd