How to control thermal and performance settings for multiple Nvidia GPUS on Ubuntu Linux
Such compute.
very co$t
Wow
The price of a pair of GTX 1060 GPUs has gone up about 50% since I built my deep-learning rig a few weeks ago, and that's if you can even find them in stock. There's been a wee bit of a gold rush surrounding cryptocurrencies lately as many new miners have been setting up systems. Ultimately, I think this will benefit more than just crypto as the demand for fast and efficient cards pushes graphics card makers to innovate more efficient and powerful cards, just like high performance computing for scientific purposes has traditionally piggy-backed on demand for better gaming. It also means increased awareness and adoption of cryptography and cryptocurrencies, which I consider a good thing as it should help stabilize the ecosystem.
In any case, whether your aim is mining ether or back-propagation, you may want to get as much performance out of the GPUs you do have during the current shortage. This means tuning the card to optimize for your needs of performance and/or efficiency. For Nvidia cards on Ubuntu this comes with a slight difficulty in that normally you can only tune a GPU running a display, but with a few tricks it's possible to overclock multiple GPUs without hooking up a monitor to each one. This took me a while to figure out, so I thought it may be helpful to others.
Note: changing the cool-bits
flag lets you bypass thermal safeguards, may affect warranty, etc., so be conservative in your changes and monitor for GPU temperatures and errors. I typically run the fans at a higher intensity than they would normally operate and keep the temperature well below 70C.
In short, it was the order that mattered. Setting cool-bits
with or without a a flag to allow empty configurations, before editing the config file always left me with control over just one GPU after rebooting :-/ Instead I had to first modify the config file, and only then allow empty configurations and change the cool-bits flag. I'll assume you've got your drivers set up and your cards are working, all you have left is to gain control of nvidia-settings.
Here's what worked for me:
First edit your Xorg config file. Duplicate the monitor/device/screen declarations while incrementing the names, e.g. "Device0" becomes "Device1." Do this as many times as you need to for each of your GPUs, I have two cards so I ended up with two screens/monitors/device entries. The text for my config file is at the end of this post.
sudo nano /etc/X11/xorg.conf
nano --> your text editor of choice
Then set the -cool-bits flag and allow empty configurations. Setting cool-bits
to 28 actually allows you to change GPU voltages, which I don't currently use or recommend. 12 or 5 should also work for our needs.
sudo nvidia-xconfig -a --cool-bits=28 --allow-empty-initial-configuration
and that's it. You should be able to reboot and start over/underclocking your GPUs. Check in on the temperature and power usage with nvidia-smi
on the command line. You can modify the thermal and performance ("Powermizer") settings with the GUI by just typing nvidia-settings
, or you can use commands like these:
nvidia-settings -a [gpu:0]/GPUMemoryTransferRateOffset[3]=400
nvidia-settings -a [gpu:0]/GPUGraphicsClockOffset[3]=40
nvidia-settings -a [gpu:0]/GPUFanControlState=1
nvidia-settings -a [fan:0]/GPUTargetFanSpeed=65
or back to normal
nvidia-settings -a [gpu:0]/GPUMemoryTransferRateOffset[3]=0
nvidia-settings -a [gpu:0]/GPUGraphicsClockOffset[3]=0
nvidia-settings -a [gpu:0]/GPUFanControlState=0
Tune the settings in small increments and just change one setting at a time until you get closer to optimizing your chosen metric, then adjust the next setting and repeat as necessary ("walking the settings"). There are many more, much better guides out there for the actual overclocking for performance or underclocking for efficiency, and I suggest you check them out. Hopefully one or two other people had the same problem as I did with the order of setting cool-bits and modifying the xorg.conf file and this short post will be useful to some fellow human, somewhere, sometime. Thanks!
I am using Ubuntu 16.04 with the 375.66 version Nvidia drivers.
xorg.conf example:
# nvidia-xconfig: X configuration file generated by nvidia-xconfig
# nvidia-xconfig: version 375.66 (buildmeister@swio-display-x86-rhel47-06) Mon May 1 15:45:32 PDT 2017
Section "ServerLayout"
Identifier "Layout0"
Screen 0 "Screen0"
Screen 1 "Screen1" RightOf "Screen0"
InputDevice "Keyboard0" "CoreKeyboard"
InputDevice "Mouse0" "CorePointer"
EndSection
Section "Files"
EndSection
Section "InputDevice"
# generated from default
Identifier "Mouse0"
Driver "mouse"
Option "Protocol" "auto"
Option "Device" "/dev/psaux"
Option "Emulate3Buttons" "no"
Option "ZAxisMapping" "4 5"
EndSection
Section "InputDevice"
# generated from default
Identifier "Keyboard0"
Driver "kbd"
EndSection
Section "Monitor"
Identifier "Monitor0"
VendorName "Unknown"
ModelName "Unknown"
HorizSync 28.0 - 33.0
VertRefresh 43.0 - 72.0
Option "DPMS"
EndSection
Section "Monitor"
Identifier "Monitor1"
VendorName "Unknown"
ModelName "Unknown"
HorizSync 28.0 - 33.0
VertRefresh 43.0 - 72.0
Option "DPMS"
EndSection
Section "Device"
Identifier "Device0"
Driver "nvidia"
VendorName "NVIDIA Corporation"
BoardName "GeForce GTX 1060 6GB"
BusID "PCI:1:0:0"
EndSection
Section "Device"
Identifier "Device1"
Driver "nvidia"
VendorName "NVIDIA Corporation"
BoardName "GeForce GTX 1060 6GB"
BusID "PCI:2:0:0"
EndSection
Section "Screen"
Identifier "Screen0"
Device "Device0"
Monitor "Monitor0"
DefaultDepth 24
Option "AllowEmptyInitialConfiguration" "True"
Option "Coolbits" "28"
SubSection "Display"
Depth 24
EndSubSection
EndSection
Section "Screen"
Identifier "Screen1"
Device "Device1"
Monitor "Monitor1"
DefaultDepth 24
Option "AllowEmptyInitialConfiguration" "True"
Option "Coolbits" "28"
SubSection "Display"
Depth 24
EndSubSection
EndSection
Write good
__Not working on Nvidia 390.x drivers __
I tried your instructions sudo nvidia-xconfig... to set cool bits to 28 and then rebooting. I think it auto rebooted once again.
When I then tried setting the FanSpeed... still got the same error (Unknown error)
My guess, when it auto rebooted the second time (not sure why/how that happened), it recreated the xorg.conf file and overwrote the coolbits 28 that I had added. I know this for sure coz after using the CLI to add the coolbits I checked the updated .conf file.
Looks like Nvidia changed/disabled coolbits updates to .conf in their newer drivers
Congratulations @thescinder! You have received a personal award!
1 Year on Steemit
Click on the badge to view your Board of Honor.
Do not miss the last post from @steemitboard!
Participate in the SteemitBoard World Cup Contest!
Collect World Cup badges and win free SBD
Support the Gold Sponsors of the contest: @good-karma and @lukestokes
Congratulations @thescinder! You received a personal award!
You can view your badges on your Steem Board and compare to others on the Steem Ranking
Vote for @Steemitboard as a witness to get one more award and increased upvotes!
Good article