Tag Archives: hpc

ORCA & OpenMPI: File Descriptor Limits

I’ve just been tinkering around with an interesting issue: ORCA, the computational chemistry program I’ve been using because I can’t afford Gaussian, crashes during geometry optimisation of a (moderately) complex molecule because of OpenMPI.

OpenMPI is complaining about running out of file descriptors. Eh? Seriously? OK…

Turns out that Ubuntu 16.04 (even the server version) sets the open file limit at what is frankly a little on the low side – 1024 open files. That sounds like a lot, until you think that when running something via MPI it can be crunching across¬†a lot of temporary files and so on… and it suddenly doesn’t seem so many. Interestingly, I never had this problem before because I was running Ubuntu 14.04 previously which (from what the internet says) had a limit of 4096. I checked with the latest release (14.04.5) which had a limit of 1024, so I’ll assume for now that the 4096 limit was in an older release…

I’ll be honest, since this is the first time I’ve encountered this issue, I’ve never actually checked previously…

Anyway, there appear to be two fixes that work, one on a per-user basis, one on a system-wide basis. Pick your poison.

The user-level fix is super easy, add the following to your .bashrc:

if [ $USER = "paramagnetic" ]; then
ulimit -n [pick-a-big-number, eg; 32768]
fi

Can also be added to /etc/profile

At the system level, it’s a little more difficult, but still totally doable. Edit /etc/security/limits.conf

* soft nofile 32768
* hard nofile 32768
root soft nofile32768
root hard nofile 32768

Then add:

session required pam_limits.so

To /etc/pam.d/common-sessions and /etc/pam.d/common-sessions-noninteractive

And reboot in all cases. If you have another form of system management software running, it might overrule you, which is annoying, but outside of the scope of this post.

NWChem: Headaches

Been struggling to get NWChem to compile today. I think I’ve nearly got it, but I’m calling it a day for now because otherwise I’ll still be working on it at 0500…

For the record, it compiles fine if I don’t want CUDA or OpenBLAS (the internal BLAS libraries are horribly slow, by the developers’ own admission) but it’s getting CUDA and a faster MKL in place that is causing me grief.

Basically, I’m hoping that it can provide me some speedup over a CPU computational chemistry package – I’ve had the chance to test TeraChem, which I actually think is awesome… it’s just that it doesn’t like Pascal generation GPUs, so it’s fine on my laptop with a 980m (well, if you call the GPU sitting at frighteningly high temperatures “fine”) but doesn’t want to know about the twin GTX1080’s I’ve got in a sort-of-but-not-workstation.

So, anyway, as soon as I’ve figured out what compile flags I need to get it working properly, I’ll update this post.

Torque PBS & Ubuntu 16.04/Mint 18

There are some programs that like MPI. There are others that are… kind of single threaded, but work pretty well with a PBS (portable batch server) to actually queue up tasks and generally speed up execution.

The I-TASSER suite, for protein structure prediction, is one of the latter.

If you’re in academia, I-TASSER is free, so it’s a useful tool to have even if it’s not used very often.

But getting Ubuntu to play nice with a PBS can be something of a trick… partly because the version included with Ubuntu is now old. Very old.

And the newer versions are still free – it only costs money for Torque if you want to use the more powerful schedulers like Maui. Which I don’t, because I’m usually the only person actually logging in to the boxes I administer. This may change in the future, but for now, I don’t need a complex PBS.

Anyway, to get Torque working without using the version included in the repos (because it’s ancient) requires relatively little work in the grand scheme of things…

The first job is to get the basic requirements for Torque installed:

sudo apt-get install libboost-all-dev libssl-dev libxml2-dev

Boost pulls in a ton of things, so it may or may not be worth adding --no-install-recommends to the end of that apt-get command. I didn’t, but I’m not short on space.

If you’ve not got a C compiler installed, now is the time for that as well. Fortunately, Torque doesn’t need anything fancy like cmake to build, just good ol’ ./configure, make, make install.

Now they’re installed, you can go download the Torque source code from Adaptive Computing. Now, annoyingly, the most recent release (6.1.1.1 as of writing) screws up for me for reasons I can’t figure out. I know from prior experience that 6.0.2 works 100%, so I’ll stick to that. It’s still newer than what is in the Ubuntu repos…

Extract the source somewhere sensible, like ~/bin using tar xzvf [torque.tgz] and run ./configure, then watch for any errors – there shouldn’t be any. When it’s all done, type make. You can use make -j [number of CPU cores] to speed things up a bit. Once that is done, switch to root with either sudo bash or su -, and type make install.

Now comes some fun bits.

There is a nice script in the folder you just built Torque in called torque_setup, but that’s not everything you need.

The first thing to check is that you have your hostname listed appropriately in /etc/hosts. Now, here is where static IP addresses really make your life easier: if you are using DHCP and your router decides to change you IP, Torque will stop working. Very frustrating.

Anyway, while lots of things need 127.0.0.1 to point to localhost, Torque also needs it to point to the server name. I name mine after elements of the periodic table, but you can do whatever you want.

Here’s what my /etc/hosts file looks like:

127.0.0.1 localhost
127.0.0.1 hydrogen
169.254.1.100 hydrogen
169.254.1.101 helium
169.254.1.102 lithium
169.254.1.103 beryllium

Without this extra 127.0.0.1 entry, Torque doesn’t work. It also works putting localhost and the hostname on the same 127.0.0.1 line.

Now you can run ./torque_setup [username] and answer y at the prompt.

Now run, echo '/usr/local/lib' > /etc/ld.so.conf.d/torque.conf and ldconfig. This tells the system where the Torque libraries are.

And echo "hydrogen np=32" > /var/spool/torque/server_priv/nodes and echo "hydrogen" > /var/spool/torque/mom_priv/config. These tell Torque about the nodes (and how many CPUs each node has) and the pbs_mom which server it’s running on. With Torque 6, trqauthd should do the job of pbs_mom, I think?

Get the server running again with pbs_server, pbs_sched and trqauthd (as root) at the commandline.

Then check that it’s working with qmgr -c 'p s' (the space is important).

Finally, check that it works by starting an interactive PBS session with qsub -I as a normal user (you can’t run this as root).

Should all work OK now!