I’m trying to configure/use autoscaling on an Azure HPC Slurm cluster, but on the scheduler VM azslurm scale fails with a missing Python module, and it also looks like the expected NFS share(s) are not mounted.
Environment:
- Node:
compular-scheduler (Slurm scheduler VM on Azure)
Running as root via sudo -i
Disk and mounts look like this:
root@compular-scheduler:~# df -h
Filesystem Size Used Avail Use% Mounted on
/dev/root 62G 36G 27G 58% /
tmpfs 7.9G 0 7.9G 0% /dev/shm
tmpfs 3.2G 1.1M 3.2G 1% /run
tmpfs 5.0M 0 5.0M 0% /run/lock
/dev/sda15 105M 6.1M 99M 6% /boot/efi
tmpfs 1.6G 4.0K 1.6G 1% /run/user/1000
I was expecting to see one or more NFS-mounted filesystems for the shared storage (e.g. /shared, /apps, or similar), but they do not appear in df -h on the scheduler.
When I run azslurm scale I get: <private data>
The azslurm script points to a virtual environment under /opt/azurehpc/slurm/venv and imports slurmcc:
$ head /opt/azurehpc/slurm/venv/bin/azslurm
#!/opt/azurehpc/slurm/venv/bin/python
import os
if "SCALELIB_LOG_USER" not in os.environ:
os.environ["SCALELIB_LOG_USER"] = "slurm"
if "SCALELIB_LOG_GROUP" not in os.environ:
os.environ["SCALELIB_LOG_GROUP"] = "slurm"
from slurmcc.cli import main
Python in that venv is:
$ which python
/opt/azurehpc/slurm/venv/bin/python
But slurmcc does not seem to be installed there:
$ python -m pip list | grep -i slurm
# (no output)
$ python -m pip show slurmcc
WARNING: Package(s) not found: slurmcc
So it looks like the azslurm CLI is present, but the underlying slurmcc package is missing from /opt/azurehpc/slurm/venv. At the same time, the scheduler VM does not show any NFS-mounted shared storage in df -h, which might indicate the Slurm/Azure integration or provisioning did not complete correctly.
My questions:
- What is the correct way to (re)install or repair the
slurmcc package and azslurm environment on an Azure HPC Slurm scheduler VM?
- Is there an official script/extension or documented procedure to re-run the Azure Slurm connector / autoscaling installation on an existing scheduler without breaking the cluster?
- Should the scheduler normally have NFS-mounted shared storage visible in
df -h (e.g. /shared, /apps, or similar)? If yes, what is the recommended way to verify and/or re-mount the expected NFS shares on the scheduler node (As I have important data on the nfs disk)?
Any guidance on restoring a working azslurm scale command and ensuring the scheduler’s NFS mounts are correctly configured would be appreciated.I’m trying to configure/use autoscaling on an Azure HPC Slurm cluster, but on the scheduler VM azslurm scale fails with a missing Python module, and it also looks like the expected NFS share(s) are not mounted.