Troubleshooting
Contents
Troubleshooting#
This page describes common problems and possible solutions.
Processing drives are full#
When a node is down because processing drives are full, please run the following to remove all files in processing that have not been touched the last 60 days.
find "$DIR" -type f -atime +60 -delete
find "$DIR" -type d -empty -delete
I/O error when submitting a batch job#
When you get the following error message when submitting a batch job, it is likely that the logfile storage is full.
sbatch: error: Batch job submission failed: I/O error writing script/environment to file
You can fix this by using rsync
to copy all logs to /mnt/archive/logs
and subsequently deleting all logs on atlas except for today.
Node in drain#
When a node is in drain, you can use the following commands to get it out of drain:
sudo scontrol update NodeName=<nodename> state=RESUME
Homefolder is exceeding quota#
Homefolder has 300G for homefolder and snapshots. It could be that the snapshots are taking up too much space. You can see snapshots on rhea by running the following command:
zfs list -t snapshot project-pool/network_homes/<username>
You can prune snapshots by using this code: :ref:’https://github.com/bahamas10/zfs-prune-snapshots/blob/master/zfs-prune-snapshots<https://github.com/bahamas10/zfs-prune-snapshots/blob/master/zfs-prune-snapshots>’
To delete snapshots starting with monthly, older tahn 6 months in
project-pool/network_homes
, you can run the following command:
.. code-block:: bash
zfs-prune-snapshots -p monthly 6M project-pool/network_homes
Use zfs send
and zfs receive
to move the snapshots
to another location. For example:
zfs send project-pool/network_homes/${USER}@monthly-2024-02-01 > /mnt/to_tesla/${USER}
See also write_to_tape.sh on rhea. Use this command to rebuild the filesystem:
sudo zfs receive project-pool/projects/eric-zfs-receive-new < /project-pool/projects/eric-zfs-send/snap to rebuild the filesystem