Training

Training#

After installing the software to train a model you can run the following to train a model you can use the direct train command.

To train on a single machine run the following code block in your linux machine:

direct train <experiment_directory> --num-gpus <number_of_gpus> --cfg <path_or_url_to_yaml_file> \
        [--training-root <training_data_root> --validation-root <validation_data_root>]  [--other-flags]

To train on multiple machines run the following code (one command on each machine):

(machine0)$ direct train <experiment_directory> --num-gpus <number_of_gpus> --cfg <path_or_url_to_yaml_file> \
                --machine-rank 0 --num-machines 2 --dist-url <URL> \
                [--training-root <training_data_root> --validation-root <validation_data_root>]  [--other-flags]
(machine1)$ direct train <experiment_directory> --num-gpus <number_of_gpus> --cfg <path_or_url_to_yaml_file> \
                --machine-rank 1 --num-machines 2 --dist-url <URL> \
                [--training-root <training_data_root> --validation-root <validation_data_root>]  [--other-flags]

The above command will start the training and will create an experiment directory in <experiment_directory>/base_<experiment_name>. If you are performing an experiment on a CPU (not recommended) replace --num-gpus <number_of_gpus> with --device 'cpu:0'.

In <experiment_directory>/base_<experiment_name> there will be stored the logs of the experiment, model checkpoints (e.g model_<checkpoint_number>.pt), training and validation metrics, and a config.yaml file which includes all the configuration parameters of the experiment (as stated in the yaml file <path_or_url_to_yaml_file>).

Some Datasets (e.g. the FastMRI or Calgary-Campinas datasets) require to download the respective the training and validation data. Assuming that the data are stored in training_data_root and validation_data_root, you can use the --training-root and --validation-root flags to pass these arguments to the command line too. For instance:

direct train <experiment_directory> --training-root <training_data_root> --validation-root <validation_data_root> \
        --num-gpus <number_of_gpus> --cfg <path_or_url_to_yaml_file> [--other-flags]

Training model configurations can be found in the projects folder.

During training, training loss, validation metrics and validation image predictions are logged. Additionally, Tensorboard allows for visualization of the above.