ahcore.data package#

Submodules#

ahcore.data.dataset module#

Utilities to construct datasets and DataModule’s from manifests.

class ahcore.data.dataset.DlupDataModule(data_description: DataDescription, pre_transform: Callable[[bool], Callable[[dict[str, Any]], dict[str, Any]]], batch_size: int = 32, validate_batch_size: int | None = None, num_workers: int = 16, persistent_workers: bool = False, pin_memory: bool = False)[source]#

Bases: LightningDataModule

Datamodule for the Ahcore framework. This datamodule is based on dlup.

Construct a DataModule based on a manifest.

Parameters:
data_descriptionDataDescription

See ahcore.utils.data.DataDescription for more information.

pre_transformCallable

A pre-transform is a callable which is directly applied to the output of the dataset before collation in the dataloader. The transforms typically convert the image in the output to a tensor, convert the WsiAnnotations to a mask or similar.

batch_sizeint

The batch size of the data loader.

validate_batch_sizeint, optional

Sometimes the batch size for validation can be larger. If so, set this variable. Will also use this for prediction.

num_workersint

The number of workers used to fetch tiles.

persistent_workersbool

Whether to use persistent workers. Check the pytorch documentation for more information.

pin_memorybool

Whether to use cuda pin workers. Check the pytorch documentation for more information.

property data_manager: DataManager#
predict_dataloader() DataLoader[dict[str, Any]] | None[source]#

An iterable or collection of iterables specifying prediction samples.

For more information about multiple dataloaders, see this section.

It’s recommended that all data downloads and preparation happen in prepare_data().

Note:

Lightning tries to add the correct sampler for distributed and arbitrary hardware There is no need to set it yourself.

Return:

A torch.utils.data.DataLoader or a sequence of them specifying prediction samples.

setup(stage: str) None[source]#

Called at the beginning of fit (train + validate), validate, test, or predict. This is a good hook when you need to build models dynamically or adjust something about them. This hook is called on every process when using DDP.

Args:

stage: either 'fit', 'validate', 'test', or 'predict'

Example:

class LitModel(...):
    def __init__(self):
        self.l1 = None

    def prepare_data(self):
        download_data()
        tokenize()

        # don't do this
        self.something = else

    def setup(self, stage):
        data = load_data(...)
        self.l1 = nn.Linear(28, data.num_classes)
teardown(stage: str | None = None) None[source]#

Called at the end of fit (train + validate), validate, test, or predict.

Args:

stage: either 'fit', 'validate', 'test', or 'predict'

test_dataloader() DataLoader[dict[str, Any]] | None[source]#

An iterable or collection of iterables specifying test samples.

For more information about multiple dataloaders, see this section.

For data processing use the following pattern:

  • download in prepare_data()

  • process and split in setup()

However, the above are only necessary for distributed processing.

Warning

do not assign state in prepare_data

Note:

Lightning tries to add the correct sampler for distributed and arbitrary hardware. There is no need to set it yourself.

Note:

If you don’t need a test dataset and a test_step(), you don’t need to implement this method.

train_dataloader() DataLoader[dict[str, Any]] | None[source]#

An iterable or collection of iterables specifying training samples.

For more information about multiple dataloaders, see this section.

The dataloader you return will not be reloaded unless you set :paramref:`~pytorch_lightning.trainer.trainer.Trainer.reload_dataloaders_every_n_epochs` to a positive integer.

For data processing use the following pattern:

  • download in prepare_data()

  • process and split in setup()

However, the above are only necessary for distributed processing.

Warning

do not assign state in prepare_data

Note:

Lightning tries to add the correct sampler for distributed and arbitrary hardware. There is no need to set it yourself.

property uuid: UUID#

This property is used to create a unique cache file for each dataset. The constructor of this dataset is completely determined by the data description, including the pre_transforms. Therefore, we can use the data description to create an uuid that is unique for each datamodule.

The uuid is computed by hashing the data description using the basemodel_to_uuid function, which uses a sha256 hash of the pickled object and converts it to an UUID. As pickles can change with python versions, this uuid will be different when using different python versions.

Returns:
str

A unique identifier for this datamodule.

val_dataloader() DataLoader[dict[str, Any]] | None[source]#

An iterable or collection of iterables specifying validation samples.

For more information about multiple dataloaders, see this section.

The dataloader you return will not be reloaded unless you set :paramref:`~pytorch_lightning.trainer.trainer.Trainer.reload_dataloaders_every_n_epochs` to a positive integer.

It’s recommended that all data downloads and preparation happen in prepare_data().

  • fit()

  • validate()

  • prepare_data()

  • setup()

Note:

Lightning tries to add the correct sampler for distributed and arbitrary hardware There is no need to set it yourself.

Note:

If you don’t need a validation dataset and a validation_step(), you don’t need to implement this method.

ahcore.data.samplers module#

Module implementing the samplers. These are used for instance to create batches of the same WSI.

class ahcore.data.samplers.WsiBatchSampler(dataset: ConcatDataset[TiledROIsSlideImageDataset], batch_size: int)[source]#

Bases: Sampler[List[int]]

class ahcore.data.samplers.WsiBatchSamplerPredict(sampler: SequentialSampler | None = None, batch_size: int | None = None, drop_last: bool = False, dataset: ConcatDataset[TiledROIsSlideImageDataset] | None = None)[source]#

Bases: Sampler[List[int]]

This Sampler is identical to the WsiBatchSampler, but its signature is changed for compatibility with the predict phase of Lightning.

Module contents#

General module for datasets, samplers and lightning modules.

  • Generic dataset generated by a manifest, which can handle classification, detection and segmentation.

  • Samplers that for instance perform adaptive sampling, or define different weights per sample.