ModelParallelStrategy¶
- class lightning.pytorch.strategies.ModelParallelStrategy(data_parallel_size='auto', tensor_parallel_size='auto', save_distributed_checkpoint=True, process_group_backend=None, timeout=datetime.timedelta(seconds=1800))[source]¶
- Bases: - ParallelStrategy- Enables user-defined parallelism applied to a model. - Warning - This is an experimental feature. - Currently supports up to 2D parallelism. Specifically, it supports the combination of Fully Sharded Data-Parallel 2 (FSDP2) with Tensor Parallelism (DTensor). These PyTorch APIs are currently still experimental in PyTorch (see https://pytorch.org/docs/stable/distributed.tensor.parallel.html). Requires PyTorch 2.4 or newer. - Parameters:
- data_parallel_size¶ ( - Union[- Literal[- 'auto'],- int]) – The number of devices within a data-parallel group. Defaults to- "auto", which sets this size to the number of nodes in the cluster.
- tensor_parallel_size¶ ( - Union[- Literal[- 'auto'],- int]) – The number of devices within a tensor-parallel group. Defaults to- "auto", which sets this size to the number of GPUs in a single node.
- save_distributed_checkpoint¶ ( - bool) – If- True, each rank saves its shard of weights and optimizer states to a file. The checkpoint is a folder with as many files as the world size. If- False, the full weights and optimizer states get assembled on rank 0 and saved to a single file.
 
 - barrier(name=None)[source]¶
- Synchronizes all processes which blocks processes until the whole group enters this function. 
 - lightning_module_state_dict()[source]¶
- Collects the state dict of the model. - Only returns a non-empty state dict on rank 0 if - save_distributed_checkpoint=False.
 - optimizer_state(optimizer)[source]¶
- Collects the state of the given optimizer. - Only returns a non-empty state dict on rank 0 if - save_distributed_checkpoint=False.
 - reduce(tensor, group=None, reduce_op='mean')[source]¶
- Reduces the given tensor (e.g. across GPUs/processes). 
 - save_checkpoint(checkpoint, filepath, storage_options=None)[source]¶
- Save model/training states as a checkpoint file through state-dump and file-write. 
 - setup(trainer)[source]¶
- Sets up the accelerator, plugins and initializes the optimizers (if needed). 
 - setup_environment()[source]¶
- Setup any processes or distributed connections. - This is called before the LightningModule/DataModule setup hook which allows the user to access the accelerator environment before setup is complete. - Return type:
 
 - teardown()[source]¶
- This method is called to teardown the training process. - It is the right place to release memory and free other resources. - Return type:
 
 - property lightning_restore_optimizer: bool¶
- Override to disable Lightning restoring optimizers/schedulers. - This is useful for strategies which manage restoring optimizers/schedulers.