Strategy¶
- class paddle.distributed. Strategy ( config=None ) [source]
-
The Strategy object is used to configure the parallelization and optimization strategies for static graph. Currently supports configuring
sharding
,fused_passes
,gradient_merge
andpipeline
. More strategies will be supported in the future.sharding
is used to configure the sharding states of the optimizer, for saving the GPU memory.fused_passes
is used to configure the fusion of the computation in the model.gradient_merge
is used to configure the gradient merge strategy in training.pipeline
is used to configure the pipeline parallelism strategy.- Parameters
-
config (dict|None, optional) – The user-defined configurations. If
config
is None, use default configurations. If it is a dict, the items inside the dict will be used to set the configurations, and the others remain the default values.
Examples
>>> import paddle >>> import paddle.distributed as dist >>> strategy = dist.Strategy() >>> strategy.sharding.enable = True >>> strategy.sharding.stage = 2 >>> strategy.sharding.degree = 2 >>> strategy.gradient_merge.enable = True >>> strategy.gradient_merge.k_steps = 2 >>> strategy.gradient_merge.avg = False >>> strategy.pipeline.enable = True >>> strategy.pipeline.schedule_mode = "1F1B" # default is "1F1B" >>> strategy.pipeline.micro_batch_size = 2
- property sharding [source]
-
sharding
is used to configure the sharding states of the optimizer, containing following configs:enable
(bool): whether to enable sharding. Default: False.stage
(int): can be set to 1, 2 or 3. 1 indicates the optimizer states segmentation, 2 indicates optimizer states and gradient segmentation, 3 indicates the segmentation of optimizer states, gradient and parameters. Default: 1.degree
(int): the number of segmentation pieces. Default: 8.Examples
>>> import paddle >>> import paddle.distributed as dist >>> strategy = dist.Strategy() >>> strategy.sharding.enable = True >>> strategy.sharding.stage = 2 >>> strategy.sharding.degree = 2
- property gradient_merge
-
gradient_merge
is used to configure the gradient merge strategy in training, containing following configs:enable
(bool): whether to enable gradient merge. Default: False.k_steps
(int): the number of steps for merging gradients. Default: 1.avg
(bool): whether to average the gradients of each step. Default: True.Examples
>>> import paddle >>> import paddle.distributed as dist >>> strategy = dist.Strategy() >>> strategy.gradient_merge.enable = True >>> strategy.gradient_merge.k_steps = 2 >>> strategy.gradient_merge.avg = True
- property fused_passes
-
fused_passes
is used to configure the fusion of the computation in the model, containing following configs:enable
(bool): whether to enable fused passes. Default: False.gemm_epilogue
(bool): whether to fusematmul
andadd
computation in theLinear
layer. Default: False“dropout_add” (bool): whether to fuse
dropout
andadd
computation. Default: False.Examples
>>> import paddle >>> import paddle.distributed as dist >>> strategy = dist.Strategy() >>> strategy.fused_passes.enable = True >>> strategy.fused_passes.gemm_spilogue = True >>> strategy.fused_passes.dropout_add = True
- property pipeline
-
pipeline
is used to configure the pipeline parallelism, containing following configs:enable
(bool): whether to enable pipeline parallelism. Default: False.schedule_mode
(str): the scheduling mode of pipeline parallelism. Default: “1F1B”.micro_batch_size
(int): the size of each micro-batch inside a mini-batch. Default: 1.accumulate_steps
(int): number of steps for accumulating. Default: 1.Examples
>>> import paddle >>> import paddle.distributed as dist >>> strategy = dist.Strategy() >>> strategy.pipeline.enable = True >>> strategy.pipeline.micro_batch_size = 2
- property amp
-
amp
is used to configure the amp, containing following configs:enable
(bool): whether to enable AMP. Default: False.dtype
, (str): the data type of AMP. Default: “float16”.level
, (str): the level of AMP. Default: “O1”.init_loss_scaling
, (float): the initial value of loss scaling. Default: 32768.0incr_every_n_steps
, (int): the number of steps for increasing loss scaling. Default: 1000decr_every_n_nan_or_inf
, (int): the number of steps for decreasing loss scaling. Default: 2incr_ratio
, (float): the ratio for increasing loss scaling. Default: 2.0decr_ratio
, (float): the ratio for decreasing loss scaling. Default: 2.0use_dynamic_loss_scaling
, (bool): whether to use dynamic loss scaling. Default: Falsecustom_white_list
, (list): the list of names for which AMP will be applied. Default: []custom_black_list
, (list): the list of names for which AMP will not be applied. Default: []custom_black_varnames
, (list): the list of names for which AMP will not be applied. Default: []use_fp16_guard
, (bool): whether to use fp16 guard. Default: Falseuse_bf16_guard
, (bool): whether to use bf16 guard. Default: Falseuse_master_grad
, (bool): whether to use master grad. Default: FalseExamples
>>> import paddle >>> import paddle.distributed as dist >>> strategy = dist.Strategy() >>> strategy.amp.enable = True >>> strategy.amp.dtype = "float16" >>> strategy.amp.level = "O2"