group_sharded_parallel¶
- paddle.distributed.sharding. group_sharded_parallel ( model: Layer, optimizer: Optimizer, level: Literal[os, os_g, p_g_os], scaler: GradScaler | None = None, group: Group | None = None, offload: bool = False, sync_buffers: bool = False, buffer_max_size: int = 8388608, segment_size: int = 1048576, sync_comm: bool = False, dp_group: Group | None = None, exclude_layer: Sequence[str | int] | None = None ) tuple[Layer, Optimizer, GradScaler] [source]
-
Use group_sharded_parallel can perform group shared configuration on the model, optimizer and GradScaler. Level has three string options, ‘os’, ‘os_g’ and ‘p_g_os’ corresponds to three different usage scenarios: optimizer state segmentation, optimizer state + gradient segmentation, and parameter + gradient + optimizer state segmentation. Usually, optimizer state + gradient segmentation is actually a re optimization of optimizer state segmentation, so optimizer state + gradient segmentation can be used to realize optimizer state segmentation.
- Parameters
-
model (Layer) – The layer to be wrapped with group_sharded_parallel.
optimizer (Optimizer) – The optimizer to be wrapped with group_sharded_parallel.
level (str) – The different level of the group sharded. Such as os, os_g, p_g_os.
scaler (GradScaler|None, optional) – If AMP is used, you need to pass GradScaler. Defaults to None, indicating that GradScaler is not used.
group (Group|None, optional) – The group instance. Defaults to None, indicating that the default environment group is used.
offload (bool, optional) – Whether to use the offload function. Defaults to False, which means that the offload function is not used.
sync_buffers (bool, optional) – Whether to broadcast model buffers. It is generally used when there are registered model buffers. Defaults to False, indicating that model buffers are not used.
buffer_max_size (int, optional) – The max size of the buffer used to integrate gradient in os_g. The larger the size, the more GPU memory will be used. Defaults to 2**23, which means that the dimension of the buffer is 2**23.
segment_size (int, optional) – The smallest size of parameter to be sharded in p_g_os. Defaults to 2**20, indicating that the dimension of the minimum segmented parameter is 2**20.
sync_comm (bool, optional) – Whether to use synchronous communication, only in p_g_os used. Defaults to False, indicating that asynchronous communication is used.
dp_group (Group|None, optional) – dp communication group, support to combine stage2 or stage3 with dp hybrid communication.
exclude_layer (list|None, optional) – exclude some layers for slicing for sharding stage3, for example, exclude_layer=[“GroupNorm”, id(model.gpt.linear)], exclude_layer must contain the layers’ name or one layer’s id.
- Returns
-
A wrapper for group sharded given model. optimizer: A wrapper for group sharded given optimizer. scaler: A wrapper for group sharded given scaler.
- Return type
-
model
Examples
>>> # type: ignore >>> >>> import paddle >>> from paddle.nn import Linear >>> from paddle.distributed import fleet >>> from paddle.distributed.sharding import group_sharded_parallel >>> fleet.init(is_collective=True) >>> group = paddle.distributed.new_group([0, 1]) >>> model = Linear(1000, 1000) >>> clip = paddle.nn.ClipGradByGlobalNorm(clip_norm=1.0) >>> optimizer = paddle.optimizer.AdamW(learning_rate=0.001, parameters=model.parameters(), weight_decay=0.00001, grad_clip=clip) >>> # wrap sharding model, optimizer and scaler >>> model, optimizer, scaler = group_sharded_parallel(model, optimizer, "p_g", scaler=scaler) >>> img, label = data >>> label.stop_gradient = True >>> img.stop_gradient = True >>> out = model(img) >>> loss = paddle.nn.functional.cross_entropy(input=out, label=label) >>> loss.backward() >>> optimizer.step() >>> optimizer.clear_grad()