group_sharded_parallel¶
- paddle.distributed.sharding. group_sharded_parallel ( model, optimizer, level, scaler=None, group=None, offload=False, sync_buffers=False, buffer_max_size=8388608, segment_size=1048576, sync_comm=False, dp_group=None, exclude_layer=None ) [source]
-
Use group_sharded_parallel can perform group shared configuration on the model, optimizer and GradScaler. Level has three string options, ‘os’, ‘os_g’ and ‘p_g_os’ corresponds to three different usage scenarios: optimizer state segmentation, optimizer state + gradient segmentation, and parameter + gradient + optimizer state segmentation. Usually, optimizer state + gradient segmentation is actually a re optimization of optimizer state segmentation, so optimizer state + gradient segmentation can be used to realize optimizer state segmentation.
- Parameters
-
model (Layer) – The layer to be wrapped with group_sharded_parallel.
optimizer (Optimizer) – The optimizer to be wrapped with group_sharded_parallel.
level (str) – The different level of the group sharded. Such as os, os_g, p_g_os.
scaler (GradScaler, optional) – If AMP is used, you need to pass GradScaler. Defaults to None, indicating that GradScaler is not used.
group (Group, optional) – The group instance. Defaults to None, indicating that the default environment group is used.
offload (bool, optional) – Whether to use the offload function. Defaults to False, which means that the offload function is not used.
sync_buffers (bool, optional) – Whether to broadcast model buffers. It is generally used when there are registered model buffers. Defaults to False, indicating that model buffers are not used.
buffer_max_size (int, optional) – The max size of the buffer used to integrate gradient in os_g. The larger the size, the more GPU memory will be used. Defaults to 2**23, which means that the dimension of the buffer is 2**23.
segment_size (int, optional) – The smallest size of parameter to be sharded in p_g_os. Defaults to 2**20, indicating that the dimension of the minimum segmented parameter is 2**20.
sync_comm (bool, optional) – Whether to use synchronous communication, only in p_g_os used. Defaults to False, indicating that asynchronous communication is used.
dp_group (Group, optional) – dp communication group, support to combine stage2 or stage3 with dp hybrid communication.
exclude_layer (list, optional) – exclude some layers for slicing for sharding stage3, for example, exclude_layer=[“GroupNorm”, id(model.gpt.linear)], exclude_layer must contain the layers’ name or one layer’s id.
- Returns
-
A wrapper for group sharded given model. optimizer: A wrapper for group sharded given optimizer. scaler: A wrapper for group sharded given scaler.
- Return type
-
model
Examples
>>> >>> import paddle >>> from paddle.nn import Linear >>> from paddle.distributed import fleet >>> from paddle.distributed.sharding import group_sharded_parallel >>> fleet.init(is_collective=True) >>> group = paddle.distributed.new_group([0, 1]) >>> model = Linear(1000, 1000) >>> clip = paddle.nn.ClipGradByGlobalNorm(clip_norm=1.0) >>> optimizer = paddle.optimizer.AdamW(learning_rate=0.001, parameters=model.parameters(), weight_decay=0.00001, grad_clip=clip) >>> # wrap sharding model, optimizer and scaler >>> model, optimizer, scaler = group_sharded_parallel(model, optimizer, "p_g", scaler=scaler) >>> img, label = data >>> label.stop_gradient = True >>> img.stop_gradient = True >>> out = model(img) >>> loss = paddle.nn.functional.cross_entropy(input=out, label=label) >>> loss.backward() >>> optimizer.step() >>> optimizer.clear_grad()