3.0 Beta Release Note¶
Overview of PaddlePaddle 3.0 Beta¶
The core features of this version mainly include new technologies such as dynamic-static unity auto parallel and automatic optimization of neural network compiler, to aim to address the new challenges in the current deep learning field.PaddlePaddle Framework 3.0 Beta extends the design concepts of 2.x such as dynamic-static unity and integrated training and inference. The development interface is fully compatible with 2.x version. This means that codes developed in version 2.x can run directly on version 3.x without modification in most cases. Several key features are detailed as follows:
Dynamic-static graph unified auto parallel: To make the parallel training programming of large models easier, PaddlePaddle has also optimized the semi-auto parallel programming paradigm with dynamic-static graph unified. Developers do not need to delve into the complex concepts and APIs need in manual parallel programming; developers only need to perform a small amount of tensor sharding annotation to complete the construction of hybrid parallelism for large models. The framework is able to automatically derive distributed sharding states and add communication operators, and also supports one-key dynamic-to-static distributed training, thus dramatically simplifying the development of hybrid parallel training codes. In terms of dynamic-static unity, PaddlePaddle has comprehensively upgraded its dynamic-to-static training capability by adopting bytecode-based dynamic-static conversion technology, to support adaptive graph construction functions. It has been verified on more than 700 PaddlePaddle industrial-grade models, achieving a 100% success rate of one-key dynamic-to-static training.
Automatic optimization of neural network compiler: PaddlePaddle Compiler Infrastructure for Neural Networks (CINN) adopts the design of integration with the framework, supporting the efficient training and dynamic shape inference of generative models, scientific computing models and other models. This provides a good balance between computational flexibility and high performance. The inference performance of Llama2 and Stable Diffusion models has been improved by 30% through automatic fusion of operators and code generation technology.
High-order automatic differentiation: In order to better support scientific computing scenarios, PaddlePaddle Framework designs and implements high-order automatic differentiation technology based on combinatorial operator mechanism, combined with automatic optimization technology of neural network compiler. We have tested more than 40 differential equations in scientific computing scenarios, and its solution speed is 70% ahead of similar products in the industry.
Highly scalable intermediate representation: In order to improve the scalability of the PaddlePaddle framework, we have developed a highly scalable Paddle Intermediate Representation (PIR).This representation systematically abstracts the underlying core concepts and provides flexible and efficient components. PIR serves as the infrastructure to support a number of technologies such as dynamic-to-static, automatic differentiation, auto parallel, combinatorial operators, and graph optimization; it is widely used in scenarios such as distributed training, model compression, and inference deployment. With the Declarative Rewrite Rule (DRR) mechanism provided by PIR, the development cost of Pass can be reduced by 60%.We have tested over 900 model configurations and the results show that the overall performance of inference improves by more than 10% after using PIR.
Multi-Hardware adaptation: PaddlePaddle provides a well-functioning and low-cost solution for large model hardware adaptation. The new hardware only needs to be adapted with more than 30 interfaces to support training, compression and inference of large models. Meanwhile, PaddlePaddle provides compiler-based hardware access mode, and hardware vendors only need to implement the compiler’s code generation back-end in the form of plug-ins to achieve efficient adaptation with the PaddlePaddle framework.PaddlePaddle hardware access this time has additional support for the daily release of four hardware units: Kunlun XPU, Ascend NPU, Hygon DCU and Cambricon MLU.
This version includes the continuous improvement of some of the existing features of the framework 2.x. Meanwhile, the new features of this version bring significant improvements in terms of user experience, performance, ease of secondary development and hardware adaptability. In addition to the above core features, this version continues to enrich and enhance the API functions to meet more scenarios at the user experience level, optimizes and improves the distributed parallel strategy optimization and reasoning function enhancement for the large model scenarios, makes thorough improvement in terms of ease-of-use in compilation and installation, makes a new synchronous upgrade to the installation method and version of the dependency packages, strengthens the security of the system comprehensively, and makes comprehensive error-correction checks to the product documentation. We have also carried out a cleanup of some deprecated codes to ensure architectural simplicity. The performance of PaddlePaddle 3.0 Beta is still mature and stable without the use of new features, and each new feature provides a switch for flexible control, which makes it easy for users to quickly understand the related product features and experience comparison.
User Experience Upgrade¶
Incompatibility Upgrade¶
PaddlePaddle API supports type promotion.In the most common calculations such as addition, subtraction, multiplication, and division, if the two inputs are of different data types, it is necessary to determine the data type of the output. Historically, PaddlePaddle partially supported this and the actual rules were not clear. Objectively, there were dynamic-static inconsistency, inconsistent API and operator overloading, and inconsistent interchange rates, and unexpected problems (hard to fix) especially in the case of large models using a mix of bf16/fp16 and fp32 for a wide range of calculations. Starting from the 3.0 beta, PaddlePaddle has clarified the type promotion rules, and defined in detail the types of Tensor vs Tensor and Tensor vs. 1 number (Scalar) computation results, ensuring that the computation conforms to the exchange law, the operator overloading is consistent with the results of the binary API, and the results of the dynamic graph are consistent with those of the static graph. This is more in line with user understanding and industry practice. #60638, #63842, #60011
Deprecated Features¶
There have been two versions stably supporting 0-dimensional Tensor. This version removes the switch
FLAGS_set_to_1d
that converts a 0-dimensional Tensor to a 1-dimensional Tensor with only 1 element in some cases. This switch is for compatibility with the incorrect way of writing a 1-element 1-dimensional Tensor to represent a 0-dimensional Tensor in some kits. That is, the current PaddlePaddle fully distinguish between the semantics of a 0-dimensional Tensor and a 1-dimensional Tensor with only 1 element, both are not equivalent. #61227
New API Features¶
Compared with the previous version, this version is added with 126 new APIs, richer API functions to better support the needs of large models, and scientific computation. The details are as follows:
Add Tensor computation API.
paddle.gammaln
,paddle.gammainc
,paddle.gammaincc
,paddle.sinc
,paddle.pdist
,paddle.histogramdd
,paddle.signbit
,paddle.copysign
,paddle.bitwise_right_shift/bitwise_left_shift
,paddle.isposinf/isneginf/isreal
,paddle.isin
,paddle.hsplit/dsplit
,paddle.column_stack/row_stack/dstack/hstack/vstack
,paddle.slice_scatter
,paddle.masked_scatter
#60553, #59311, #59357, #63521, #57869, #57880, #57882, #60150, #57785, #58092, #63523, #64001, #58917, #59127, #59973, #59383Add probability distribution API.
paddle.distribution.ContinuousBernoulli
,paddle.distribution.MultivariateNormal
,paddle.distribution.Exponential
,paddle.distribution.Gamma
,paddle.distribution.Binomial
,paddle.distribution.Poisson
#58004, #57899, #57856Add optimizer API.
paddle.optimizer.ASGD
,paddle.optimizer.NAdam
,paddle.optimizer.RAdam
,paddle.optimizer.Rprop
#58834, #63671, #58851Add Linear Algebra API.
paddle.linalg.matrix_exp
#59715Add other APIs.
paddle.bernoulli_
,paddle.nn.ZeroPad1D/ZeroPad3D
,paddle.nn.AdaptiveLogSoftmaxWithLoss
,paddle.Tensor.apply
#64252, #59690, #63728, #63302, #59374,#63227
Some API Enhancements¶
Enhance about 30 APIs to support complex number computation, such as
paddle.log
,paddle.log1p
,paddle.square
, andpaddle.reciprocal
, to extend the support for more scientific computing scenarios. #62448, #60821, #60897, #62764, #59536, #59529, #63207, #62237, #64684Enhance 46 APIs, to make existing APIs easier to use and easier to convert to codes,including but not limited to, adding API parameters, extending the data types supported by the APIs, and fixing the existing unreasonable designs. #59890, #63513, #59674, #62778, #64110, #63222, #64331, #64715, #61155, #60070, #61974, #62407, #62672,#62722, #62876, #63284, #63860, #60466, #63690, #63953, #63901, #62624, #59857, #60084, #60766, #62788, #62937, #63134, #62966, #63648, #63881, #64358, #60503, #63604, #62338
Enhance single-test infrastructure for higher-order differentiation, making it easier to add single-test use cases for higher-order differentiation. #62074
API Performance Improvements¶
Bug Fixing¶
Fix errors in
paddle.optimizer.LBFGS
caused by using non-Tensor computations #60219Fix the problem of random numbers not being fixed in
paddle.optimizer.LBFGS
#60591Fix the incorrect calculation of gradient of
set_value
operator #59034Fix the problem of Tensor basic index adapting to PIR #60259, #61103
Fix the problem of Tensor combined index assignment problem #60447
Fix the problem when Tensor combined index takes values [problem] #61922
Fix
paddle.flatten
stride calculation error issue, with being able to addpaddle.flatten_
#63084Fix the result inconsistency problem between
paddle.index_fill
andpaddle.index_fill_
#59863Fix the
paddle.masked_scatter
error report issue #60835Fix the
paddle.histogramdd
cpu error report issue #61891Fix the bug that
paddle.cast_
continuous use on cpu leads to incorrect result #60054Fix
paddle.put_along_axis
bug when input size is very large #60551Fix
paddle.nanmedian
cpu error report issue #63221Fix the bug that
paddle.median
does not support inputs other than floating-point types in the min branch. #64444Fix the dataloader issue in distributed scenarios. #62696, #63378
Fix the format issue under GLOG_v>=6. #63345
Basic Execution Architecture¶
PIR basic functions have been upgraded and improved comprehensively, and the maturity level has been greatly improved. Based on PIR, the design of the PaddlePaddle infrastructure is more reasonable, ensuring the excellent performance and good scalability of the framework. In this version, we have completed the inference verification of PIR in multiple scenarios: For the single-machine scenario, complete the PIR back-end switching in the dynamic-to-static scenarios; For inference scenario, complete the verification of all the stock models, and 84.2% of the models have a gain of 10%+; we have completed the verification of distributed scenarios based on PIR. Meanwhile, based on PIR, we have completed the development and validation of core modules such as control flow, backward logic, save/load, and OneDNN adaptation, which lays a solid foundation for the switching of the PaddlePaddle PIR to the default mode. The functional completeness, execution efficiency and stability of the PaddlePaddle framework operator system are further improved, bringing better use and development experience to the developers.
Function Optimization¶
Improve the basic functions of PIR, including basic type system enhancement, debugging, printing, Pass development, and AMP support, to enhance the development efficiency of PIR. #60723, #60677, #60783, #60798, #61053, #61366, #61446, #60024, #59939, #63376, #61853, #63914, #60170, #60678, #64093, #64065, #62451, #59784, #60136, #63336, #62108, #60860, #60536, #60590, #60752, #61435, #62977, #62139, #60432, #61452, #61978, #62262, #62422, #60359, #62989, #61297, #61399, #61871, #61496, #62413
Optimize the execution logic of the PaddlePaddle actuator, improve the Pass system, enhance the performance of training and inference, to better support distributed parallel logic operation. #60182, #60516, #63573, #60181, #59792, #62025, #61160, #61188, #61277, #61669, #60823, #61310, #60892, #60578, #61657, #62638, #63960, #64234
PIR New Features¶
Realize reverse logic based on PIR, generate reverse computation graph directly, and support higher-order differentiation at the same time. #60174, #60328, #60818, #61352, #61661, #61927, #62772, #60360, #60866, #60970, #60810, #64696, #59844, #59999, #60262, #60338, #59935, #59982, #60221, #62621, #60044, #59790, #60529, #61378, #61584
Implement control flow logic based on PIR to improve the expressive ability of PIR and better support multi-scenario services such as training and inference. #61396, #64045, #60953, #61091, #61304, #62093, #64710, #60668, #60433, #60963, #61192, #60895, #60017, #60369, #60330, #60364, #61416, #60460, #60703, #61027
Realize save/load logic based on PIR, to carry out the process of PIR and upstream/downstream training and inference services. #63438, #63574, #64281, #64327, #63622, #64507, #63389, #63539, #63749, #63957, #64044, #64121, #64239, #63818, #63910,#63380#63380,#63275,#63663,#64692,#63958
Completed the development and validation of OneDNN related basic functions to prepare for the full-scale switch of OneDNN. #60680, #60665, #63162, #59917, #62901, #59918, #60257, #60502, #61062, #61170, #61474, #60874, #61495, #61664, #61649, #61592, #61667, #61137, #60952, #61651, #62126, #62187, #61307, #62734, #60974, #61451, #61011, #61218, #61623, #61893, #61876, #61892, #62085, #62220, #62244, #62265, #60754, #60896, #61868, #61659, #62241, #62471, #61165,#64441,#63141,#63145,#63592,#63617,#63518,#63726,#63853,#63812,#63811,#64524,#62993,#63516,#62998,#63151,#64661,#64433,#64448,#63201,#63230,#63233,#63281,#64671,#63274
Implement Sparse related logic based on PIR, including basic Type and operator expression, and complete the verification of Sparse key functions. #62868, #63015, #62894
Dynamic-to-static Function Optimization¶
Optimize the dynamic-to-static basic capability, adapt to the dynamic dimension in SOT training scenarios, and support Python 3.12.
Complete the PIR adaptation in dynamic-to-static scenarios. #60988, #61936, #59929, #61790, #64323, #62030, #61143, #62680, #63309, #63311, #62199
SOT adapts to Python 3.12 bytecode, and the dynamic-to-static SOT function can be used in Python 3.12. #61414, #59562, #61031, #61272, #61412, #61305, #61964, #62008, #62028, #61995, #62073, #62120, #62218, #62155
SOT completes the adaptation of the dynamic dimension of the training scenario, avoiding triggering duplicate graph compositions in dimension changes, and improving the operation efficiency. #64278, #64435, #64499, #64500, #62080
Operator Mechanisms¶
For the problems of incomplete implementation of some kernels and inefficient calculation logic, we have improved and optimized some of the operator implementation and internal mechanisms of framework, fixed some known problems, and supported some new features.
For XPU kernel, we have optimized the data type support of
numel
,concat
, andslice
, and the mixed-precision training support forAdamW
optimizer. #63715, #61617, #61694, #64542, #63644, #61340, #63108Improve the function and performance of some operators. #59413, #60295, #64304, #60979, #63556, #63061, #62533
Improve the mechanism of composite operators, and optimize composite logic for some operators. #59448, #60505, #59891, #63161, #63245, #63782, #64346, #63156, #63171, #61315, #61701, #61874, #61873, #62059, #61912, #62112, #63011, #63009, #64714
Bug Fixing¶
Fix the bugs related to PIR, actuator, and dynamic-to-static. #64442, #60443, #60122, #60625, #60607, #60705, #61110, #61278, #61448, #61491, #61692, #62100, #62239, #62365, #62758, #63395, #64272, #62165, #64151, #64204, #64815, #63757, #61972, #64806, #60010, #60461, #60310, #62006, #61591, #60327, #60720, #64656, #60236, #60684, #60790, #60944, #62056, #62891, #64676, #60271, #60634, #60663, #60827, #60845, #60905, #60945, #60949, #61107, #61111, #61117, #61158, #61177, #61355, #61593, #61666, #61934, #62216, #62491, #62515, #62594, #62605, #62895, #62913, #64413, #59947, #60264, #60721, #63113, #63629, #64300, #64450, #64532, #64561, #64625, #64731, #60059, #60487, #60423, #61599, #62032, #62686, #64055, #60751, #61646, #60454, #62530, #62821, #64454, #64754, #59860, #60280, #60357, #60363, #60900, #61185, #61505, #61644, #62256, #62396, #63040, #63409, #63764, #59571, #59894, #59569, #59896, #60015, #60081, #60164, #60200, #60211, #60267, #60458, #60395, #60907, #60707, #60993, #61401, #61433, #61450, #61577, #61575, #61703, #61711, #61883, #61822, #62012, #61858, #62176, #62257, #62470, #62536, #62606, #62808, #62854, #62879, #62864, #63063, #62958, #63397, #63805, #63694, #64168, #64184, #64174, #64315, #64362, #64400, #64475, #64458, #64548, #59858, #61132, #62010, #62069, #62707, #62921, #63085, #63321, #63351, #63549, #64567, #59936, #60269, #60879, #61314, #61391, #61479, #61789, #61832, #61864, #61917, #62052, #62068, #62293, #62479, #62506, #59948, #64118, #64126, #64195, #64307, #64314, #64276, #64312, #64350, #64319, #64463, #64457, #64455, #64487, #64645, #63155, #59893, #63332, #63332, #64786, #60515, #60627, #60863, #60854, #61447, #61440, #61932, #62131, #62252, #62283, #62358, #62411, #62424, #62810, #62811, #62896, #62947, #63182, #63190, #63294, #63306, #63352, #63404, #63474, #64013, #64674,#60055,#62050,#62770,#63234,#63374,#64277, #63420, #60312, #63810, #64631, #63970, #63708, #62062, #60898, #62373, #59878
Fix some bugs in operator mechanism, operator implementation logic and related unit tests. #63792, #60570, #61572, #59971, #61336, #63276, #63251, #63697, #63706, #64685, #64009, #62461, #61568, #63912, #60475, #60222, #63961, #63593
Developer Content¶
Developer related contents include PIR switching, unit test start, function verification and other PR. #60621, #59703, #59694, #59717, #59729, #59730, #60216, #60238, #60246, #60343, #60302, #60870, #59956, #60795, #62528, #59932, #59636, #59959, #59734, #60287, #60347, #60335, #60332, #59631, #60255, #60329, #60401, #60522, #60792, #59617, #60277, #60584, #60911, #61322, #60838, #60602, #61458, #61607, #61960, #60484, #61662, #62263, #62270, #62469, #62416, #62443, #62412, #62541, #62634, #62369, #60805, #62644, #62494, #62767, #62735, #62802, #62801, #62783, #62579, #62833, #62668, #62972, #62505, #63005, #62900, #60577, #60877, #61076, #61038, #61112, #61120, #61582, #61119, #61036, #61289, #60695, #61039, #61963, #62118, #62797, #62807, #62887, #62830, #62849, #62750, #62965, #59742, #59867, #60836, #60902, #61228, #60037, #60079, #60173, #60373, #60380, #60381, #60750, #61065, #61122, #61074, #61204, #61191, #61182, #61219, #61296, #61503, #61484, #61513, #61476, #61510, #61511, #61526, #61524, #61525, #61466, #61497, #61538, #61533, #61530, #61468, #61527, #61535, #61512, #61531, #61539, #61532, #61521, #61517, #61518, #61550, #61545, #61548, #61519, #61549, #61574, #61585, #61581, #61553, #61504, #61603, #61534, #61567, #61523, #61565, #61564, #61707, #61560, #61684, #61706, #61724, #61719, #61729, #61763, #61755, #61737, #61750, #61753, #61756, #61777, #61758, #61731, #61771, #61739, #61559, #61717, #61733, #61563, #61546, #61566, #61562, #61793, #61902, #61905, #61904, #62227, #62332, #62653, #62681, #62709, #62794, #62938, #63185, #63754, #63769, #63793, #63830, #63939, #64340, #64657, #62527, #64088, #60203, #60372, #60685, #60815, #60791, #60864, #60851, #60844, #60694, #60855, #60869, #60948, #61042, #61455, #61580, #61589, #61609, #61616, #61715, #61716, #61759, #61555, #61492, #61805, #61712, #61615, #61713, #62129, #59294, #59865, #60270, #60547, #60698, #60762, #60753, #60966, #60976, #61100, #61203, #61210, #61424, #61213, #61275, #61276, #61279, #61292, #61295, #61298, #61299, #61301, #61302, #61329, #61804, #62745, #62909, #64247, #64308, #60690, #61149, #61145, #61193, #61207, #61229, #61236, #61244, #61242, #61263, #61370, #61410, #61480, #61522, #61540, #61520, #61625, #61700, #61708, #61736, #61889, #61952, #62033, #62637, #62777, #62779, #63226, #63287, #63398, #63431, #64000, #64058, #64059, #64063, #64066, #64089, #64170, #64235, #64237, #64243, #64242, #64286, #64322, #64317, #64490, #60138, #62384, #59702, #60341, #60636, #60714, #60716, #60700, #60702, #60704, #60715, #60713, #60711, #60724, #60803, #61331, #63286, #60473, #61046, #61859, #60675, #60719, #62863, #63013, #61293, #62781, #62935, #63014, #64203, #63349, #59572, #59911, #59861, #60014, #59913, #58889, #60114, #59928, #60180, #60168, #60166, #60250, #60247, #60172, #59661, #58880, #60291, #58881, #58955, #58684, #58708, #60323, #58762, #60048, #60345, #60325, #59627, #60416, #60434, #59801, #60619, #60445, #60666, #60353, #60733, #60693, #60350, #61096, #61121, #61164, #62054, #62136, #62508, #62988, #63472, #60193, #60197, #60198, #60346, #60318, #60645, #60650, #60660, #60706, #60799, #60837, #60817, #60820, #60894, #61079, #61087, #61073, #61072, #61127, #61097, #61365, #61456, #61846, #62217, #62519, #62881, #62880, #59723, #59722, #59797, #59960, #59761, #59996, #60009, #58896, #60051, #60410, #60420, #60548, #60575, #60726, #60809, #61346, #61222, #61099, #62254, #62269, #62362
Improve the underlying error checking mechanism of PaddlePaddle to facilitate developers’ debugging. #62571, #62602, #60903, #64695, #59907, #62018, #62839, #60651, #61488, #64064, #63192, #63525。
Vulnerability Fixing¶
Compiler Infrastructure for Neural Networks (CINN)¶
In version 3.0, the compiler architecture has been significantly upgraded. Based on Shape Dialect, build a symbolic automatic derivation and simplification system, support symbolic expression and constraint construction, and support end-to-end execution under the dynamic shape of the compiler. Meanwhile, CINN has upgraded the automatic fusion of subgraphs and Pass Pipline mechanism, merged the core modules of dynamic and static shapes, and merged the iteration paths, so that the architecture is clear and unified. In this version, the compiler has been refactored in important back-end modules such as AST Compute, Schedule strategy, and Tiling, improving the general optimization capability of the compiler, and verifies the training, inference correctness and speedup performance of the dynamic shapes on the subgraphs of PaddlePaddle Industry Suite models and typical large models Llama2-7B and Stable Diffusion models.
New Features¶
Upgrade the new automatic subgraph fusion mechanism, and innovatively propose the TrivialOp and ReduceOp fusion theory, supporting a wider range of vertical fusion and horizontal fusion, ensuring the correctness and robustness of subgraph fusion, and giving full play to the fusion potential of the neural network compiler.(#63340、#63913、#63579、#63605、#60769、#62088、#63124、#63658、#64557、#63318、#62545)
Add the symbol derivation function of dynamic shapes. Based on the Shape Dialect, realize the dynamic symbol construction, automatic derivation, constraint expression, symbol simplification and other mechanisms, introduce the DimExpr concept, upgrade the support for the PaddlePaddle framework of the InferSymbolicShape logic of the 150 + typical primitive operators, and provide more information for training and inference with compiler support for dynamic shapes.(#60843、#62662、#63790、#60098、#60511、#61232、#61939、#62798、#62955、#63029、#60572、#61035、#61224、#61587、#61937、#62314、#62394、#62569、#62495、#62844、#63000、#63016、#64222、#60129、#60899、#61342、#61439、#62766、#61133、#61430、#61498、#61680、#63367、#62151、#62665、#61407、#61502、#61655、#64115、#61791、#62141、#63422、#63577、#63978、#63576、#63947、#64332、#63990)
Add the Pass Pipline function, including PdToCinn, CinnPreprocess, BuildGroupOp, GroupClusterOp, CinnLowering, Accuracy Check and other Pass strategies, to support the Lowering and execution of subgraphs in dynamic and static shapes, with a clear architecture.(#61611、#62612、#64354、#61848、#62316、#64152、#61619、#62318、#61977、#62211、#63972、#63686、#64505)
Add the support for BuketLower and DyShapeSchdule functions, to realize automatic bucket compilation and optimization according to the range of dynamic shapes; and adapt and upgrade the logic of CodeGen module to support the generation of InferShape function and the distribution of conditional branching function of Host function, so as to support the acceleration of training inference under the dynamic Shape of large models.(#62730、#61115、#59941、#62207、#64318、#64345、#60519、#62584、#60828、#60533、#61436、#62071、#63971、#61656、#63083、#64405、#63047、#64655、#63095、#63829、#63572)
Add support for compilation caching strategy, to automatically recognize, merge and reuse compilation results of the same subgraph structure, improve compilation efficiency by using multi-threading, so as to enhance the user experience.(#62952、#63269、#64718、#61367、#63305、#63750、#63871、#64893)
Add support for GenerateShape mechanism, add corresponding AST Compute operator definitions, support automatic resolution of dynamic symbols, and automatic generation of ShapeOp in the Lowering stage.(#64167、#64636、#61993、#64843、#62587)
Function Optimization¶
Optimize BuildCinnPass logic, upgrade the compiler’s perception strategy for black and white list operators, and improve the robustness of Pass logic.(#62372、#61081、#61225、#58863)
Optimize the OpLoweringGroup data structure, remove unnecessary interfaces and members, and reduce the coupling between upstream and downstream modules.(#62339)
Optimize the component design of the compiler on the architecture Arch, to abstract the concept of hardware, and reduce the cost of adapting to domestic hardware.(#63530、#64347、#64506、#64587)
Upgrade the AST Compute module of the compiler’s back-end operator, to adapt to support the computing logic of dynamic Shape.(#62488、#63581、#63687、#63654、#64217)
Performance Optimization¶
Optimize the Schedule logic of AST IR, restructure the core modules such as Vectorize, Unroll, AxisBind, and ComputeAt, and merged the iterative paths of dynamic and static shapes, so as to reduce the development and maintenance costs.(#60449、#60155、#60342、#60498、#60538、#60190、#61197、#63140、#61156)
Optimize the Tiling strategy and temp Buffer function, support warp-level memory continuous Read and cache_read cache_write function, and improve the subgraph execution performance.(#64240、#60562、#64711、#62856、#61576、#61901、#62581、#61987、#60190、#63138、#62517)
Support automatic search function of Schedule configuration and AOT offline saving mechanism to accelerate the performance of subgraph Kernel.(#64271、#64588、#64694、#64620、#64702、#63086)
Support OptimizeReductionTactic optimization strategy to improve kernel performance in Reduce scenarios.(#6066、#61363、#60881、#63859)
Enhance DCE Pass function, remove redundant If/For branch codes and improve execution efficiency.(#61682)
Add support for FuseParallelMatmulPass Pass, integrate multiple Matmul operators to achieve acceleration.(#63623)
Bug Fixing¶
Fix the bug when Lowering some special operators to the compiler, to improve the end-to-end user experience.(#60800、#64720、#62593、#62661、#64626、#63320、#64581、#61608、#64135、#64659、#62391、#62490、#63891、#64529)
Fix a bug in the symbolic derivation logic of some operators.(#62141、#62376、#62941、#63322、#64672、#64407、#60241、#60440、#62503、#62997、#63169、#61098、#63973、#62248、#62321、#63755、#63917、#63903、#64173、#64525、#64615、#62247、#62455、#62898、#62867、#63608、#63789、#64085、#64136、#64181)
Fix the problems of compiler execution errors under dynamic and static shapes, to improve the robustness of the framework mechanism.(#60813、#61877、#61909、#62954、#63614、#60339、#60623、#60658、#60669、#58823、#62483、#62742、#61797、#63411、#64077、#62736、#62390、#63689)
Deprecated Features¶
Auto-Parallel Architecture¶
In order to further enhance the usability of the Auto Parallel architecture in large model training scenarios, PaddlePaddle has improved the Auto Parallel functionality in dynamic-static graphs, including the newly added parallel strategies such as sharding parallelism and interleaved pipeline parallelism, including support of lazy initialization parameters. Add and enhance the SPMD derivation rules for some of the operators. The auto-parallel architecture has been comprehensively verified in a number of mainstream large language models. Meanwhile, in order to build the new 3.0 architecture of PaddlePaddle, the static graph auto parallel architecture has been comprehensively upgraded based on PIR, the new generation intermediate representation of Paddlepaddle. It introduces DistDialect for distributed related components, and natively support DistAttr and DistTensor in the computation graph representation, and smooth the transfom from static to dynmaic graph, further enhance the unity of auto parallel usage in dynamic and static graph mode. Finally, a number of performance optimization technologies have been added and improved, including zero bubble pipeline scheduling strategy, achieving the same or even better end-to-end training performance compared to the manual parallelism on typical large models such as Llama-2 13B/70B.
Function Improvements¶
Add the dtensor_from_local interface for creating DistTensor from local tensor after sharding (correspondingly, shard_tensor is the created DistTensor from global tensor before sharding). #60206
Add the unshard_tensor interface to convert DistTensor to global tensor, which is reciprocal operation to shard_tensor. #60272
To reduce the GPU memory usage during training, add Sharding parallelism, and support stage1, stage2 and stage3 modes. #61926, #62711, #62486, #62230
To solve the problem of insufficient GPU memory when initializing parameters first and then sharding them, add the LazyInit function, to support slicing parameters first and then initializing them. #60316, #60441, #60563, #61792
In order to reduce the bubble of pipeline parallel, add the interleaved pipeline parallel parallelism has been added, and support automatically converting the pipeline parallel of the user’s networking to interleaved pipeline parallel through configuration, so that the user doesn’t need to perform complicated marking in the networking. #59751, #60050, #60467, #60868, #60187, #62884, #60560, #61541
Add the SPMD derivation rules for stack, gather, scatter_grad, cumsum, unbind, swiglu, and fused_linear_param_grad. Improve and optimize the implementation of fused_rope, reshape, flatten, fused_rms_norm, slice, tile, flash_attn, cross_entropy and other operator slice derivation rules, to solve the problem of incompatibility in some of the model networking scenarios. #62720, #64202, #63361, #63290, #61460, #59986, #61184, #60144, #62525, #62053, #60709, #60111, #63681, #62180, #60794, #60632, #62439
Improve the distributed checkpoint storage and loading function, support master_weights strategy, and fix the random hanging problem. #60027, #59872
In order to support the auto parallel of arbitrary shape tensor, add the non-uniform tensor sharding feature. #62611, #61432
In order to support users to use customized operators in the auto parallel networking, support user registration outside the framework to customize the SPMD derivation rules for this class of operators. #60509
Improve the slice SPMD rule, and support the transition from any state to replicate and from replicate state to any state. #60281, #59869
Add MoE expert parallelism (experimental). Currently, only dynamic graph auto parallel is supported. #63904
Fix some process adaptation problems of auto parallel and dynamic diagram execution, and dynamic to static. #60214, #60546, #62082, #61313, #61840, #60614, #60234, #64813, #61606, #63405, #64334, #60504
Performance Optimization¶
In order to reduce the bubble in pipeline parallel, support the reverse computation of parameter and activation splitting in backward, and add zero bubble pipeline scheduling strategy to improve the training performance. #62865, #62737, #64534,
To improve the performance of sequence parallel, perform fusion on related communication operations and computation operations, and optimize redundant transopse operations. #64807, #63948, #64316, #64119
Optimize the time consumption of auto parallel graph optimization for static graphs, to reduce the delay from the start of training to the completion of the first step. #59912, #61817, #60022, #60125
Optimize the time consumption of related communication operations in hybrid parallel scenarios. #62157, #61622
Optimize the redundant video memory consumption of parameters under the auto parallel dynamic-to-static. #62746
Improve the hybrid precision training function of auto parallel, support the setting of local auto_cast and black/white list, support master grad function, and adapt to different parallel strategies. 60158, #59987, #62629, #60385, #62015, #60514, #61221, #60779, #63228
Optimize non-essential casts caused by type promotion and amp to improve performance. #63293, #63228
Upgrade Static Graph Auto Parallel Architecture¶
Based on the new generation of Intermediate Representation(PIR), add the new DistDialect, natively supporting DistAttr and DistTensor in computation graph representation, and realizing the direct binding of distributed attributes between tensor or operator, which making the auto-parallel architecture more simple and unified. #63828, #64299, #63870, #64144, #62524, #62630, #62897, #60478, #60574, #63876, #63798, #62560, #63676
Improve APIs such as shard_tensor, reshard, and to_static, to support users to convert the dynamic graph model networking directly into PIR static computation graph for better performance. #62945, #62356, #60175, #62654, #63347
Optimize the auto-parallel graph optimization compilation process, and reduce the compilation and optimization time of static graphs by refactoring and optimizing the procedure of computation graph parallelization and communication resolution. #64137, #62201, #64143, #62560
Optimize the procedure of the SPMD derivation in static graphs to achieve the consistency results under dynamic-static graphs, which improves the unity and stability of the architecture. #62659, #62547, #63117, #63434, #63770, #64361, #63073
Upgrade the implementation of Reshard conversion in static graphs, and use consistent conversion rules under dynamic-static graphs to ensure the consistency of the execution logic and results of tensor reshard conversion in dynamic-static graphs, so as to improve user experience. #62718, #62694, #60215, #63362, #63072, #63962, #64223, #61796, #64465, #64623, #64418
Automatic Search and Tuning of Training Strategies¶
In order to improve the ease of use of the training strategy automatic search and tuning tool (AutoTuner), support user-defined search items, support for setting the priority of search items, and support for user-configured illegal strategy combinations, to comprehensively enhance the error reporting information in the runtime and post-run logs, and support for AutoTuner on NPU devices. #60101, #60294, #61898, #60248, #60417, #60954, #61499, #62724, #60954, #63693, #62853, #62984
Cuda Training Performance Optimization¶
This upgrade achieves the improvement of large model training efficiency from multiple perspectives, such as operator computation efficiency, distributed communication optimization, and video memory optimization.
Function Improvements¶
Enhance the FlashAttention operator function, including support for NVIDIA SM90 GPU compilation, support for Group Query Attention, support for cuDNN access, support for QKV-packed form inputs, and so on. #59820,#60776,#58680,#63289
In the Repeat_interleave operator, add support for BFloat16 data type. #61854
For the issues of many interface parameters of ResNet-like models such as fused_scale_bias_add_relu, fused_scale_bias_relu_conv_bn, and fused_dconv_drelu_dbn, and the ease of use of operators, add the fuse_resunit pass, to support automatic fusion of the abovementioned operators, to achieve generic performance optimization. (#59771)
Performance Improvement¶
To address the problem of large GPU memory consumption during the computation of SwiGLU activation module of the Llama models, add the SwiGLU fusion operator to save the memory consumption of intermediate variables, thus reducing the memory overhead during the training process of the large model, and reducing the recomputation to improve the performance. The performance of the Llama-70B model is improved by 9%. #61508
To address the problem of higher percentage of communications in Sequence Parallel, realize the overlap between Sequence Parallel reverse process communication and Matmul computation, saving the end-to-end time consumption and improving the end-to-end performance of large model training scenarios by 1%~2%. #62284,#63531
For the problem of slow training speed due to the need to divide by nranks after sharding reverse communications, support the fusion of reverse communication and division by nranks operation, and support the mode of ReduceScatter Average, to improve the performance of large model training. #62623
For the problem of jitter training speed caused by the input data broadcasting process of the tensor model parallel process, fix the unnecessary synchronization between CPU and GPU in the data broadcasting, to ensure the stability of the training speed. #60816
For the problem of low training speed due to the long parallel P2P communication time of pipelined models, realize the overlap of P2P communication and forward-backward computation. The end-to-end training performance of large models is improved by 2%~3%. #61935,#62051
For the problem of low inefficiency of bias gradient computation of fused_linear_param_grad_add operator, optimize the computation efficiency of bias gradient computation, and improve the end-to-end training performance of large model by 0.2%. #63114
For the problem of long time-consuming parameter broadcasting process after the end of sharding reverse computation, implement the overlap between parameter broadcasting and next step computation. As a result, the end-to-end training performance of large model is improved by more than 2%. #63945
To address the problem that the gradient occupies too much video memory during the pipelined parallel training, as a result of slow training speed due to the introduction of multiple computations, we have implemented the gradient dynamic release technique, to improve the end-to-end training performance of large models by 3.4%. #59739
Bug Fixing¶
Fix the problem of StreamSafeCUDAAllocator CUDA Event resource leakage, as a result of slowdown of large model training. #64621
Fix the bug of reverse calculation error of fused_rotary_position_embedding operator. #60217
Fix the bug that customized operators cannot control the calculation accuracy by black and white lists in AMP scenarios. #60052
Fix the bug that operators such as add_, and divide_ natively supporting operations with different data types have unanticipated type boosting when type boosting occurs. #64302
Distributed Strategy Enhancements¶
Focus on strengthening the functional experience of PaddlePaddle dynamic graph distributed computing, and make various functional improvements to parallel strategies such as AutoTuner, pipeline parallel, and sharding, and enhance the flexibility of large model training. Add the features such as Flash Attention Mask, which significantly reduce the video memory usage of large model training, especially long-sequence training, improve training performance, and provide stronger capability support for large model training. In addition, several bugs and potential security risks have been fixed, which has significantly improved the overall stability of the system.
Function Optimization¶
Optimize the search space of Autotuner, which significantly improves the performance of search. #62608
For the problem of pipeline parallel that the training may be wrong due to the checking of sending type in the eval process, add the training configuration, to skip the redundant receiving check of pipelined sending, featuring higher flexibility and better performance. #63001
In the dynamic graph pipeline parallel, add the checking of the size and type of the sent and received data, and add the error message, making the robustness and debuggability better. #59405
Support the settings of multiple loss functions with returning multiple losses in dynamic graph pipeline, which improves the flexibility of dynamic graph pipeline. #63167
In the dynamic graph pipeline, add the pipeline cache clearing configuration option, to clear the cache sent and received in the pipeline in time to better support dynamic batchsize training. #62277
For the problem that the sharding stage3 strategy cannot be aligned bit by bit, replace the unordered set with OrderedSet to avoid the error caused by the accumulation order, as a result of alignment bit by bit after fixing. #60085
In order to further reduce the video memory usage in sequence parallel, add a new method of recalculating allgather, to reduce the video memory size of the activation of allgather. #64244
New Features for Dynamic Graphs¶
For the search space of autotuner, add a new search dimension of refined recompute, which makes the search result more accurate and the threshold of model tuning lower. #62430
For the problem of limiting the training batch size in virtual pipeline parallel, modify the pipeline scheduling method, to flexibly set the batch size, so as to support more flexible batch size. #61561,#60314
In order to solve the problem that the video memory occupation of the mask is a quadratic complexity with low performance in sequence length when using flash attention with a mask, the memory complexity of the mask is reduced from the quadrature of the sequence length to the first square by using the sparse mask, to optimize the memory of the mask. This reduces the number of storage accesses. Meanwhile, use share memory to accelerate memory access, greatly improving the performance. #62029
Add the dynamic graph sharding parallel strategy, to improve the communications and computation overlap function, to improve the performance of the training process. #60455
Communication Library Function Optimization¶
Bug Fixing¶
Fix the problem of dbias_out space application of fused_linear_param_grad_add_kernel operator, and add the gradient address checking logic to make the error message easier to debug. #363433,#64460
Fix the problem that the sharding policy does not scale the gradient when comm_overlap is turned off in the support of reduce_avg operation. #62702
Fix the bug related to fusion in the calculation order of main grad in Stage2. #59142
Fix the bug that the switch attribute cannot be found when reduce_avg communication operation is turned on under the sharding strategy. #62502
Fix the problem of setting stop_gradient=True for some parameters when Sharding stage1 training supports non-training parameter training. #62616
Fix the bug of message printing when TCP is turned off, to prevent misleading users. #62631
Fix the DataParallel training problem and solve multi-card training error when some gradients are not initialized and segmentation fault error occurs in data parallel training. #62299
For the scenario of turning on sequence parallel, fix the bug caused by weight freezing in some models. #63596
Fix some bugs for autotuner scenarios with single dp. #60757
Fix aadiff bug of streaming parallel strategy. (#64716)
Remove some distributed unit tests. (#62762)
Parameter Server¶
This update mainly fixes several bugs in the process of using the parameter server as well as compilation and installation issues.
Bug Fixing¶
For the problem of reading and writing out of bounds of the unique operator, fix the problem of setting the wrong length in the calculation process of the unique operator to ensure the correctness of the operation of the unique operator. #60840
Fixed some bugs in PGLBox save/load and compilation process to ensure the correctness of PGLBox function in response to the lack of save/load function and compilation error in PGLBox training process. #63905
Fix the setting value of use_ps_gpu in CPUPS to ensure the correctness of the CPUPS training process, in response to the problem that the CPUPS training process triggers the GPUPS logic and causes the training to crash. #61406
For the problem that the cudaErrorInvalidResourceHandle error occurs in GPUPS training in CUDA 12.3, add the device id switching mechanism, to ensure that the corresponding resource operation is carried out on the correct device. #63391
For the problem of garbled codes in PGLBox Embedding Dump process, fix the bug of improper use of C++ std::string, to ensure the correctness of Embedding Dump results. #65179
Inference Deployment¶
The inference framework is based on PIR upgraded PASS under GPU, XPU, CPU hardware, to significantly reduce the number of lines of codes compared with the previous version, and improve development efficiency. The underlying executor is upgraded to a new version of asynchronous executor, improving inference performance on most models. Complete the adaptive interconnection for inference acceleration based on CINN compiler. Add the switches for these features. Users can turn on the features through settings. In addition, Paddle Inference supports direct loading of optimized serialized models under mixed inference with TensorRT subgraphs natively, to reduce startup time consumption. For Paddle-TensorRT, add the interfaces to flexibly control node computation precision and whether the subgraph enters TensorRT computation. It is convenient for debugging. For performance optimization, GPU, XPU, CPU are added with more Transformer and LLM computing acceleration fusion operator, such as group attention mechanism fusion operator, GQA structure, and WINT4, and support for automatic matching by PASS.
New Features¶
Paddle-TensorRT
The API called at the underlying of Paddle-TensorRT is upgraded. When the version of TensorRT is later than 8.5, the EnqueueV2 API called (which will be deprecated in the future) is upgraded to the EnqueueV3 API. #60807
Add the config.exp_disable_tensorrt_subgraph() to set some subgraphs not to enter TensorRT. #61967
Add the config.exp_disable_tensorrt_dynamic_shape_ops() to set dynamic shape input operators not to enter TensorRT. The default value is False. #62352
Add the config.exp_specify_tensorrt_subgraph_precision() to set nodes to run different precision types. #62402
In the Inference, add switch to turn on CINN compiler. When configuring inference config, turn on CINN through config.enable_cinn(). #61949
PIR use mechanism in the Inference upgrade
In the config, add enable_new_ir() interface to enable PIR. #61968
In the config, add set_optimization_level() interface to set different optimization levels. #61968
In the PIR mechanism, the PASS function supports custom C++PASS. #62468
The inference library exposes PIR-related implementation header files to the outside world. Support users’ secondary development based on PIR, such as custom Pass development. #61863,#62293
The PIR mechanism supports input and output of the Hook operator by registering the Predictor. #63101
The multi-layer Transformer fusion operator fused_multi_transformer_op supports GQA calculation. #64125
Function Improvements¶
The inference supports loading optimized models directly, making it possible to skip IR optimization altogether. The deployment in this way can minimize framework overhead. #61598
Re-specify the shape range information file when loading the saved IR PASS optimized model inference. #60457
Collect the Shape information within the subgraph of the control flow operator, supporting the use of Paddle-TensorRT inference acceleration. #60451 ,#59588
The mixed-precision PASS (auto_mixed_precision_pass) for GPU-native inference supports the handling of sparse Tensor. #62656
XPU hardware related function
Paddle TensorRT INT8 computation mode supports tile operator into TensorRT computation, to improve INT8 performance of some models. #60189
Model Compression¶
Fix bugs and optimize functions mainly for Post Training Quantization (PTQ) and Quantization Aware Training (QAT).
Performance Optimization¶
Upgrade the inference executor to reduce the video memory usage at runtime while keeping the performance unchanged. This can be used through config.enable_use_executor(True). #57920,#58452,#63350,#64466
Upgrade oneDNN version of paddle inference to v3.4. Its overall performance has been improved compared with v3.3. #64661
Upgrade the CUTLASS-based support for matrix multiplication and activation fusion calculation. (#61925)
Add generic PASS in PIR mechanism¶
Add identity_op_clean_pass and matmul_scale_fuse_pass. #59840
Add fused_flash_attn_pass. The pass can call flash_attention to replace the original attentions computation. #64213,#64707,#63304
In the inference PIR new architecture, upgrade layout adjustment algorithm, support the NHWC inference of conv class and norm class. The performance tested on SD models is significantly improved. #63628,#64634,#64658,#64708,#64830,#64896
Add remove_redundant_transpose PASS. #63357
Enable CSE PASS in inference to improve inference performance. #64523
GPU Performance Optimizations¶
Include new fusion operators and new PASS under PIR mechanism.
Optimize the performance of sparse convolution operator (sparse conv) to improve the inference performance of BEV and other models. #63067
Add the fusion PASS based on flash attention. #63220
The inference supports elementwise_add+group_norm+silu activated operator fusion pattern and its corresponding fusion kernel. #64199
The Matrix multiplication calculation supports groupwise’s Weight only INT4 calculation. #60422 、#63212 、#60204)
The implementation of the group attention mechanism fusion operator block_multi_head_attention supports KV Cache quantization. #59951)
The Inference uses CUTLASS upgraded conv fusion operator to implement and support PASS automatic fusion. Support bias and activation. Compared to the original cuDNN, the new operator has significant performance acceleration. It is used through config.exp_enable_use_cutlass(True). #64201、#64641
Add the blha_get_max_len operator and remove every call to get_max_len in block_multihead_attention. The function application is used for large model dynamic inference acceleration. #64246
Data layout optimization: PASS prohibits using NHWC mode calculation in the conv fusion operator FP32 precision type, because cuDNN will cause performance degradation under this condition. #63400
GPU peak video memory optimization: upgrade the underlying interface TryShrinkMemory, and upgrade to support GPU place under the support for the release of the idle video memory in the pool. In certain scenarios, peak video memory can be significantly cut. #61319
CPU performance optimization¶
Include new fusion operator. Add PASS under PIR mechanism and optimize part of Kernel.
Add scale_matmul_fuse_pass. #63313
Add CPU implementation in fused_bias_residual_layernorm and fused_rms_norm to improve inference speed. #63196、#63165
Add the cache optimization for Deconvolution kernel, to greatly improve the execution speed of this operator. #60922
In PIR, add depthwise_conv fusion PASS, to convert the depthwise_conv operator to conv2d, thus using the onednn conv2d kernel optimization to improve the inference speed of this operator. #63051
In PIR, add Conv and Activation Fusion PASS (conv_activation_mkldnn_fuse_pass), to support the fusion of conv and 13 kinds of activation functions, thus greatly improving the inference speed of conv-related operators. #63145
In PIR, add the fusion PASS (operator_unsqueeze_onednn_fuse_pass) between multiple operators and unsqueeze, to improve inference speed. #63592
In PIR, add PASS (operator_reshape_onednn_fuse_pass) to fuse reshape into multiple operators. #63812
In PIR, add scale fusion PASS (operator_scale_onednn_fuse_pass). #63811
In PIR, add PASS (conv2d_transpose_bias operator) that fuses conv and bias. #62241
In PIR, add onednn_placement_pass, which supports 151 operators to convert from Phi operators to oneDNN operators, so that the oneDNN high-performance library can be used for optimization, to improve the inference speed. #63982
In PIR, add the fusion between Elementwise type operators and 13 activation functions, to greatly improve the inference speed of enabling Onednn on the CPU. #63516
In PIR, add the fusion of multiple conv + concat + activation functions and fused_conv + concat + activation functions, to greatly improve the inference speed when there are concat and activation functions in conv. #62993、 #62713
In PIR, add matmul+add operator fusion PASS (matmul_elementwise_add_fuse_pass). #62715
In PIR, add the scale parameter to fold PASS (scale_matmul_fuse_pass). #63313
In PIR, add the fusion PASS (softplus_activation_fuse_pass) between softplus and 12 activation functions. #63617
In PIR, add fc operator conversion PASS (fc_onednn_enable_pass). #63518
In PIR, add self-attention operator fusion PASS (self_attention_fuse_pass). #63726
In PIR, add fusion PASS (fc_activation_fuse_pass) between fc and 12 activation functions. #63853
In PIR, add BatchNorm folded PASS (conv2d_bn_onednn_fuse_pass) to amplify the fusion probability of subsequent PASS. #64524
In PIR, add the fusion PASS (matmul_activation_fuse_pass) between matmul and 12 activation functions. #62901
In PIR, add reshape + transpose + reshape fusion PASS (shuffle_channel_detect_pass), which is fused into a shuffle_channel operator under specific conditions. #64053
In PIR, add reshape + transpose + matmul fusion PASS (reshape_transpose_matmul_fuse_pass). #62998
In PIR, add matmul + transpose + reshape fusion PASS (matmul_transpose_reshape_fuse_pass) to PIR to significantly improve performance in some scenarios. #63151(https://github.com/PaddlePaddle/Paddle/pull/63151)
XPU hardware new fusion PASS optimization:
Add qk_qkv_attention_xpu_fuse_pass and qkv_attention_xpu_kernel in XPU hardware. #60089
Add rotary position encoded fusion operator, to support elementwise_mul + strided_slice + sin/cos+ stack fusion to 1 operator in XPU hardware. #60025
Add group_norm_silu_xpu_fuse_pass. #62689
Add weight_only_linear_xpu_pass. #64185
Add block_multihead_attention operator and PASS, to support large model inference for LLaMA2 models in XPU devices. #65036
Support float16 type for squeeze_excitation_block_xpu_kernel. #61023
Bug Fixing¶
Fix mixed-precision conversions in models such as faster_rcnn_swin_tiny_fpn_1x_coco, and solve the mixed_precision_pass error. #64673
Block fused_conv2d_add_act pass from being validated in activation functions that are sigmoid (fused conv2d and sigmoid cause performance degradation between cudnn versions 8.0 and 8.7). #64717
Fix compilation issues with self_dp_attention and fused_layer_norm_avx_kernel in Clang12. #63414
Fix the issue that scale and zeroPoints in the qdq operator of some models are deleted prematurely in the IR/Pass stage. #62225
Fix the issue that causes an error to be reported when both Config.UseOptimizedModel() and config.EnableMemoryOptim() are turned on. #62501
Add constraint on matmul_scale_fuse_pass, where input w must be a weight or the pass will not be matched. #62850
Keep inference model output key ordering guaranteed to be the same as when dynamic graph models are exported. #63791
Fix the error in subgraph when the constant fold PASS is in “the folded op and its input and output are not in the same subgraph.” #62148
Fix several runtime problems in PaddleTRT mode. Include the failure of quantization calibration table generation caused by yolo_box operator in int8 mode, and the error caused by incorrect handling of dim attribute data type in reduce operator. #61596
Fix some runtime error problems in mixed-precision inference mode.Include the errors caused by sharing weights among fused conv2d operators without correctly converting weight layout, fused conv2d operator backend not properly selected as cuDNN, fused conv2d operator incorrectly handling bias dimension under NHWC, incorrectly handling input data type of norm class operator. #60955、#60076、#63007、#63988
Fix the problem that config.delete_pass function does not take effect. #61056
Fix the GC mechanism of While control flow in PIR to recycle unwanted inputs in advance and reduce the peak memory, for example, 2GB memory reduction in LLaMA 7B model. #63062
Fix the OneDNN mean kernel rollback error. #64676
Fix the conv_bias_fuse_pass strong constraints newly added, e.g., the shape of the bias cannot be 1, so as to ensure the stability of the pass inference result. #64412
Fix the conv_elementwise_add_onednn_fuse_pass strong constraints newly added, e.g., conv2d_out and residual_param must have the same size, so that the pass inference is stable. #64448
Fix the problem of repeatedly inserting quantized inverse-quantization operators under certain circumstances #63082
Hardware Adaptation¶
Adaptation Scheme (Custom Device)¶
For PaddlePaddle hardware access, add the daily release supports for 4 hardware Kunlun XPU, Ascend NPU, Hygon DCU and Cambricon MLU this time. Meanwhile, the problems in distributed communications have been fixed through large model training and inference deployment, and performance is optimized through functions such as video memory optimization, and overlap of computation and communication. Furthermore, each hardware is also added to support a large number of BFloat16 data type operators this time, as well as many operator fusion Pass and fusion operators on each hardware. Through the hardware and software together, hardware large Transformer operator library is accessed to fully improve the performance of large models.
New Features¶
Add the support for distributed policy sharding stage1 v2. #61500
Support the distributed communication module in BF16 data type.Add some operators to support for BF16 data types such as empty, shape, etc. #60768,#62140,#62604
Add the support for get_comm_name interface, support for memory stat function, and support for Profiler to record memory time. #62556,#61030,#62292
Add support for some fusion strategies and operators, including silu_fuse_pass, conv_elementwise_add_act_fuse_pass, and generator offset. #60595,#60708,#60616
Performance Optimization¶
The distributed communication strategy Sharing uses asynchronous strategy in Broadcast parameter, to improve the overlap between computation and communication. #59745
Add the support for STRIDED Layout operator to improve the performance of the operator. #62532,#62697,#62649
Optimize the memory usage of elementwise_mul operator.#62377
Bug Fixing¶
Fix the bug under the distributed strategy Sharing. #61942,#62236,#62305,#62535,#62572,#61601
Fix the problem that the operator cannot be registered due to c_embedding operator is not under PHI namespace. #60774
Fix the xccl_comm release issue. #60465
Fix data address error caused by index_put operator fallbacking cpu. #61842
Fix stream_safe_custom_device_allocator issue. #63369
Fix the distributed worker port conflict issue. #61409
Fix comm data type to improve device compatibility. #62306
Unify the use of comm data type to phi::DataType. #62464,#62562
Fix the problem of missing precision parameter in PD_ConfigEnableCustomDevice. #63702
Kunlun XPU¶
New Features¶
Add the support for BF16 data types for some operators, including compare_kernel and add reduce_all_kernel (#63602), empty(#60212), hybrid_parallel_optimizer(#60213), reduce_max/reduce_min(#60453), all_reduce/concat/split(#62364), tile/tile_grad(#63075), accuracy(#63863), swiglu/set_value(#64070), amp_master_grad(#63865), c_concat (#63403), flatten (#63997), compare_op (#64473), moment1/moment2 (#62688), fused_rope (#60064), c_softmax_with_cross_entropy (#60472), elementwise_pow/square/sin/cos (#60402), strided_slice (#60382), tile/sigmoid_grad (#60119), elementwise_sub/elementwise_div (#60386), softmax_with_cross_entropy (#63759)
Add the support for INT8 data types for some operators, including multi_encoder_xpu (#61212), qkv_attention (#63105)
Update Kunlun SDK versions including BKCL, XHPC, XCCL, etc. #59895、#59888、#63624, #60305, #62076, #62646, #63520, #64163, #64326, #60617, #60377, #60421, #60598, #61199
Add the support for memory stat function. #61116
Add multi-stream support, to assign default l3/gm buffer size to each stream. #62729
Add nonzero operator, to support simulator XPUSIM_SKIP_RUN mode. #60224。#60388
Add stride_slice and stride_slice_grad operators, to support strides < 0. #62749
Add rotary_embedding, to support use_neox_rotary_style == True. #64090
Add fusion Pass and fusion operators including cross_attention (#63203), fused_bias_act (#62232), fused_layernorm (#62228), group_norm_silu_xpu_fuse_pass (#63342)
Add the support for distributed policy sharding stage3. #57457
Add the support for tf32 fc quantization mode. #62273
Add the flash attention operator. #60065
Add the roformer relative embedding pass & kernel and support multi_encoder_xpu. #62089
Add the support for pp + sharding strategy. #63640
Upgrade the XPU communication library architecture to support dynamic-static unified communication library function. #63817
Performance Optimization¶
Add XHPC buffer manager to improve the performance of Paddle and XHPC memory collaboration. #63924
Enhance TensorSetConstantXPU performance and support BF16 data type. #63920,#61818
Fusion multiple group norm + silu + conv modules and compress the video memory. #62892
Optimize XPU memory allocation in comm manager. #64139
Optimize operator performance, including mean_all_grad (#61148), dropout_v2 (#61029), fused_rotary_position_embedding (#62846), cross_entropy (#63159), elementwise_add (#64289), fused_gemm_epilogue (#61350, check_nan_or_inf (#60853)
Bug Fixing¶
Fix the tile operator support for 0-dimensional Tensor. #64279
Fix the group_norm_silu_fuse_pass. #63449
Fix the distributed strategy Sharing stage1 v2 bug. #64209
Fix the XPU constant issue. #60763
Fix some operator issues, including AdamW (#62251), dropout_v3 (#62726), softmax(#63780) , fused rope embedding (#62143), elementwise_add (#60252), resnet_basic_block (#62914)
Fix XPU runtime and installation related issues. #60028,#61970
Fix XPU compilation bugs. #63307
Fix end-side memory related bugs when initializing XPU communication library. #64396
Environment Updates¶
In this PaddlePaddle version, we complete the release and update synchronization of the basic dependency libraries, and remove the old dependency libraries that are no longer updated. Complete a number of optimizations to improve compilation efficiency and compatibility, and improve the CI pipeline monitoring function to enhance the user installation experience. Fixe the several known compilation problems, improved the compilation system of paddle, and add some new features. Through the optimizations, the compilation and installation experience of the PaddlePaddle framework is further improved to bring developers a better use and development experience.
New Support¶
Support users to install paddle without relying on local cuda and cudnn, thus improving the user installation experience. #60841,#61973,#61862,#61235,#61209,#61653,#64083
Support CUDA 12.3 completely. Complete the retirement of cuda10.2. #63356,#60299,#64171,#62189,#63392,#64228,#62498,#64298
Support Python 3.12 completely, bringing more powerful language features and performance optimizations. Complete the retirement of python3.7. #59875,#59877,#59876
Upgrade of other paddle-dependent third-party libraries: #63741,#64447,#60195,#60110,#61509
Compilation Optimizations¶
Optimize paddle’s CMake codes, significantly improving compilation efficiency and experience. ##59995,#60167,#61052,#59995,#59607,#63093,#63887,#62969,#64007,#59811,#63045,#60235,#60240,#60235,#61411,#61944,#61961,#59990,#59478,#61501,#60066,#64133,#64231,#60087,#60348,#60737,#61364,#63214,#62454,#62473,#63692,#63950
Support C++ unit test link dynamic library under linux and windowx, greatly reducing the size of C++ unit test and the size of the entire build directory. #60008,#60960,#60960,#60961,#60831,#60832,#60833,#61372,#60834,#61374,#61463,#61376,#60830,#61373,#61672,#61375,#61676,#62036,#61945,#61675,#61674,#62773,#61238,#59988,#60307,#59612,#59942,#59968,#59978,#60121,#60149,#60161,#60160,#60230,#60154,#60356,#60392,#60517,#61131,#60959
Add the support for Clang compiler. Users can now use Clang to compile, enjoying faster compilation speed and better error message prompts. #63382,#63133,#61705,#63152,#63373
CI Pipeline Improvements¶
Improve the merge-in code monitoring mechanism in the CI pipeline, to ensure higher code quality and stability. Add a function monitoring module, to monitor various indicators of the CI pipeline in real time, ensuring smooth execution of each stage, to identify and resolve issues in a timely manner. #61384,#62190,#60758,#60399,#58623,#62177,#62361,#62893,#63705,#64476,#64752,#64733,#61914
Code Cleanup¶
Bug Fixing¶
Fix several compilation issues of paddle framework. #63297,#62994,#62651,#64408,#60934,#62899,#60528,#63158,#64549,#62351,#61259,#61281,#62304,#60736,#60811,#63949,#59892,#60767,#60856,#61286,#61638,#62079,#62142,#62823,#62814,#62425,#62619,#60207,#60765,#61870,#61923,#62144,#62426,#63848,#60682,#61369,#62882,#63944,#64812,#60654,#60887,#62058,#64639,#60115,#61940,#62614,#59914,#63762,#60145,#60285,#60378,#60393,#61057,#61058,#61151,#61347,#61554,#61844,#62915,#61852,#61704,#61991,#62264,#62762,#63820,#63864,#65017,#61183,#59866,#61171,#61290,#61725,#61614,#61721,#61494,#61556,#61689
Others¶
Non-user related changes, including deprecated code cleanup, useless unit test cleanup, debugging or upgrade of monitoring mechanism. #63377,#64106,#64220,#64293,#64464,#64944,#63638,#63732,#63735,#63826,#63982,#63737,#64471,#64574,#64494,#62775,#63601,#62564,#63772,#64719,#61640,#63459,#64062,#63480,#63833#63673,#63672,#64131,#64156,#64155,#64159,#63902,#64230,#64229,#64236,#64260,#64175,#64250,#64269,#64238,#64349,#64394,#64402,#64401,#64388,#64329,#64502,#64501,#64515,#64503,#64514,#64601,#64564,#64012,#64697,#64682,#64051,#63267,#63426,#63626,#63257,#63266,#63468,#63262,#63248,#63241,#63252,#63258,#63235,#63399,#63488,#63487,#63466,#63464,#63483,#63486,#63475,#63489,#63470,#63457,#63493,#63561,#63584,#63587,#63586,#63569,#63559,#63558,#63555,#63543,#63589,#63583,#63565,#63564,#63265,#63562,#63591,#63460,#63238,#63631,#63707,#63714,#63854,#63929,#63532,#59628,#62209,#63742,#60518,#62078,#62684,#62723,#64141,#60404,#64212,#60652,#64545,#64477,#64556,#63160,#63796,#64693,#64484,#64677,#64461,#63189,#63855,#63896,#63193,#63200,#63406,#61283,#63607,#64486,#64004,#63132,#63553,#63572,#63794,#63919,#63980,#62917,#64451,#63541,#63703,#64536,#63264,#63335,#63841,#64628,#63419,#62210,#63557,#63064,#61442,#63537,#63839,#60927,#60566,#60842,#64612,#60047,#63898,#60415,#60474,#60439,#60565,#64414,#62526,#54183,#64096,#61325,#60629,#61051,#62103,#63594,#60968,#64613,#64073,#63816,#64416,#62499,#64531,#63827,#59885,#59949,#63428,#63218,#63538,#64497,#63082,#64395,#60183,#63691,#64428,#64648,#64650,#59926,#59750,#60080,#60208,#64124,#64187,#64166,#64284,#64253,#64555,#59878,#64081