2.6.0 Release Note¶
1. Important Updates¶
Paddle New generation IR(PIR) : In order to further improve scalability of the PaddlePaddle framework, we have developed a new generation intermediate representaion. It abstracts underlying core concepts of the PaddlePaddle framework, such as Operation, Attribute and Type, providing developers with flexible and efficient basic components. By introducing Dialect mechanism, PIR can comprehensively and hierarchically satisfy needs of each module for intermediate representations to greatly enhancing scalability of the framework. PIR strictly follows Static Single Assignment (SSA) principle, ensuring unity of top-level structure and harmonious coexistence of “operator sequentiality” and “computational graph semantics”. In addition, PIR provides a more concise and low-cost Pass development process, with a series of built-in rich and functional Pass optimization strategies. It provides technical support for the ultimate performance optimization of large-scale models.
Static graph construction and compiler Optimization Architecture: In order to further improve performance of the framework, PaddlePaddle’s dynamic to static training capability has been comprehensively upgraded to support adaptive graph construction capability. This has been tested on more than 700 PaddlePaddle industry-level models, with 100% success rate of one line code converter to start static training. Meanwhile, Compiler Infrastructure for Neural Networks (CINN) of PaddlePaddle framework is integrated into PaddlePaddle main Repo, making the compiler and PaddlePaddle more integrated. CINN completes architectural optimization and improvement of expansion capability, increasing system stability. Based on PIR framework, it is much more easied to bind dynamic to static, primitive operator, executor and compiler together, to providing more space for boosting overall performance of PaddlePaddle framework.
Enhanced dynamic graph distributed capability: Large models pose higher demands on the distributed training performance of framework. PaddlePaddle has comprehensive optimizations in dimensions of communication library, graph analysis, distributed strategy and task enable/disable, enhancing distributed computing capability of PaddlePaddle’s dynamic graph and providing support for efficient training of large models. In terms of performance, training performance is further improved by reducing pipelined GPU memory occupation, adopting TensorFusion technology, implementing communication computation overlap, and reducing non-essential data synchronization copies. Meanwhile, flexibility of hybrid-parallel debugging is improved through environment variable control Optimizer. In addition, stability of system is significantly improved by fixing related Bugs.
Auto parallel architecture with dynamic-static unification: In order to further reduce difficulty of programming and optimizing large models, PaddlePaddle has fully optimized the Semi-Auto Parallel programming paradigm with dynamic-static unification, simplifying programming complexity for developers. Developers do not need to deeply understand complex concepts and APIs under the manual parallel programming paradigm, such as row-parallel, and column-parallel. They only need a small amount of tensor distribution annotations to implement the hybrid parallelism. The distribution specification will be propagated to all tensors and operators automatically, and the framework would handle the communication and synchronization needed by distributed training appropriately. Meanwhile, it supports dynamic-to-static distributed training by adding one extra code only, allowing developers to efficiently implement any mixed parallelism strategy and deeply simplify the development process of hybrid-parallel training paradigm.
Hardware Integration Solution (CustomDevice): With increased demand for parallel training on new hardware in large model scenarios, PaddlePaddle has added support for distributed advanced policies, custom operators, and custom fusion policies. Distributed communication library is upgraded, with newly added support for many advanced distributed policies such as MP, GroupShared, PP, SP and MOE. Moreover, it supports vendors to flexibly access Transformer operator libraries of different granularities and modify the computation graph through Fusion Pass for performance acceleration.
Installation and development experience: use of modular compilation optimizes logics of CMake codes, and improves efficiency of PaddlePaddle full compilation and incremental compilation. In addition, this can increase efficiency of RD development. It supports Python3.12, CUDA12, Hopper architecture compilation, with introduction of Clang and other tools to fully optimize code formats. In addition, C++ is changed from linking static libraries to linking dynamic libraries to reduce compilation volume. These optimizations provide users with a smoother and more efficient installation and development experience.
2. Incompatible Upgrade¶
In order to avoid misuse, we removed the 0-dimensional Tensor compatibility state switch, to achieve the same API behaviors as industry’s mainstream habits. In the previous version, we already supported 0-dimensional Tensor, but we added a compatibility state switch in order to avoid error reporting of some models, as much as possible. That is, in some scenarios where model suite is used frequently and modification is not completed, we still used 1-dimensional Tensor with only 1 element to replace the 0-dimensional Tensor by default. In this version, compatibility state switch is removed, so the 1-dimensional Tensor with only 1 element will no longer be used, to replace 0-dimensional Tensor in any scenario. Behaviors of 376 APIs that should support the 0-dimensional Tensor have been corrected and unified, to thoroughly complete support for the 0-dimensional Tensor.#57036, #54581, #54500
To improve API usability, paddle.nn.functional.diag_embed has been streamlined to paddle.diag_embed, with support of use of Tensor.diag_embed. #58223
In order to solve the problem of differential computation error caused by Tensor index writing (e.g., tensor[0] = 10) under static graphs, and to comply with static graph specifications, this version introduces paddle.static.setitem API. In static graph environments, this API is recommended to support indexed write operations for tensor, instead of subscript operators. This change does not affect dynamic graph environments, where index write using subscript operators are still allowed. #53682
paddle.fluid API is completely retired in this version. In this update, we completely removed all paddle.fluid APIs and deleted the fluid directory. Meanwhile, a small number of PaddlePaddle underlying public components have been consolidated into the paddle.base directory. It is unnecessary for PaddlePaddle users to pay attention to fluid-related concepts and APIs, further simplifying PaddlePaddle API system and improving readability.#56576, #54424, #54829, #53992, #54806, #55754, #55986, #55345, #56099, #51717, #54152, #55522, #55757, #58521, #54936, #55007, #55661, #55970
3. Training Framework (including Distributed)¶
Python API¶
Upgrade Tensor indexing mechanism¶
This version comprehensively optimizes basic index, advanced index and joint index functions of Tensor, to better comply with industry standards and user habits. Specifically, we added support for view in basic index, fixed some wrong behaviors in advanced index, and implemented read function of joint index. In addition, we have sunk index parsing to C++ level, improved performance of high-level indexing operators, and removed redundant computations in bool indexing. With these optimizations, performance of Tensor’s basic, advanced and joint index has been improved comprehensively. #56893, #58643, #57986, #56272, #58856, #55211, #57023, #56613, #55602, #59281, #57737
Upgrade Inplace mechanism¶
In earlier versions, in order to ensure correctness of inverse differentiation calculations, when reverse calculation of an API depends on its forward input data, PaddlePaddle avoids using Inplace operation method, with possibly overwriting original input data. This mechanism simplifies implementation process, and also limits the ability of many APIs to implement Inplace functionality. As a result, user experience may be affected. In this version, PaddlePaddle has fully upgraded the Inplace mechanism. It implements automatic detection of the dependency of reverse computation on forward inputs, to save input data when needed. Therefore, more Inplace operations are supported. This improvement not only improves memory usage efficiency, but also enhances functionality of the API. In addition, we have added 109 new APIs that support Inplace operations, including paddle.abs_, paddle.sin_/cos_/tan_, comparison operations such as paddle.greater_than_/less_than_/equal_, logical operations such as paddle.logical_and_/logical_or_/logical_not_, paddle.neg_ and paddle.log_. While enriching the feature set of PaddlePaddle, it improves users’ efficiency and convenience in numerical computation and deep learning tasks. #54683, #55078, #55576, #56888, #55509, #57093
Other new APIs¶
Added paddle.nn.functional.scaled_dot_product_attention. This significantly improves computational efficiency of the attention mechanism in large models, and meets demand for high-performance computation in large-scale deep learning models. #55242
Added a series of new scientific computing-related APIs, including paddle.cummax and paddle.cummin for cumulative maximum and minimum computation, paddle.index_fill and paddle.masked_fill for filling tensor by index or mask, paddle.linalg.pca_lowrank for low-rank principal component analysis, paddle.hypot for calculating length of the hypotenuses of right triangles, and paddle.atleast_1d, paddle.atleast_2d, and paddle.atleast_3d to ensure the tensor is at least one, two, or three dimensional. We also provide paddle.select_scatter and paddle.diagonal_scatter for more flexible selection and hashing of tensor data, and paddle.multigammaln for choosing the natural logarithm of multigamma function. In addition, new optimizer-related APIs are added in this version, including: paddle.optimizer.lr.LinearLR and paddle.optimizer.lr.CosineAnnealingWarmRestarts for learning rate scheduling strategies; introduction of paddle.io.SubsetRandomSampler to support random sampling from a subset of data. These added APIs will further enhance flexibility and efficiency of PaddlePaddle in various application scenarios. #57416, #53546, #53743, #57295, #57726, #58764, #58323, #57720, #58209, #58214, #57792, #51395, #57724, #57355, #57744, #58244, #57599, #59343, #57879
New Generation of Paddle Intermediate Representation (PIR)¶
PIR systematically abstracts underlying core concepts such as Operation, Attribute and Type, to build a set of flexible and powerful base components for developers. In addition, PaddlePaddle can comprehensively and hierarchically manage requirements of each module on Intermediate Representation (IR) by introducing the concept of Dialect, and support developers to customize extension of Dialect according to specific needs to significantly improving scalability and adaptability of framework. In terms of designs, PIR strictly follows the Static Single Assignment (SSA) principle, unifies top-level structure, realizes compatibility of “Operator sequentiality” and “computational graph semantics”. This provides a clear and consistent view of the complex computation process. In order to further optimize performance of large models, PIR also provides a set of more concise and low-cost Pass development processes, including Declarative Rewrite Rule (DRR) and Pattern Rewriter. In addition, a series of rich and full-featured Pass optimization strategies are built-in, to deeply optimize application according to characteristics of large models, thus providing strong support for ultimate performance of large models. Through these innovative designs and optimization methods, PIR lays a solid foundation for efficient operation and continuous expansion of the PaddlePaddle framework.
New features¶
Abstracted core concepts of IR bottom layer and provided developers with flexible base components, such as Operation, Attribute, Value, Type, Trait, and Interface. #56354,#57106,#57349,#54844,#54984,#54565,#54562,#57249,#57550,#59278,#54875,#55041,#54987,#55903,#57582,#57580,#58052,#55322,#57418,#57635,#55328,#57463,#59791,#59821,#59115,#57461,#59392,#57373,#59118
Added Dialect mechanism to support comprehensive and hierarchical management of intermediate representation requirements of each module of framework. Through built-in Builtin Dialect, it supports developers to customize and extend Dialect according to their needs. #56325,#57539,#54682,#55381,#56156,#56431,#56615,#57103,#57209
Normalized PaddlePaddle static graph operator system. Added OperatorDialect and KernelDialect. Managed conceptual differences of operators in the form of Dialect during compilation and execution, making Architecture clearer. #56284,#54469,#58660,#58975,#56680,#54790,#54826,#54840,#55699,#55648,#55880,#56101,#56754,#54944,#56836,#57185,#58757,#56243,#56436,#57741,#59124,#57054,#56984,#57403,#57904,#58031,#56924,#59270,#55343,#56557,#55693,#54428
Added ShapeDialect with built-in rich shape operation operators for constructing dynamic shape constraints and expressions for AI compilers. #56727,#59254,#58368,#57069,#57337,#56351,#57029,#58036,#59032,#57961,#56427,#57459
Unified top-level structure of Framework Program, supporting compatible representation of “operator sequentiality” and “computational graph semantics”, decoupling dependency on ir::Graph, and strictly following the principle of Static Single Assignment (SSA). #59369,#54563,#57051,#57306,#57857
Added IrPrinter and IrPaser components to support serialization and deserialization of PIR Programs, providing a friendly debugging experience for PIR development. #55695,#59449,#54369,#54499,#55518,#55784,#57180,#57471,#54859,#54968,#55209,#57314,#57969
Built a new, simple and low-cost Pass development system based on Declarative Rewrite Rule (DDR) and Pattern Rewriter, with built-in a series of rich and full-featured Pass Optimization strategies, to accelerate training and inference execution process. #54385,#54738,#55859,#56638,#57090,#58673,#59415,#56729,#58655
Added ProgramTranslator component, to support conversion from ProgramDesc to new generation of IR representations of PaddlePaddle by pressing one key, with provision of easy-to-use C++ and Python interfaces. #55433,#54470,#58044,#58390,#58100,#55403,#55406,#54719,#56550,#55448,#55453,#56294,#56308,#56842,#58517
With help of automatic code generation technology, it can generate the full amount of static graph operator representations for PaddlePaddle framework by pressing one key. Sank static graph networking logic to C++ side and bind it to _C_ops module. This can greatly streamline API code on Python side, realize ultimate dynamic-static unification of APIs of PaddlePaddle Framework, and upgrade a lot of Python APIs to support static graph networking of the new IR. #56570,#55745,#56955,#57298,#57946,#57248,#56080,#54396,#54551,#56520,#55002,#57067,#59320,#59348,#57164,#57267,#59064,#54340,#54895,#55004,#56196,#56862,#58991,#55428,#55909,#56241,#56526,#56571,#56518,#57016,#56653,#56809,#57158,#55422,#55458,#55432,#55467,#55483,#55419,#55517,#55500,#56674,#57693,#55008,#57166,#57157,#57159,#57175,#57325,#57330,#57415,#57122,#57393,#57344,#57667,#57348,#57700,#58093,#58005,#58081,#58094,#58137,#58287,#58352,#58340,#58363,#58331,#58343,#58317,#58450,#58377,#58466,#58470,#58491,#58546,#58587,#58453,#58634,#58604,#58605,#58593,#58675,#58699,#58384,#58629,#58579,#58695,#58548,#58688,#58792,#58843,#58840,#58718,#58883,#58785,#58608,#58781,#58783,#58429,#58685,#58696,#58690,#58831,#58929,#58740,#58937,#58782,#58833,#58882,#58935,#58931,#59041,#59040,#58877,#58888,#59042,#58780,#58682,#58815,#58676,#58678,#58446,#59077,#59091,#58661,#58832,#58642,#58698,#59313,#59371,#58700,#58953,#58879,#59469,#59573,#59481,#59419,#59509,#58735,#59616,#59582,#59420,#59500,#58911,#59535,#54891,#56794,#57477,#57929,#57765,#58693,#58603,#56291,#57123,#57317,#57341,#57020,#57324,#57761,#57762,#57907,#57909,#58099,#58110,#58114,#58139,#58144,#58165,#58194,#58138,#58113,#58245,#58318,#58105,#58348,#58235,#58354,#58341,#58445,#58418,#58239,#58473,#58239,#58391,#58501,#58519,#58416,#58588,#58531,#58730,#58773,#58862,#58946,#58500,#56585,#57480,#57433,#58498
Function optimization¶
Upgraded static graph executor to extend more Kernel Instruction types, and supported loading of PIR with efficiently scheduling execution. This has significant video memory and performance gains in training and inference. #54570,#58665,#57291,#54452,#57431,#54692,#55112,#55210,#55401,#55772,#55828,#56148,#54763,#56886,#57284,#57268,#57791,#56789,#56704,#57594,#58397,#58337,#58756,#58371
Reconstructed auto-differentiation module for PIR, migrate and adapted the high-order auto-differentiation function. Optimized Stop Gradient transfer mechanism, so logic is clearer and function is more robust. #55660,#57084,#56890,#58942,#59373,#57206,#58145,#55235,#57255,#56925,#55957,#56163,#56316,#57294,#57449,#59520,#59565,#56265,#56512,#56650,#57183,#57956,#59100
Optimized design and representation of control flow forward and reverse operators, introduced ControlFlow Dialect, and supported conversion and execution from control flow operators to PIR under ProgramDesc. #58729,#57364,#58625,#57475,#57265,#56799,#59033,#57342,#57801,#57958,#57949,#57937,#59231,#59496,#59321,#58088,#58198,#58024,#58089,#58086,#59175,#59423,#59567,#58098,#58163,#58250,#58277,#58355,#59020,#59200,#59585,#58109
Upgraded dynamic to static execution flow to support PIR, optimized dynamic to static subgraph Pass mechanism, and supported users to try and use functions in the PIR system under the @to_static function. #57566,#55620,#56791,#57357,#59152,#59312,#58630,#56035,#59447,#57361,#59261,#59774
Upgraded combination operator function with introducing the concept of Backend to manage logic of combination operator module of dynamic and static graphs in a hierarchical way. Sank necessary components and operator splitting rules into C++, to dramatically reduce maintenance costs. #58153,#56391,#56614,#57030,#57554,#58018,#58130,#58581,#58679,#59054,#55480,#58451,#55647,#56342,#56798,#57561,#58023,#57722
Performance optimization¶
Added PIR Program operators such as DCE and constant_folding_pass, and structure-optimized Pass. #54935,#59430,#58753,#58732
Added optimization operators fusing class Pass, such as fused_attention, fused_dropout_add, fused_gemm_epilogue_pass, fused_linear_param_grad_add_pass, fused_weight_only_linear_pass, and fused_softmax_mask_upper_triangle, to improve training and inference performance. #57557,#58272,#58188,#58401,#59366,#57655,#57360,#56672,#58537,#56247,#59391,#58897,#54933
Dynamic to static capability enhancement¶
Dynamic to static graph conversion is a key technology in deep learning frameworks. It allows developers to find the best balance between flexibility and training efficiency. This version of PaddlePaddle has fully upgraded core Dynamic to Static functionality. Success rate of dynamic to static training is up to 100% among 700+ models in PaddlePaddle industry-grade model library.
New features¶
Adopted Python Eval Frame and VM simulation execution technology to innovatively implement an adaptive Graph Break mechanism. This mechanism is especially designed for control flow scenarios. By introducing the CallLayer mechanism, it makes full use of the advantage of PaddlePaddle dynamic-static unification motion. Support hybrid mode of Abstract Syntax Tree (AST) and bytecode simulation. Efficiently captures control flow operators, thus dramatically improving ability of computational graph to be static. At cache optimization level, fuse advanced optimization technologies such as common sub-expression elimination, to significantly improve execution efficiency of Guard. These optimizations not only reduce redundant computations, but also improve overall system operation speed. To enhance robustness of the system, a simple and efficient data intermediate layer structure is designed. Structure supports correctness recovery of SideEffects, ensuring stability and reliability of system in complex environments. In addition, it is widely compatible with mainstream interpreter versions from Python 3.8 to 3.11, providing users with a wide range of applicability. #57824,#55887,#58155,#56107,#57490,#58829,#57240,#57588,#58117,#59823,#56077,#58956,#57653,#59855,#59017,#58424,#58187,#57793,#59698,#59747,#59710,#59297,#58423,#56262,#58103,#58538,#58771,#59191,#57754,#59439,#59816,#59035
Added dynamic to static syntax transcription parsing for PyLayer functions, making PyLayer’s conversion between dynamic and static graphs smoother. Users can now seamlessly carry out dynamic to static training on PyLayer, to easily export inference models. #56108,#56531,#57066,#57633
Bug Fix¶
Fixed the issue that video memory is abnormal in some scenarios of dynamic to static in is_test=True mode. #58350
Fixed the issue that function decorated by @to_static is exported to jit.save model in scenarios like foo(x,x,y). #55963
Fixed the issue that dynamic and static logic of some API behaviors is not uniform. This improves success rate and user experience of dynamic to static graph conversion. #56092
Enhanced distributed dynamic graph capability¶
In order to meet the needs of large models, this version focuses on improving the distributed computing capability of the dynamic graph of the PaddlePaddle. Various improvements have been made in communication library, graph analysis, distributed policies and task enable/disable, to provide comprehensive support for large model training. In terms of performance, we further improved training performance by reducing streaming parallel GPU memory occupation, adopting TensorFusion technology, implementing communication computation overlap, and reducing non-essential data synchronization copies. Meanwhile, flexibility of hybrid-parallel debugging is improved through environment variable control Optimizer. In addition, stability of system is further improved by fixing related Bugs.
New features¶
Added TraceHang function in communication library, to quickly locate the faulty node when cluster training has Hang problem. #59217
In order to improve training efficiency and reduce memory, dynamic graph supports stride mechanism. #55156,#54762,#55850,#59190,#57005,#57005,#57331,#58033,#58033,#58303,#57835,#57189
Enhanced paddleviz function to facilitate analysis of computational graphs. #56837,#57626
In distributed Sharding strategies (Stage1,2,3), added main_grad function to support higher precision gradient accumulation, and reduce precision loss caused by low precision accumulation. #57972,#57934,#57473,#57537,#59611,#57960
In Sharding Stage1 strategy, added a switch variable to control whether to perform fusion calculation on Optimizer. #58790
In Recompute function, added support for Tuple input parameters, enhancing calling ability of Recompute interface. #56793
Enhanced Launch function, allowing distributed training without specifying endpoints in dynamic graphs. #54636
Function optimization¶
Implemented new communication library with dynamic-static unification. Communication operators are fully adapted to PHI operator system, reducing development and maintenance costs to better support dynamic graphs and auto parallel architecture upgrade. #54417,#57768,#57897,#55537,#56604,#57519,#56088,#57153,#57161,#57252,#57251,#57208,#57305,#57424,#57548,#57560,#57564,#57233,#55726,#58073
TCPStore is changed to a single instance to support dynamic graphs and auto parallel more flexibly. #55956
Improved maintainability and flexibility of distributed policies such as MP/PP/SP, including addition of printing warning and error messages, structural cleanup of code files, and optimization of PP restrictions on inputs. #54448,#59762,#55462,#54788,#54664,#56456,#55540
In PP strategy, added support for P2P communication in computation flow, making communication mode more flexible. #54747
Sharding strategy supports reduce Operation on gradient. #58842,#57967,#55495
Performance optimization¶
Implemented timely release of last layer of PP strategy, to save video memory. #54505
In MP strategy Tensor fusion, supported incoming params group to enhance Tensor fusion function. Improved allreduce asynchronous communication performance, and enhanced training performance through overlap of computation and communication. #57690,#55662
In Sharding strategy, carried out overlap for reverse computation and gradient communication, to improve training performance. For Sharding stage1, added Tensor fusion and fuse grad clip, and optimizer, to improve computational efficiency. Supported overlap between VPP and DP/Sharding Stage1, to improve communication and computation parallelism. Optimized performance of Sharding Stage1 under FP16. Check only gradient responsible for this sharding rank in the check finite stage, to reduce computation overhead; added environment variables to control whether Optimize is performed to save video memory, to achieve use of fewer resources for model training debugging. #55598,#55427,#56063,#55766,#59848
In Hybrid Parallel strategy, arranged Tensor fusion under PP/VPP to pre-run, to solve the problem of extra overhead of runtime fuse on video memory. Improved model training performance by reducing non-essential synchronous memcpy. #54403,#57215
Bug Fix¶
Fixed 13 bugs in PP, Launch function, MP strategy, and fuse_rope, to enhance stability of distributed strategies. At mechanism level, fixed errors of inplace and tensor reference to improve stability. #55116,#55782,#59609,#57394,#55864,#58482,#54571,#55896,#54648,#58307,#55679,#58133,#58408,#59707,#55342,#54703,#54869,#55568,#55233,#56418,#56428,#56892,#57192,#59161,#59340,#57006,#57353,#57352,#59088
Fixed bug that PP strategy can’t release single-layer output in time. Fixed the bug that initialization process may Hang. #54624,#58844,#54673,#58376
Fixed the bug calculation is wrong when input data type is not uniform under MP strategy. Fixed the bug of parameter synchronization under MP strategy. Fixed the bug user input config is not used correctly. #58858,#57918,#58037
Unified judgment method of dygraph and dynamic mode. #54633
Fixed the bug shape of sin and cos in fuse_rope is not correct. #56132
Fixed the bug task fails to due to long endpoints in Luanch distributed scenarios. Fixed the bug endpoints may be out of order. #55011,#55478
Fixed the bug MEA function may cause segmentation fault error. #55408
Auto parallel¶
This release fully optimizes Auto Parallel programming paradigm with dynamic-static unification to simplify programming complexity for developers. Developers do not need to understand complex concepts and APIs in manual parallel programming paradigm, such as row-parallel, column-parallel, and so on. A small amount of tensor distribution annotations is required to build a hybrid parallel model. Framework will handle the derivation of distribution states of all tensors and operators, and adding appropriate communication operators. Meanwhile, it supports the dynamic to static distributed training by just one extra code changed, enabling developers to efficiently and easily implement any hybrid parallel strategy. This can significantly reduce development costs of hybrid parallel training codes.
Improved auto parallel core functions¶
Implemented auto parallel core APIs such as process_mesh, placement, shard_tensor, reshard, dtensor_from_fn, unshard_dtensor, shard_layer, to_static, and so on. #55494,#59059,#56561,#54425,#59557,#59682,#56565,#59862,#59856,#59342,#59575,#57604,#57293,#57278
Implemented Sharding derivation rules based on Enisum expressions, and completed 20+ classes of operator Sharding derivation rules, which covers LLaMA, GPT and other transformer-like large language models. #55196,#53863,#56257,#55394,#54810,#55508,#56257,#57813,#58149,#58506,#58563,#58360,#58920,#59050,#58760,#59083,#59236,#59350,#59411,#59260,#54373,#54991,#55397,#55350,#55177,#56443,#58097,#56509,#56502,#56504,#56506,#56507,#56505,#57176,#57374,#57573,#57545,#57875,#57866,#58854,#59109,#59185,#58913,#59547,#58296,#59545,#59039,#59002,#58087,#56367,#57877,#56839,#59003,#57269,#55130,#58474,#57197,#57467,#57259,#57280,#56508
Implemented distributed checkpoint storage and loading with dynamic-static unification. Supports ReShard upon arbitrary Sharding of storage and loading in a Sharding state. #59659,#59843,#60033,#60034
Enhanced semi-auto parallel capability of dynamic graph¶
Basic data structure supplementation: Added DistTensor, Placements and other distributed specific basic data structures on C++ end, and exposed to Python end. Supports debugging and printing of related attributes and values. #58930,#59068,#55436,#56449,#59683,#55593,#58032,#56368,#59086
Added SPMD derivation and Reshard generation logic in execution flow for all operators, and adapted to multiple types of inputs and outputs such as vector and optional, as well as special mechanisms such as cpu fallback and multi-kernel selection. #56602,#57321,#57092,#56831,#57119,#58819,#58254,#55698,#59241,#59328,#58644,#56202,#59159,#58573,#59246,#59133,#59186,#57505,#57241,#58928
Adapted auto parallel execution logic for special types of operators, such as custom operators. Supports automatic conversion of DistTensor and DenseTensor as mixed inputs. #57774,#59108,#58436,#59523,#59136,#59352,#59062,#58434,#59148,#58553,#58716,#58369,#59061,#58841,#59139,#59141,#58837,#59137,#59143
Optimized dynamic graph execution system: Adapted Autograd execution process. Supports dynamic graph’s inverse gradient aggregation, AMP, Hook, PyLayer, View, custom operators, and other surrounding mechanisms. #58437,#58769,#58796,#58339,#58409,#58772,#58380,#58447,#58706,#58656,#58172,#59401,#58727,#58238,#59243,#58469,#58442,#58487,#58476,#59706
Added support for Pipeline Parallelism, Sequence Parallelism and other distributed parallelism. #58126,#59766,#59060,#59841,#58609,#59688,#58449、#59598
Added various Reshard strategies and support tensor conversions between different distributed states. #58592,#59138,#59367,#59621,#59758,#59777,#56975,#58550,#58703,#57210,#58734,#56833,#59292,#57432,#57568,#56553,#58284,#56039,#55552,#56149
Enhanced semi-auto parallel for static graphs¶
Added Sequence Parallel Parallelism; added FThenB, Interleaved 1F1B, Eager 1F1B, VPP and other scheduling modes for Pipeline Parallel, and supported the hybrid parallel between the above new parallelism and original parallelism. Supported visualization of pipeline scheduling. Upgraded gradient synchronization mechanism which supports gradient synchronization when data is sharded on any broadcast dimension. #57605,#54727,#54409,#54787,#58313,#59179,#59416,#59719,#59822,#59057,#59522,#57061
Adapted the executor to PIR, and supported PIR optimization Pass. In distributed scenarios, supports fuse_linear fuse, and etc., to improve performance. #58459,#58528,#55555,#59757,#59102,#57917
Upgraded underlying architecture: upgraded the executor to reuse the results of data-flow dependency analysis and static kernel selection; upgraded entire graph based sharding completion mechanism, to switch to new sharding derivation rules and support some long-tailed cases; optimized the support of control flow under distributed static graph to adapt to more scenarios; reduced the graph compilation time and refined error message format to improve user experience. #55389,#55650,#54938,#57447,#57751,#57742,#59524,#59526,#58669,#57616,#56511,#55727,#58906,#56016,#54897
Optimized the gpu memory usage in static graph mode, and added refined recomputing strategy; optimized auto mixed precision pass, and allows users to manually specify auto-cast region and fixed some bugs; supports parallel computation of cross-entropy; supports fusion operators such as scaled_dot_product_attention, fuse_rope, etc.; performs scheduling optimization to support better overlap between communication and computation in tensor parallelism and pipeline parallelsim. #58421,#58533,#59498,#59498,#59187,#59188,#58172,#58628,#56185,#56696,#59497,#58304,#58977
AutoTuner¶
This release implements a profiling based automatic search and tuning tool named AutoTuner for parallel strategies, to automatically combine parallel and optimization strategies. Users can select effective combination configurations for experiments, and AutoTuner will search for the optimal configuration for large model training and inference given the model and hardware specification. In addition, AutoTuner implements a variety of pruning methods, including gpu memory modelling based pruning, so the search space and search time can be significantly reduced. #54460,#54668,#59794,#59727,#59782,#54834,#58127,#56968,#55466,#56939,#58183,#58314,#55499,#59748
Operator library¶
Incompatible upgrade¶
In order to improve maintainability of PaddlePaddle framework, some deprecated operators in the framework (e.g. diag_v1, isfinite_v1, pad2d_v1, etc.) have been removed, and models using these operators saved through the PaddlePaddle 1.x training will not be able to infer on new version of PaddlePaddle. #57895,#57892,#57898,#57730,#57732,#57810,#57884,#57794,#57926,#57925,#57807,#57808
Operator library enhancements¶
The complex kernels of PaddlePaddle PHI operator library have been further enhanced, and a total of 40+ complex kernels have been added. #55380, #56349, #56412, #56323, #56723, #56457, #56903#56914, #57116, #56048, #57244, #57639, #57638, #57540, #58545, #58336, #58532, #58839, #59079, #59277, #59122, #57058
Optimized and added XPU kernels for some operators, and enhanced the support for data types such as bfloat16 on XPU kernel. #54478, #57740, #58346, #58456, #58662, #59066, #59263), #59375, #59505, #59653, #55001, #57272, #56169, #59454, #59480, #55914, #54758, #54827, #58364, #58419, #58982, #57216, #59166, #55033, #55375, #58805, #59389, #57077, #55166, #56773
Added some operators for optimizing large model training and inference performance. #55758, #54998, #55400, #54630, #55969, #55026, #58986
Improved mechanism of Tensor Strided in the operator library. #59422, #59325, #56863, #56882, #56947
Optimized function implementation and template function in some kernels to reduce size of complied library package. #57083, #57299, #57261, #57290, #57118, #57551, #57509, #57558, #57064, #57365, #57327, #57603, #57671, #57672, #57631, #57082, #57721, #57823, #57821, #57815, #57822, #57541, #57817, #57838
CUDA¶
New features¶
Added debugging class API paddle.amp.debugging.check_check_numerics. Calculated and returned number of outliers (NaN, Inf) and zero elements in this Tensor value. #54301
Added fused_rope fusion operator to accelerate LLaMA class large model training.#54351
Updated CUDNN Frontend API version to v0.9.1 and added fused_scale_bias_add_relu fusion operator to accelerate ResNet networks. Note this feature is in experimental period and is disabled by default. #58367, #54949, #58504
Based on Flash-Attention v2, added Tensor-like Mask function support. Inverse operator supports deterministic computation for debugging. #57276, #56363
Modified sparse conv3d backend implementation to support 2d shapes, avoiding front-end reshape overhead. #54707
Added matmul_int8 operator. (#55228)
Function optimization¶
Optimized CUDA Graph’s support for random number operators.#58310
Enhanced automatic mixed-precision training default functionality, including:
Optimizing the experience of using automatic mixed precision training interface.#58152,#55364,#57903
Added matrix computation class operators such as fused_attention, fused_feedforward, and fused_gemm_epilogue to framework’s default whitelist, and unified default black and white list settings for dynamic and static graphs. #55373, #55713
The argsort, dist, erfinv, nanmedian, poisson operators and lamb optimizer operators support FP16 and BF16 low precision computing. #51662, #55105, #55287, #55824, #56056, #56184, #55641
Fixed elementwise_max operator low-precision implementation. Changed to use FP32 type for numerical computing, and reduce precision loss. #54799
Changed temporary result Tensor needed for Reduce class operator to FP32 type, to avoid precision loss caused by converting intermediate result to low precision. #55709)
Optimized GPU codes for flip, roll & roll_grad, index_put & index_put_grad, etc. Removed unnecessary C++ templates to optimize compilation time and reduce compiled binary size without performance degradation. #57309, #57525
For the bernoulli operator, added a check on legitimacy of input probabilities. #59174
Performance optimization¶
Optimized BroadcastKernel’s support for large Tensor. Change to call INT32 version implementation for multiple times for large Tensor Sharding, improving operator performance by 7.27x. #57313, #57996
Optimized performance of Tensor save interface by copying the Tensor to CPU and then converting to numpy, to avoid overhead of automatically converting the Tensor to a continuous Tensor when Tensor is not continuous. #57040
Bug Fix¶
Fixed bug of memmory_efficient_attention operator supporting the sm_90. #58070
Fixed the NaN problem of softmax operator when axis=-1 and length is greater than 100000. #57851
Fixed bug of GPU access error in some cases for set_constant operator. #59905
Fixed GPU storage read/write contention issue in fast implementation version of layer_norm operator. #56435
Expanded Compiler Infrastructure for Neural Networks (CINN)¶
In this update, PaddlePaddle CINN focuses on optimization of architecture and comprehensive expansion of its capabilities. In view of increasing demand for dynamic shapes for large models, effective operation and optimization strategies of compiler under dynamic shapes are initially explored and implemented. At the architectural level, Python DSL is introduced, significantly improving CINN’s development convenience and Debug capability and enabling developers to write and debug codes more efficiently. Meanwhile, logic of Schedule has been refactored to be dominated by GroupSchedule, enabling more general and stable optimization strategies at operator Group level. In order to enhance stability of CINN, a strong constraint component is explored and introduced. This can effectively reduce uncertainties and potential errors in the system. In addition, historical tool classes and software structure of CINN are systematically organized, optimized and improved, to further enhance readability and maintainability of codes. In terms of integration with other PaddlePaddle components, tight integration of CINN with PIR and Paddle has been further strengthened, making compiler more coherent with overall PaddlePaddle framework. This improvement not only enhances performance of the compiler, but also provides developers with a smoother and more unified development experience.
Compatibility upgrade¶
Modification deprecation¶
New features¶
Added CINN paddle.framework.core.is_run_with_cinn() API on the PaddlePaddle side. #54355
Added CINN related operator logics, including various combinatorial operator’s disassembly logic. #56072,#58210,#58502, #58591, #58981, #59135, #59274, #59306, #59202, #59176, #59534, #59713, #59798; Supports bf16, amp and other forms #54399, #54368, #54608; Supports operator zero-dimensional capability #54892, #54919, #54907, #54966
Supports CINN and PaddlePaddle PIR, and combinator operator junction operation mode, so new PIR and CINN operation is integrated. #54732, #56074, #58216, #55680, #56302, #59037, #55186, #58641
There are strongly constrained components to stabilize CINN changes. #58719, #59309, #58993
Added Group Schedule related CINN architecture process. #58399, #56444
Added CUTLASS, error handling, and NVRTC Cubin Fmad options to CINN architecture functions preliminarily. #58079, #57198, #58794
Added Python interface language for CINN. #57731, #57515, #57644, #57981, #58009
Added dynamic shape functionality for CINN to cover ASTGen to generate dynamic shape symbols, to replace the ISL to generate dynamic shape signals #56360, #57207, #57454; Added Bucket Conditional Compilation functionality #59165; Added Schedule, Device, and IR level support for dynamic shape #58988, #59493, #58717, #58602, #59196
Supports CINN Group Schedule operator – at Group level, perform more general and stable Schedule optimization. #56122, #57777, #57569
Function optimization¶
Enriched or improved operator functionality, including improvements to various operator processes such as Repair Reverse, FP16, Infershape, Operator Single Test, etc. #56320, #56845, #54939,#54378,#55321,#55336,#55337,#55442,#55470,#55489,#55510,#55547,#55505,#55563,#54280,#59650,#54862,#55135,#55292,#55333,#55316,#55379,#55326
Improved CINN, PaddlePaddle, PIR, combinator operator junction operation, including various and PIR and its actuator interface and CINN mutual support. #59170,#58766,#59255,#59203,#59024,#57829,#58135,#58193,#58207,#58606,#59437,#59759,#55075,#56805,#57764,#58620,#59769,#58702,#58749,#59025,#58820,#58908,#58169
There are strongly constrained components to stabilize CINN changes. #55090,#55705,#57587,#59501
Improved CINN IR and related tool codes. #55145,#55955,#56307,#55519,#56958,#57019,#57230,#57531,#57532,#57524,#58770,#59337,#59096,#56274,#56350,#57312,#55171
Supports CINN Group Schedule operator – at Group level, perform more general and stable Schedule optimization. #54982,#57963,#58220,#55484,#55935,#55590,#56530,#58344,#59810
CINN architectural improvements, including parallel compilation, low-level storage allocation method, print information, Group structure, Pass structure, etc. #56282, #59014,#59209,#52660,#54749,#58694,#58940,#59504,#56123
Improved CINN codegen, jit instruction, dim args, and host kernel to support dynamic shape. #58825,#59395,#59398,#59540,#59470,#59640
Improved cleanup of CINN codes, including CI, file paths, C++17, Flags, third-party libraries, Docker, etc. #55018,#55121,#55009,#55888,#56168,#56192,#56896,#53861,#55208
Fixed bug¶
Fixed operator-related bugs. #56280,#57767,#58406,#54406,#54494,#54751,#55674,#55684,#55683,#57798,#57816,#57687,#56719,#59756,#59770,#58811
Fixed process architecture-related bugs. #54899,#59737,#59356,#56105,#56662,#58146,#58910,#58121,#58943,#58886,#59642,#56164,#56338,#56966,#59112,#55820,#56660,#57307,#57530,#58236,#55190,#55043,#55667
Other bugs. #57239,#55530,#56605,#58243,#58197,#58197,#56086,#56065,#58775,#54750,#58595,#58873
4. Deployment Direction (Paddle Inference)¶
General inference optimization¶
This version of the upgrade improves performance and ease-of-use of the inference engine on GPU and CPU, reducing user cost and application cost of online inference. On GPU: A high-performance multi-threaded asynchronous executor is supported, and inference performance of each model is improved by 5%~10%. The new version of TensorRT and BF16 inference capabilities are also supported, and TensorRT inference performance and ease of use are further improved. On CPU: The latest version of OneDNN high-performance inference is supported. SwinTransformer, FastRCNN and other series of models have greatly improved performance.
matmul supports transpose and broadcast operations. #56827
TruncatedNormal and Assign supports FP64 data types. #57507
Added conv_fuse_pass. Support conv + bn fusion. The conv2d_fusion is renamed fused_conv2d_add_act. #58724,#55374,#54477,#59431
Mixed precision inference supports OP whitelisting. #56535
OneDNN optimization is enabled by default. Supports SwinTransformer, FastRCNNd and other inference optimizations. #58560,#59394,#59421,#58435,#58488,#59259,#56303,#56782,#57598,#58361,#59641,#59527,#59663,#59744
Added share_data and support for pass in specified data. #57933
Large model inference optimized¶
The fine-grained fusion inference optimization of generative large models is realized. Optimization solution ensures high-performance inference capability and excellent expandability. Users can flexibly utilize various fine-grained fusion operators and PaddlePaddle native operators to build a network structure of generative large models in free combinations as required, thus achieving efficient and low-cost inference. In addition, our solution also supports mainstream generative large model structure, significantly reducing deployment cost of inference for such models and strongly supports efficient and low-cost implementation of generative large models.
Supports the FMHA/MMHA for CacheKV division block scheduling. #59462
RoPE encoding fusion operator supports input sin/cos values. #55415
Added fine-grained fusion operators. Supports high-performance inference optimization of generative large models. Added operators such as quant_linear, weight_quantize, and linear_compress for support of large model quantitative inference. #57852,#55128,#59090,#56706,#59951,#55490,#59291,#59441,#59778,#59651#55301,#58637,#56673,#56401
Supports variable length inference series API. #57948
Added masked multihead attention. Supports high performance MMHA inference. #55344,#56411,#58134,#57936
weight_quantize/weight_only_linear supports the Volta architecture. #58082
Added weight_only_linear_grad for support of large model weight only quantization gradient transfer-back. #57685
Fixed large model dynamic to static bug. Optimized communication initialization logic between static graph cards. #56390,#57169,#56688,#56592,#58868
Optimized top_p_sampling random number generation logic. #59494
Paddle-TensorRT Inference Optimization¶
elementwise_add fusion supports NHWC format. #56795
conv2d supports filter as input. #55246。
Added MarkTrtEngineOutputs API. Users can specify TensorRT Engine outputs. #56858,#56188,#57407
Customized OP can generate TensorRT Plugin automatically. #58976,#56037
TensorRT inference allows users to specify input hook to optimize shape collection process. #59466,#54841,#57498,#54861,#54432,#55503
TensorRT Inference supports inference model after saving Tuning. #55893,#56952,#57031
Supports variable length Transformer model PromptTuning. #57034
Added operators such as bitwise_and, bitwise_or, bitwise_not, cumsum, einsum, lookup_table, assign, flip, size, scatter, solve, unbind, reduce, and argsort. Optimized support of existing operators. #59214,#59293,#54882,#54097,#54860,#55426,#54372,#55688,#56069,#59563,#59317,#59424,#55476,#56043,#58549,#57326,#59409)
TensorRT enables video memory sharing by default. #59495,#58251
PrelnResidualBiasPluginDynamic supports 4D input. #56304
Added support for FlashAttention for Paddle-TRT inference for architectures below SM80.#56492
Modification deprecation¶
Bug Fix¶
Fixed “Inference so” link flags conflict issue. #59755
Fixed constant_folding pass execution error. #55556
Fixed softmax forward speed bug and reverse accuracy bug. #56036,#57858#57538
Fixed customized OP while error and export bug. #58898,#59318
Fixed CUDA 12.0 compilation problem on Windows platform. #59852
Fixed bug of inference partial operator error when TensorRT version is later than 8.6. #54379,#54679,#54251
Fixed and removed inference fusion Pass. #54846,#54887,#55573,#56434,#56326,#56753,#57491,#56909,#54536,#55073,#55081,#55240,#56439,#59009
Fixed error of multi-stream inference context switching. #57629,#58048,#54994
5. Hardware Support¶
Hardware Integration Solution (Custom Device)¶
In this update, added support for distributed advanced strategy, custom operator and custom fusion strategy. By upgrading distributed communication library, supports MP, GroupShared, PP, SP, MOE and other advanced distributed strategies. Meanwhile, enables vendors to flexibly access Transformer operator libraries of different granularities, and modify computation graph through Fusion Pass for performance acceleration.
New features¶
Upgraded CustomDevice to support for Paddle’s latest distributed communication library CommContext. Added a variety of advanced distributed strategies such as GroupShared and MOE. #56301,#54671,#57957,#56669,#54384,#54572,#54573,#54676
Upgraded CustomDevice to support CustomOP. Users can register undefined operators in Paddle PHI operator library. CustomDevice can support CustomOP via CAPI. #57038,#55532,#56755,#55532,#55533,#55659
Added CustomDevice’s support for CustomPass function. Modified the computation graph IR through Python API. #55511,#55728
Added CustomDevice’s support for Paddle run_check. #56318
Added CustomDevice’s support for StreamSafeAllocator. #55393,#56380,#56536,#58035
Added CustomDevice’s support for DataTransform. #56627
Function optimization¶
Added CustomDevice’s support for more PaddlePaddle APIs such as Variable.set_value, adamw, share_external_data, mp_allreduce_sum, tensor.numpy, get_paddle_place, and GeneratorState. #55272, #56386, #57253, #56927,#56189,#55225,#55247
Modified CustomDevice dynamic library loading method from RTLD_NOW to RTLD_LAZY, to facilitate subsequent checking of compatibility of CustomDevice related software stack version. #57544
Added CustomDevice’s detection function for FP16 operator under mixed precision training. #56053,#56176
Bug Fix¶
Fixed some problems in CustomDevice’s support for distributed communication libraries. #55293,#58038,#59800
Fixed some problems in CustomDevice on some operators, including c_softmax_with_cross_entropy,data loader,SplitDenseTensor,grad accumulation,atan2 grad.#56486,#55541,#55615,#56052,#56067
Fixed some problems of device management in CustomDevice, including device exceptions (#56556,#58639,#55173), exception events (#56745,#58059), video memory exception (#56977,#59247,#54606), device initialization (#57099,#57994), device release (#54932,#55351,#55783), and device resource pooling, etc.(#55229,#56580)
Fixed CustomDevice compilation-related issues. #56760,#56766
Kunlunxin XPU¶
New features¶
Added XPTI (XPU Profiling Tool Interface) to support collection and analysis function of runtime performance data. #54685,#54690,#54800
Supports Paddle’s latest distributed communication library CommContext. #59418
Added XPU fusion operators, for example, fast_where. #55628
Added support for XPU Pluign function, facilitating users to develop XPU customized operators through XTDK programming. #55101,#59326
Added XPU’s support for AutoGrowthAllocator. #54121
Added operator support list of Kunlun3. #57683
Function optimization¶
Upgraded XPU Inference API. #54342
Optimized performance of some XPU operators. Added support for bf16 in some XPU operators, including unique/index_put,squeeze/unsqueeze kernels,swish/swish_grad,scatter_nd_add_grad/slice,rsqrt/bitwise_or/arange_tensor,where,collective. #56582,#58161,#58440,#58580,#58950,#58616,#59273
Optimized XPU memory management to avoid memory leakage. #59334,#54847
Supports INT8 inference. #57258
Added support for FP16 series inference operators. #55642,#54410
Supports share_external_memory interface to pass input and output. #55170
Supports open source quantization model XPU inference. #58568
Added context_gm_size configuration, instead of allocating global memory in Pass. #54674
Supports fusion of fast_layternorm + leaky_relu. #57113
Supports elementwise_min/max/floordiv/where inference in KL1 and KL2 precision. #58422
Supports autotune configuration of fc and conv2d operator. #58801
Supports conv and fc dynamic quantization. #59307
fc + act fusion support for sigmoid, swish and relu6. #54486
elementwise_sub/elementwise_div supports int data type. #55920
Hygon DCU¶
Bug Fix¶
Fixed some operator bugs of Hygon DCU, including rnn, concat/split, fft, and so on.#59402,#55821,#56340)
Fixed issues related to communication library of Hygon DCU. #57110
Fixed compilation-related problems of Hygon DCU. #59775,#55507,#55612,#54952,#55076,#56079,#54874)
Fixed support issue of Hygon DCU for BF16 data type. #56517
6. Environment Adaptation¶
Adopted modular compilation to optimize CMake codes, improving efficiency of compilation of PaddlePaddle. This can increase efficiency of RD local development. Meanwhile, supports compilation in Python3.12, CUDA12, and Hopper architecture, and using Clang tool to comprehensively optimize code formats. In addition, C++ unitest is changed from linking static libraries to linking dynamic libraries to reduce compilation size. These improvements provide users with a smoother and more efficient installation and development experience.
CMake code optimization: stratify directories into independent static libraries, to improve incremental compilation efficiency. #59095, #58960,#56591,#58484
CMake compilation stratification: to realize compilation layering of PaddlePaddle architecture from bottom-up and improve compilation efficiency. #56442,#54729,#55733,#56352,#55109,#54992,#57698,#55147,#55113,#56691,#58618,#58899,#59140,#59129,#59222,#59105,#59711
Offline compilation of third-party libraries: Third-party dependent libraries are compiled offline, so CI/CE system does not need to download third-party libraries repeatedly in every compilation, improving operation efficiency of the CI/CE system. #54344,#54370,#54466,#54438,#54388,#54436,#54392,#54646,#54380,#55501,#55136,#54451,#55631,#55549,#56165,#54391,#54614,#54522,#54764,#54400,#54322
Using Clang tool to optimize source codes and improve code quality. #59626,#55895,#56632,#54449,#54523,#54796,#55847,#55807,#56261,#57522,#57868,#57809,#55658,#58285,#55491,#55506,#55279,#55741,#55894,#55704,#55800,#55799,#55983,#55954,#55764,#56246,#56219,#56217,#56216,#56208,#56134,#56253,#56255,#56693,#56692,#56637,#56636,#56647,#56218,#56640,#56635,#55675,#56601,#56485,#56648,#56747,#56676,#56649,#56895,#56994,#56904,#56744,#56954,#57114,#57343,#57483,#57871,#57861,#58028,#57627,#59072
C++ unitest has changed from linking static libraries to linking dynamic libraries, reducing compilation size and improving compilation efficiency. #59477,#56630,#57789,#54257,#59620,#59384,#59619,#58583,#58821,#58710,#58619
Fixed bug related to source code compilation, improving compilation efficiency. #56617,#58195,#56136,#54540,#57172,#54429,#55603,#54807,#56102,#56829,#56951,#56555,#57781,#57836,#58807,#54535,#54946,#54437,#54411,#54411,#54391,#54466,#54480,#54480,#54724,#59193,#54735,#54812,#56430,#56655,#56684,#56774,#56936,#56949,#56974,#57171,#57712,#56617,#58181,#58253,#58268,#59051,#59048,#59081,#59076,#59155,#59253,#59347,#58957,#59443,#58998,#57574,#55889,#59078,#55762,#56252,#56715,#54905,#56978,#57032,#57179,#57179,#58996,#59915,#54883,#56746,#57674,#60117,#55627,#54568,#54450,#54513,#54615,#54913,#54916,#55148,#55125,#55479,#55723,#55831,#55904,#56085,#56259,#56366,#56366,#56546,#56679,#57222,#57387,#57993,#59556,#57931,#58112,#54228,#56913,#56993,#55042,#55305,#55286,#56634,#57778,#58374,#58640,#58822,#59055,#59303,#59487,#58400,#59283,#54791,#59134,#56206,#56199,#56670,#58923
Thanks to Our Contributors¶
Azure-Tang, zhaoyinglia, From00, JZ-LIANG, xysheng-baidu, SylarTiaNII, kuizhiqing, zhiqiu, FeixLiu, liuzhenhai93, GhostScreaming, pangengzheng, xiaoyewww, wanghuancoder, ForFishes, hitywt, danleifeng, tianshuo78520a, ykkk2333, houj04, lj970926, XiaociZhang, HarperCy, cqulilujia, runzhech, RuohengMa, Caozhou1995, kangguangli, heavyrain-lzy, zyfncg, SigureMo, YuanRisheng, lchdl, LiYuRio, AndSonder, Wennie396, zhangbo9674, liudongxue01, risemeup1, phlrain, winter-wang, yuanlehome, NALLEIN, Liujie0926, yuguo-Jack, gitliuyf, zh794390558, Aurelius84, 6clc, GGBond8488, xiaoguoguo626807, Wong4j, iosmers, xiaoxiaohehe001, LielinJiang, carryyu, Difers, yangxiaoyu14, xuxinyi389, cxxly, gongshaotian, jjyaoao, lijialin03, lxd-cumt, cyber-pioneer, HydrogenSulfate, MayYouBeProsperous, Charles-hit, Patrick-Star125, ScottWong98, huangjiyi, DrRyanHuang, jinyouzhi, BeingGod, Wanglongzhi2001, yangguohao, zyt1024, longranger2, 2742195759, megemini, thisjiang, kevincheng2, zhoutianzi666, Wangzheee, ming1753, tianhaodongbd, freeliuzc, zhenyun-li, MARD1NO, RichardWooSJTU, eee4017, leo0519, csy0225, wwbitejotunn, bukejiyu, jiweibo, iamsonderr, ckl117, ronny1996, zhanglirong1999, LLee233, ZHUI, wangxn12138, zhwesky2010, Courtesy-Xs, zoooo0820, llyyxx0413, Asthestarsfalll, zxcd, pkuzyc, idontkonwher, sneaxiy, hong19860320, ZibinGuo, leolishaohao, MuShangCC, zhupengyang, shentanyue, Travis-Lee, wz1qqx, frank-oops, newway, QingshuChen, zhangyk0314, HandSomeLEEw, Shixiaowei02, zhangyuqin1998, Xing-lil, zhhsplendid, jiahy0825, xinyu-intel, MarioLulab, 0x45f, Tom-Zheng, xingmingyyj, zhangbopd, gouzil, zeroRains, BiynXu, WintersMontagne10335, wuhuachaocoding, GreatV, chenwhql, deepllz, parap1uie-s, ozogxyz, FisherWY, changeyoung98, zhiboniu, YangQun1 dynamicheart, Xreki, liugddx, Lylinnnnn, YSF-A, zzjjay, YanhuiDua, lishicheng1996, USTCKAY, abenmao, cocoshe, HermitSun, ccsuzzh, sanbuphy, enkilee, RedContritio, Liyulingyue, zrr1999, chen2016013, Galaxy1458, chalsliu, mrcangye, XieYunshen, zhiheng-liu, haohongxiang, ZzSean, JamesLim-sy, yuehuayingxueluo, niuliling123, umiswing, sijunhe, littsk, SecretXV, zhurou603, zhangjun, caizejun, yangjianfengo1, vivienfanghuagood, Xinyu302, lizexu123, yghstill, Li-fAngyU, VigiZhang, co63oc, dhanush-2501, ooooo-create, PommesPeter, zeus2x7, akshatvishu, jzhang533, Sekiro-x, gumblex, BernieHuang2008, YibinLiu666, qiuwenbogdut, XavierZXY, MqLeet, zhangting2020, mingxu1067, Ainavo, SSKlearns, yuchen202, silverling, zade23, wenxiaohahaha, NKNaN, Tsaiyue, fsczz, Tomoko-hjf, rhmaaa, zbt78, Hhankyangg, wangzhen38, zhengqiwen1997, engineer1109, onepick, qili93, Rane2021, nemonameless, DesmonDay, RachelXu7, ceci3, lyuwenyu, liuruyan, LokeZhou, shiyutang, lanxianghit, feifei-111, Sahala08, sunzhongkai588, Kaedeharai, Candy2Tang, liyongchao911, whisky-12, InsaneOnion, yoyoIcy, KongAKun, linzeyang, MuhammadNizamani, eltociear, Ligoml, LUZY0726, Windfarer, FlyingQianMM, jeng1220, junelotus, zlsh80826, Vvsmile, Frida-a, TonibMw, guoshengCS, zhink, ZhangYulongg, AlbertVan, fengxin-hello, mjp9527, entired, DanGuge.
2.5.0 Release Note¶
1. Highlights¶
New dynamic-static unification architecture: Implement a new dynamic-to-static plus compiler execution model in combination with the basic operator, and complete the whole dynamic-to-static, combinator and neural network compiler optimization and acceleration process on the ResNet50&Bert model. For the dynamic-to-static, complete the whole graph fallback core function development, and support the fallback to dynamic graph training execution in case of dynamic-to-static failure. For the combinator, design a set of basic operator systems containing more than 150 basic operators, to achieve the python layer forward operator splitting mechanism and the reverse operator splitting mechanism of static graphs, to realize splitting of more than 70 commonly used forward and reverse operators. For the CINN compiler, fix the correctness bug, develop the key Pass, add manual schedule rules, achieve automatic generation of kernel codes, and improve performance of ResNet50 model by 12% and Bert model by 10%.
Operator architecture unification of PHI operator library: Unify all remaining 350+ operator kernels under the original operator system into PHI operator Library. Unify the way of defining operator in the original operator system into the operator definition form of PHI operator library (configuration of operator definition based on YAML), enhancing unity of the architecture, and reducing comprehension cost of framework development. Decouple all the Fluid header files that the PHI operator library depends on and compile them independently as dynamic link libraries to provide a lighter reuse of the operator library for secondary development of the framework. Continue to standardize and adjust unspecified operators, as well as operator kernels in the PaddlePaddle framework. It is easy for developers to understand and reduce the cost of accessing the hardware.
Full go-live of new actuator for static graph: The new actuator for static graph implements a number of functions and performance optimization, and completes unification and replacement of the original multiple sets of old actuators. The new actuator becomes the back-end default execution engine for the static graph single card and distributed training python side entrance, as well as dynamic-to-static, control flow, CINN, etc. This significantly improves scheduling performance of the framework, and the functional architecture is clearer. Secondary development capability is significantly enhanced.
Python API supporting 0-dimensional tensor: clear semantics are defined between tensor of shape [1,] and tensor of shape [], and fixed many API behaviors to support tensor of shape [], such as
paddle.sum
etc.New environment adaptation: Adapt to CUDA 12. Compilation with gcc12 is supported.
2. Incompatibility Upgrade¶
PaddlePaddle API supports 0-dimensional tensor.PaddlePaddle previously used a 1-dimensional tensor with a shape of [1] instead of a 0-dimensional tensor, which is different from current mainstream habits. It increases development and debugging cost of the model, and sometimes leads to unintended errors. This release fixes 376 APIs that need to support 0-dimensional tensor, and implements tools widely used by the community such as EinOps. For example, in previous cases, output loss in model training was a 1-dimensional tensor. To take out or print the loss, it was often necessary to use codes like
loss.numpy()[0]
.After this modification, output loss in model training is a 0-dimensional tensor. When usingloss.numpy()
, users can take out or print the loss. The codes are short, easy to understand, and in line with the industry’s habit.paddle.fluid
API is fully decommissioned. According to the plan that has been previewed in the last version, 1116paddle.fluid
APIs and related internal interfaces have been decommissioned, and the remaining few related internal interfaces will be cleaned up in the next version.fluid API belongs to the historical APIs that PaddlePaddle 2.0 had planned to remove, but delayed the cleanup in consideration of compatibility and other factors. This decommissioning cleanup will not affect programs developed based on PaddlePaddle 2.0, and the PaddlePaddle API system will be more concise and easier to understand.Complete code cleanup at the old version of the dynamic graph Python side.So far, the Python side only uses the new version of dynamic graph to call the C++ core logic.
In order to unify the training method of data parallel for static graph model, original single-process multi-card training method is abandoned, including
paddle.static.ParallelExecutor
andpaddle.static. CompiledProgram(). with_data_parallel( )
APIs, because this set of APIs only supports single-computer multi-card, does not support multi-computer multi-card, and the underlying execution performance is poor.It is recommended to use the multi-process multi-card training method uniformly, i.e.,paddle.distributed.launch
API for distributed training with data parallel. This upgrade affects only static graphs, and does not affect dynamic graphs and dynamic-to-static training. If you use the decommissioned API, please refer to the documentation on data parallel to modify model code. #50351,#50501,#51240,#51701,#51616,#51369,#52671Remove the original adaptation code of Ascend NPU and Cambricon MLU in the framework, upgrade all to CustomDevice plug-in adaptation, and migrate the adaptation code of Ascend NPU and Cambricon MLU to PaddleCustomDevice warehouse.
3. Training Framework (Including Distributed)¶
Python API¶
API supporting 0-dimensional tensor¶
API input supports 0-dimensional tensor, involving
paddle.reshape
,paddle.trace
,paddle.linalg.norm
and other 286 APIs. #53208, #53592, #47074, #53186, #47677, #49357, #50237, #46555, #47219, #47501, #47858, #47961, #48058, #48007, #49755, #51024, #51566, #51899, #49813, #47812, #47849, #47251, #53125, #53828, #51265, #47689, #48452, #49072, #48638, #49175, #49279, #50857, #49805, #47734, #45992, #49616, #49959, #50536, #49544, #49842, #46909, #49361, #50169, #48314, #48735, #49122, #49122, #49177, #49501, #49562, #49340, #49550, #49596, #49730, #49667, #49692, #49854, #49845, #49803, #49889, #49904, #49518, #49884, #49880, #49862, #49921, #49260, #49929, #49570, #49882, #50213, #49780, #50271, #50289, #50293, #49735, #50433, #49847, #50635, #50950, #50947, #49460, #53087, #51687, #52185, #54649API output supports 0-dimensional tensor, involving
paddle.sum
,paddle.min/max
,paddle.any/all
and other 90 APIs. #52891, #52861, #52775, #52850, #52843, #52857, #51721, #53051, #53192, #52739, #52741, #53175, #51889, #53199, #53242, #53421In addition to the support of 0-dimensional tensor, fix the original non-standard codes, and provide hints and compatibility for non-standard usage in the model codes. #51562, #51586, #51757, #52197, #54117。
new API¶
Add
paddle.autograd.jacobian
andpaddle.autograd.hessian
APIs for scientific computing. #53331Add sparse computing API. For example,
paddle.sparse.reshape
,paddle.sparse.sum
andpaddle.sparse.slice
. #46694, #51513, #53794, #51406Add APIsFor example,
paddle.optimizer.LBFGS
,paddle.index_put
andpaddle.logaddexp
. #53314, #51912, #52886, #50843, #47282, #52284
Dynamic graphs¶
New features¶
Add paddle.nn.utils.clip_grad_norm_ for gradient clipping support and paddle.Tensor.data_ptr for getting the address of the Tensor data’s memory/GPU memory. PR49935, PR48235, PR49173
Add the saved_tensors_hooks mechanism, for temporary storage and retrieval of forward Tensor used in backward computation. PR45763, PR46215, PR48124
Tensor supports pickler, for serialization of Tensor. PR47025, PR48179
Add debug logs, to print forward Python stacks when nan/inf appears in reverse. PR53217 PR52639 PR52729
Add the support for expand_v2, tile, concat, assign, slice higher-order differentiation. PR45941, PR45942, PR45940, PR45879, PR45960
Improvements¶
bug fix¶
Fix bugs in some operators, including batch_norm, slice, set_value, scale, multinomial, adam, conv, transpose2_grad, conv2d_transpose_double_grad. PR47802, PR47634, PR47349, PR46124, PR46147, PR50388, PR48626, PR48519, PR50386, PR48432, PR51851
Fix some PyLayer bugs. PR51740, PR47154, PR47323, PR54041, PR48533
Makes sure sync_batch_norm is sequential in reverse to avoid hang or precision errors due to misordering. PR52268, PR52860, PR52779
Fix a bug of linspace under AMP. PR46088
Fix Python C API’s incorrect call that causes Windows to crash. PR46833
Fix the bug that DataLoader may miss deleting/dev/shm. PR48511
Fix some bugs of paddle.grad. PR47151
Add error message for operators that do not support higher order differentiation. PR47231
Add numpyarray support for python operators. PR48229
Delete either of element_size APIs. PR49631
Fix the bug of crash when opening old dynamic graph VLOG. PR47115
For XPU, change to d2h+h2d in case of d2d, to solve the multi-threading problem. PR48373
Performance optimization¶
Python operators sink to C++ implementation, to improve API performance. There is a 3x to 6x performance improvement in this class of APIs after sinking. PR45811, PR46326, PR46329, PR46520, PR46542, PR46565, PR47060, PR47077, PR47174, PR47315
Optimize the Optimizer CPU scheduling performance to reduce GPU Gap caused by Optimizer phase. PR49787, PR50188, PR51340, PR49864, PR50158, PR50335
According to the logic that API can be sunk to C++, API is sunk to C++ to improve API performance. PR46412, PR46190
Optimize unnecessary call logic on Python side under dynamic graph, to improve API performance. PR46221, PR49473, PR49574, PR49589, PR49612, PR49717, PR49733, PR49823, PR49508, PR46840
Optimize use of Allocator to improve dynamic graph API scheduling performance. PR47125, PR48548, PR50995, PR47731
Optimize fused_attention operator performance. PR48902
For optimizer’s _add_accumulator, if device is CPU and under dynamic graphs, use full to initialize var directly. PR48189
Prune unnecessarily executed subgraphs for inverse graphs to improve performance. PR47827
Optimize performance of initalizers. PR46033
Add fused dropout add operator to improve computation performance when dropout and add are used together. #52903
Static graphs¶
The new static graph executor is now fully go-live.¶
The new actuator for static graph implements a number of functions and performance optimizations, and completes unification and replacement of the original multiple sets of old actuators. The new actuator becomes the back-end default execution engine for the static graph single card and distributed training python side entrance, as well as dynamic-to-static, control flow, CINN, etc. This significantly improves scheduling performance of the framework, and the functional architecture is clearer. Secondary development capability is significantly enhanced. #45913,#46025,#48911,#50239,#45696,#46092,#48158,#51389,#49708,#49275,#48789,#49939,#51149,#52652
Operator library¶
Enhance functions of customized operators¶
New function support for custom extension mechanism to achieve the C++ extension of the arithmetic function binding to the Python side, to further enhance the framework’s secondary development capabilities. The extension supports custom hardware to use a custom operator mechanism to meet the needs of hardware manufacturers to implement non-Paddle existing operations. The extension supports custom operators in the implementation of the inplace
, vector < Tensor>
output, optional < Tnesor>
input and other high-level mechanisms in custom operators. Optimized scheduling performance of custom operators in dynamic graph mode, with a 25.4% performance improvement for operators with multiple input parameters. Add new commonly used operators and APIs for custom operator Tensor extensions. Support chaining calls and simplify code writing. Optimize the operator kernel selection mechanism. Improve the logic of some operator kernels, enhance supported data types and optimize performance. Add and improve XPU kernels 100+. Fix 170+ bugs. #49222, #51773, #51923, #53080, #50731, #50563, #50840, #50983, #51713, #48733, #50558, #50764, #51973, #52216, #51027, #50745, #50756, #50886, #50813, #50869, #51085, #51646, #51620, #51844, #52421, #52872, #52597, #50582, #52114, #52915, #50928, #48272, #48702, #52191, #52191, #47374, #47375, #47378, #54126, #47638, #47661, #50606, #53528, #50599, #51727, #50825, #50773, #50979, #53336, #53555, #53716, #53753, #53981, #53977, #53980, #54043, #54066, #52866, #53043, #53325, #54323, #54367, #51353, #53749, #50013, #47570, #50997, #51241, #49537
Unification of operator architecture¶
Unify all remaining 350+ operator kernels under the original operator system into PHI operator library. Unify the way of defining operator in the original operator system into the operator definition form of PHI operator library (configuration of operator definition based on YAML), enhancing unity of the architecture, and reducing comprehension cost of framework development. Decouple all Fluid header files the PHI operator library depends on and compile them independently as dynamic link libraries to provide a lighter reuse of the operator library for secondary development of the framework. Continue to standardize and adjust unspecified operators, as well as operator kernels in the PaddlePaddle framework. It is easy for developers to understand and reduce cost of accessing hardware. #47856, #49328, #49138, #52014, #52044, #52116, #52486, #52101, #52882, #53003, #53034, #51914, #49116, #52626, #52878, #52879, #52880, #52875, #51600, #51601, #51590, #51887, #51891, #52036, #52130, #52134, #51951, #51886, #52274, #52263, #51913, #52145, #52347, #52370, #52437, #52424, #52231, #52522, #52529, #52802, #52799, #52855, #52711, #52940, #53309, #47817, #48001, #48063, #48049, #48168, #48415, #48696, #48970, #50183, #50407, #50498, #50419, #50282, #50870, #50911, #50865, #51288, #53735, #47248, #47787, #52202, #47579, #49444, #45772, #51264, #51634, #51631, #47385, #46342, #47510, #47532, #47702, #47860, #49470, #50358, #49121, #50190, #52374, #52372, #52375, #52371
Dynamic-to-static plus combinator¶
New features¶
Add the combination rules for combinators such as dropout, silu, stack, relu, expand, unsqueeze, pow, squeeze, meshgrid, batch_norm, layer_norm, group_norm, instance_norm, full_like, split, split_with_num, gelu, mean, flatten, rsqrt, hadswish #50497, #50838, #50861, #50819, #50810, #51527, #51070, #51539, #51061, #49894, #50422, #51874, #51341, #50295, #50298, #50672, #51432, #51003
Add the vjp rule for combinators such as gather_nd, reduce_max, group_norm, relu, reduce_max, gather, topk, sqrt, elementwise_pow, softmax, batch_norm, prod, multiply, expand, div, relu, slice, cumsum, sigmoid, layer_norm, sin, cos, roll, instance_norm, abs, assign, tile, scatter_nd_add, erf, floor, log, silu, leaky_relu, pad #50966, #51653, #52663, #51742, #52203, #50794, #50305, #50786, #50679, #51045, #51230, #51474, #51283, #51238, #49831, #51838, #50771, #50565, #51768, #51750, #51748, #52532, #52935, #50963, #51430, #53141, #52469, #50436, #51059, #51296, #52533, #53374
Add the second-order differentiation rule for combinators such as matmul, tanh, and elementwise #50452, #52192, #53014
Add the bf16 datatype support for combinators such as exp, reduce_mean, softmax, divide, cast, layer_norm, prod, meshgrid, expand_as, dropout, concat, gather_nd, elementwise_max, elementwise_pow, reduce_max #54263, #54236, #53865, #54175, #54399
Add support for assigning semantics to containers in control flow in dynamic-to-static. #51248
For to_static, add full graph fallback function. When dynamic-to-static conversion fails, the whole graph can fall back to the dynamic graph mode of execution. For the fallback mechanism, add the set_eval_frame API. #50111, #52006
For to_static, support the combinator mechanism. Support the scenario of using register_hook under to_static decoration; #49836, #52948, #53572
Add a backend parameter to the to_static API. It can be specified as
CINN
or None. When the parameter is specified as CINN, the CINN compiler will be used to accelerate training and inference. #52596Add the code automatic generation function for the primitive API. Based on operator definitions in ops.yaml and legacy_ops.yaml, automatically generate code for the primitive API. Automatically generate the Tensor computation API. #50315, #49654, #50642
Add the function of forward combination of operators. By registering the combination rules of forward operators, it can split forward operators into base operators. #49605
Add the combinator switch. You can set environmental variables in shell to split operators in different ways. #50309
Add
OpTest
combination test function to guarantee accuracy of operators. Add elementwise class base operator unit test. Add batch_norm CINN unit test. #50509, #50807, #52815
Improvements¶
Add combinator to support FP16 operation and AMP O1 operation. Add AMP logic for softmax and layer_norm operators. #52397, #52598, #51473
Simplify combination rules and vjp rules of the combinator batch_norm. #54012, #51827, #51933,
Optimize combination rules for combinators, and improve performance of combination rules with containing scalar. Optimize log printing for combinators. #51960, #50160
Combinator supports the jit.save API. Add custom VJP rule API. #52344, #50885
Remove the overwrite parameter from combinator gather_grad. #52707
Clean up dynamic-to-static code style, optimize error message, and standardize logs. #48637, #46128, #52527, #46800,#46415
For dynamic-to-static, call the append backward to get
grad var name
to fix the error in the high order gradient computation. #53250Upgrade the dynamic-to-static function, and clean up the temporary directory of to_static to speed up code conversion. Enhance to_static to automatically skip internal API. Support use of to_static decorator in the program. #47102, #50596, #45768
For dynamic-to-static, optimize
print
function conversion to support printing Tensor parameters at the networking stage. Upgrade the parameter collection mechanism. #48672, #50336
bug fix¶
For the combinator, fix cmake compilation errors. Fix cuda 12 test errors. Fix bugs of operators such as meshgird, expand_as, concat, conv, and arrange. #49643, #54622, #53951, #53951, #53350, #51486, #52764
For the combinator, fix the bug in a number of scenarios such as rank=1, shape=-1, amp, and multi-process. #51413, #51435, #50518, #47301,
For the combinator, fix bugs in automatic code generation of composite grad maker and static prim api. Fix bugs that op creation attributes are missing, and some combination rules do not take effect. #50854, #51445, #50780, #52120
Fix some other bugs for combinators #50086, #51208, #51577, #53598, #47500, #52119, #50397, #50527, #50788, #51014, #52154, #52752
For dynamic-to-static, fix the bugs of dataloader, cond input dict, transformer import, T5 model memory leak, and grad var name parsing error. #49821, #47299, #50776, #50883, #51100, #51464, #51966, #52110, #52821
For dynamic-to-static, fix the bugs of Lazy initialization, Windows training, is_paddle_func failure, and recurrent op failure to delete pass. #50785, #52580, #51585, #51763, #51763
Distributed training¶
Dynamic graph distributed training¶
Remove the distributed sharding API in the old dynamic graphs. #49334
Upgrade fleet to distributed directory. #50834
Optimize log printing for distributed strategies. #47761
For re-computation, support hook mode, inplace function, and stop_gradient mode. Support more flexible use. #48471, #47985
Data parallel
For data parallel, support no_sync API for blocking parameter gradient communications. Support the parameter synchronization function. Add scale API to scale parameters. #47536,#51895,#47519
Fix the problem of video memory leakage under data parallel. #47369,#47444,#48668
Support sparse parameter gradient synchronization. #52785
Pipeline parallel
Optimize pipeline performance, and remove communication wait. Optimize scheduling and communication overlap. #46209,#54003,#54312,#53384,#54310,#46399,#46483,#46780,#46116
Support custom sharding, log printing, random seed setting, and timer elapsed time printing. #53344, #47670,#47336,#52656,#53831
Optimize video memory release logic in pipeline scheduling, and release intermediate variables and data in advance. #54557, #47199,#47497,#48045,#54672
Support VPP mode and model saving for pipeline parallel. #54196, #52927,#47801,#45922,#47242
Grouping sharding parallel
sharding stage2 parallel supports the quantization function, hybrid parallel training, gradient accumulation, XPU hardware, BF16 low precision computation, optimizer learning rate setting, offload function, and data parallel. #47169,#47535, #46795,#47711,#48310,#46846,#48857,#49196,#49931,#47114,#49767
Optimize sharing stage2 performance. Support the communication computation overlap. #46495,#46894
sharding stage3 support shared parameters, and untrainable parameters. #48695,#48577
Tensor model parallel
Optimize tensor model parallel performance to reduce performance impact of stream sharding. #47715,#51617
Support parameter, optimizer shapes, gradient synchronization. #51428,#53254, #53335,#45803,#46303,#52293
Optimize tensor model parallel operators such as c_embedding, softmax_with_corss_entropy. #53197,#53547,#53541,#52789,#46491,#52742,#53419
Launch
Communication library
Add custom mixed parallel communication groups, topology information printing, and custom communication topology order. #47021,#54000,#51781
Remove communication library dependency on Place information #47857
Add communications library to support GLOO operator. Support send/recv/gather. #52221, #52334,#49084
Disable reverse computation of communication operator. #47636
Add communication library static shape check, to help determine whether communication volume is matched. #48256,#48915,#48646
Support communication python object type, BF16 type, alltoall, reduce, allgather, group call, global gather, broadcast, and scatter communication methods. Support XPU device communications. #51765,#45844,#48059,#48115, #48339,#49252,#49451,#50085,#50701,#48208,#48736,#51762,#52495,#53514,#48232,#49896,#49941,#45584
Add support for communications between computational streams. #46182,#46023,#46295,#46761,#47481,#47740,#47976,#48163,#48396,#48308,#47110,#53089
Optimize communication library TCP linking time. #49810,#47184
Automatic parallel¶
Improve semi-automatic parallel for static graphs:
Add FLOPs computation function for multiple operators, and add computation Cost modelling based on FLOPs. #48083,#47978,#47595,#48083,#48084,#47816
Improve API ease-of-use. Perfect the DistAttr, Process Mesh, Engine API, information printing, input and output modules. Implement the Engine new cost API. It can be used to theoretically analyze model running time and video memory overhead. #47503,#46416,#46554, #46633,#49214,#53848,#46552, #47043, #49665, #52912, #45776, #47263
Optimize the generality and ease of use of Pass. Support more scenarios, and reduce time spent on Pass pre-analysis. #46519,#47358,#46391, #51035
Enhance debugging capabilities with distributed randomness control mechanisms and hybrid parallel precision alignment tools. #52903,#49865
Support automatic sharding of inference generation task networking. Adapt special usage of control flow and conditional block in the generation model. #46771, #54067
Improve grad_clip to support load balancing in data parallel scenarios. #49510, #49249
Semi-automatic parallel performance improvement for static graphs:
Add the Sharding Pass automated communication Fuse and multi-streams communication functions, with throughput performance improved by 26% on two machines for GPT 6.7B model. #48604, #47180,#46180
Add Recompute optimization strategy tuning function. Select optimal recompute checkpoint settings based on video memory and model size. #48608,#47846,#49010
For the pipeline parallel, add 1F1B scheduling optimization Pass #54260, #45915
Optimize data parallel. Support optimizations such as converged communication and communication computation Overlap, with performance improved by 5% in GPT 1.3B model. #48092,#45643,#49744, #47578
Optimize Reshard module concate performance. Reduce number of concates in some scenarios. #47809
Optimize mixing accuracy, upgrade Pass performance, support BF16 low accuracy, and adapt the auto mixing parallel of the while loop control flow. #51285,#51147, #49219, #49079
Improve function of fully automatic parallel for static graphs:
Parameter server¶
Clean up the all list in ps directory, in which API is not exposed #51289
Clean up cvm operator #48989
For GPUPS, add support for AFS. #46611
Degrade PGLBOX2.0 log, fix stuck issue of dense parameter, fix the bug that barrier does not take effect, and add get_epoch_finish python side interface #49946,#50166,#50349
GPUPs run to switch to specified mode. #51115
Fix the GPUPS optimizer selection bug, fix reader reading problem, and fix RPC compilation problem. #47026,#47192,#49878, #46356,#46575,#49389,#46258,#50136
Add rocksdb compilation method. #46074
CUDA¶
New features¶
Improvements¶
Add mixed precision strategy and optimize precision:
Add and optimize FP16 and BF16 data type support for more than 200 operators in the framework, including logsumexp, reduce_max, cumprod, sync_batch_norm, compare class OP, etc. Carry out precision optimization and unit test for all FP16 and BF16 operators. Improve the unit test framework function for low-precision operators, to ensure there is no loss of accuracy in the process of large-model training. (#51193, #51114, #45817, #52862, #52919, #52921, #46413, #48205, #54193, #48041, #48121, #46364, #51153, #53023, #53079, #53137, #46212, #50908, #52555, #51582, #47897, #45601, #53522, #52666, #50101, #48315, #50847, #50905, #50906, #50909, #50916, #50917, #50920, #50919, #50904, #50918, #50938, #50858, #50933, #50945, #50936, #51168, #51493, #50924, #50923, #50926, #50925, #50930, #53284, #53286, #53285, #50976, #50915, #50915, #48192, #50993, #50998, #51380, #51137, #51106, #51197, #51159, #51552, #51151, #51005, #51565, #51036, #51185, #51791, #51083, #51694, #51689, #51009, #51051, #51532, #51978, #51903, #51888, #52016, #52035, #52184, #52018, #51787, #51640, #52172, #52193, #51160, #51809, #51678, #52158, #51015, #52240, #52276, #52233, #52220, #52107, #52282, #52311, #52315, #52357, #52256, #51649, #52413, #52369, #51837, #52112, #51819, #52388, #52411, #52521, #51300, #51117, #52380, #52317, #51263, #52668, #52259, #50999, #52407, #52288, #52845, #50953, #52667, #52582, #52426, #51884, #52630, #52136, #52604, #51615, #51275, #52898, #52918, #52572, #52683, #52956, #52963, #52954, #52444, #52314, #52887, #52195, #53100, #52961, #52953, #53111, #53549, #53736, #52920, #53195, #53535, #53876, #53785, #53722, #54285, #54232, #53922, #47277, #50811, #54571, #50129, #50340, #50848, #50849, #50868, #50878, #50929, #50939, #50973, #50913, #51145, #51090, #51098, #51094, #51216, #51736, #51684, #51925, #54030, #50700, #52264, #51069, #51101, #51286, #53582,#49869))
AMP optimization: Comprehensively upgrade and optimize ease of use, accuracy stability and debuggability of AMP training, to better support acceleration of large model training. In terms of ease of use, unify the API for dynamic and static graphs. Add new conversion interfaces such as model.float(), model.float16() and model.bfloat16(). In terms of accuracy stability, enhance automatic adjustment of the strategy for BF16 type. Optimize blacklist settings. Enhance support of the multi_precision function by optimizer operators Adagrad, Adamax, Adadelta, and RMSProp. In the O2 mode, improve master grad mechanism, add type promotion mechanism and a new parameter for the specific module to use float32 computation to guarantee accuracy. In terms of debuggability, add the paddle.amp.debugging module to provide operator statistics, outlier detection, and accuracy comparison. ( #50132, #50078, #50131, #49705, #52936, #52871, #53289, #53362, #54240, #53768, #48041, #47672, #48843, #49391, #51635, #45541, #53742, #51020, #51063, #52514, #50940, #52936, #53439, #53712, #48238, #52215, #53012, #52918, #54571)
For GroupNorm operator, add support for NHWC data format. (#47533)
For index_put operator, add support for mixed data types of bool and int. (#54195)
Add sparse.is_nan API for determining whether a sparse tensor contains a NaN element. (#51513)
bug fix¶
Fix bugs of computation errors of several operators such as trace, roll, dropout_nd, and log_softmax, stack overflow, and some unit test error. (#50243, #52012, #53795, #53149, #53654, #51054, #49373, #53038)
Fix the problem that conv operator exhaustive search does not work in some scenarios. (#47065)
Fix timeout problem of collective_reduce_scatter and other operators on A100. (#54513)
Fix the problem of attribute error in FusedLinear unit test. (#50359)
Fix the OOM problem that may occur when using Profiler. (#46089)
Performance optimization¶
Further optimize GPU Kernel and eigen implementations of the framework’s large number of operators, including max_pool3d, dropout, adaptive_pooling, depthwise_conv2d, transpose, eigh, broadcast class computations, reduce class computations, prelu, logsumexp, and sparse, to achieve better performance in more configuration scenarios. (#45820, #45959, #45934, #46332, #46287, #47233, #48855, #48560, #49419, #49748, #50348, #52401, #51131, #51141, #51479, #51835, #52509, #52482, #52700, #53112, #53659, #53658, #53154, #54071, #53622, #52952, #46046, #46119, #45946, #47212, #47791, #47454, #45230, #48899, #33051, #49040, #48992, #49086, #50808, #46431, #50931, #48056, #46071, #49231, #38660, #50287, #46111, #46997, #45854, #47738, #48635, #50353, #50362, #51934, #54045, #46679, #52093, #52969)
Provide more fusion implementations and related fusion pass, such as fused_feed_forward, gather-gemm-scatter, matmul + bias, layernorm_shift_partition + element_add, and elementwise class fusion, to further improve performance of models that use the mode. ( #50423, #50091, #50364, #53017, #50755, #50050, #47099, #48848, #49383, #50809, #52361, #52028, #48439, #49009, #51427, #52731, #51805)
Intermediate Representation¶
In order to guarantee stability and reduce R&D cost of the IR system, we have developed a new IR system for PaddlePaddle. Complete basic data structure definition, operator definition generation, and execution system adaptation. In order to better support higher-order requirements of scientific computing scenarios, complete higher-order adaptation of operators such as silu and cast.
Complete the definition of IR data structure, including type system and operator definition. Implement execution adaptation with phi kernel. #51112, #51992, #50412, #53557, #53953, #50959, #54250, #54197, #54289, #51636, #52846, #53988, #54143, #54035, #54052, #54340, #54356, #54068, #53894, #53707, #54185, #54031, #54220, #54275, #54281, #54186, #54259, #54124, #54292, #48068, #53978
Improve the basic pass setup, including basic pass definition, pass registration management. #54023,#54170, #54170, #54308, #54348, #54385
Improve adaptation of high-level arithmetic, including modification of the basic module and adaptation of silu and cast arithmetic. #52005, #53425, #53417, #53417, #53498, #53171, #53632, #53605, #53746, #53874, #54164, #45888, #46024, #46446, #46960
CINN compiler¶
New features¶
Add CINN support for 0D-Tensor. At present, in order to cooperate with the upgrade of the main framework, it is supported by adding pass temporarily. We will replace and upgrade the solution later. (#53382, #53955, #54064, #54118, #54216, #53454)
Add CINN support for int8/uint8/int16/uint16/bf16 data types. (#50566, #53637)
Add support for the CINN expand operator. (#46776)
Add CINN support for PaddleInference. (#45009)
Improvements¶
For CINN compiler, pass skip_gc_vars attribute to CINN subgraph. CINN adds fetch operator for skip_gc_vars. #49471, #49553
For CINN compiler, conv2d and conv2d_grad do not use cinn operator by default. #51645
Add build_cinn_pass to BuildStrategy for use in dynamic-to-static (#49496)
Add reshape operator to perform unit test under combinator mechanism. (#51276)
Change version of the main framework binding CINN from fixed commit to develop. (#49775)
Set default Target parameter for CINN. (#50182)
bug fix¶
Fix the problem of inconsistent operator order after topology sorting during CINN symbolization. (#52556)
Fix some operator computation errors, accuracy degradation, and unit test related problems. (#53859, #54261, #46801, #53676, #53772)
Fix the problem of CINN support for float16 type. (#48249)
Fix the problem in build_cinn_pass. (#46843)
Fix the problem of no data area due to incorrect GC when CINN is turned on during combinator + dynamic-to-static. (#50116)
Fix the problems of compiler dropout amp error, combinator resnet error, and inplace variable not found #51688, #52813, #51769
Hardware support¶
CustomDevice¶
Add support for the distributed strategy MP/Sharding/PP/MoE and recompute on the training side. Add support for the distributed strategy MP on the inference side. Support for hardware Ascend NPU and Cambricon MLU accessed through CustomDevice, without changing any codes, to automatically inherit all new distributed strategies added by CustomDevice. #52872, #54384, #53220, #54572, #54573, #54676, #53044, #53719, #53701, #53702, #53703
Add API paddle.device.is_compiled_with_custom_device. It is convenient for users to judge whether the current environment supports the plug-in device backend of a certain hardware. #49271
Add environment variable CUSTOM_DEVICE_BLACK_LIST setting, to support automatic heterogeneous operation on CPU of blacklisted operators. #50409, #50666
Optimize CustomDevice performance by reducing number of calls to get_device_count interface in runtime. #46963
KUNLUNXIN XPU¶
For the training side, use a new version of dynamic graph, with adding support for distributed strategy MP/Sharding/PP and recompute function, and communication library. For the inference side, add support for distributed strategy MP and support for XPU FasterTransformer operator acceleration library. #49531, #49815, #48897, #50717, #51082, #49757, #51399, #50329, #48369, #47838,#48076,#47882,#48961,#49043,#49749,#49806,#53427,#48470,#49207,#52296,#51785,#47168,#47445,#50200,#49934,#50792,#52228,#53337,#53389,#53496,#53609,#53697,#53496,#53720,#53734,#54172,PR46227
4. Deployment Direction(Paddle Inference)¶
New features¶
Support Paddle TensorRT multiple subgraph TensorRT engine or TensorRT engine between different Predictors to share video memory in order to save video memory. #45842 #47631
For the C++ API, add Shape and data type API to obtain the input Tensor, and add Shape and data type API to obtain the output Tensor. For the C API, add SetExecStream, EnableMkldnnInt8 and other C++ existing APIs for serviced deployment. #49758
Add paddle.inference.Predictor.register_output_hook() API. Support printing of the output of each layer under GPU inference in case of debugging. Support use in control flow models such as While. It should be noted the API does not support Paddle-TensorRT. #54433 ,#47050 , #54254 。
Paddle Inference Predictor API supports paddle::Tensor as input and output, so users can directly reuse the PaddlePaddle dynamics graph for pre-inference and post-inference processing. (#50445)
Enhance Paddle TensorRT dynamic shape running ability, config.enable_tuned_tensorrt_dynamic_shape() API to build TensorRT Engine at runtime without passing any parameters. It is unnecessary to collect shape information before running. To avoid rebuilding at runtime, it is necessary to overwrite minimum and maximum Shape in first operations for several times. #52162 。
Paddle-TensorRT supports model input in NHWC format. #49633 。
Extend config.Exp_DisableTensorRtOPs API to disable access to TensorRT by specifying the name of the Tensor variable. #49497 。
Improvements¶
Enhance GPU mixed-precision inference (non-Paddle TensorRT scenarios). For the Config.enable_use_gpu enhancement, you can set precision type. #47993
Support double type input for inference. #51786 。
Since the TensorRT operator does not support the INT64 type, leading to running failure of INT64 data type in the model. Paddle-TensorRT has been enhanced to automatically convert, with reducing the model to run in the INT32 type when model contains INT64 data type. #45547
Paddle-TensorRT supports more operators into TensorRT inference, including:
expand_v2,gather_nd,rsqrt,sign,not,onehot,arg_min,temporal_shift,expend_as_v2,setvalue,index_select,round,acosh,square,reduce_max,not_equal,reduce_min,reduce_prod,grid_sampler,elementwise_mod,pad3d ,greater_equal,bitwise,cumsum,matmul_v2,reciprocal,where,bmm,take_along_axis,less_than,greater_than, logical_or, logical_xor, logical_and, less_equal,range,reduce_all,reduce_any ,fill_any_like ,pow
#47002 , #47589 ,#48223 ,#48557 , #48655 , #49113 , #51207 ,#51028 ,#50341 ,#51498 ,#48534 ,#48684 , #49393 , #49615 ,#50934 ,#50974,#50986 , #52000 ,#51971 , #52518 ,#44918 ,#48230 ,#47820 , #46877 , #48358 , #48592 ,#48697 , #53088 , #47974 , #53462
Enhance Paddle-TensorRT mapping operators strided_slice, instance_norm, prelu, argmax, cast, nearest_interp_v2, elementwise, bilinear. #46819 ,#47998 ,#48043 ,#48998 , #49675 , #47495
Paddle-TensorRT partial operators (scale, square, sum, swish, expand_as_v2, prelu, gelu, hard_swish, hard_sigmoid, leaky_relu,softmax, stack, clip, cast, flatten_contiguous_range, unary, equal, elementwise_op). Support 0-dimensional Tensor. #53660 ,#53627 , #53634 , #53714 , #53729 ,#53769 ,#53506 ,#53704
Support compilation for versions earlier than GCC12 + CUDA 12.0. #50106
Paddle-TensorRT’s DeformableConv plugin supports dynamic Shape input. #50698
For Paddle-TensorRT, add plugin support for lookup_table operator. #46613
Add config.enable_low_precision_io() API to support low-precision type input in Paddle-TensorRT scenario. #52485
Paddle-TensorRT’s LayerNorm plugin supports FP16 computation. #45043
Predictor’s input data paddle_infer::Tensor supports bool type. #49388
Paddle-TensorRT enhanced Convolution implementation uses ConvolutionNd. #47653
conv2d_fusion operator supports NHWC format. #49047
Adjust the directory structure related to Phi operators under C++ inference library. #53091
Support rebuilding TensorRT Engine instead of reporting errors when TensorRT serialization and loading versions do not match. #50775 。
Optimize Paddle-TensorRT runtime to print log messages. #50181
Support elementwise 0-dimensional Tensor inputs for oneDNN-based CPU inference. #51656
Clean up and normalize support for Paddle-TensorRT’s FC, matmul, matmul_v2 operators, and unify and upgrade to use TensorRT’s IMatrixMultiplyLayer for support. #52222
Performance optimization¶
Support multiple lookup_tables into Paddle-TensorRT’s Embedding+Eltwise+LayerNorm fusion. #46243 ,#46230
Add MoE fusion Phi operator to improve inference performance of MoE model. #48703
In the scenario of INT8 quantized inference, Paddle-TensorRT plugin can fall back to FP16 computation, instead of FP32 computation. #50554
Optimize memory and video memory in case of inference. #49051 , #49046 ,#53930
Optimize Layout and enhance Pass. #52997
Support caching of operator Shape inferences to improve model inference performance. #48312
Optimize bias+add+relu fusion using half2 instructions. #49048
Optimize Concat Kernel for multiple inputs using vectorization operations. #49540
Implement Convolution, Depthwise Convolution and related fusion operators based on CUTLASS to improve inference speed. #47989 ,#50603 ,#51792 ,#50603
Paddle-TensorRT supports FlashAttention’s plugin, to improve inference speed of models such as StableDiffusion. #49438 。
Add Transpose+LayerNorm fusion PASS, to improve inference speed of models such as StableDiffusion. #50082 。
Add Elementwise+Transpose fusion. #50081
Optimize Paddle-TensorRT Group Norm plugin implementation. #49160
For Config.EnableTensorRtEngine() API, add use_cuda_graph parameter. You can enable CUDA Graph. It should be noted you need to ensure the model input shape remains unchanged during usage, to reduce runtime consumption. #53406
Support inplace operation of Reshape, to reduce copying time of the model at runtime. #49146
Optimize LayerNorm kernel implementation based on oneDNN. #47782
Support fusion of quantize+transpose and transpose+dequantize based on oneDNN. #49509
When MKLDNN is turned on in CPU inference, FC-related fusion pass is enabled by default, to improve performance. #45704
CPU OneDNN inference supports suqeeze2 + transpose2 fusion. #47592
XPU inference enhancement and performance optimization¶
Add ExpRunWithRuntimeConfig API and XpuRuntimeConfig, to allow settings of parameters such as external streams, and L3 cache during inference. GetExecStream API supports obtaining Kunlun external stream objects. Input and output support Kunlun device memory, to reduce D2H and H2D overheads. #53334、 #52466、 #53240
Add multi-encoder, fused_multi_transformer and fusion pass, to improve performance of ERNIE and Transformer class models. #50570、#51346、 #50499、#53982、#50759、#51571、 #53144、#53306
Optimize BeamSearch performance. Transform, remove and fuse fine-grained operators such as write_read_array and gather, to improve model performance when beam_size=1. #53130
Transform multiple stack operators with the same input into unsqueeze operators that support broadcast. Unsquee/squeeze supports inplace computation. #52099
Add support for exporting multi-card inference models for Kunlunxin. #50490
Add embedding_with_eltwise_add fusion pass and operator phi kernel, to reduce video memory usage and improve inference performance. #50590
interpolate class operator phi kernel supports FP16. #52358
argmax operator supports INT32 type output. #51303
Fix the error of only model file when saving serialized model after turning on mixed-precision inference mode. #52994
Fix segment error of instance_norm when scale and bias are empty. #52627
conv_transpose operator supports FP16. #53626
Add yolo_box_xpu fusion pass and operator phi kernel, to optimize YOLO model generic substructure. #54163
Add conv2d_xpu fusion pass and operator phi kernel, and support FP16 inference, to optimize convolution operation inference consumption time. #52247 ,#53626
Add sigmoid_elementmul generic fusion pass, to fuse to swish operator to match conv2d_fusion pass to improve YOLO model inference performance. #53580
Add act_add fusion pass and operator phi kernel to improve inference performance. #53965
Add fold_interp_outsize fusion pass, to improve inference performance. #54245
Solve the problem of incorrect results due to duplicate fusion when there is shared weight in FC. #51108、#51039
Remove op_device attribute where operator is only used for training, to prevent wrong choice of place for training during inference. #51029
Support saving of optimized models, allowing PASS optimization to be skipped in case of re-inference, to reduce first time inference time. #53696
Solve the problem of computation error caused by the CPUPlace input of operator Kernel being forced to copy to XPU. #51306
subblock supports early copying of H2D parameters to improve inference performance. #51876
Fix scale memory size of the output activation of Kunlunxin 2nd generation chip. #53505
In new executor Kunlunxin D2D copy, support asynchronous execution. #51876
Remove concat operator with only one input. #52304
lookup_table_v2 supports FP16 to remove redundant cast operator. #52888
Control flow While operator supports caching scope, to reduce overhead of creating new scope every time. #52628
Scatter newly supports FP16, to remove redundant cast operators and elementwise_mul operators with an input of 1. #52831
Model quantization¶
Upgrade of dynamic graph quantization function.
Add a new API for quantization training of dynamic graph models:
paddle.quantization.QAT
. Support passing quantization-related parameters through configuration, simplifying quantization training process and difficulty of secondary development. (#49398)Add a new offline quantization API:
paddle.quantization.PTQ
. Support exporting quantization model to model format supported by inference. (#50107)Add STUB operator to simulate actual quantization operation during training process. (#50510)
Support quantization training model to load parameters of offline quantization model. Support more operators for quantization, including matmul, scale, and conv1d. #47892, #45911,#48912
Support hybrid parallel training of static graph quantization training. #52219
Fix the problem in the process of dynamic graph quantization:
5. Environment Adaptation¶
Improve efficiency of source code compilation, and promote setuptools + ninja compilation method to increase development efficiency: In CPU scenarios, full amount of compilation time is reduced by 20 min, and compilation speed is increased by 24.52%. In GPU scenario, full amount of compilation time is reduced by 22 min, and compilation speed is increased by 29.31%. In order to adapt to mainstream development environments, PaddlePaddle supports gcc12 compilation and C++17 in the source code, and adapts to the latest CUDA12. In terms of code quality, complete cleanup of compilation warnings, to improve compilation experience. At the third-party dependency level, we have upgraded the version of underlying protobuf to reduce dependency, cleaned up deprecated attributes of some earlier versions of dependency libraries and old code formats, and removed support for Python 2.x.
ninja compilation adaptation to improve compilation speed. #52433,#48932,#49420,#48435,#49303,#49448,#49838,#50067,#52796,#50431,#49181,#48867,#48490,#48211,#49499,#53076
setuptools compilation and package all-in-one adaptation. #48770,#46957,#49583,#47602,#48301,#50800,#42575),#49826,#49002,#51443,#51528,#52621,#52465
gcc12 support. #52960,#52265,#46546,#52318,#46808,#47466,#52083,#48176,#49423,#49452,#51037,#52007,#52441,#52085,#50817,#52646,#50777,#53288,#54009
c++17 standard support. #53345,#53892,#54282,#49017,#47635,#54258
Compilation Warning is removed. #47163,#47216,#47309,#47252,#47341,#47399,#47513,#47558,#47706,#52717,#51203,#51336,#51608,#51633,#46644,#53092,#53185,#53246,#53650,#53683,#53687,#53886,#53689,#53679,#53681,#53532,#47137,#47045,#52186,#52490,#53924,#53938,#53945,#53851,#53847,#53818,#53931
Support protobuf upgrade. #49875,#48495,#49673,#52499,#51161,#49168
Support offline compilation of third-party libraries. #54326,#54370,#54335,#54346,#53744,#54319,#53915
Phi independent compilation header file dependency decoupling. #50456,#47088,#52573,#52651
Python2.x decommissioning. #48685
6. Security¶
Fix bugs such as null pointer usage, illegal address access, memory out of bounds, divide by 0, and Python IndexError PR49976, PR49993, PR49942, PR49965, PR50000, PR50005, PR49953, PR49995, PR49974, PR50015, PR50010, PR49979, PR49994, PR49977, PR49968, PR49984, PR49958, PR50008, PR51714, PR51847, PR51034, PR51088, PR51091, PR51092, PR49966, PR49656, PR52161, PR49548, PR49546, PR49547, PR49549, PR51850
Thanks to our Contributors¶
This release contains contributions from: 1want2sleep, 201716010711, 404988613, 5u13, 6clc, Ackeraa, Aganlengzi, ahahahahahaha, Ainavo, Allen Guo, andyj, Asthestarsfalll, Aurelius84, Ayuan, BellaZYL, Bjmw3, Bo Zhang, bukejiyu, caozhou, carryyu, Ccc, ccrrong, ceci3, chalsliu, Chang Xu, CHANGer, Charles-hit, Chen Weihang, chenjian, Chenxiao Niu, chenxiao120660, chenxujun, Chitsing KUI, cifar10, co63oc, CollaborativeFiltering, csy0225, cxxly, cyber-pioneer, cyberslack_lee, czr-gc, Dandelight, danleifeng, Danyang Zhang, dasen, denglianbin, Difer, dongfangshenzhu, DrowFish19, duanboqiang, duanyanhui, engineer, engineer1109, Epsilon Luoo, feifei-111, Feiyu Chan, Feng Ni, feng_shuai, Fisher, FlyingQianMM, Frank Lin, Galaxy1458, GaoYuYang, gaoziyuan, gem5, GGBond8488, Ghost Screaming, gongenlei, gouzil, Guanghua Yu, Guo Sheng, Guoxia Wang, Hamid Zare, Hanchiao, handiz, Haohongxiang, haosicheng, haozi, Happyd99, heliqi, hellockx, hellolllw, heyanru, hg-1099255210, hh-qiao, hjyp, hong, HongyuJia, houj04, hua-zi, Huang Jiyi, Huang Zhengjie, huangjiyi, huangjun12, Hui Zhang, Huihuang Zheng, Hulek, hwa, HydrogenSulfate, Ikko Eltociear Ashimine, iLeGend, Infinity_lee, Infrared1029, Jacek Czaja, jakpiase, james, jameszhang, Jiabin Yang, jiahongyu, jiangcheng, jiangfan06, Jianghai, jiaqianjing, jingsongliu, JingZhuangzhuang, jjyaoao, joanna.wozna.intel, junxiu777, Jx-qi, JYChen, JZ-LIANG, jzhang533, Kai Song, Kai Xing, Kaipeng Deng, Kang Zhao, kangguangli, Kevin Wu Jiawen , Kim, Kim Yann, knamg, kuizhiqing, lanxianghit, Leding Li, Leo Chen, Leo Guo, levi131, Li Min, Li-fAngyU, Ligoml, lijialin03, lijin23, limingshu, Lin Manhui, LinearTemporalLogic, Linjie Chen, lishicheng1996, Little-chick, littleforest, liu zhengxi, liulinduo, liuruyan, liuzhenhai93, LiYuRio, lj970926, LokeZhou, LoneRanger, lubiu, Lucas, lugimzzz, Lux et Veritas, lxsbupt, LyndonKong, lzy, lzydev, Mahmoud Ashraf, Manan Goel, Maple Xie, Matsumoto Ruko, mayang002, MayYouBeProsperous, megemini, mengziheng, Meteor Liu, mhy, mhy-666, Ming-Xu Huang, ming1753, minghaoBD, mjxs, Moqim, Mountagha, Mr.Juice, mrcangye, NetPunk, Netpunk, nihao, niuliling123, Nyakku Shigure, OccupyMars2025, Ouyang Chao, pangengzheng, pangyoki, parap1uie-s, Paulina Gacek, Piotr Paturej, PommesPeter, PPGitub, PPPPzhang, PuQing, Qi Li, Qi Shao, QingshuChen, qipengh, qizhaoaoe, Rayman, RedContritio, RichardWooSJTU, risemeup1, Roc, ronnywang, Ruibiao Chen, Ruibin Cheung, RuohengMa, Ryan, SaltFish11, Sanbu, Scotty, scotty, seemingwang, Shaojie WANG, ShenLiang, shentanyue, Shijie, Shuangchi He, Siming Dai, Sing_chan, sneaxiy, Sonder, sprouteer, Sqhttwl, sunli, superwinner1, supplyout, SylarTiaNII, Sylwester Fraczek, Sławomir Siwek, taixiurong, Tao Luo, Taylor-Layrose, TeFeng Chen, Thomas Young, thunder95, Thunderbrook, Tian, Tian Zheng, tiancaishaonvjituizi, tianshuo78520a, tifa, Tinson Lai, Tomasz Socha, Tony Cao, ucsk, umiswing, ustiniankw, Vegetable dog, Vigi Zhang, Vvsmile, Wang Bojun, Wang Xin, Wang Xinyu, wangfengsheng1999, wangguanqun, wangguanzhong, wanghuancoder, wangna11BD, wangshengxiang, wangxiaoning, wangxinxin08, Wangzheee, WangZhen, wangzhen38, wasupandceacar, wawltor, Wei Shengyu, Weilong Wu, weishengying, Wen Sun, wenbin, wentao yu, wenzhe.wang, westfish, whisky-12, whs, Wilber, will-jl944, winter-wang, Winters Montagne, WJJ1995, wuhuachaocoding, wuyefeilin, wz1qqx, XiangGao, xiaoguoguo626807, xiaohemaikoo, xiaoluomi, xiaoting, xiaoxiaohehe001, Xiaoxu Chen, xiaoyuanzi914, Xinger, Xinyu Chen, xiongkun, xjmxyt, xu98bin, xysheng-baidu, yangguohao, yangjianfengo1, YangQun, YangZhou, yeliang2258, YepKong, Yichen Zhang, yikaikkk, Yiqun Liu, yjphhw, ykkk2333, Young-Flash, yu wentao, Yuang Liu, Yuanle Liu, YuanRisheng, yuchen202, yuehuayingxueluo, YuhangLi, Yulong Ao, YUNSHEN XIE, yunyaoXYY, YuRonan, zachary sun, ZeKai Zhou, Zenghui Yuan, zengshao0622, Zero Rains, Zhan Rongrui, Zhang Jun, Zhang Na, Zhang Ting, Zhang Zheng, zhangbo9674, ZhangDY-6483, zhangkaihuo, zhangxin81, zhangyikun02, zhangyingying520, zhangyuqin1998, zhaocaibei123, zhaoyingli, Zhen Wang, Zheng-Bicheng, Zhenghai Zhang, Zheng_Bicheng, zhenyun, Zhibao Li, zhiboniu, Zhong Hui, Zhou Wei, ZhouMengLei1999, zhoutianzi666, zhouzj, zhupengyang, zhurou603, zhuyipin, zhwesky2010, ziyoujiyi, zlsh80826, Zman, zmxdream, zqw_1997, Zuza Gawrysiak, zxcd, zyfncg, ZZK, zzk0, Ding Yi, Fu Jianhan, Liu Ge Gu Tou, Lu Lin, Zhou Zhouzhou, Jiang Yongyong, Xue Zhawu, Zhang Chunqiao, Zhang Zhenghai, Ning Meng Wei, Wang Mingdong, Shi Xiaowei, Chao Ji Ma Niu, Chen Cangye, Qi Ma Xiao Mao
2.4.2 Release Note¶
V2.4.2 fixed known bugs, and added a tiny set of features.
Training Framework (distributed included)¶
Fix the problem while using paddle.utils.dlpack.to_dlpack API to create dlpack objects multiple times in the for loop, and fix the bug that the reference counting error causes the memory actually pointed by dlpack to be destructed unexpectedly. #50138
Fixed the issue of out-of-bounds memory access when the input tensor is multi-dimensional in paddle.multiplex API. #49368
Fix the occasional compilation error caused by incorrect referencing of the Eigen header file. #48157
Fixed the bug that the output value of the backward operator may be None when the output gradient parameter order of the custom operator is not continuous.#48656
Add cutlass and implement the fusion kernel of gather+gemm+scatter; Optimize training and inference performance of sparse convolution; Optimize inference performance of batch_norm under 1D input data.#50118
Fix compilation failure in gcc54 environment caused by using constexpr. #50421
Move sum op kernel to PHI and fix bug that can’t get correct SelectedRows’ dims when run infermeta.#49342
Fixed the issue that the fold operator accesses memory out of bounds under large bs input.#49491
Fix the problem that no parameter Layer cannot call backward under dynamic to static mode.#49812
Fix the compile problem of CUDA11.8 on windows platform.#50205
Fix the unsupported error for
FusedDropoutActBiasGrad
on H100.#47285Add
debug_graphviz_path
option intobuild_strategy
.#46531Fix the not closed
popen
object.#47053
Deployment Direction (Paddle Inference)¶
Improve the functionality and stability of mixed-precision inference. Reconstruct the implementation of interface convert_to_mixed_precision and add parameter precision to interface enable_use_gpu.#49077、#49239、#49477
Support compilation under jetson ampere architecture.#49364
Fixed fc kernel diff.#49781
Fixed the error of trt workspace parameter type under CAPI. #48350
Fixed the error caused by arg_max/arg_min without flatten dtype parameter in Paddle 1.x version. #49771
Fixed the bug of missing information about lod logic after split infermeta’s refactoring. #49745
Fixed the bug of the constant-folding pass, which causes the conv2d weight to be non-persistent after folding and not enter the TensorRT engine. #50105
2.4.1 Release Note¶
Remove the dependence of the Paddle on python.so, and fix the bug that fails to execute due to the inability to find python.so in specific environments, including conda.
2.4.0 Release Note¶
1. Important Updates¶
New dynamic graph architecture is officially effective: The new dynamic graph framework has significantly improved the scheduling performance. The scheduling performance of more than 90% APIs is improved by over 50%, and the model performance of more than 50% kits is improved by over 5%. The functional architecture is clearer, and the secondary development capability and experience are significantly enhanced.
Comprehensive improvement of the dynamic-static unification ability of the PaddlePaddle: The dynamic-to-static function is provided with richer Python syntax support. The Python syntax coverage of the PaddlePaddle reaches 90%. The syntax transcription logic is mainly optimized to completely support the control flow syntax, with providing smooth dynamic-to-static graph experiences by pressing one key. With the newly upgraded static graph executor, the dynamic-to-static training has better acceleration capability, and the key model test shows that it is close to the best level of the static graph. The dynamic-to-static scalability is improved, with newly supporting multi-function merge export and inference. Users can use the PHI operator library for secondary development and flexible deployment. This can effectively support the custom decoding of U2++ featured models in the speech domain.
Add sparse computing APIs: Add 55 sparse APIs
paddle.sparse.*
and support mainstream sparse computing scenarios. The APIs have been applied to sparse training and inference deployment for 3D point cloud target detection, Sparse Transformers, and other tasks, with a speedup of 105.75% compared to DenseTensor in high sparse scenarios. In contrast to similar products, the speed of sparse computing is increased by 4.01%-58.55%. Support the computing of a variety of sparse Tensors (SparseCoo and SparseCsr). This is the ultimate saving of video memory. Meanwhile, it maintains a consistent usage experience, with the same usage method of the dense Tensor API.Large-scale graph neural network GPU training engine: Through the heterogeneous hierarchical storage technology of SSD, memory, and video memory, it breaks through the video memory bottleneck and supports all-GPU storage and training of super-large-scale graphs. It realizes the all-GPU integrated solution of walk, sampling and training. This can increase the training speed by more than 10x under the same costs, compared to the traditional distributed CPU solution.
Environment adaptation: Add pre-compiled installer adapted to CUDA version 11.7. It newly supports the running in Ubuntu 22.04 or later.
Forward-looking forecast¶
PaddlePaddle Framework will deprecate support for python 3.6 in version 2.5.
The PaddlePaddle framework will gradually deprecate the API under the
paddle.fluild
namespace on the python side, and some of the APIs under this namespace will be directly removed in version 2.5.
2. Incompatibility upgrade¶
The pre-compiled installer for CUDA version 10.1 is cancelled.
The -Tensor.clear_gradient(bool set_to_zero) interface will not take the value passed by kwargs, and will have to pass the bool variable of set_to_zero through args.
In order to improve the utilization efficiency of video memory, only the gradients of forward leaf node variables, such as the gradients of network parameters in training, are retained in the dynamic graph by default, instead of the gradients of non-leaf nodes. If you need to preserve a specific Tensor gradient, you can call the Tensor.retain_grads() interface before reverse execution.
paddle.autograd. PyLayer will no longer support the case where the input is tuple, pass in a list of Tensor if you want a group of them.
3. Training framework (including the distributed feature)¶
(1)New APIs and enhanced API functions¶
Add the sparse computing class API:paddle.sparse
Add 55 sparse APIs and support mainstream sparse computing scenarios. The APIs have been applied to sparse training and inference deployment for 3D point cloud target detection, Sparse Transformers, and other tasks, with a speedup of 105.75% compared to DenseTensor in high sparse scenarios. In contrast to similar products, the speed of sparse computing is increased by 4.01%-58.55%. Support the computing of a variety of sparse Tensors (SparseCoo and SparseCsr). This is the ultimate saving of video memory. Meanwhile, it maintains a consistent usage experience, with the same usage method of the dense Tensor API.#45849, #46694, #45086, #41857, #42935, #43475, #43668, #43966, #44022, #44346, #44432, #44451, #44743, #42013, #43520, #41434, #42130, #41276, #41857, #41356
Add the audio field API: paddle.audio
Add the feature extraction APIs such as MFCC, Spectrogram, and LogMelSpectrogram. Support the GPU computing. The performance increases by more than 15x compared to the CPU. This can significantly improve the GPU utilization in speech model training.#45424
Add the feature extraction basic APIs such as Window Function and Discrete Cosine Transform. This can facilitate users to customize the speech feature extraction.#45424
Add the speech I/O module. It provides 2 types of audio I/O backend and supports 6 types of codecs for convenient loading of speech data. #45939
Add TESS and ESC50 speech classification datasets. It is convenient for users to complete the classical speech classification model.#45939
Add the graph learning domain API: paddle.geometric
Graph learning is gradually becoming a key technology in the field of machine learning. The new paddle.geometric module of PaddlePaddle provides a better modeling and training development experience of graph learning.
Message passing: The message passing mechanism of the graph learning is the basis of graph modeling. We add 7 graph learning message passing APIs to make it more convenient to complete the modeling of the graph learning. Among them, 3 newly added message passing fusion operators can significantly reduce the GPU memory consumption in the GNN model training. In the dense graph scenarios, more than 50% of GPU memory can be saved in the models of GCN series, and the training speed can increase by more than 20%.#44848, #44580, #43174, #44970
Graph sampling: Graph sampling is the performance bottleneck of GNN model training. This newly added high-performance graph sampling operator supports high concurrent graph sampling. It can increase the sampling speed of GraphSage by more than 32 times and the model training speed by more than 12 times.#44970
Add the vision domain API
-
Add other API
Add the iinfo(#45321), count_nonzero(#44169), nanmedian(#42385), remainder_ (#45266), take(#44741), triu_indices(#45168), sgn(#44568), bucketize(#44195), nanquantile(#41343), frac(#41226), logcumsumexp(#42267), pairwise_distance(#44161), heaviside(#41872), logspace(#41261), corrcoef(#40690)
Add the RReLU(#41823), CyclicLR(#40698), OneCycleLR(#41825), Softmax2D(#40910), SoftMarginLoss(#42364), MultiLabelSoftMarginLoss(#41183), TripletMarginLoss(#40487), TripletMarginWithDistanceLoss(#40545), CosineEmbeddingLoss 和 cosine_embedding_loss(#41680), PixelUnshuffle(#40728), ChannelShuffle(#40743)
Enhanced API functions
Add the large batch_size calculation function of BatchNorm1D #43072
Optimize the collective communications distributed training API
Optimize the
fleet.init
function, and add thelog_level
parameter to facilitate users to view logs during operation #45909Add the
paddle.distributed.fleet.recompute_sequential paddle.distributed.fleet.recompute_hybrid
interface. It is convenient for users to use the recompute function #45348Add the
paddle.distributed.fleet.layers.mpu
package. It is convenient for users to use tensor parallel function #45803Add the communication API
paddle.distributed.destroy_process_group paddle.distributed.isend paddle.distributed.irecv paddle.distributed.all_to_all_single
. It improves the completeness and ease of use of communication #43918Add the
paddle.distributed.stream
package. The performance is increased by 5% to 10% compared to the base version#46023 #45282The communication API is added with the support of multiple data types such as
Char/Byte/Bool
. It improves the completeness and ease of use of communication #45574 #45440The communication API asynchronous parameter is changed from
use_calc_stream
tosync_op
, It enhances the semantic readability of the interface #46493
Enhanced high-level API
(2)New functions and important upgrades¶
The new dynamic graph architecture is officially launched:The scheduling performance of the new dynamic graph framework is greatly improved. Compared with the original architecture, the scheduling performance is significantly enhanced. The scheduling performance of more than 90% APIs is improved by over 50%, and the model performance of more than 50% of kits is improved by over 5%. The new dynamic graph architecture is clear, and the coupling is low. The learning and development costs of extension modules such as Hook and PyLayer are significantly reduced based on the new architecture. #37550 , #37574 , #37813 , #37926 , #39192 , #37599 , #37406 , #37466 , #37599 , #40945 , #39989
High-order auto-differentiation mechanism:In order to better support scientific computing and other scenarios, the PaddlePaddle framework has been further improved and optimized for higher-order auto-differentiation capabilities. At present, the
paddle.incubate.autograd
directory has provided relevant trial functions and APIs for forward/reverse higher-order auto-differentiation (Currently they are in incubation, and related functions and API signatures may change).If you intend to implement related models and explore the auto-differentiation mechanism by yourself, please read the usage and limitations of higher-order auto-differentiation carefully. Specific upgrades include:Static graph higher-order differentiation mechanism upgrade. Through the base operator system and program transformation, it supports higher-order forward and reverse differentiation, with the availability of the compiler and distributed functions.#41919, #41201
Add the forward and reverse higher-order auto-differentiation API,
paddle.incubate.autograd.forward_grad
,paddle.incubate.autograd.grad
. #43354Add 18 higher-order auto-differentiation operators:
sin
,cos
,exp
,erf
,abs
,log
,cast
,where
,equal
,not_equal
,greater_than
,greater_equal
,elementwise_pow
square
,elementwise_max
,gelu
,reduce_mean
,size
. #46184, #46024, #45888, #45338, #44345Fix the existing bugs of the operators such as
elementwise_div
,reduce_sum
,p_norm
. #46514, #46184
Generic heterogeneous parameter server architecture:
Parameter server GPUGraph infrastructure upgraded to meet the implementation needs of large-scale applications: The storage and training of large-scale graph neural networks based on the traditional CPU feature high cost, low stability, and less performance. To overcome these problems, we have built a pure GPU graph training engine (PGLBox). Through the heterogeneous hierarchical storage technology of SSD, memory and video memory, it supports the training of ultra-large scale graph models. The training performance is improved by more than 10x compared with CPU graph training engine on the premise of equal cost. The task failure rate is extremely low.#44594
Large-scale federation parameter server architecture: For large-scale personalized recommendation scenarios, the large-scale federation parameter server training is developed based on the heterogeneous PS infrastructure, to support horizontal and vertical federation under hundreds of billions of parameters. It includes two features: User private parameters updated locally and public parameters updated remotely. Users can flexibly configure the slicing policy for private and public parameters. A new central scheduling node Coordinator is added. Users can perform secondary development from the base class to customize the Client selection policy. #42682 , #44864 , #44327
Adaptive parallel
Design and launch a complete automatic parallelism interface system: Support automatic dynamic-to-static distributed training, automatic distributed data loading, automatic distributed saving and loading, automatic parameter conversion, custom slice marker and custom execution process. Users can easily obtain the automatic distributed training capability based on a single machine networking. It supports data parallel, model parallel, pipeline parallel, and hybrid parallel. #45776 ,#46552 , #44202 , #45840 , #45518 , #40528, #42838, #43093, #43312, #45053.
Improve the underlying adaptive parallel mechanism, including the upgrade of the distributed costmodel design and implementation, to provide better evaluation of the slice policy. Add the native distributed properties to ProgramIR and enrich the Cluster functions. #40457 , #42601 , #42727 , #42874 , #43114 , #44095 , #44146 , #44701 , #44973 , #45002 , #45118 , #45237 , #42576 , #41722 , #44150 , #44989, #44951, #44963 .
Add the Shardingstage1/2/3 AutoTuning feature under data parallel. This allows to automatically select the highest throughput Shardingstage policy while ensuring that the video memory constraints are met. #43782 .
Training hardware access - Plug-in solutions:Add custom Runtime/Kernel/CCL/Graph/Pass solutions. The hardware vendors can choose which modules to implement on-demand based on hardware characteristics.
ONNX format export
(3)Function optimization¶
Comprehensive increase of dynamic-to-static analysis conversion & extension capabilities
In order to improve the success rate and experience of model dynamic-to-static conversion, the transcription logic of control flow syntax is reconstructed. The core syntax has been upgraded to JIT (just-in-time) paradigm to achieve equivalent transcription with Python codes. The syntax functions such as break, return and continue are improved.#43666 , #43846 , #43848 , #43880 , #43957 , #43328 , #43348 , #43998 , #44465 , #44504 , #43713 , #43864 , #43967 , #44155 , #44487 , #44527 , #45105 , #45900
In order to support the voice custom decoding flexible deployment scenarios, the jit.save/load interface function is extended to support user multi-function merge and export. A new JITLayer component is added to support the invocation of class functions. Meanwhile, the custom inference deployment function is implemented with the PHI operator library C++ API. #44283, #41783, #43607, #43754, #43758, #43798, #44010, #44351, #44465, #44504, #44597, #44738, #44984, #46249
In order to unify API dynamic and static behaviors, 20 operators are upgraded to support variable attribute information of Op in static graphs, to ensure consistent dynamic and static behaviors and improve the success rate of dynamic-to-static conversion of models. Include
pad2d
,depthwise_conv2d_transpose
,conv2d_transpose
,adaptive_avg_pool2d
,reverse
,bincount
,multinomial
,reduce_sum
,reduce_mean
,reduce_prod
,reduce_min
,reduce_max
,uniform
,squeeze
,max_unpool2d
,dropout
,cumsum
,eye
,argmin
,argmax
. #44737, #45084, #45189, #45391, #45417, #45427, #45514, #45525, #45543, #45660, #46352, #46433, #45078, #45342, #45372, #45453, #45522, #45620In order to solve the problem of occasional loss of error reporting stack for user dynamic-to-static, the logic of the error reporting module is optimized to improve the readability of the error reporting stack and the user debugging experience. #44054, #44083, #44781, #44996
Add the TypeHint syntax recognition and transcription module to fully support Python Type Hint syntax. #47121
PHI operator library covers the full amount of arithmetic class operators:Continuously build the highly reusable operator library PHI. The remaining PaddlePaddle 2.x arithmetic class PythonAPI-associated operators and related kernels are migrated to the PHI operators library and rewritten as functional expression. Add about 180 forward/reverse operator CPU&GPU kernels, and 170 Kunlun-specific arithmetic kernels. This further enhances the kernel function sets that can be reused when new operators are added. In addition, add more than 100 C++ arithmetic class APIs. These APIs can be used in the custom operators, further enhancing the ease of use for external extension development based on the PaddlePaddle. #44577, #44631, #44434, #44605, #44676, #44742, #44436 , #45887, #45851, #45623, #45397, #45863
Normalized operator definitions with significantly improving the model simplicity:For the problems of many redundant parameters in the historical operator definitions of PaddlePaddle 1.x and the high cost of understanding the adaptation, the redundant parameters of about 150 high-frequency operators are cleaned up centrally. Basically, the mathematically irrelevant parameters are removed. After these redundant parameters are cleaned up, the amount of information in the inference model stored in the PaddlePaddle is significantly reduced. Generally, about 40% of the attribute variables are removed, significantly improving the clarity of the PaddlePaddle operator definition, and improving the experience of model analysis and debugging. Meanwhile, the size of the inference model stored in the PaddlePaddle is also significantly reduced by more than 70%. As a result, this can significantly improve the lightweight of the PaddlePaddle model. #44310 , #45613 , #45684 , #45708 , #45758 , #45786 , #45772 , #45845 , #45984 , #46218 , #46553
(4)Performance optimization¶
AMP performance and accuracy optimization
More operators are added with the support of FP16 data types, including elementwise series operators, compare series operators, strided_slice, set_value, uniform_ramdom, etc.(#45504 #44405 #45496 #46641, #46906 )
Optimize the implementation scheme of the hard_swish operator FP16 Kernel to guarantee the accuracy without loss. ( 35386 )
More operators are added with the support of BF16 data types, including fused_linear, empty, selu, pow, adam, clip, embedding, gelu, pad3d, pixel_shuffle, tile, where, etc. #46364, #47177
AutoTuning of single machine training performance
Transpose OP supports automatic Kernel selection mechanism. This allows the automatic search for the best Kernel implementation for different model configurations, improving the model performance. #43310 (Transpose Op access AutoTuning function)
AMP Layout auto-switching supports the new dynamic graph mode. For the ResNet50, TSM, and DeepLabV3 models, the performance increases by 9%-21% by Layout AutoTuning in the new dynamic graph. (#45409, #45751, #45826, #46880)
Generic performance optimization of GPU single machine training
Optimize the Cache scheme of the Conv operator cuDNN algorithm and Cache the results in all algorithm acquisition methods. This can significantly reduce the CPU overhead of the operator.(#41891 #47197 )
Further optimize the GPU Kernel and Python side performance of multiple operators, including dist, poisson, depthwise_conv2d, transpose, eigh, broadcast computation, reduce computation, layer_norm, cross_entropy, etc. This can achieve better performance in more configuration scenarios. (#44946, #45057, #45160, #42491, #42704, #42853, #46287, #46362, #46490, #46412, #46623, #40051 )
Performance optimization of distributed training for collective communications
To improve pipeline parallel scheduling efficiency, support the dynamic graph Interleaving1F1B scheduling policy. In the GPT-3 model, the performance is improved by 3%-4%. #45797 , #45869 , #45922 , #46209 , #45402 , #45444 , #45497 , #45797 , #45869 , #45922, #46209, #46399 , #46483 , #46876 , #47242 , #47249 , #47497 , #47517
To improve the distributed training performance of the MLPerfBERT model, the DistributedFusedLamb distributed optimizer supports hierarchical AllReduce. It improves MLPerfBERT performance by 17% on the DCU1024 card. #44821 , #44843
To optimize the video memory footprint when using DataParallel, the Buffer Lazy initialization policy for Tensor Fusion is supported, thus reducing the video memory footprint by an amount equal to the number of model parameters. #45631.
Distributed parallel policies DataParallel and Sharding support BF16 training. #46846 , #47246
To support the Sequence Parallel policy, the Distributed Pipeline Parallel supports enable_partial_send_recv policy, and supports the tensor after slice of the transmission sequence parallel. #46992 , #47083
To improve the performance of sharding stage 2 policy, implement the overlap of sharding stage 2 optimizer broadcast parameters with next step forward and use multi-CUDA Stream for communication. In the GPT 6.7B model, the 16-card training performance is improved by 11%. #46495 , #46656 , #47061
(5)Bug fix¶
Dynamic-to-static
Fix the bug of reporting an error in dynamic-to-static of the model in a Parameter no-gradient scenario during multi-card training. #44485
Fix the bug of where redundant frame logs are mistakenly output by the terminal in the dynamic-to-static. #45754, #46800
Fix the bug of reporting an error in the dynamic-to-static training when the control flow in the model contains a Tensor that does not require a gradient. #43034
Fix the bug of incorrect computation value during gradient aggregation in the dynamic-to-static training. #44893
Fix the bug of reporting an error in the dynamic-to-static when the function is decorated with @staticmethod. #44983, #45268, #45277
Fix the bug of too much video memory footprint in some scenarios where the model contains the dynamic-to-static training. #45380
Fix the bug of reporting an error of dynamic-to-static shape derivation in the networking phase when the model contains a complex control flow. #45916, #46020
Fix the error report mechanism
Distributed training in collective communications
Fix several bugs in communication library initialization and communication process, and enhance the system operation stability. #44964 #45100 #44758
Fix the bug of frequent occurrences of hang in pipeline parallel, and enhance the ease of use of the policy #47201; enhance the pipeline function to support unbalanced input. #47199
Fix the bug that the performance of the new dynamic graph MP/PP policy is lower than the old dynamic graph. #47071
Fix the bug that the shardingstage2 policy incorrectly maintains the parameter trainable property. #47240
Fix the bug that tensornumel is greater than INT32_MAX in series of OPs. #45711, #45741, #45897, #46158, #46767, #47191, #46045, #46160
Fix the bug of too much video memory footprint in FusedAttention and Fused FeedForward OP.#47236, #47235
Fix the bug of incorrect parameter update in multi_tensor_adam and multi_tensor_momentumOP when the parameters passed in are listofdict. #47352, #47372
4. Deployment direction (Paddle Inference)¶
(1)New features¶
Optimize the back-end graph engine integration scheme
In order to reduce Paddle-TensorRT plugin code development and reduce the number of Paddle-TensorRT subgraphs and thus reducing resource usage, a generic plugin mechanism has been developed, to automatically provide a unified TensorRT plugin interface for rich Phi operators in the framework. As a result, the video memory footprint can be effectively reduced in most scenarios. #46970, #46179, #46580
In order to facilitate users to customize operators in the framework and make Paddle-TensorRT perform efficient inference, the function is upgraded to support the framework custom Paddle-TensorRT plugin. #46970
Optimize the Inference library build system. The size can be pruned on demand
Pre-compiled installer supports TensorRT by default: The pre-compiled installer for training and the pre-compiled installer for deployment (Paddle Inference) are unified into one pre-compiled installer. The build system is optimized so that the pre-compiled installer supports TensorRT by default, reducing the switching cost for users using PaddleTensorRT. #46008, #45824, #46058
The size can be pruned on demand: Pruned according to the model operator. #47033 , #47049 , #47047
Inference supports native AMP
In order to make full use of GPUTensorCore computation capability and improve the model inference performance, a model accuracy conversion tool has been developed. The InferenceGPU natively supports the inference of the mixed precision model. For the usages, refer to the documentation. documentation, #43814, #43881, #44057, #44307, #44457, #44866, #45050, #45346, #45379, #45406, #45882
In order to improve the inference performance of the mixed precision model, the FP16kernel of high-frequency operators that do not support FP16 computation is supplemented, thus reducing the possibility of inserting the cast operator due to input precision mismatch. The inference performance is improved. #44642, #45061, #44653, #45504, #45061, #44969, #44558, #44710, #43871, #44792
Upgrade the compression and inference engine
Upgrade the quantization model storage format. The new format supports PaddleInference, PaddleLite and Paddle2ONNX 3 deployment methods. The supported chips include X86 CPU, NVIDIA GPU, and Arm CPU. (#46305, #462832, #46022 )
Add the INT8 full quantization function compatible with SoC/NPU chips. This can ensure the output INT8 quantization model has the best inference acceleration and precision on SoC/NPU chips.
Add the INT8 full quantization function compatible with SoC/NPU chips. This can ensure the output INT8 quantization model has the best inference acceleration and precision on SoC/NPU chips.
(2)Underlying optimization¶
GPU performance optimization
Add the TensorRT mapping for operators such as matmul_v2, LSTM, reshape, fill_constant, swish, mulitclass_nms3, bilinear_interp_v2, split, silu, shuffle_channel operators. Optimize the support for the dynamic shape. Performance improved by 7% to 90% for multi-class focused models. (#46177, #44678, #44314, #44561, #45166, #44411, #43424, #44516)
Add constant folding PASS for inference performance optimization, to improve the performance of SwinTransformer, HifiGAN, FastSpeech2, and other models.(#45494)
Add cache of conv_fusionworkspacesize, to improve the computation performance of conv_fusion. (#45902)
Vision ViT model optimization
Inference performance optimization of large model
To improve the inference speed of very large generative models and save the video memory, add INT8 implementation (fused_multi_transformer_int8_op) to the multi-layer Transformer fusion operator (fused_multi_transformer_op), and support quantized inference of generative models. Use the matrix multiplication algorithm to select, quantize/de-quantize the kernel fusion for performance optimization. #46169
Add Pass for automatic matching fusion in order to improve the ease of use of fused_multi_transformer fusion for large model inference.
CPU performance optimization
(3)Bug fix¶
TensorRT workspace size supports int64. (#44469 )
In Paddle-TRT, fully support Op’s input as weight.(#45545 )
In Paddle-TRT, support conv2d_transpose/conv3d_transpose to have the output_padding attribute.(#45004 )
In Paddle-TRT, enhance the strided_slice support for dynamic shape. (#46819 )
In Paddle-TRT, optimize the video memory footprint of context when running in multi-thread scenarios.(#45468 )
In Paddle-TRT, fix the bug of repeatedly generating serialization files in case of change of initialization sequences when multiple models run in the same process.(#43942 )
Fix the bug of occasional crash when Predictor is initialized to run for multiple times in the same process.(#45203 )
Fix the bug of abnormal inference accuracy of quantization models such as MobileNetV3_large, ERNIE 3.0-Medium and bert (#45416, #46283, #45920 #47573)
5. Environment adaptation¶
The pre-compiled installer for training and the pre-compiled installer for deployment (Paddle Inference) are unified into one pre-compiled installer. The build system is optimized so that the pre-compiled installer supports TensorRT by default.
The pre-compiled installer for CUDA version 10.1 is cancelled.
Add the pre-compiled installer for CUDA 11.7.
Decrease of source code compilation time: Reduce inter-module dependencies, improve the parallel, and optimize the compilation speed of some modules. The full compilation time is reduced by about 20 minutes in total.
Support the running of PaddlePaddle on windows 11, Centos 8, Ubuntu 22.04, Jetson 5.02 system environment. Support to run PaddlePaddle linux installer in windows system by using the WSL 2 tool.
Fix the running error bug of the PaddlePaddle in glibc2.34+ environment.
Optimize the code style of C++, Python, CMake in the whole code repository. Introduce or upgrade the following code style checking tools.
pre-commit is upgraded from 1.10.4 to 2.17.0: #43103
pylint is changed from default version to specify as: #43103
remove-crlf is upgraded from 1.0.1 to 1.1.14 : #43103
cpplint is changed from default version to specify as 1.6.0 : #43175, #43978, #43673, #43679, #43695, #43733, #43740
clang-format is upgrade from 3.8 to 13.0 : #42840, #43248, #43329, #43333, #43633, #43678
Introduce the black tool for python code style checking :#46014
Introduce the cmakelint tool for cmake file code checking. Version is 1.4.2 : #43222, #43406, #43414, #43428
Introduce cmake-format for automatic formatting of cmake files. Version is 0.6.13 : #43057
6. Hardware adaptation¶
Hygon DCU¶
Add the Profiler function on DCU, to collect, count and display performance data of model running process on DCU, and support DCU occupancy display at kernel level.
Kunlunxin Chip¶
Add Profiler function on Kunlunxin 2 generation chip, which can collect, count and display the performance data of model running process on Kunlunxin 2 generation chip, and support occupancy display of Kunlunxin 2 generation chip at kernel level.
Training/reasoning support for Kunlunxin 2 generation chips (Kunlunxin AI accelerator cards R200, R300, R200-8F, R200-8FS, RG800), a total of 51 models such as PPYOLOE, PP-OCR, ERNIE3.0, PP-TSM, PP-TTS, DLRM, PPO, etc. have been verified, supporting static graph + dynamic graph training, supporting mixed precision training, support single machine single card and single machine multi-card training, covering 5 fields of intelligent vision, natural language processing, intelligent speech, intelligent recommendation, reinforcement learning.
Cambricon¶
Support the training/inference of Cambricon MLU chip (MLU370 series of boards): The ResNet50, BERT, YoloV3, OCR-DB, Deeplabv3 and many other models are verified. Support the static graph + dynamic graph training. Support mixed precision training. Support the single machine single card and single machine multi-card training.
Graphcore¶
Support the training/inference of Graphcore IPU chip (including IPU Mk2 GC200 and Bow IPU). Support ResNet50, BERT and other models. Support the static graph and dynamic-to-static graph mode training. Support the single chip, single machine, and multi-machine distributed training.
Add the support of more operators
Upgrade to Poplar SDK v3.0.0 #46892
Support the training models by using the dynamic-to-static graph mode. Add a new paddle.incubate.identity_loss op to assist with composition #43770
Support the Paddle native distributed training API: paddle.distributed.launch #43311
Support the training models with the mixed precision #41733
Paddle Inference supports custom operators by using PopART #45235
Intel¶
Migrate oneDNN operators : transpose2_grad(#46139), relu6_grad(#46501), gaussian_random(#46747, #45481), sgd and stack(#46374), concat+grad, expand+grad,fill_constant(#45863), slice, slice_grad, split,pad and pad3d(#46101), softmax_grad(#46257), Shape(#46051), Sum(#46239), Transpose2_grad(#46139), Cast, clip+grad andpool+grad(#45775), Reduce sum+grad,mean+grad, min and max(#45536), Relu and abs(#45397), Gelu(#45596), Scale(#45537)
Optimize kernels of fill_constant, fc, conv, and a number of operators
Add several Pass fusion optimizations
Optimize the Adam-W CPU FP32 optimizer (#42522)
Optimize pad3d fp32 onednn operator kernel implementation (#43990)
Optimize the concurrent execution of matmul, FC andlookup_v2 kernels (#44023, #44078, #44640, #44744, #45249)
FC onednn operator kernel supports bf16 ( #42758, #43154, #43109)
Add the fusion of matrix multiplication and activation functions (#43519, #43198)
Support convolution operator int8 parameter production IR passes ( #44680, #42625)
Add pool/avg quantization and scales correction (#44186)
Add the matmul and elementwise onednn operator kernel fusion (#45077)
Migrate 42 oneDNN operator kernels to PHI operator library (#46374, #46101, #45989, #45863, #45775, #45626, #45536, #46501, #46257, #45596, #45537, #45481, #45397, #46239, #46139, #46051)
Quantize the elementwise_sub and shape operator kernels (#42854, #44124)
Thanks to our Contributors¶
This release contains contributions from:
0x45f, Aganlengzi, Ainavo, Allen Guo, Asthestarsfalll, Aurelius84, Baibaifan, baoachun, BiynXu, Bo Zhang, BrilliantYuKaimin, cambriconhsq, caozhou, carryyu, ccrrong, ceci3, chalsliu, Chang Xu, Charles-hit, Chen Long, Chen Weihang, chenjian, chentianyu03, Chenxiao Niu, cifar10, crystal, csy0225, danleifeng, David Nicolas, dc-cheny, denglin-github, dongfangshenzhu, duanboqiang, duanyanhui, engineer, enzodechine, Fan Zhang, feifei-111, Feiyu Chan, Feng Ni, feng_shuai, FlyingQianMM, freeliuzc, furnace, fuyou765, fwenguang, Ghost Screaming, gongweibao, Guanghua Yu, guguguzi, Guoxia Wang, Haipeng Wang, handiz, Haohongxiang, haosicheng, helen88, heliqi, hong, HongyuJia, houj04, huangxu96, Hui Zhang, Huihuang Zheng, huzhiqiang, Jacek Czaja, Jack Zhou, jack603047588, Jackwaterveg, jakpiase, james, Jiabin Yang, jiangcheng, Jiaqi Liu, JingZhuangzhuang, joanna.wozna.intel, JYChen, JZ-LIANG, Kaipeng Deng, kangguangli, kuizhiqing, Leo Chen, Leo Guo, levi131, Li Min, Li-fAngyU, lidanqing, LielinJiang, Ligoml, Lijunhui, lilong12, limingshu, Lin Manhui, Linjie Chen, liqitong-a, littletomatodonkey, liu zhengxi, Liu-xiandong, liutiexing, Liyulingyue, LiYuRio, Lux et Veritas, lyq, Matsumoto Ruko, MayYouBeProsperous, mengqingchun02, Ming-Xu Huang, ming1753, minghaoBD, moyan, mrcangye, Netpunk, niuliling123, Nyakku Shigure, OccupyMars2025, onecatcn, pangyoki, parap1uie-s, peachlcy, piotrekobi, Qi Li, QingshuChen, qipengh, Rayman, Regan Yue, RichardWooSJTU, risemeup1, Roc, ronnywang, Rui Li, Ruibiao Chen, seemingwang, Shang Zhizhou, shangliang Xu, ShenLiang, shentanyue, Shijie, ShiningZhang, shixingbo, shiyutang, Shuangchi He, Siming Dai, Sing_chan, Skr Bang, SmirnovKol, sneaxiy, sprouteer, Sylwester Fraczek, Sławomir Siwek, taixiurong, Tao CHANG, TeFeng Chen, Thomas Young, thunder95, Thunderbrook, tiancaishaonvjituizi, tianshuo78520a, Tomasz Socha, TTerror, USTCKAY, Vigi Zhang, Walter, Wang Bojun, wangguanqun, wangguanzhong, wanghuancoder, wangna11BD, WangXi, wangxinxin08, Wangzheee, WangZhen, wangzhen38, wawltor, wbn, Wei Shengyu, Weilong Wu, weishengying, Wen Sun, wenbin, whs, Wilber, WJJ1995, wuhuachaocoding, wuhuanzhou, wuyefeilin, XiaoguangHu, xiaoguoguo626807, xiaohemaikoo, xiaoting, xiaoxiaohehe001, Xiaoxu Chen, xiayanming, Xingyuan Zhang, xiongkun, yang131313, yangguohao, YangZhou, Yanxing Shi, Yao Zihang, yaoxuefeng, yaozhixin, yeliang2258, Yilingyelu, Yiqun Liu, ykkk2333, Yuang Liu, Yuanle Liu, YuanRisheng, yuguo, Yulong Ao, Yulv-git, YUNSHEN XIE, Zhang Jun, Zhang Ting, Zhang Zheng, zhangbo9674, zhangbopd, zhangchunle, Zhangjingyu06, zhangkaihuo, zhangxiaoci, zhangyikun02, zhangzhenguo, Zhanlue Yang, zhaocaibei123, zhaoying9105, zhaoyingli, Zhen Wang, Zhengyang Song, zhiboniu, Zhong Hui, Zhou Wei, zhoutianzi666, zhupengyang, ziyoujiyi, zlsh80826, zmxdream, zn, Zuza Gawrysiak, zyfncg, 傅剑寒, 六个骨头, 津, 熊峻峰, 王明冬, 石晓伟
2.3.1 Release Note¶
1. Important Updates¶
V2.3.1 is built on V2.3 by fixing known issues and releasing precompiled binary that supports CUDA 11.6.
2. Training Framework (distributed included)¶
(1) Function Optimization¶
API¶
Modify two initialization modes of
paddle.nn.initializer.KaimingUniform
andpaddle.nn.initializer.KaimingNormal
, to support multiple types of activation functions. (#43721, #43827)Optimize the data pre-fetching function of
paddle.io.DataLoader
, so that it can support the setting of theprefetch_factor
to set the cache size of pre-fetched data. This can avoid IO blocking when reading large blocks of data. (#43674)
New dynamic graph execution mechanism¶
Modify the initialization method of optional type Tensor in the new dynamic graph API logic to prevent data exceptions caused by early destruction. (#42561)
New static graph executor¶
Defer initialization of the thread pools in the executor, to avoid creating thread pools for
programs
that execute only once (e.g.,save, load, startup_program
, etc.). (#43768)
Distributed training¶
Enabling tensor parallelism in
paddle.incubate.nn.functional.fused_attention
andpaddle.incubate.nn.functional.fused_feedforward
. (#43505)
Others¶
Adjust print format of the framework operator kernels to facilitate automated splitting and parsing. (#42931)
Update the model quantization API to support the round-off in
rounding to nearest ties to even
, and support quantization in the range [-128, 127]. (#43829)Support AMP mixed precision training in quantization-aware training. (#43689)
Add the
progress bar
at the beginning of quantization-aware training, so that it is easy to check the progress of quantization initialization. Skip the scale op when counting out_threshold to speed up the initialization process. (#43454)Support
conv
andbn
fusion in the dynamic graph quantization training. Support the settings of skip_tensor_list in the static graph offline quantization, to skip some layers without quantization. (#43301)
(2) Performance Optimization¶
Optimize
paddle.incubate.nn.functional.fused_attention
andpaddle.incubate.nn.functional.fused_feedforward
operators. Addadd_residual
property to control whether to perform add-residual
operation in the last step. The performance of CAE model is improved by 7.7%. (#43719)Optimize
linspace
operator. Initialize three input Tensor ofstart
,stop
andnum
on CPU, to avoid GPU->CPU copy in the operator. This can speed up SOLOv2 model performance by 6%. (#43746)
(3) Bug Fix¶
API¶
Fix the error reported by
paddle.io.DataLoader
whenreturn_list=True
due to multi-thread conflict. (#43691)Fix the error that the
to
method reports NoneType does not have the device attribute when thepaddle.nn.Layer
parameter has theNone
type parameter. (#43597)Fix the bug that the calculation result of cumsum op is wrong in some
shape
settings. (#42500, #43777)Fix the bug that the output result dimension of
Tensor.__getitem__
is 0 in the networking stage when usingbool
index in the static graph. (#43246)Fix the bug occurred when
paddle.slice
andpaddle.strided_slice
handle negative parameters. (#43432)Fix the bug that the assignment result of set_value op is abnormal when the processing slice
step
is negative. (#43694)Fix the bug that the
copy
interface in C++ cannot copy between multiple cards. (#43728)Fix the bug in inference stage caused by attribute naming in
paddle.incubate.nn.functional.fused_attention
andpaddle.incubate.nn.functional.fused_feedforward
. (#43505)Fix an exception in ConditionalBlockGrad op when processing Tensor that does not require
grad
. (#43034)Fix the bug of device memory increase caused by einsum op in the speed optimization of backward computation. By default, this optimization is enabled. (#43397)
Fix the bug that data fails to be fixed when
paddle.io.DataLoader
multi-process data reads the fixing random seeds under a single card. (#43702)Fix the bug that softmax op triggers CUDNN_STATUS_NOT_SUPPORT when the Tensor exceeds 2G. (#43719)
Fix the bug that the trace op
Event
string is indistinguishable among different operators that cause the inconvenient performance analysis. (#42789)
Others¶
Fix the bug of overflowing device memory caused by multiple deepcopy and saving in case of dynamic-to-static. (#43141)
Fix the bug that the device id introduced by the upgrade of PlaceType used in the custom operator is wrong in the multi-card scenario. (#43830)
Optimize the
paddle.profiler.Profiler
timeline visualization logic, move events customized in python scripts from C++ folding display to python folding display. (#42790)
3. Deployment Direction (Paddle Inference)¶
(1) New Features¶
(2) Underlying Optimization¶
(3) Bug Fixing¶
Framework and API fixing¶
Backend capability fixing¶
Fix the bug that two ops of elementwise_mul and matmul in MKLDNN are crashed during quantitative inference. (#43725)
Fix a bug where TensorRT subgraph serialization files are repeatedly generated for the same model during inference. (#42945, #42633)
Fix a conflict between the ONNX Runtime backend and the externally use of protobuf. (#43159, #43742)
Fix an error reported by python prediction library when using ONNX Runtime backend in case of multiple inputs. (#43621)
2.3.0 Release Note¶
1. Important Updates¶
We are excited to release the PaddlePaddle Framework V2.3.0. This version contains the following highlights.
API¶
Added more than 100 new APIs, covering automatic differentiation, linear algebra, probability distribution, sparse tensor, framework performance analysis, hardware device management, vision domain, etc.
Added 4 new automatic differentiation APIs, 11 new linear algebra APIs, and 21 new probability distribution APIs to better support use cases in scientific computing, reinforcement learning, xand other application areas.
Added 11 new Sparse Tensor APIs including basic functions of sparse tensor construction and conversion. The COO and CSR formats are supported.
Added 9 new framework performance analysis APIs. The new performance profiling APIs, centered around Paddle.Profiler.Profiler, help users collect and analyze performance statistics during training and inference.
Added 7 APIs for device management, facilitating hardware information acquistion.
Added several visual and text domain APIs to facilitate ~~the~~ reusability of MobileNetV3, ResNeXt and other backbone networks, to achieve the fast networking.
Paddle HIgh reusability operator library¶
We announce PHI as the new Paddle HIgh reusability operator library. PHI provides Primitive API, enabling kernel reuse for operator development. As a refactored functional operator library, PHI aims to solve legacy problems that harm the framework’s performance and reusability, in particular on the operator development. Such problems include inefficient ways of cross using operators, unclear operator interfaces and lacking direct calls to the operator library in C++. With PHI, new operators can be easily implemented by composing functions available in the functional library. The library provides over 200 C++ operator class APIs and nearly 500 kernels. Composing new operators through these built-in functions can greatly reduce the user’s development effort. PHI supports different types of hardware (e.g., GPU and XPU). In addition, PHI is extensible with plugins for accommodating third party accelerators (such as NPU) in a low cost and reusable fashion. In short, PHI supports low level operator composability, the reuse of kernels through Primitives, and accelerators through plugins.
Distributed Training¶
Fully upgrade the adaptive distributed training architecture, including multiple modules such as elastic resource management, asynchronous pipelined executor, heterogeneous communication, and automatic parallelism, and support the hard-aware distributed training and inference under a variety of heterogeneous hardware.
Add MoE parallel strategy, GroupSharded parallel strategy, and Pure FP16 under dynamic graph hybrid Parallelism, which further supports the efficient distributed training of large models under the dynamic graph.
Comprehensively upgrade and optimize the architecture of general heterogeneous parameter server, and simplify each module, such as communication and storage, to improve the secondary development experience of parameter server. The performance of GPU parameter server is improved by 2.38 times under 100 billion parameters and 10 billion data.
Compile and Install¶
From version 2.3.0, PaddlePaddle upgrades GPU architectures supported.
Inference Deployment¶
Add the Java API and ONNX Runtime CPU backend.
Support the TensorRT 8.0 / 8.2 and structured sparsity, with deep performance optimization for ERNIE-like structural models.
Hardware Backend Extention¶
Add custom device support: provide a plug-in way to extend PaddlePaddle hardware backend.
Add training/inference support for multiple heterogeneous chips such as HUAWEI Ascend 910 / GraphCore IPU / Cambricon MLU / KUNLUNXIN 2.
Framework Architecture¶
In this version, we did a lot of work on the framework executor. For details, please see New Dynamic Graph Execution Mechanism and New Static Graph Executor.
2. Incompatibility Upgrade¶
Due to limitation of the binary size, sm35 CUDA ARCH is dropped in pre-compiled binaries. (#41754)
When
paddle.to_tensor
converts a python int scalar to a Tensor, the default data type on Windows changes from int32 to int64, thus alignment with Linux/Mac. (#39662)To keep consistency with division behavior under python3, the division symbol
/
has been changed from “rounding divide” to “true divide”, and the data type of the computed output has been switched from int to float. (#40890)
2.2 | 2.3.0 |
---|---|
|
|
Revise the ELU’s formula. The computing method in case of alpha <0 aligns with the original paper, thus fixing a small number of cases where the results are incorrectly calculated. Meanwhile, elu_ will report an error in case of alpha <0, because it is not mathematically possible to compute the inverse gradient from the output only at alpha <0. (#37316)
2.2 | 2.3.0 |
---|---|
|
|
3. Training Framework (with the distributed function)¶
(1) New functions¶
API¶
Add 4 new automatic differentiation APIs to support scientific computing, as listed below: (#40692)
paddle.incubate.autograd.vjp
, compute vector-Jacobi matrix product.paddle.incubate.autograd.jvp
, compute Jacobi matrix-vector product.paddle.incubate.autograd.Jacobian
, compute Jacobi matrix.paddle.incubate.autograd.Hessian
, compute Hessian matrix.
Add linear algebra class API
Add
paddle.linalg.triangular_solve
, to compute a system of linear equations with unique solutions through a triangular coefficient. (#36714)Add
paddle.linalg.eig
, to compute the characteristic decomposition of the general square matrix. (#35764)Add
paddle.linalg.sovle
, to compute solutions to systems of linear equations. (#35715)Add
paddle.linalg.lstsq
, to compute least-squares solutions to systems of linear equations. (#38585, #38621)Add
paddle.linalg.qr
, compute QR decomposition of matrix. (#35742, #38824)Add
paddle.inner
, to compute inner product of a matrix. (#37706)Add
paddle.outer
, to compute outer product of a matrix. (#37706)Add
paddle.linalg.cov
, to compute covariance between vectors. (#38392)Add
paddle.linalg.cholesky_sovle
, to compute the cholesky solution of the equation. (#38167)Add
paddle.linalg.lu
andpaddle.linalg.lu_unpack
, to compute matrix lu decomposition, and decompress lu matrix. (#38617, #38559, #38616)
Add 21 new probability distribution class APIs for reinforcement learning, variation inference, scientific computing, and other scenarios. Including 6 random variable distributions, 13 random variable transformations, and 2 KL divergence computing. as listed below: (#40536, #38820, #38558, #38445, #38244, #38047)
paddle.distribution.ExponentialFamily
, exponential distribution family base class.paddle.distribution.Beta
,Beta
distribution.paddle.distribution.Dirichlet
,Dirichlet
distribution.paddle.distribution.Independent
, Independent distribution, used to create higher order distributions.paddle.distribution.TransformedDistribution
, Transform distribution, used to generate higher-order distributions through the base distribution and a series of transformations.paddle.distribution.Multionmial
, a multinomial distribution.paddle.distribution.Transform
, base class for transforming random variables.paddle.distribution.AbsTransform
, take absolute value transform.paddle.distribution.AffineTransform
, affine transform.paddle.distribution.ChainTransform
, chain combination of the transform.paddle.distribution.ExpTransform
, exponential transform.paddle.distribution.IndependentTransform
, independent transform, used to extend theevent_dim
of the transform definition field.paddle.distribution.PowerTransform
, power transform.paddle.distribution.ReshapeTransform
,reshape
transform.paddle.distribution.SigmoidTransform
,sigmoid
transform.paddle.distribution.SoftmaxTransform
,softmax
transform.paddle.distribution.StackTransform
,stack
transform, used to combine multiple transforms in astack
method.paddle.distribution.StickBreakingTransform
,stickbreaking
transform.paddle.distribution.TanhTransform
,tanh
transform.paddle.distribution.kl_divergence
, compute KL divergence.paddle.distribution.register_kl
, register user-defined KL divergence calculation function.
Add high-level API
Add
paddle.vision.models.AlexNet
andpaddle.vision.models.alexnet
, to use AlexNet models directly. (#36058)Add
paddle.vision.models.DenseNet
,paddle.vision.models.densenet121
,paddle.vision.models.densenet161
,paddle.vision.models. densenet169
,paddle.vision.models.densenet201
, andpaddle.vision.models.densenet264
, to use DenseNet models directly. (#36069)Add
paddle.vision.models.GoogLeNet
andpaddle.vision.models.googlenet
, to use GoogLeNet models directly. (#36034)Add
paddle.vision.models.InceptionV3
,paddle.vision.models.inception_v3
, to use InceptionV3 models directly. (#36064)Add
paddle.vision.models.MobileNetV3Small
,paddle.vision.models.MobileNetV3Large
,paddle.vision.models.mobilenet_v3_small
, andpaddle.vision.models.mobilenet_v3_large
, to use MobileNetV3 models directly. (#38653)Add
paddle.vision.models.resnext50_32x4d
,paddle.vision.models.resnext50_64x4d
,paddle.vision.models. paddle.vision.models.resnext101_32x4d
,paddle.vision.models.resnext101_64x4d
,paddle.vision.models.resnext152_32x4d
, andpaddle.vision.models.resnext152_64x4d
, to use ResNeXt models directly. (#36070)Add
paddle.vision.models.ShuffleNetV2
,paddle.vision.models.shufflenet_v2_x0_25
,paddle.vision.models.shufflenet_v2_x0_33
,paddle.vision.models.shufflenet_v2_x0_5
,paddle.vision.models.shufflenet_v2_x1_0
,paddle.vision.models.shufflenet_v2_x1_5
,paddle.vision.models.shufflenet_v2_x2_0
, andpaddle.vision.models.shufflenet_v2_swish
, to use ShuffleNetV2 models directly (#36067)Add
paddle.vision.models.SqueezeNet
,paddle.vision.models.squeezenet1_0
, andpaddle.vision.models.squeezenet1_1
, to use SqueezeNet models directly. (#36066)Add
paddle.vision.models.wide_resnet50_2
, andpaddle.vision.models.wide_resnet101_2
, to use WideResNet models directly. (#36952)Add
paddle.vision.ops.nms
API, to support single-category and multi-category non-maximum suppression (NMS) algorithms for target detection and prediction task acceleration (#40962)Add
paddle.vision.ops.roi_pool
andpaddle.vision.ops.RoIPool
, to support RoI region pooling operations in detection tasks. (#36154)Add
paddle.vision.ops.roi_align
andpaddle.vision.ops.RoIAlign
, to support RoI Align operations in detection tasks. (#35102)Add
paddle.text.ViterbiDecoder
, andpaddle.text.viterbi_decode
Viterbi decoding API, mainly for sequence tagging model prediction. (#35778)
Add 11 Sparse class APIs, to support basic functions, such as creating Sparse Tensor in COO and CSR formats, and add C++ inter-converting with Tensor.
paddle.sparse.sparse_coo_tensor
,create Sparse Tensor in COO format. (#40780)paddle.sparse.sparse_csr_tensor
,create Sparse Tensor in CSR format. (#40780)paddle.sparse.ReLU
,support ReLU activation layer for SparseCooTensor. (#40959)paddle.sparse.functional.relu
,support ReLU function of SparseCooTensor. (#40959)Tensor.values()
,c++ method to get non-zero elements of a SparseCooTensor or SparseCsrTensor. (#40608)Tensor.indices()
,c++ method to get the coordinate information of a SparseCooTensor. (#40608)Tensor.crows()
,c++ method to get information about the compressed row information of the SparseCsrTensor. (#40608)Tensor.cols()
,c++ method to get the column information of the SparseCsrTensor (#40608)Tensor.to_sparse_coo()
,c++ method to convert a DenseTensor or SparseCsrTensor to a SparseCooTensor. (#40780)Tensor.to_sparse_csr()
,c++ convert a DenseTensor or SparseCooTensor to a SparseCsrTensor. (#40780)Tensor.to_dense()
,c++ convert a SparseCooTensor or SparseCsrTensor to a DenseTensor. (#40780)
Add hardware related APIs
Add four GPU memory monitoring related APIs:
paddle.device.cuda.max_memory_allocated
,paddle.device.cuda.max_memory_reserved
,paddle.device.cuda.memory_allocated
, andpaddle.device.cuda.memory_reserved
, to view and analyze the GPU memory usage in real-time. (#38657)Add
paddle.device.cuda.get_device_properties
, to return the properties of the GPU device. (#35661)Add
paddle.device.cuda.get_device_name
andpaddle.device.cuda.get_device_capability
, to return the name and compute capability of the GPU device. (#35672)
Add Tensor operation API
Add
paddle.nansum
, to sum input Tensor alongaxis
with ignoring theNaNs
values. (#38137)Add
paddle.nanmean
,to average input Tensor alongaxis
with ignoring theNaNs
values. (#40472)Add
paddle.clone
, to return a copy of the input Tensor and provide gradient calculation. (#38020)Add
paddle.Tensor.element_size
, to return the number of bytes allocated for a single element in a Tensor. (#38020)Add
paddle.Tensor.to_uva_tensor
, to convert the numpy objects to be accessed by CUDA objects with virtual addresses, which are stored in CPU memory physically. (#39146, #38950)Add
paddle.rot90
, to rotate the n-dimensional Tensor by 90 degrees along the plane specified byaxes
. (#37634)Add
paddle.logit
andpaddle.Tensor.logit
, to compute the logit function values for input Tensor. (#37844)Add
paddle.repeat_interleave
, to copy the input along the specified axis, and return a new Tensor. (#37981)Add
paddle.renorm
, to split the Tensor into multiple pieces at the specifiedaxis
and then perform p norm operations separately. (#38130, #38459)Add
paddle.mode
andpaddle.Tensor.mode
, to search the values and indices of the input Tensor along the specified axis. (#38446)Add
paddle.quantile
andpaddle.Tensor.quantile
, to compute the q-quantile of a Tensor along the specified axis. (#38567)Add
paddle.kthvalue
andpaddle.Tensor.kthvalue
, to find the values and indices of the kth smallest at the specified axis. (#38386)Add
paddle.is_floating_point
andpaddle.Tensor.is_floating_point
, to determine if the input Tensor is the floating point type. (#37885)Add
paddle.erfinv
andpaddle.Tensor.erfinv
, to compute the inverse error function of the input Tensor. (#38295)Add
paddle.lerp
andpaddle.Tensor.lerp
, to compute linear interpolation among the input Tensors based on the given weights. (#37253)Add
paddle.angle
, to compute the phase angle of a complex Tensor. (#37689)Add
paddle.rad2deg
andpaddle.Tensor.rad2deg
, to convert each of the elements of input from the angles in radians to the degrees. (#37598)Add
paddle.deg2rad
andpaddle.Tensor.deg2rad
, to convert each of the elements of input from the degrees in radians to the angles. (#37598)Add
paddle.gcd
andpaddle.Tensor.gcd
, to compute the greatest common divisors of the absolute values of two inputs by element. (#37819)Add
paddle.lcm
andpaddle.Tensor.lcm
, to compute the least common multiple of the absolute value of two inputs by element. (#37819)Add
paddle.amax
andpaddle.Tensor.amax
, to get the maximum value of Tensor elements along the specified dimension. (#38417)Add
paddle.amin
andpaddle.Tensor.amin
, to get the minimum value of Tensor elements along the specified dimension. (#38417)Add
paddle.isclose
, to determine if each element of two Tensors is close to each other. (#37135)Add
paddle.put_along_axis
andpaddle.take_along_axis
, for extracting or placing elements with specified index subscripts. (#38608)Add
paddle.bincount
andpaddle.Tensor.bincount
, for counting the number of occurrences of each element in a Tensor. (#36317)Add
paddle.fmax
andpaddle.fmin
, to extend the max/min function to support the case of NaN values in the two Tensors. If there is one NaN value in the corresponding position, return that non-NaN value; if there are two NaN values in the corresponding position, return the NaN value. (#37826)Add
paddle.diff
, for computing the nth forward difference along a given dimension. It currently supports n=1. (#37441)Add inverse hyperbolic functions:
paddle.asinh
,paddle.acosh
, andpaddle.atanh
. (#37076)Add
paddle.as_real
andpaddle.as_complex
for conversion between real Tensor and complex Tensor. (#37784)Add
paddle.complex
, for constructing a complex Tensor with the given real and imaginary parts. (#37918, #38272)Add
paddle.det
andpaddle.slogdet
, to compute the determinant of a matrix and the natural logarithm of the determinant. (#34992)Add
paddle.nn.utils.parameters_to_vector
, to flatten parameters to a 1-D Tensor. (#38020)Add
paddle.nn.utils.vector_to_parameters
, to transform a Tensor with 1-D shape to the parameters. (#38020)
Add networking class APIs
Add
paddle.nn.Fold
andpaddle.nn.functional.fold
, to extract sliding local area blocks for the Tensors of a batch. (#38613)Add
paddle.nn.CELU
andpaddle.nn.functional.celu
, to support the CELU activation layer. (#36088)Add
paddle.nn.HingeEmbeddingLoss
. Add a way to compute hinge embedding loss. It is usually used for nonlinear embedding or semi-supervised learning. (#37540)Add
paddle.nn.ZeroPad2D
API, for zero-padding according to the padding property. (#37151)Add
paddle.nn.MaxUnPool3D
andpaddle.nn.MaxUnPool1D
, for computing 3D maximum inverse pooling and 1D maximum inverse pooling. (#38716)Add
paddle.incubate.graph_khop_sampler
,paddle.incubate.graph_sample_neighbors
, andpaddle.incubate.graph_reindex
APIs, to support graph multi-order neighbor sampling and graph reindexing operations. They are mainly used for graph neural network model training. (#39146, #40809)
Add random number class APIs
Add
paddle.poisson
, to generate a Tensor that obeys Poisson distributed with the lambda parameter. (#38117)Add
paddle.randint_like
API, to generate a new Tensor that obeys uniform distribution in the range [low, high), with the shape of the output matching the shape of the input. (#36169)Add
paddle.Tensor.exponential_
. It is an inplace style API that populates the input Tensor with exponentially distributed random numbers. (#38256)
Add parameter initialization class APIs
Add
paddle.nn.initializer.Dirac
, to initialize 3D/4D/5D parameters with Dirac delta functions. It is commonly used for initialization of Conv1D/Conv2D/Conv3D parameters in the convolution layer. (#37389)Add
paddle.nn.initializer.Orthogonal
for orthogonal matrix initialization. The initialized parameter is the (semi-) orthogonal vector. (#37163)Add
paddle.nn.initializer.calculate_gain
, to get the recommended gain value for the activation function. The gain value can be used to set certain initialization APIs to adjust the initialization range. (#37163)
Add learning rate class API
Add
paddle.optimizer.lr.MultiplicativeDecay
, to provide thelambda
function to set the learning rate. (#38250)
Add distributed-related APIs
Add new optimizer-related APIs(#40710)
paddle.incubate.optimizer.functional.minimize_bfgs
,add second-order optimizer BFGS.paddle.incubate.optimizer.functional.minimize_lbfgs
,add second-order optimizer L-BFGS.
Add
paddle.incubate.multiprocessing
module, to provide Tensor (CPU/GPU) data transfer between python processes. (#37302, #41339)Add
paddle.incubate.autotune.set_config
API, to support multi-version Kernel auto-selection, mixed precision data layout auto-conversion, and num_workers auto-selection for DataLoader to automatically improve model performance. (#42301)Add
paddle.incubate.nn.FusedMultiTransformer
andpaddle.incubate.nn.functional.fused_multi_transformer
API, to fuse multiple layers of transformers into a single op to improve model inference performance. It should be noted that only forward is supported. (#42311)Add einsum_v2 operators for consistent interface between dynamic graph mode and static graph mode. It is compatible with the
paddle.einsum
implementation at the original python side, while supporting dynamic to static export and more complete Infershape inference. (#42495, #42327, #42397, #42105)
IR(Intermediate Representation)¶
Dynamic graph to static graph
For the variable type StaticAnalysis module, add support for type tag similar to
a, b = paddle.shape(x)
. (#39245)Add a computed field, supporting
InputSpec.name
as the Program cache hash key. (#38273)Add syntax for supporting
dict['key'] = x.shape
. (#40611)Add the support for Pure FP16 training. (#36944)
Add the support
for i in [x,y,z]
syntax. (#37259)Add the support for type hint syntax of python3. (#36544)
Pass development
Add forward and backward fusion for FC + [relu|gelu] based on NVIDIA cuBlasLt Epilogue. (#39437)
Kernel Primitive API
Add KP operators on GPU platform, including cast, scale, clip, bce_loss, abs_grad, reduce_sum_grad, reduce_mean_grad, clip, bce_loss, full, full_like, distribution, random, masked_select_kernel, where_index, masked_select_grad, dropout, sigmoid, where, and abs_grad. (#36203, #36423, #39390, #39734, #38500, #38959, #39197, #39563, #39666, #40517, #40617, #40766, #39898, #39609)
Add the support for XPU2 source code compilation mode. (#37254, #40397, #38455)
Add the support for KP operator reuse on XPU2 and GPU, including reduce, broadcast, elementwise_add,
exp、log、relu、sigmoid、leaky_relu、softplus、hard_swish、reciprocal
。(#36904, #37226, #38918, #40560, #39787, #39917, #40002, #40364)Add unit tests of KP operators on the XPU2 platform, including
brelu、ceil、celu、elu、floor、hard_shrink、hard_sigmoid、log1p、logsigmoid、relu6、silu、soft_relu、softsign、sqrt、square、swish、thresholded_relu、softshrink
。(#40448, #40524)Add the support for XPU2 KP models, including resnet50, deepfm, wide_deep, yolov3-darknet53, det_mv3_db, bert, transformer, mobilenet_v3, and GPT2.
Mixed Precision Training¶
Split the
paddle.amp.GradScaler.unscale_
method from theminimize
of the mixed precision trainingpaddle.amp.GradScaler
, to provide a separate interface for recovering the loss. (#35825)Add the FP16 support for
paddle.nn.ClipByGlobalNorm
dynamic graph mode. Add FP16 Kernel for clip op to enable clip-related operations to support FP16 compute. (#36198, #36577)Support the case that the
optimizer
parameter transferred frompaddle.amp.decorate
is Nan. (#37541)For the merged_momentum op,add the support of input multiple learning rates, the computing for use_nesterov policy and the regularization computing. (#37527)
Add multi_tensor policy to
paddle.optimizer.Momentum
optimizer. Addset_to_zero
branch toclear_grad
ofOptimzizer
class. (#37564)Add multi_tensor policy to
paddle.optimizer.Adam
. (#38010)Add multi_precision policy to
paddle.optimizer.SGD
optimizer. (#38231)Add the storage
master weight
parameter to the optimizerstate_dict
method. (#39121)Add support for op CUDA bfloat16 mixed precision training. Support for O1 and O2 modes. Enable the above training modes via
paddle.amp.auto_cast
. (#39029, #39815)Add bfloat16 CUDA Kernel for the following ops: matmul, concat, split, dropout, reshape, slice, squeeze, stack, transpose, unbind, elementwize_max, elementwize_add, elementwize_mul, elementwize_sub, scale, sum, layer_norm, p_norm, reduce_sum, softmax, log_softmax, sigmoid, sqrt, softplus, square, gaussian_random, fill_constant, and fill_any_like. (#39485, #39380, #39395, #39402, #39457, #39461, #39602, #39716, #39683, #39843, #39999, #40004, #40027)
Add bfloat16 CPU Kernel for the following ops: dropout, reshape, slice, squeeze, unsqueeze, stack, transpose, unbind, elementwize_max, elementwise_mul, elementwise_sub, and gather. (#39380, #39395, #39402, #39457, #39461, #39602, #39716, #39683)
Support printing of Tensor with data of bfloat16. (#39375, #39370)
Add support for FP16 computation for
p_norm
,elementwise_max
, andfill_constant_batch_size_like ``scatter
. (#35888, #39907, #38136, #38499)Add support for int16_t for the following ops: cumsum, less_than, less_equal, greater_than, greater_equal, equal, not_equal, fill_any_like, grather_nd reduce_sum, where_index, reshape, and unsqueeze. (#39636)
Add support for int16_t label type for cross_entropy op. (#39409)
Add support for int16_t id type for embedding op. (#39381)
Add support for FP16 type for reduce_mean op. (#38289)
Add support for FP16 type for elementwise_min op. (#38123)
Update bfloat16 AMP oneDNN default support list. (#39304)
Paddle HIgh reusability operator library¶
We announce PHI as the new Paddle HIgh reusability operator library. PHI provides Primitive API, enabling kernel reuse for operator development. As a refactored functional operator library, PHI aims to solve legacy problems that harm the framework’s performance and reusability, in particular on the operator development. Such problems include inefficient ways of cross using operators, unclear operator interfaces and lacking direct calls to the operator library in C++. With PHI, new operators can be easily implemented by composing functions available in the functional library. The library provides over 200 C++ operator class APIs and nearly 500 kernels. Composing new operators through these built-in functions can greatly reduce the user’s development effort. PHI supports different types of hardware (e.g., GPU and XPU). In addition, PHI is extensible with plugins for accommodating third party accelerators (such as NPU) in a low cost and reusable fashion. In short, PHI supports low level operator composabilty, the reuse of kernels through Primitives, and accelerators through plugins.The main contents include six parts as below:
The implementation of the operator library infrastructure, core components and mechanisms: The directory structure of the new operator library is reasonably planned, design and implement the common base data structure of the new operator library, the new functional InferMeta and Kernel development paradigm and the corresponding registration and management components. Support the automated compilation object generation and compilation dependency generation of Kernel files, allowing developers to focus only on the functional Kernel implementation, and making the development paradigm clear and concise. (#34425, #37107, #36946, #36948, #37876, #37916, #37977, 38078, #38861, #39123, #39131, #39748, #39790, #39941, #40239, #40635, #41091, #37409, #37942, #39002, #38109, #37881, #37517, #39870, #40975, #39475, #37304, #36910, #37120, #37146, #37215, #37255, #37369, #38258, #38257, #38355, #38853, #38937, #38977, #38946, #39085, #39153, #39228, #38301, #38275, #38506, #38607, #38473, #38632, #38811, #38880, #38996, #38914, #39101)
Operator library C++ API system construction: design and implement yaml configuration file-based operator definition paradigm, to automatically generate more than 200 C++ operator class APIs for internal and external developers to reuse. This reduces the cost of repeated development of basic operators. (#37668, #36938, #38172, #38182, #38311, #38438, #39057, #39229, #39281, #39263, #39408, #39436, #39482, #39497, #39651, #39521, #39760, #40060, #40196, #40218, #40640, #40732, #40729, #40840, #40867, #41025, #41368)
Operator library compatible with various execution systems: Implement new InferMeta and Kernel to access the original dynamic and static graph execution system. Support the safe removal of the original OpKernel registration and migration to the new Kernel form. (#34425, #38825, #38837, #38842, #38976, #39134, #39140, #39135, #39252, #39222, #39351)
Decouple the underlying data structures and tool functions of the operator library from the framework: Relieve PHI’s dependence on the framework for core data structures, lay the foundation for subsequent independent compilation of PHI, and support infrt, custom Kernel, and a series of Phi-based construction work (#38583, #39188, #39560, #39931, #39169, #38951, #38898, #38873, #38696, #38651, #39359, #39305, #39234, #39098, #39120, #38979, #38899, #38844, #39714, #39729, #39889, #39587, #39558, #39514, #39502, #39300, #39246, #39124)
Integration between custom operator mechanism and Phi with improvement: support for calling over 200 C++ operator class APIs automatically generated by PHI when writing custom operators. This reduces custom operator development costs. A series of bugs are fixed. (#37122, #37276, #37281, #37262, #37415, #37423, #37583, #38776, #39353, #41072)
Operator scale migration and refactoring: migrate about 250 high-frequency forward and backward operator Kernel to the new operator library and refactor them as a single function. Achieve the high-performance operator by encapsulating multiple base Kernel functions on the C++ side for the fast combination. Meanwhile, add the corresponding yaml operator definition, and access to the new dynamic graph execution system to improve the python API scheduling performance. The migrated and refactored operators include:
sqrt (#40727)
square(#40727)
sin (#40175)
sinh (#40175)
elementwise_fmax(#40140)
elementwise_fmin(#40140)
p_norm (#40819)
fill_constant_batch_size_like (#40784)
conv2d(#39354)
conv3d(#39354)
mish(#40727)
gather (#40500)
sgd(40045)
momentum (#41319)
rmsprop(#40994)
adam (#40351)
layer_norm(#40193)
adagrad(#40994)
adamax (#40173)
adadelta (#40173)
ceil (#40913)
cos (#40175)
atan (#40175)
cosh (#40175)
erf(#40388)
asin (#40175)
acos (#40175)
scale (#39278)
elementwise_pow (#40993)
round (#40913)
floor (#40913)
pow (#40913)
elementwise_floordiv (#40993)
reciprocal(#40727)
log1p (#40785)
allclose (#40469)
mul (#40833)
elementwise_max (#40590)
elementwise_min (#40590)
elementwise_mod (#40590)
fill_any_like (#39807)
dot(#38359)
sum (#40873)
diag_v2 (#39914)
one_hot_v2(39876)
bce_loss (#39868)
argsort (#40151)
arg_max (#40222)
arg_min (#40222)
segment_pool (#40099)
dist (#40178)
isnan_v2 (#40076)
logical_and (#39942)
logical_not (#39942)
isfinite_v2 (#40076)
logical_or (#39942)
isinf_v2 (#40076)
is_empty (#39919)
logical_xor (#39942)
less_than(#39970)
not_equal(#39970)
equal(#39970)
less_equal(#39970)
equal_all(#39970)
uniform_random (#39937)
randperm (#41265)
unbind (#39789)
bernoulli (#39590)
where (#39811)
log10 (#40785)
log2 (#40785)
expm1(#40727)
atan2 (#39806)
empty (#38334)
tan (#40175)
bitwise_and (#40031)
bitwise_not(#40031)
bitwise_or(#40031)
poisson(#39814)
cholesky_solve(#40387)
bitwise_xor(#40031)
triangular_solve(#40417)
sigmoid (#40626)
atanh (#40175)
softsign(#40727)
thresholded_relu (#40385)
tanh_shrink (#40565)
stanh(#40727)
reduce_mean (#37559)
reduce_max(#40225)
reduce_min (#40374)
reduce_all (#40374)
reduce_any (#40374)
logsumexp (#40790)
softshrink(#40565)
stack(#40581)
tile (#40371)
unique(#40581)
unstack(#40581)
slice(#40736)
transpose2(#39327)
unsqueeze2( #40596)
squeeze2( #40596)
strided_slice (#40708)
softmax (#39547)
leaky_relu (#40385)
gelu (#40393)
prelu (#40393)
log_softmax (#40393)
elu (#40565)
logsigmoid (#40626)
kthvalue(#40575)
mode (#40571)
yolo_box(#40112)
yolov3_loss (#40944)
temporal_shift(#40727)
depthwise_conv2d(#39354)
pad3d (#40701)
pad( #40012)
greater_equal(#39970)
kldiv_loss (#39770)
isclose (#39770)
silu (#40565)
unfold (#39778)
batch_norm(39347)
norm(#39324)
label_smooth (#39796)
grid_sampler (#40585)
greater_than(#39970)
nearest_interp_v2 (#40855)
bilinear_interp_v2 (#40855)
softmax_with_cross_entropy (#40832)
rnn (#41007)
reverse (#40791)
trace (#39510)
kron(#40427)
accuracy(#39982)
dropout(#40148)
bincount (#39947)
assign_value (#40967)
assign (#40022)
cast (#37610)
where_index (#40255)
cumprod (Xiong Kun #39770)
shard_index (#40254)
lookup_table_v2(#39901)
adamw (#40351)
tanh (#40385)
cross (#39829)
split (#39060)
linspace (#40124)
huber_loss (#39761)
hierarchical_sigmoid(#40553)
nll_loss (#39936)
exp(#40727)
rsqrt(#40727)
viterbi_decode (#40186)
conj (#38247)
lgamma (#39770)
relu (#40175)
log (#40785)
bilinear_tensor_product(#39903)
logit (#37844)
broadcast_tensors(#40047)
gumbel_softmax(#39873)
diagonal (#39575)
multi_dot (#40038)
matrix_power (#40231)
digamma(#39240)
masked_select(#39193)
determinant (#40539)
eigh (#40213)
shape (#40248)
reduce_prod (#39844)
histogram(#39496)
meshgrid (#41411)
brelu (#40385)
hard_swish (#40913)
hard_shrink (#40565)
selu (#39819)
expand_v2 (#39471)
top_k_v2(#40064)
expand_as_v2(#40373)
swish (#40913)
hard_sigmoid (#40626)
exp, det, assign, gaussian_random, matrix_rank, eye, and deformable_conv. (#41755, #41737)
New Dynamic Graph Execution Mechanism¶
To improve scheduling performance and custom development capability of the dynamic graph execution mechanism of the PaddlePaddle, we have reconstructed the underlying execution mechanism of the dynamic graph. With the new execution method, the PHI operator library can be used for efficient runtime execution. For the operators supported by the PHI operator library, switching to the new dynamic graph mode will get a significant improvement in scheduling performance. However, due to the huge workload required in the upgrade of the overall framework execution mechanism and this part of the work is coupled with a lot on the PHI operator library, we still do not use this execution method by default in this version. If you want to try it, you can switch to it by setting the environment variable FLAGS_enable_eager_mode=1
.The details are as follows:
Implementation of dynamic graph execution infrastructure, core components and mechanism: By staticizing dynamic graph-related execution codes, the original homogeneous operators constructing converted to specific calling for different PHI APIs, thus greatly optimizing the scheduling overhead. (#36059, #37323, #37556, #37555, #37478, #37458, #37479, #37599, #37659, #37654, #39200, #39309, #39319, #39414, #39504, #39526, #39878, #39963)
New dynamic graph execution mechanism sub-function development and adaptation: support more flexible and complete dynamic graph sub-functions such as hook, pylayer, double_grad, inplace, amp, etc. (#41396, #40400, #40695, #41043, #40915, #41104, #41350, #41209, #40830, #40891, #36814, #37377, #37193, #36965, #37810, #36837, #38488, #39282, #39449, #39531, #39638, #39674, #39893, #40170, #40693, #40937, #41016, #41051, #41121, #41198, #41287, #41380, #41306, #41387, #40623, #40945, #39282, #39449, #38488)
Automatic code generation mechanism for new dynamic graph execution: When we are trying to split the computation and scheduling logic of a large number of homogeneous operators into different specific scheduling logics, we find that it is a huge workload. So we introduce a new automatic code generation logic to generate code and thus simplify the runtime logic of dynamic graphs. Meanwhile, in order to adapt to the various types of runtime logic in the previous framework, we also use some complicated compilation techniques to obtain information at runtime to generate more accurate scheduling code. (#37574, #37575, #37639, #37723, #37753, #37812, #37837, #37910, #37943, #37992, #37959, #38017, #37969, #38160, #38085, #38562, #38573, #39192, #39215, #39355, #39358, #39328, #39233, #39628, #39767, #39743, #39897, #39797, #39997, #40058, #40080, #40107, #39962, #40132, #40276, #40266, #40480, #40482, #40368, #40650, #40815, #40907, #40935, #41089)
New dynamic graph execution mechanism accessed into the main framework and Integration test: we currently use some environment variables to distinguish between static graph mode and dynamic graph mode (including new dynamic graph and old dynamic graph mode). We have adapted most logics of dynamic graphs in these modes. However, there are still a lot of problems being fixed. (#37638, #37643, #37653, #38314, #38337, #38338, #39164, #39326, #40391, #40201, #40854, #40887)
Update some judgment logics under dynamic graphs, to support fast execution paths for dynamic graphs in compatible forms:(#40786)
Non-static graph mode (current transition scheme):
_non_static_mode()
。Determined as new dynamic graph in dynamic graph mode (recommended judgment logic):
_in_dygrah_mode()
。Determined as old dynamic graph in dynamic graph mode (Not recommended. It will be deprecated in future versions):
_in_legacy_dygraph()
。Enable old dynamic graph and disable new dynamic graph in dynamic graph mode:
_enable_legacy_dygraph()
or exit_test_eager_guard()
。Enable new dynamic graph and disable old dynamic graph in dynamic graph mode:
_disable_legacy_dygraph()
or withwith _test_eager_guard()
。Determine in new dynamic graph in static or dynamic graph mode:
_in_eager_without_dygraph_check()
。
Support inplace after dynamic graph reconstruction: input and output are the same Tensor.
Adapt the inplace strategy for dynamic graph reconstruction intermediate states. (#40400)
Adapt the inplace strategy to the final state of the dynamic graph reconstruction. (#40695)
Add inplace strategy to PyLayer function after dynamical graph reconstruction. (#41043)
Add inplace strategy for Tensor’s setitem function after dynamical graph reconstruction. (#40915)
Add
_reset_grad_inplace_version
interface after dynamic graph reconstruction, to set the inplace version of the Tensor’s gradient to 0. (#41101)If the value of the forward Tensor is not needed during the inverse computation (no need buffer property), the inplace version detection operation is not needed for that Tensor. For Tensor with no_need_buffer, skip the inplace version check. (#41350)
Unify error messages for inplace version checks after and before reconstruction of dynamic graphs. (#41209)
Support view strategy after dynamical graph reconstruction: input and output Tensor share underlying data.
Add support for weakref on the python side of the new dynamic graph eager Tensor. (#41797)
Enhance the new dynamic graph DoubleGrad function to support the basic DoubleGrad feature. (#41893, #41894, #41895)
Add
core.eager.StringTensor
interface, to support the construction of StringTensor on python side and the use of the StringTensor related APIs. (#41039)**Add
_grad_name
and_grad_value
*tocore.eager.Tensor
to return the name and value of a gradient. (#41990)Add the processing of the no_need_buffer attribute for dynamic graph intermediate state. The Tensor with the no_need_buffer attribute is skipped in the inplace backward check operation. (#41720)
New Static Graph Executor¶
In order to solve the problem that the original static graph executor of the PaddlePaddle is not good enough for scheduling in some scenarios and it is not easy to use multiple streams, we have implemented a new static graph executor with superior performance. It is easy to take advantage of the asynchronous scheduling capabilities of multi-streams and multi-threads. The new executor is a compatible upgrade of the original executor. At present, it is used by default in single-card scenarios. Users do not need to make any changes in the training codes. It can be used automatically. Of course, we also provide an interface to switch back to the original executor. Users can switch back to the original executor by setting the environment variable: FLAGS_USE_STANDALONE_EXECUTOR=false
. (#41179) The main contents are as follows.
Basic components: High-performance thread pool for multi-threaded scheduling in the executor (#35470, #35930, #36030, #36480, #36688, #36740, #38335, #40770) and thread co-op component (#38779, #40876, #40912). There is the timely memory recovery after operator execution (#37642, #39617, #40859). There is the new dependency analysis algorithm for parallel executor (#37231) etc.
Scheduling logic: Optimize the scheduling method of operator in the executor. Support multi-stream multi-threaded asynchronous scheduling mechanism. Change transforms such as data type, device, and layout to the operator scheduling to improve performance. Support caching the selection of operator Kernel. Support the selection of new PHI operator. (#35024, #34922, #35711, #35928, #39458,#36899)。
Interface compatibility: Compatible with the user interface and functionality of the original executor, such as alignment with python interface Executor.run(), support for managing Tensor in Scope, etc. This ensures that users can switch to the new executor without perception. (#37278, #37379, #37445, #37510, #40955, #41778, #41058, #38584, #37957, #37672, #37474, #37085, #37061, #36945)
Enhance debugging and error reporting in multi-threaded scenarios by capturing error reports from sub-threads and throwing them uniformly in the main thread. This can improve user experience. (#36692,#36802)
Fix the bug with the new executor communication flow resetting stream cache information in the allocator, to reduce RecordStream overhead in cross-stream scenarios. This improves performance of DeepFM models by about 8% after optimization. (#42046)
Optimize the dependency analysis method between new executor operators to improve runtime performance. Establish correct dependencies for send/recv communication operators to support pipeline parallel. (#42009)
Distributed Training¶
Basic functions of multi-machine multi-card parallel training based on collective communication
Add support for elastic training, enables scaling up and down the number of workers, enables training process resuming when node failure,to improve the fault tolerance of distributed training. (#36684, #37177, #37781)
Refactor launch startup module, add
master
collaboration and node numbernnodes
definition, to improve the ease of using the distributed startup. (#40086, #40568, #40782, #40844, #40936, #41190, #41314)Add support for GPU/NPU/XPU multi-hardware heterogeneous training. (#37613, #37998)
Add fleet_executor asynchronous pipeline executor. (#36966, #37049, #37087, #37126, #37150, #37203, #37167, #37282, #37319, #37462, #37507, #37533, #37576, #37605, #37691, #37742, #37783, #37809, #37862, #37882, #37934, #38024, #38083, #38164, #38261, #38290, #40607, #37093, #37106, #37143, #37338, #37376, #37485, #37531, #37623, #37693, #37755, #37807, #37889, #38420, #38539, #36892, #37084, #37158, #37361, #37509, #37603, #37703, #37824, #38114, #38322, #38535, #38650, #38709, #38799, #38839, #38904)
Add distributed inference function for large-scale model. (#38795, #39012, #39032, #39076, #39194, #39207, #39241, #39603, #39758, #39992).
Dynamic graph hybrid parallelism
Reconstruct
paddle.distributed.fleet.utils.recompute
, to support new dynamic computational graph. (#41396)Add pure FP16 training to support data parallelism. (#36420)
Add MoE (Mixture of Experts) parallel strategy, to support large-scale MoE model training. (#41092, #40895, #40850, #39224)
Add GroupSharded parallel strategy. Support stage1, stage2, stage3, and it supports synchronous and asynchronous communication. It can be used together with the basic function combinations such as Recompute, AMP O1\O2, Offload, GroupShardedClipGrad, and GroupShardedScaler. (#37489, #37568, #37707, #37836, #37947, #38151, #38407, #38052, #39112, #38989, #39171, #39285, #39334, #39397, #39581, #39668, #40129, #40396, #40488, #40601,#37725,#37904, #38064)
Static graph hybrid parallelism
Add
scale_gradient
flag bit togradient_scale_configs
to control the position where the gradient aggregation operation averages the gradients under pipeline parallelism. (#36384)Under tensor parallelism, the dropout op supports the settings of deterministic random seed generators, to ensure random consistency for non-distributed variables and randomness of distributed variables. (#36228)
NPU hybrid parallelism supports Offload, with saving 40% of NPU memory. (#37224)
Add
force_cpu
optional parameter to the seed op, to allow dropout to read seed values directly from CPU. (#35820)Improve the Automatic Sparsity (ASP) sharding strategy and support the selection of sharding strategy according to the program. (#40028)
Automatic parallel
Add the process restart (relaunch) after automatic mapping between logical processes and physical devices. (#37523, #37326)
Improve the underlying mechanism and interface for automatic parallel to facilitate the unification of modules and add the optimized pass. (#36617, #38132)
Add unified resource representation, to support for automatic mapping between logical processes and physical devices. (#37091, #37482, #37094)
Improve the distributed attribute complementation for the backward and update parts of the computation graph. (#36744)
Add data slicing function. (#36055)
Add tensor resharding function to reshard the tensor according to the distributed properties of the tensor and operator. (#40865, #41106)
Add the automatic conversion pass of distributed parameters when the number of resources or parallel policy changes. (#40434)
Add GradientMerge pass to reduce the number of communications and improve training efficiency. (#38259, #40737)
Add Recompute pass to reduce the activation memory storage. (#38920)
Add Sharding optimization pass, to support p-g-os 3 stage optimization. (#38502)
Add fused QKV parallelization for Transformer class model. (#39080)
Improve the sharding propagation for while op to ensure convergence of the fix-point algorithm. (#39939, #39086, #39014)
Support training and inference for sub-block and while op control flow. (#39612, #39895, #40077)
Parameter Server
Add NaN/Inf value checking tool under GPUPS. (#38131)
Under GPUPS, add set_date interface to adapt incremental training. (#36194)
Under GPUPS, add asynchronous release dataset function. (#37790)
Under GPUPS, support the Dump parameters and intermediate layers(#36157);
Under GPUPS, support the optimizer parameter configuration. (#39783, #39849)
Under the Unified Parameter Server, refactor the base classes of each module such as communication and storage, to improve the ease of secondary development of each module. (#41207, #41022, #40702, #39341 #39377, #39191, #39064)
Add evaluation metrics module under the Unified Parameter Server, to support AUC/WuAUC/MaskAUC and other evaluation metrics calculation and customizable extensions. (#38789)
Supports XPU parameter server training on KUNLUNXIN 2. (#41917, #42266, #41916)
Profiler¶
Add the performance analysis module
paddle.profiler
in the Python layer: Provide the ability to collect, export, and count performance data during the training push. (#40065, #40357, #40888)paddle.profiler.Profiler
: performance analyzer, interface for user interaction. (#41029, #41524, #41157, #40249, #40111, #39964, #40133)paddle.profiler.RecordEvent
: provide custom punches to record time. (#39693, #39694, #39695, #39675,#41445, #41132)paddle.profiler.ProfilerTarget
: specify the target device for performance analysis.paddle.profiler.ProfilerState
: indicate the state of the performance analyzer.paddle.profiler.SortedKeys
: specify the sorting method of the data within the statistics form.paddle.profiler.make_scheduler
: the scheduler generating the performance analyzer state and implement the periodic control of the collection scope.paddle.profiler.export_chrome_tracing
: save performance data to a google chrome tracing file viewable by the chrome://tracing plugin. (#39316, #39984, #41029)paddle.profiler.export_protobuf
: save performance data to a protobuf file represented by internal structure. (#39519, #39109, #39474)paddle.profiler.load_profiler_result
: load the performance data saved to a protobuf file.paddle.profiler.Profiler
generate statistics for data reading, step overhead and throughput for the model training by specifying thetimer_only
parameter. (#40386)
Refactor Profiler underlying infrastructure in C++ layer
Modify the name and type of logging for op under new dynamic graph. (#41771
Add Kernel running statistics into profilers’ summarization and optimize the summarization. (#41989
Remove side-effect to performance in forward computing forward when Profiler is off. (#42142)
CINN compiler adoption¶
With the recent development of PaddlePaddle’s compiler, a.k.a, CINN(GitHub - PaddlePaddle/CINN: Compiler Infrastructure for Neural Networks), paddle framework has also been changed to adapt the compiler CINN features. These include the subgraph management related functions for the Paddle-CINN runtime, optimization of memory and speed performance, and bug fixing during development.
Functions developed:
Subgraph op related functions:
Add the function to find and generate CINN subgraphs from computational graphs. (#36345)
Add cinn_launch op as a runtime entry point to CINN. It is responsible for scheduling CINN to compile the subgraph, to initialize the data, and to execute the generated kernels. (#36600)
Add a helper class
CinnLaunchContext
to the kernel implementation of cinn_launch op to manage the intermediate data for compiling and running subgraphs, to improve scalability and code readability. (#37938)Add additional fetch nodes to CINN subgraphs, thus ensuring that CINN external nodes can fetch the values of variables. (#37172, #37190)
Add the function to symbolize a CINN subgraph, which is used to topologically sort the subgraphs and return the CINN execution sequence. (#36417
Add
CinnCompiler
class for involking subgraphs in the CINN compiled graph that can be replaced by using CINN operators. (#36562, #36975)Add the interface to CINN symbolization class to get the names of subgraph fetched variables to prevent fetched variables from being eliminated in compilation optimizations. (#37218)
Checking, debugging, and PI changes related:
Synchronize the update of NetBuilder API name changes in CINN. (#40392)
Add necessary log information to Paddle-CINN for better debugging. (#36867)
Add the bidirectional conversion function between Paddle desc and CINN desc. (#36100)
The operator implemented in CINN may not use some input variables compared to Paddle. Therefore, remove the check that the input variables must be used in the cinn_launch op. (#37119)
Added cinn_instruction_run op for invoking CINN to execute a single generation instruction, facilitating the construction of scheduling run subgraphs on the Paddle side. (#39435, #39576)
Add control macros to Paddle for CUDA/CUBLAS/MKL/CINN pass application required to compile CINN. (#37066, #36660)
Add two control flags FLAGS_allow_cinn_ops and FLAGS_deny_cinn_ops to control the categories of CINN operators used to replace native operators during Paddle training. (#36842)
Performance optimization:
Speed optimization
Optimize the computational time consumed by CinnCacheKey. (#37786, #37317)
Cache variable scope for CINN compiled subgraphs to reduce runtime parameter construction overhead. (#37983)
Utilize CINN’s auto-tuning in case of subgraph compilation, could be enabled by flag, for further tuning of training performance. (#41795)
Refactor the correctness check of compilation results in case of subgraph compilation to avoid repeated checks at runtime and reduce the scheduling overhead. (#41777)
Enable TransposeFolding and GemmRewriter optimization passes by default in Paddle-CINN training. (#41084)
Pass the cuda stream created in Paddle into CINN so that Paddle and CINN can use the same CUDA stream in cuda computing. (#37337)
Move CINN optimization pass application logic from Paddle to CINN. (#42047, #42070)
Device memory optimization
Add NoNeedBufferVars to cinn_launch op to declare a list of input variables that do not require a buffer, so that the memory can be freed in advance. (#38367)
Pass in reference count information for external variables to the subgraph, so that subgraphs within cinn_launch can reuse memory optimization passes and reduce the memory overhead in using CINN. (#39209, #39622)
Add the function to convert a collection of executable instructions generated by CINN compilation to a Paddle Graph, supporting reuse of the Paddle scheduler and memory optimization pass, further reducing the memory overhead in using CINN. (#39724, #39911)
Add Kernel of cinn_instruction_run op, to support dynamic device memory requests based on data types inferred from compilation results. (#40920)
Bug fixing:
Fix and optimize the generation logic of CINN subgraphs. (#36503)
Fix the bug that Paddle-CINN does not support no-input subgraphs. (#40814)
Fix an error reported due to CINN not being able to handle useless outputs in operators such as batch_norm. (#36996)
Fix several bugs in CINN subgraph partitioning and symbolization, and solve problems with Paddle training accessing the CINN. (#36739, #36698 )
CINN does not yet support the control flow yet. Add logic to skip control flow when encountered. (#40812)
Other¶
Model quantization
Upgrade quantization storage format to unify quantization formats for dynamic and static graphs. (#41041)
Add new post training quantization (PTQ): EMD and Adaround. (#40421, #38460)
Support to quantize more operations in PTQ and QAT, such as crop, split, ab, unsqueeze etc. (#40083)
Support to quantize operators in control flow. (#37498)
Support quantization of matmul_v2 operator. (#36469)
Add support for quantized matmul_v2 inference on TensorRT. (#36594)
CUDA memory optimization
Implement multi-stream safe Allocator to support safe and efficient use of CUDA memory in asynchronous computing scenarios. (#37290)
Add new APIs (paddle.device.cuda.max_memory_allocated, paddle.device.cuda.max_memory_reserved, paddle.device.cuda.memory_allocated and paddle.device.cuda.memory_reserved) for GPU memory monitoring in runtime. (#38657)
Support allocate CUDA Managed Memory to train super large models in memory-constrained scenarios. (#39075)
Add GetBasePtr interface in C++ to get device address created with cudaMalloc. (#37978)
Reduce the number of free blocks in AutoGrowth Allocator to improve memory allocation performance. (#35732)
Remove redundant Float32 temporary tensor and cast operation for tensor with data type FP16 in
initializer.Normal
andinitializer.Constant
to save 2x memory. (#38818)
High-order derivative testing for models in dynamic graphs.
Custom op: Support to custom op in ROCm(HIP) platform. (#36771)
Cost Model: Add basic Cost Model based on profiling infomation. (#35774)
Added a function to allow user to add their own layer and correspond pruning way to ASP support. (#40253)
Add string tensor data structure, allowing the framework to have the ability to represent and process string. (#39830, #40992)
Add or upgrade oneDNN FP32/int8/bfloat16 Kernel, including:
ELU (#37149)
exp (#38624)
stack (#37002)
softplus (#36382)
round (#39653)
shape (#36033)
flatten and flatten2 (#35892)
slice (#37630)
elementwise_mul (#40546)
elementwise_add (#38176)
ementwise_div (#36158)
elementwise_sub (#35662)
roi_align (#37848)
assembly optimized Adam (#39158)
logsoftmax (#39793)
activation (#40721)
mul (#38552)
mean (#37104)
relu (#36265)
pool2d (#37081)
concat (#35889)
LayerNorm (#40418)
Add the 3-stage storage graph retrieval engine based on SSD - host memory - GPU device memory, to support large-scale graph neural network training. (#42472, #42321, #42027)
Add heterogeneous multi-cloud training communication module switch, implement the Send/Recv interface function, and support multiple heterogeneous cloud communication. (#40965 40911)
(2) Function optimization¶
API¶
Add backward implementation of
paddle.linalg.det
. (#36013)Add support for mixed precision training O2 mode for
paddle.Model
, i.e., support for Pure FP16 training mode of the original dynamic/static graphs. (#36441)Support for self chain calls for
paddle.nn.Layer
. (#36609)Add settings of
is_distributed
property for theto
method ofpaddle.nn.Layer
to ensure that the distributed properties remain consistent before and after network parameter transform. (#36221)Improve the parameter conversion logic of the
to
method ofpaddle.nn.Layer
, to reduce the peak memory consumption of the conversion process and improve the conversion success rate. (#36862)Support settings of the shape of the output Tensor for
paddle.incubate.graph_send_recv
to reduce the memory usage during the actual computation. (#40509)Add the support of int32 and int64 data types for
paddle.incubate.segment_sum
,segment_mean
,segment_max
, andsegment_min
. (#40577)Add the support of the bool type for transpose op. (#35886)
Switch the
paddle.mm
underlying operator from matmul to matmul_v2. (#35770)Support static graph mode and support the unknown shape for
paddle.einsum
. (#40360)Support data
parallelism for paddle.nn.functional.margin_cross_entropy
andpaddle.nn.functional.class_center_sample
. (#39852)Support input of shape [1] for
paddle.nn.functional.grid_sample
. (#36183)Support NHWC data format for
paddle.nn.PRelu
. (#37019)Support the fixed random state using
paddle.seed
forpaddle.nn.functional.class_center_sample
. (#38248)Add ROCM backend support for all APIs under
paddle.fft
, and optimize CUFFT backend error messages. (#36415, #36114)Support the function that the slicing dimension i 0, that is, allow slicing index results to be empty. (#37313)
Support int and bool type Tensor with using bool index for
Tensor.setitem
. (#37761)Support nearest mode for
paddle.nn.functional.interpolate
when the input shape is 5D. (#38868)Add the support of int16 for
paddle.nn.Embedding
andpaddle.gather
. (#40964, #40052)Support data
parallelism on single machine on``CPU platform``in paddle.distributed.spawn
. (#35745, #36758, #36637)Add
depthwise_conv2d
MKLDNN operator. (#38484)Add complex types check in the static graph model for API
paddle.abs
,paddle.transpose
,paddle.squeeze
,paddle.unsqueeze
,paddle.matmul
, andpaddle.full
. (#40113)Support tuple and list type arguments for
paddle.autograd.PyLayer
. (#38146)Add check whether tensor is inplace and leaf when calculate gradient. (#37931)
Support HIP library for
paddle.autograd.PyLayer
. (#38184)Support more size inputs for
paddle.take_along_axis
andpaddle.put_along_axis
, and allow index matrix shape size to be larger than array matrix shape size. (#39072)Optimize the error report message of API
paddle.nn.Pad2D
when replicate is 0. (#36510)Support pad input in tuple format for API
paddle.nn.Pad2D
. (#35985)Add tdm_sample API in
paddle.distributed.InMemoryDataset
to support sampling operations in TDM algorithms. (#37044)Add Pre-saving Hooks mechanism for
paddle.jit.save
. (#38186)Add new higher-order differentiation-related APIs.
elementwise_add
: add third-order Kernel, to support computation of third-order differentiation. (#36508, #36618)matmul_v2
: add third-order Kernel, to support computation of third-order differentiation. (#36459)elementwise_mul
: Add third-order Kernel, to support computation of third-order differentiation. (#37152)
Improve the logic of the
paddle.amp.GradScaler
to call check_finite_and_unscale op, to eliminate the cudaMemcpy introduced by the creation of the bool variable. (#37770)Add check for unstack and unique op in case of input Tensor with 0 elements. (#36021)
Add new multi-layer, bi-directional LSTM function that supports KUNLUNXIN 2, to improve RNN forward/backward ops, and support the use of temporal model training. (#42076)
Add bce_loss forward/backward ops for KUNLUNXIN 2. (#41610)
Add backward implementation of
paddle.linalg.det
. (#36013)
IR(Intermediate Representation)¶
Dynamic Graphs to Static Graphs
Optimize the behavior of the
ProgramCache.last
interface for dynamic graph to static graph so that it returns the most recently used Program instead of the final generated Program. (#39541)Optimize the error report message for the
paddle.reshape
API for dynamic graph to static graph, and add a new recommended usage hint. (#40599)Optimize the type of exception catch in the
is_api_in_module
function when transcribing dynamic code to static code. (#40243)Optimize the hint of error message for dynamic graph to static graph,hide warning information by default. (#39730)
Add the support of type hint syntax for dynamic graph to static graph to improve the accuracy of variable type analysis. (#39572)
Optimize the
paddle.cond
function to allow values are equal for basic types such as bool and int. (#37888)Optimize the decorate function
@to_static
to allow the switch of the train/eval mode. (#37383)Optimize the stack of error report for dynamic graph to static graph, to highlight user-related codes and reduce the framework redundant error stack. (#36741)
Remove
no_value
placeholder from the return value ofpaddle.cond
. (#36513、#36826)Adapt the run_program op to the new dynamic graph mode. (#40198, #40355)
Add check for zip syntax. (#37846)
Fix the dynamic graph to static graph failure due to the error of dimension and type judgment in the
paddle.signal.frame
,paddle.signal.stft
andpaddle.signal.istft
. (#40113)Add registration of plural type Kernel for mean, pad3d ops. (#40113)
Mixed Precision Training¶
Distributed Training¶
Basic functions of the distributed training
Optimize Fleet API and DistributedStrategy configuration to use dynamic graph parallel function conveniently. (#40408)
Optimize Dynamic Graph mixed parallel HybridParallelClipGrad strategy, support 4D hybrid parallel and Pure FP16 training. (#36237, #36555)
Restructure dynamic graph data parallel strategy, to support new dynamic graph and communication. (#40389, #40593, #40836, #41119, #41413, #39987)
Support distributed tensor model parallel for fused_attention op. (#40101)
Support the distributed tensor model parallel for fused_feedforward op. (#40160)
Graph retrieval engine
Optimize the data format returned by the graph sampling interface of the graph engine, with a 3x improvement of the sampling speed. (#37315)
Reduce the amount of graph engine threads to improve performance. (#37098)
Optimize graph engine data transfer to improve performance. (#37341)
Optimize the merge logic of embedding op to improve performance by exploiting the topological relationship of embedding op in the model. (#35942)
Communication library: restructure the communication library to improve the scalability and development of the communication library, and support heterogeneous communication. (#41398, #39720, #40911, #40579, #40629, #40437, #40430, #40228, #40181, #40100, #40097, #39892, #39384, #39737, #40040)
Support the publication of MoE-related interfaces in
paddle.incubate.distributed.models.moe
(moe.GShardGate
,moe.BaseGate
,moe.SwitchGate
,moe.MoELayer
, andmoe. ClipGradForMOEByGlobalNorm
). (#42300)Fix the error report in the use of recomputing in
paddle.incubate.distributed.models.moe.MoELayer
. (#42128)Fix the error report in the new dynamic graph pipeline parallel caused by different data types (#41937 #42053)
Fix the error report in the new dynamic graph tensor model parallel due to different data types(#41960)
Custom operator¶
Enhance the C++ custom operator mechanism for writing second-order gradient operators, to support adding suffixes to the gradient input variables of second-order gradient operators for use as outputs. (#41781)
Remove the use of the deprecated enumeration type
PlaceType
from the Tensor API member methods, make it compatible, and add a deprecation warning. (#41882)Add deprecated warning for a number of deprecated interfaces of the original Tensor API, including the incomplete constructor, reshape, mutable_data, and copy_to methods. (#41882)
Other¶
Error report and debugging optimization
Optimize
the error message of the label
boundary check for the cross_entropy op. (#40001)Add profile record for
infer_shape
andcompute
methods of op execution of dynamic graphs, show their cost in timeline. (#39023)Replace
pybind::index_error
error hint on Windows for unknown exceptions. (#40538)Add the error message in the out-of-bounds checks for user scatter op. (#37429)
Download tool: For the problem of slow decompression of directories with multiple files in
paddle.utils.download.get_path_from_url
, replace the original way (traverse directory in loop) of decompressing files in directories one by one by calling extractall on the directory, which greatly improves the decompression speed. (#37311)Speed up the quantization training for
fake_quantize_range_abs_max
、fake_quantize_abs_max
、fake_quantize_dequantize_abs_max
、fake_quantize_moving_average_abs_max
, etc. (#40491)
(3) Performance optimization¶
Distributed Training¶
Hybrid parallel optimizer
sharding_optimizer
supportsoptimize_cast
optimization, which move the parameter cast during forward and backwark stage to the optimizer stage. This improves performance by 7%. (#35878)GPUPS optimization: support for gradient fuse allreduce training. This improves training performance by 20%. (#35131)
GPUPS optimization: dump CPU optimization speed improves by 3.21x. (#40068)
CPU parameter server streaming training optimization: support for automatic statistics of sparse parameter statistics, incremental saving of sparse parameters, etc. The training performance improves by 20%. (#36465, #36601, #36734, #36909, #36943, #37181, #37194, #37515, #37626, #37995, #38582, #39250, #40762, #41234, #41320, #41400)
Auto-tuning¶
Add hardware-aware automatic performance tuning for the full training process, with performance improvements of about 3% to 50% or more on image classification, segmentation, detection, and image generation tasks compared to the model’s default configuration. The auto-tuning status is set via the paddle.incubate.autotune.set_config
API. By default, it is currently disabled. Auto-tuning has three specific levels:
Add the auto-tuning function to
paddle.io.DataLoader
, to select the best num_workers based on training data and device resources. (#42004)Add mixed-precision training data layout auto-tuning feature, to select the best data layout based on device type and data type, and automatically convert it at runtime. (#41964)
Add the automatic tuning of the required workspace size threshold for Conv, which is automatically set based on the GPU’s currently available requested device memory resources. Add the automatic selection of Conv cuDNN algorithms based on the generic AlgorithmCache design and Kernel timing component, which supports data variation length models. (#41833)
Operator Optimization¶
Optimize
FasterTokenizer
performance, with a 10% performance improvement compared to pre-optimization. (#36701)Optimize
index_select
inverse computation, with 3.7~25.2x performance improvement over pre-optimization. (#37055)Optimize the performance of
paddle.nn.ClipByGlobalNorm
. Take 10*10paddle.nn.Linear
as an example. In contrast to pre-optimization, the performance improves by about 30%. (#38209)Optimize the performance of
pnorm
with very large or very smallaxis
dimensions, with 31-96x improvement in forward speed and 1.1-19x improvement in backward speed. (#37685, #38215, #39011)Optimize
softmax
forward and backward performance, with a speedup ratio of about 2x for theaxis!=-1
configuration. (#38602, #38609, #32387, #37927)Optimize
log_softmax
forward and backward performance, with a speedup ratio of about 6x to 20x foraxis!=-1
configurations. (#38992, #40612)Optimize
softmax_with_cross_entropy
forward and backward performance, with a speedup ratio of about 1.3x for thehard_label
configuration. (#39553, #40424, #40643)Optimize
top_k
performance, with a speedup ratio of more than 22x for one-dimension and largerk
(k=5000) configuration. (#40941)Optimize
elementwise_mul
backward computation, with 1.85~12.16x performance improvement over pre-optimization. (#37728)Optimize
elementwise_min
andelementwise_max
backward computation, to equalize or improve performance by 1.05x to 18.75x over pre-optimization. (#38236, #37906)Optimize
nearest_interp
forward and backward computation, with forward performance improvement by 1.5x to 2.3x over pre-optimization, and backward performance improvement by 60% to 1.8x over pre-optimization. (#38528, #39067)Optimize
bilinear_interp
forward and backward computation, with forward performance improvement by 0.4x to 2.3x over pre-optimization, and backward performance improvement by 10%-30% over pre-optimization. (#39243, #39423)Optimize
dropout
forward and backward computation, with performance improvement by about 20%. (#39795, #38859, #38279, #40053)Optimize
grid_sampler
forward and backward computation, with forward performance improvement by 10% to 30% over pre-optimization, and backward performance improvement by 10% to 60% over pre-optimization. (#39751)Optimize
group_norm
forward and backward computation, with the forward performance improvement by 1.04x to 2.35x, and backward performance improvement by 1.12x to 1.18x. (#39944, #40657, #39596)Optimize
conv1d
forward and backward computation, with the forward performance improvement by 1.00x to 2.01x, and backward performance improvement by 1.01x to 474.56x. (#38425)Optimize
elementwise_div
backward computation, with the backward performance improvement by 1.02x to 29.25x. (#38044)Optimize
gelu
forward and backward computation, with the backward performance improvement by 1.13x to 1.43x, and reverse performance improvement by 1.10x to 1.55x. (#38188, #38263)Optimize
elementwise_sub
backward computation, with the backward performance improvement by 1.04x to 15.64x. (#37754)Optimize
flip's
forward performance on one-dimensional data input, with the performance improvement by 100%. (#37825)Optimize
layer_norm
forward and backward computation, with the forward performance improvement by 2x to 5x over pre-optimization, and backward performance improvement by 20% to 50% over pre-optimization. (#39167, #39247)Optimize
embedding
forward and backward computation, with a maximum improvement of 1.51x in forward performance and 1.03x to 7.79x in backward performance. (#39856, #39886)Optimize
gelu
FP16 forward and backward calculations, with forward performance improvement by 9% to 12% over pre-optimization, and backward performance improvement by 2% to 9% over pre-optimization. (#38980)Remove CPU -> GPU explicit data transfer operation in
gather_nd
forward and backward operators, and remove the explicit synchronous operation inindex_select
forward and backward operators. Change GPU -> GPU data transfer inscatter_nd
from synchronous operation to asynchronous operation. (#40933)Optimize
Lars optimzier
computation, with the training performance improvement of Resnet50 PF16 model by 5.1% over pre-optimization. (#35652, #35476)Optimize
AvgPool2dGrad
computation, with the performance improvement by 2.6x over pre-optimization. (#35389)Optimize
Elementwise
computation for multivariate output, improving performance by up to 15% over pre-optimization. (#38329, #38410)Optimize
Categorical
the probs computation, simplify the computation logic, and improve the performance by 4x to 5x. (#42178)Optimize the
paddle.sum
performance, with performance improvement by about 20%. (#42309)Remove CudaStreamSync operation from
paddle.nn.ClipGradByGlobalNorm
to reduce scheduling overhead during execution, with 5% performance improvement on ptb models. (#42170)Optimize a series of underlying data structures and detailed implementations in the original dynamic graph execution system to improve the scheduling performance of the original dynamic graph. (#42010, #42171, #42224, #42256, #42306, #42329, #42340, #42368, #42425)
Simplify the probs calculation logics of
paddle.distribution.Categorical
, to improve performance by 4x to 5x. (#42178)
(4) Bug fixing¶
API¶
Fix the output type error with
paddle.sum
when the input parameter type and output parameter type do not match and the number of reduce elements on theaxis
is 1. (#36123)Fix an
AttributeError
inpaddle.flops
when the layer output type is tuple. (#38850)Fix the
paddle.diag
failing to propagate gradients because there is no backward kernel. (#40447)Fix an error in sorting
paddle.sort
input with NaN values. (#41070)Fix the error when
paddle.full_like
’s input contains INF value. (#40232)Fix the bug in
paddle.strided_slice
: strided_slice result does not consistent with slice when the data in the input of starts is less than -rank. (#39066)Fix the bug in the
max_pool
family of operators where infer_shape is calculated incorrectly when index is returned. This affects the APIs:paddle.nn.functional.max_pool1d/2d/3d
,paddle.nn.functional.adaptive_max_pool1d/2d/3d
,paddle.nn.MaxPool1D/2D/3D
,paddle.nn.AdaptiveMaxPool1D/2D/3D
. (#40139)Fix an issue where the dtype of pooling_mask returned by the
max_pool
family of operators is incorrect. Now the dtype of pooling_mask is int32. The affected APIs arepaddle.nn.functional.max_pool1d/2d/3d
,paddle.nn.functional.adaptive_max_pool1d/2d/3d
,paddle.nn.MaxPool1D/2D/3D
,paddle.nn.AdaptiveMaxPool1D/2D/3D
. (#39314 )Fix the bug with
paddle.shape
where the backward gradient by default causes a computation error. (#37340)Fix the bug in
paddle.nn.Layer's
to
method when converting both dtype and place at the same time. (#37007)Fix the bug that
paddle.amp.decorate
fails to rewrite the parameters of non-leaf network layers to FP16. (#38402)Fix the bug that the
paddle.amp.decorate
rewrites the non-input parameter inpaddle.nn.BatchNorm1D
,paddle.nn.BatchNorm2D
, andpaddle.nn.BatchNorm3D
to FP16. (#38541)Fix the bug that the
paddle.amp.decorate
rewrites the non-input parameter inpaddle.nn.SyncBatchNorm
to FP16. (#40943)Fix redundant warnings in
paddle.nn.Layer.to
. (#36700)Fix the bug in
paddle.nn.RNN
when being used inside control flow. (#41162)Fix the bug that the
paddle.to_tensor
fails to specify the CUDAPlace of the Tensor. (#39662)Fix the issue that
paddle.nn.Identity
is not exposed. (#39615)Fix the bug where the output values of the
fill_
andzero_
inplace APIs are incorrect when the input is on a CUDAPinned Place after dynamic graph reconstruction. (#41229)After refactoring the dynamic graph, fix the bug of incorrect inplace version value of the output Tensor when calling assign op using the append op. Change it to call assign op using the
_C_ops
. (#41118)Remove unreasonable codes in the
elementwise_add
‘s third-order kernel, and fix an uninitialized issue in the network creation process. (#36618)Fix the missing attribute bug in
conv2d
execution of cuDNN Kernel. (#38827)Fix an issue where
multiclass_nms3
output shape is incorrect. (#40059)Fix an issue with
yolo_box
outputting incorrect shape. (#40056)Fix an issue where the higher-order differentiation
gradients
interface does not take effect as expected when target_grad is specified. (#40940)Fix an issue that the network parameter type is incorrect when the default_dtype is modified in the op
_BatchNormBase
base class in the dynamic graph mode. The affected APIs arepaddle.nn.BatchNorm1D
,paddle.nn.BatchNorm2D
,paddle.nn.BatchNorm3D
, andpaddle.nn.SyncBatchNorm
. Specific reason: whenget_default_dtype() == 'float16'
, the default parameter data type is modified byset_default_dtype('float32')
. The parameter type in dynamic graph mode is created by default_dtype; therefore, the change of the default parameter type causes the subsequent networking Parameter type error. (#36376)Fix the bug of the undefined intermediate variable in the backward op in batchnorm op in case that the data type is FP32 and the data dimension is
dims = 2 and data_layout = NHWC
. (#37020)Fix the bug that shape of weights is incorrect, when using
paddle.static.nn.prelu
in static graph mode, and input format isNHWC
,mode==channel
. (#38310)Fix the bug of
paddle.nn.functional.class_center_sample
: CUDA seed setting issue in multi-machine case. (#38815)Fix the bug of failing to report error when the input of
paddle.nn.functional.one_hot
is incorrect. (#41335)Fix an issue where a callback to reclaim device memory on a DCU device is not triggered in time, resulting in an OOM of the device memory. (#40445)
Fix the bugs of
setitem
backward gradient abnormal and inplace logic handling abnormal in some dynamic graph scenarios. (#37023, #38298)Fix the bug of index abnormal when Tensor array uses the Slice to index in the dynamic to static scenarios. (#39251)
Fix the bug of memory or device memory leaks caused by some temporary variables not being correctly destructed when
paddle.Tensor.register_hook
interface is used. (#40716)Fix the bug that
Tensor.getitem
cannot get the value when the index is a bool Tensor with all False. (#41297)Fix the bug that
Tensor.getitem
cannot get the value when the index is a bool scalar Tensor. (#40829)Fix the bug in
paddle.index_select
when index is a 0-shape Tensor. (#41383)Fix the bug when the number of GPU threads requested by
paddle.index_select
andpaddle.index_sample
exceeds the limited machine resources. (#41127, #37816, #39736, #41563)Fix the bug when ReduceConfig, elemwise_grad, gather, gather_nd, and scatter ops request more GPU threads than the limited machine resources. (#40813, #41127)
Fix the bug that the memory access is out of boundary when NX ! = 1 in ReadData, ReadDataBc, and ReadDataReduce in Kernel Primitive API. (#36373)
Fix the bug of the computation result abnormal due to data overflow caused by the IndexRandom data type error. (#39867, #39891)
Fix the bug of the returned computing result error of reduce op when reduce_num = 1. (#38771)
Fix the bug of the memory access out-of-bound of reduce op in the middle dimension of reduce in HIP environments. (#41273)
Fix the bug of Kernel failed to properly release in the computation of two FP16 one-dimensional vectors of matmul op.
Fix the bug caused by CUDA integer computation overflow for some operators, including: bernoulli, gaussian_random, gumbel_softmax, multinomial, truncated_gaussian_random, uniform_ random_inplace, and uniform_random ops. (#37670)
Fix the bug where
paddle.nn.Sequential
reports a KeyError error when traversing sublayers in a for loop. (#39372)Fix the bug of the check shape error in
paddle.nn.functional.unfold
when compiling in static graphs. (#38907, #38819)Fix the bug of reporting an error if
axis
is specified when using dropout for static graphs. (#37223)Migrate the matmul operator in the
paddle.nn.MultiHeadAttention
to the matmul_v2 operator. (#36222)Fix the bug occurred in throwing FPE when the empty Tensor is used in
paddle.nn.functional.label_smooth
. (#35861)Fix the deformation bug of reshape op when input is an empty Tensor. Support the empty Tensor rehape to [-1]. (#36087)
Fix the bug of the modified values will incorrectly override other rows when the
fill_diagonal
‘s input parameter offset is non-zero. (#36212)Modify stop_gradient returned by the range op bing set to True in dynamic graph mode. (#37486)
Fix the bug where Lamb optimizer is updated incorrectly when Beta1Pow and Beta2Pow are on the GPU. (#38518)
Fix the bug where the conv2d operator doesn’t respect to FLAGS_cudnn_deterministic. (#37173)
Fix the bug caused by an earlier version of cufft that does not define CUFFT_VERSION. (#37312)
Fix the computing error of
paddle.ifftshit
andpaddle.fftshift
. (#36834, #36748)Fix the
axis
computation error inpaddle.fft
series of APIs. (#36321)Fix an output data type registration bug of batch_norm_grad op in case of FP16 data type. This bug causes the compilation failure in some scenarios. There is also the impact on FP16 computational precision. (#42461)
Fix the incorrect Infershape information bug in the
paddle.nn.functional.pad
API when the padding is Tensor in dynamic to static conversion. (#42414)Fix an exception in
paddle.distribution.StickBreakingTransform
when the input dimension exceeds 2. (#41762)Fix a nan/inf bug calculated with QK^T in fused_attention op. (#42032)
Fix a nan/inf bug calculated in fused_attention op with FusedResidualDropoutBias on V100. (#42398)
Fix a redundant data transform bug introduced by the full_like op during execution. (#41973)
Fix a problem with p_norm op calculating nan on GPU environments. (#41804)
Fix a section error of split op when the sections parameter has a size of 0. (#41755)
Fix the bug of reporting not supporting Place (gpu:0) in multi-card training when broadcast is required in 6 elementwise ops (pow, complex, divide_double, multiply_double, fmax, and fmin). (#42332)
Fix the bug that the deprecated interface reports a warning in case of
import paddle
due to a PIL version update. (#42307)Fix the bug that
paddle.linalg.matrix_rank
does not support tol as FP64 Tensor under static graph. (#42085)
IR(Intermediate Representation)¶
Dynamic to static graphs
Fix a type derivation error in reverse gradient accumulation when the
tensor_array
is used with the control flow. (#39585, #39689)Fix an issue where the parameter gradient type is not set correctly during dynamic to static AMP training. (#40938)
Fix an issue of reporting an error in the dynamic to static transcription when there are misplaced annotations in the codes. (#39035, #38003)
Fix an issue where Tensor is not properly converted to Variable when calling a non-forward function in dynamic to static codes. (#37296, #38540)
Fix an issue where
paddle
is incorrectly passed as a variable when dynamic to static transcription. (#37999)Fix an issue where model parameters are incorrectly counted when calling
paddle.flops
after model dynamic to static conversion. (#36852)Fix an issue where GPU memory will keep growing in train mode and no_grad contexts after loading models using the
paddle.jit.save/load
interface. (#36434)Add warning in function of convert_call when converting the generator function. (#35369)
Fix the run_program op dependency analysis bug. (#38470)
Fix the code conversion bug when returning a single value in control flow For. (#40683)
Fix the bug when generating a reverse op when the input to conditional_block op contains LoDTensorArray. (#39585)
Fix the bug that
padddle.jit.save
loses the forward_pre_hook and forward_post_hook of the top Layer in case of the export of a dynamic-to-static graph mode. (#42273)Fix the dynamic to static conversion error report where the shape parameter in
paddle.expand
contains a Tensor. (#41973)
Distributed Training¶
Distributed training basic functions
Fix the bug of a port reporting error in the distributed multi-machine training. (#37274)
Fix the brpc compilation dependency bug. (#37064)
Fix an occupied port issue due to tcp self-connections when Fleet starts. (#38174)
Fix the precision degradation bug under data parallel due to inconsistent initialization of FP16 parameters under multiple cards. (#38838, #38563, #38405)
Fix the precision degradation under data parallel due to FP16 gradient synchronization without dividing by the number of cards. (#38378)
Dynamic graph mixing parallel
Fix the bug where parameters are not updated in FP16 mode under mixed parallel by using the new update interface. (#36017)
Static graph mixing parallel
Fix an issue where grad merge is not compatible with ClipGradientByGlobalNorm in distributed dp mode. (#36334)
Fix an issue under hybrid parallelism where the non-distributed parameters of tensor model parallelism are not broadcast during the initialization phase, resulting in inconsistent non-distributed parameters across cards. (#36186)
Fix the issue that sharding’s save_persistables interface does not save FP16 parameters and offload persistent variables when sharding is enabled with offload. (#40477)
Fix the bug where ema parameters are not saved on non-0 cards when sharding is enabled for training. (#39860)
Fix an issue where FC incorrectly calculates gradients according to column cuts. (#38724)
Fix the bug reported when DistributedStrategy is set to without_graph_optimizer when used with rnn. (#36176)
GPUPS Parameter Server Training
Fix the CPU branch compilation bug triggered by the GPUPS macro definition. (#37248)
Fix an occasional error raised when saving delta and pullsparse concurrency during GPUPS streamline training. (#37233)
Fix a download error issue caused by HDFSClient querying a directory without returning the full path. (#36590)
Fix the bug with pulling old parameters in GPUPS streamline training. (#36512)
Fix a GPUPS multi-stream allocation issue. (#37476)
Fix the bug of the GPUPS pybind out of core. (#37287)
Other¶
Fix the clip_extra issue when saving models for dynamic graph quantization training. (#38323)
Fix an issue with abs_max scale initialization for dynamic graph quantization training. (#39307)
Fix an issue of exceptions in saving model in dynamic graph quantization training. (#38102, #38012)
Fix the offline quantization flatten op output error. (#37722)
Fix the non-matching dimension bug in case of inverse quantization matmul op. (#36982)
Fix the bug of adding quantization op when quantizing matmul_v2 without weights. (#36593)
Fix the error of saving the quant_axis attribute in the conv op channel-wise quantization when saving the models. (#39054)
Fix the slow training of channel-wise quantization. (#40772)
Fix the bug of quantization training when dividing by tensor(initialized as 0) leads to nan. (#36762)
Fix incorrect settings of amp_level for mixed precision in multi-threaded scenarios. (#39198)
Fix an issue where PyLayer and Recompute is not set mixed precision correctly when mixed precision training is used with PyLayer and Recompute. (#39950, #40042)
Fix an issue where
D_GLIBCXX_USE_CXX11_ABI
does not take effect when compiling custom operators under Mac. (#37878)Fix the bug of inconsistent dynamic and static behaviors in case of block=None the initializer-related API. (#37827)
Fix the bug in python 3.6 where there is no fluid module. (#35862)
Fix the bug where optimizer
paddle.optimizer.Adamw
incorrectly calls adam op. (#36028)Fix a logic error when the
paddle.optimizer.Momentum
optimizer parameterregularizer
property is None under the multi tensor policy. (#38344)Fix the bug that the
paddle.optimizer.Momentum
andpaddle.optimizer.Adam
optimizers modify themulti_precision
property under the multi tensor policy. (#38991)Fix the code compilation error when using final-state API amp in combination with optional Tensor. (#40980)
Fix the bug where paddle+lite+xpu prediction library would report an error when calling lite CPU prediction, and fix the bug where paddle+lite(without NNAdapter) would report an error when compiling. (#37449)
Fix the bug in Debug compile mode where LoDTensorArray crashes due to inconsistent Pybind11 bindings. (#37954)
Fix the bug that prevents correct construction of Tensor in the extreme case where the shape parameter is a list of Tensor mix with int. (#38284)
Fix a compatibility issue with the
paddle.optimizer.AdamW
API. (#37905)Fix the bug in _InstanceNormBase where the returne value of extra_repr is incorrect. (#38537)
Fix the bug that the Paddle Inference lacks of the symbol
paddle::distributed::TensorTable
when the -DWITH_DISTRIBUTED is uesd. (#41128)matmul_v2 op reports error when there is a 0 value in the shape. (#35791)
Fix the problem of the repeated printing for no gradient input hint message of the recomputed in dynamic graphs. Change it to the printing only once with using warning. (#38293)
Fix the low accuracy bug on the validation set in later epoch training in visual models in the gelu op. (#38450)
Fix adamw op error in numerical computation. (#37746)
Add the parameters in the sparse_momentum
_C_ops
interface. (#39969)Fix the bug where there is no
distributed
module in python 3.6. (#35848)Fix the eigh unit test data initialization problem. (#39568)
Fix the eigvalsh unit test data initialization problem. (#39841)
Fix the bug of not working properly due to excessive register usage on V100 by segment op. (#38113)
Fix the bug with conv-related op sparsification incorrectly set dimension. (#36054)
Provide Automatic SParsity training for static graph-related function Alias to
Paddle.static.sparsity
. (#36525)Fix the bug where divide op’s integer division is still an integer. (#40890)
Fix the crash bug of
paddle.multiplex
when input Tensor value is 0. (#34972)Fix a speed exception for set
reduction
parameter inpaddlpaddle.nn.functional.kl_div
. (#37283)Fix the data source unsorted bug in loading the Cifar dataset. (#37272)
Fix the conversion of loss from uint16 to float in the ProgressBar class. (#39231)
Fix the ShareBufferWith shared data type problem. (#37464, #37247)
Fix the performance issue when
paddle.io.DataLoader
uses IterableDataset and num_workers>0. (#40541)Fix the bug with
paddle.vision.ops.yolo_loss
returns incomplete values in dynamic graph. (#40185)Remove the restriction that the input parameter dataset of
paddle.io.BatchSampler
needs to be thepaddle.io.Dataset
type, to expand the support for user-defined datasets. (#40184)Fix the bug of
paddle.summary
reporting that op_flops does not exist. (#36489)Fix the formula error of lars_momentum op when lars_weight_decay=0. (#40892)
Fix the bug that the optimize-offload cannot save presistable var. (#36433)
Fix an issue where optimizer-offload does not support adamw op type. (#36432)
Fix an issue where enable_program_desc_tracing_data in Tracer is not safe in multi-threaded scenarios. (#39776)
Fix an issue where the model file size is not initialized when the model is read. (#40518)
Fix the logic bug of the Expand op. When the dimension of the input Tensor X is smaller than the shape to be expanded, it may result in the incorrect Out.Shape. (#38677)
Fix the dynamic to static transcription error when the Expand_As op takes only y.shape without Y variable entered. (#38677)
Fix the logic error when Expand_As op computes the output shape. (#38677)
Fix the bug that the variables of the
core.VarDesc.VarType.STRINGS
type report error when getting thelod_level
property and setting itslod_level
to None. (#39077)Fix an issue where the framework function
Pylayer
does not support different dtypes. (#37974)Fix the bug of division by zero of the learning rate decay API
paddle.optimizer.lr.PolynomialDecay
. (#38782)Fix the issue where some logs remained after calling the DisableGlogInfo() interface. (#36356)
Fix an error in backward of multi-layer RNN (when dropout is set to 0) in the training of SimpleRNN, GRU and LSTM API CPU. (#37080)
Add cache for fft on the backend of cufft and hipfft. (#36646)
Enable the shifts parameter of
paddle.roll
to support transfer in Tensor. (#36727)Add onemkl to fft as an optional computation backend. (#36414)
Fix the precision bug in the bfloat16 type under two mamtul_v2 and elementwise_div ops. (#42479)
Fix a possible error in the next step caused by LoDTensorArray clearing only the internal Tensor and not clearing the Array during device memory recycling. (#42398)
4. Deployment Direction (Paddle Inference)¶
(1) New features¶
New APIs¶
Add the Java API so that Java developers can implement high performance inference on the server and in the cloud through a simple and flexible interface. (#37162)
Add
GetTrtCompileVersion
andGetTrtRuntimeVersion
interfaces for getting TensorRT version information. (#36429)Add the
ShareExternalData
interface to avoid memory copy of input data during inference. (#39809)
New functions¶
Add ONNX Runtime backend support. Currently it supports only CPU in the integrated version. (#39988, #40561)
Add support for Ascend 310 inference based on the Paddle Lite subgraph approach. (#35226)
Add the native GPU FP16 inference. (#40531)
For the switch_ir_debug interface, add the dump model function. (#36581)
Add the configuration interface for TensorRT config:
void UpdateConfigInterleaved(paddle_infer::Config* c, bool with_interleaved)
for special data layout in int8 quantization inference. (#38884)Add TensorRT inspector output information to the log. It is valid only for TensorRT 8.2 or later. (#38362,#38200))
Add the support of the TensorRT ASP sparse inference. (#36413)
(2) Underlying optimization¶
CPU performance optimization¶
Optimize the caching mechanism of MKLDNN. (#38336, #36980, #36695)
Add matmul_scale_fuse pass. (#37962)
Add MKLDNN reshape_transpose_matmul_v2_mkldnn_fuse_pass. (#37847, #40948)
Add MKLDNN conv_hard_sigmoid_mkldnn_fuse_pass. (#36869)
Add MKLDNN matmul_v2_transpose_reshape_fuse_pass. (#36481)
Add MKLDNN softplus_activation_mkldnn_fuse_pass. (#36657)
Add MKLDNN elt_act_mkldnn_fuse_pass. (#36541)
Add MKLDNN mish operator and conv_mish_mkldnn_fuse_pass. (#38623)
GPU performance optimization¶
Change the inference default video memory allocation policy from
naive_best_fit
toauto_growth
, to solve the problem of some models filled up with the GPU video memory. (#41491)Support gelu and FC+gelu ops using TensorRT inference. (#38399)
Support
deformable_conv
inference using TensorRT under static shape. (#36612 #36850 #37345)Support nearest_interp_v2 op using TensorRT inference. (#34126)
Add
yolo_box
TensorRT plugin to support input parametersiou_aware
andiou_aware_factor
so that the IoU computed by inference is used as a factor for confidence. (#34128)Support
elementwise_sub
andelementwise_div
calling for TensorRT inference. (#40806 #41253)Support
multiclass_nms3
using TensorRT inference. (#41181 #41344)Support flatten_contiguous_rang op using TensorRT inference. (#38922)
Support for
pool2d
attributepadding
using TensorRT inference when dimension is 4, andglobal_pooling
andceil_mode
are True. (#39545)Support batch_norm and elementwise_add using TensorRT inference when dimension is 5. (#36446)
Add the
reduce
int32 and float types to use TensorRT inference. Addreduce_mean
GPU operator int32 and int64 registration. (#39088)Modify MatmulV2ToMul pass. Modify the qualifier (not support of broadcast) and op_teller mapping condition. (#36652)
Add the support for TenorRT plugin interface AddPluginV2IOExt. (#36493)
Add the aligned attribute in roi_align op and support for TensorRT inference. (#38905)
Add the support for TensorRT inference with concat attribute
axis = -1
. (#39096)Add TensorRT plugin: preln_emb_eltwise_layernorm, preln_skip_la, and rnorm ops, for ERNIE-like model performance optimization. (#39570)
Add TensorRT fuse pass: preln_embedding_eltwise_layernorm_fuse_pass, preln_skip_layernorm_fuse_pass, for ERNIE-like model performance optimization. (#39508)
Split matmul fusion-related passes based on different backends (GPU, CPU, TensorRT), to support transpose function for FC weights. (#39369)
Add the support to TensorRT by roll, strided_slice, and slice op in case of dynamic shapes. (#41913, #41573, #41467)
Add div op support for TensorRT. (#41243)
Quantization support
For the
PostTrainingQuantization
API, add the support forpaddle.io.DataLoader
object orPython Generator
input. (#38686)ERNIE full quantization model inference supports for interleaved data layout. (#39424)
Support for PaddleSlim new quantile model format inference. (#41049)
Add matmul int8 quantization inference op converter and plugin. (#37285)
Add pass to determine if all ops in the model can support int8 quantization. (#36042)
Support quantization inference for the FC part of the multihead attention of the non-variable-length branch. (#39660)
(3) Bug fixing¶
Framework and API fixing¶
Fix the bug of model clipping when saving static graphs. (#37579)
For the C API, add wrapper PD_Cstr for strings, and provide construction and destructing methods to avoid users to use C runtime library to destruct strings directly. (#38667)
Fix the logic bug with memory reuse at prediction time. (#37324)
Fix memory reuse error reporting in multi-threading. (#37894)
Allow passing empty strings for inference when no weight file is available. (#38579)
Fix an issue of clone not being supported when TensorRT dynamic shape is enabled. (#38520)
Fix multi-threaded clone error after TensorRT dynamic shape is enabled. (#40067)
For the lite xpu interface, fix an issue where the xpu card cannot be selected. (#36610)
The TensorRT dynamic shape parameter automatically generate the interface, to add the file existence check. (#36628)
Fix the bug that the MKLDNN does not support conv3d. (#42055)
Backend Capability Fixing¶
Fix cuDNN default algorithm selection configuration for prediction, with using non-deterministic policies. (#41491)
Fix the bug with deformable_conv op in TensorRT plugin resource recovery handling error. (#38374)
Fix a serialization error in the TensorRT plugin for deformable_conv op. (#38057)
Adapt the new refactor engine and serialization API of TensorRT 8.0. (#36769)
Fix the bug that the Flatten2MatmulFusePass, Squeeze2MatmulFusePass, and Reshape2MatmulFusePass do not take effect. (#37644)
Fix the bug with TensorRT input data reporting errors. (#37427)
Add error message when input dimension is wrong. (#38962)
Fix the bug with EmbEltwiseLayernorm output type error. (#40015)
Remove conv_affine_channel_fuse_pass and the corresponding unit test. (#39817)
Fix an issue where the adaptive_pool2d pass incorrectly replaces the pool attribute. (#39600)
Fix the bug that shuffle_channel_detect_pass incorrectly generates shuffle_channel op. (#39242)
Fix transpose parameter error. (#39006)
Fix the crash bug when nearest_interp_v2 input scale dimension is less than 1. (#38725)
Fix the bug that the prelu does not support one-dimensional input in dynamic shape. (#39389)
Fix the bug in the kernel function of slice’s special_slice_plugin. (#39875)
Temporarily disable int8 branch under skip_layernorm variable length to prevent accuracy degradation. (#39991)
Fix some bugs regarding support for preln_ernie models. (#39733)
Fix the bug that slice may exceed threads limit in ERNIE. Fix the bug that the spacial_slice is incorrectly triggered. (#39096)
Fix the bug that the elementwise does not support broadcast when the dimension is the same. (#37908)
Fix the problem that the underlying implementation is different in the nearest_interp op when align_corners is True and TensorRT layer results and native op have diff. (#37525)
Fix qkv_plugin: Kernel function computation error. (#37096)
Fix the bug with inference pass for dynamic quantization. (#35879)
Reuse directly when Tensor requests less memory than the allocated size. (#37880)
Fix the hang bug when ERNIE fixed-length model is enabled with TensorRT. (#37839)
Fix the crash bug when TensorRT int8 lacks of dynamic range information. (#36900)
Fix the bug with slice deserialization code. (#36588)
Fix yolo box calculation formula error. (#36240)
Fix the crash bug when the earlier version model uses a later version of roi_align. (#38788) External Developers
Fix the bug of a large performance difference of softmax between python and C++. (#37130)
Fix matmul inference failure on static shape 2-dimensional input and dynamic shape 3-dimensional input. (#36849)
Fix reshape_transpose_matmul_mkldnn_fuse_pass mishandling of shapes. (#36731)
Fix an issue where TensorRT gets 4 dimensions when the input is 2 dimensions. (#36614)
Fix the bug report when the interpolate_v2 MKLDNN operator is null in the scale attribute. (#36623)
Fix poor performance of the recurrent operator in multi-threaded scenarios. (#36052)
Remove restrictions of relu, sigmoid, tanh, relu6, batch_norm, clip, concat, gelu, hard_sigmoid, prelu, softmax, split, and swish on TensorRT 2-dimensional inputs. (#37097)
Fix reshape op to use TensorRT inference. (#41090)
Fix matmul related pass, which is compatible with matmul_v2. (#36424)
Support VALID and SAME attributes in the padding method of the conv2d operator when TensorRT is enabled. (#38999)
Fix MKLDNN multi-input operator quantization problem. (#39593, #39346, #40717)
Fix scale error of conv+activation in MKLDNN quantization scenarios. (#38331)
Fix the bug in MKLDNN quantization without parameters where the quantization of subsequent operators is handled differently. (#39342)
Fix a data type related issue in MKLDNN cpu_bfloat16_placement_pass. (#38702)
Fix a split operator execution issue in MKLDNN bfloat16 inference. (#39548)
Fix the bug with MKLDNN matmul_v2 operator not supporting 6 dimensions. (#36342, #38665)
Fix MKLDNN DeviceContext error in MKLDNN matmul_v2_transpose_reshape. (#38554)
Fix incorrectly calculated results for segmentation models in MKLDNN inference scenarios. (#37310)
Fix MKLDNN bfloat16 placement operator list and add the missing operator. (#36291)
Fix the format bug of MKLDNN operators, including: FC, conv_transpose, 6-dimensional Tensor error reporting, and wrong output format of conv to NHWC input. (#38890, #37344, #37175, #38553, #40049, #39097)
Fix MKLDNN multi-threaded reasoning scenario error due to cache mechanism. (#36290, #35884)
Fix MKLDNN quantization model accuracy anomaly caused by matmul and FC. (#38023, #37618)
Fix the abnormal quantization model accuracy issue in MKLDNN quantization conversion scripts caused by missing passes. (#37619, #40542,#38912)
Fix the crash bug in MKLDNN enabling volume op due to data type mismatch. (#38133)
Fix an issue where some MKLDNN ops need to change back to the original layout after modifying the layout. (#39422)
Fix the bug of Python API error report due to conflict with Ascend software stack, because the GIL lock is not released in the Ascend 910 inference scenario. (#38605)
5. Environment Adaptation¶
Compile and Install¶
From version 2.3.0, PaddlePaddle has adjusted and upgraded the types of GPU architectures supported by the framework. (For more information, please refer to: GPU architectures supported by PaddlePaddle)
Notes:
PIP source installation means downloading the installation package and dependency libraries from PIP official website with using
pip install paddlepaddle
orpip install paddlepaddle-gpu
. This supports less architecture types, and lighter installation package,and only one CUDA version of the installation package is provided(compared with BOS source).Prior to version 2.3, the PIP source installer (CUDA10.2) supports the following GPU architectures: 3.5, 5.0, 5.2, 6.0, 6.1, 7.0, and 7.5.
Later than version 2.3, the PIP source installer (CUDA11.0) supports the following GPU architectures: 6.0, 6.1, 7.0, 7.5, 8.0
The BOS source is a way to download the installation package and dependency libraries from the official website of PaddlePaddle, which supports more GPU architectures. The download source is from China and it is much faster. (compared with PIP source, it supports more kinds of architectures and provides multiple CUDA versions of installation packages).
Prior to version 2.3, the GPU architectures supported by the bos source installer on the PaddlePaddle website:
CUDA10: 3.5, 5.0, 5.2, 6.0, 6.1, 7.0, 7.5;
CUDA11: 5.2,6.0,6.1,7.0,7.5,8.0。
Later than version 2.3, the GPU architectures supported by the bos source installer on the PaddlePaddle website:
CUDA10: 3.5, 5.0, 5.2, 6.0, 6.1, 7.0, 7.5;
CUDA11: 3.5, 5.0, 6.0, 6.1, 7.0, 7.5, 8.0。
Support Python 3.10. Fix compilation bugs caused by some PythonC API changes on Windows. (#41180)
The Windows platform supports the compilation through Visual Studio 2019. (#38719)
Eliminate various warnings when compiling on the Windows platform. (#38034, #37890, #37442, #37439, #36857)
Fix jetson compilation issues introduced by the underlying data structure upgrade. (#39669, #39441)
New Hardware Backend Extention¶
Custom device support: provide a plug-in way to extend PaddlePaddle hardware backend. With this function, developers do not need to modify PaddlePaddle codes for specific hardware, but simply implement the standard interface and compile it into a dynamic link library that can be called by PaddlePaddle as a plug-in.This reduces the development effort of adding a new hardware backend to PaddlePaddle. Currently it supports custom Runtime and custom Kernel.
Support Huawei NPU chip (Ascend910) training/inference. Support ResNet50, YoloV3, BERT, Transformer and many other models. Support static + dynamic graph and auto-mixed precision training. Support single card, and distribute training across multiple cards, multiple machines.
Support Graphcore IPU chip (including IPU Mk2 GC200 and Bow IPU) training/inference. Support ResNet50, BERT and other models. Support static graph training. Support single card, and distribute training across multiple cards, multiple machines.
Support cambricon MLU chip (MLU370x4) training/inference. Support models such as ResNet50. Support static graph + dynamic graph training. Support auto-mixed precision training. Support single card, and distribute training across multiple cards, multiple machines.
Support KUNLUNXIN 2 chips (KUNLUNXIN AI acceleration cards R200, R300) training/inference. Support ResNet50, YoloV3, OCR-DB, SSD, MobilnetV3, UNet, BERT, Transformer, GPT-2, Wide&Deep, and DeepFM. Support static graph + dynamic graph training. Support auto-mixed precision training. Support single card, and distribute training across multiple cards, multiple machines.
Thanks to our Contributors¶
This release contains contributions from the project core team as well as:
Adam Osewski, Allen Guo, arlesniak, chenenquan, chenyanlann, fengkuangxiaxia, fuqianya, fwenguang, guguguzi, helen88, houj04, Jacek Czaja, jakpiase, jianghaicheng, joanna.wozna.intel, joeqiao12, Leo Chen, Leo Guo, Li-fAngyU, lidanqing, Liyulingyue, Matsumoto GAO, maxhuiy, Ming-Xu Huang, Nyakku Shigure, piotrekobi, piotrekobiIntel, QingshuChen, qipengh, Skr Bang, Sylwester Fraczek, Sławomir Siwek, taixiurong, tanzhipeng, Tomasz Socha, TTerror, Webbley, yaozhixin, ykkk2333, yujun, Zhangjingyu06, zhangxiaoci, zhangyikun02, zhangyk0314, zlsh80826, zn, Zuza.