Abstract:To enhance the exploration capability and stability of agents during policy optimization, and to address the issues of local optima and reward function setting in reinforcement learning, a PPO algorithm based on maximum entropy correction and GAIL has been proposed. Within the PPO framework, a maximum entropy correction term is introduced, which optimizes the entropy of the policy to encourage the agent to explore among multiple potential suboptimal policies, thereby enabling a more comprehensive assessment of the environment and the discovery of superior strategies. Meanwhile, to tackle the suboptimal training outcomes stemming from ill-conceived reward function settings in reinforcement learning, the concept of GAIL is incorporated, guiding the agent's learning process through expert data. Experimental results demonstrate that the PPO algorithm incorporating maximum entropy correction and GAIL achieves remarkable performance in reinforcement learning tasks, effectively boosting learning speed and stability while effectively mitigating performance degradation caused by ill-suited reward function designs in the environment. This algorithm offers a novel solution in the field of reinforcement learning and holds significant implications for tackling challenging continuous control problems.