Abstract:High reliability design requirements for long-term power on of test and control system. The current dual redundancy scheme lacks consideration of system software and hardware status, possibility of redundancy failure during long-term system operation. In this dissertation,A system fault diagnosis and fault tolerance method is proposed,the system fault sources such as system task status, CPU temperature, CPU utilization, disk space, IO operation status and other anomalies are comprehensively studied and analyzed. Key technologies and methods such as task real-time monitoring, least square method and hash algorithm are used to realize system fault diagnosis and fault tolerance processing, It is verified by the actual model project that the application meets the application requirements of the system for high reliability.