Abstract:When processing raw data in a laboratory system, there are special cases of high sampling rate and high skewness in real-world application scenarios, which cannot be effectively dealt with when balancing the load on the Reducer nodes in a homogeneous environment using a two-order partitioning algorithm. Therefore, the parallel processing of MapReduce is introduced to improve the utilization of sampling data in the laboratory system; At the same time, in order to solve the problem of data skewness and high sampling, ICSC (Improved Cluster Split Combination) partition scheduling algorithm is adopted. Experiments show that MapReduce load balancing algorithm based on two-tier partition can effectively reduce the idle time of Mapper and Reducer nodes. With the increase of data skewness, the execution time of the algorithm is basically unchanged, that is, data skewness has little impact on the execution time of the algorithm. In addition, with the increase of data sampling, ICSC partition scheduling algorithm also maintains the minimum time cost in the comparison model. Therefore, the MapReduce load balancing algorithm based on two-tier partitions weakens the dependency between the reducer nodes, and improves the execution efficiency and fault tolerance of MapReduce tasks, thus effectively realizing the load balancing of data processing in the laboratory system under the MapReduce framework.