Abstract:Weibo is a good source of data,and the data is very suitable for public opinion analysis.The API provided by Sina officially limits the speed of data collection,and the network crawler using simulated login is relatively complicated and reduces efficiency.For these problems,a crawler without login for Weibo is designed.Experiments show that the crawler can perform complete and stable collection of Weibo data more quickly.With the increasing demand for data,the single network crawler can’t meet the requirements.The Hadoop distributed computing platform is combined with the crawler without login to design a distributed network crawler system based on MapReduce.Using a cluster of multiple computers, you can capture massive amounts of Weibo data in a short period of time.Through experiments,the crawler system can stably capture nearly 10 million micro blog per day.