摘要

In order to improve the quality of web data mining algorithm, this paper summarizes the advantages and disadvantages of several web data source models, including web log, application server log, Client-side log, Packet sniffer, and 5-gram united events model. Based on this analysis, a new 4-gram united events model (UEM4) is proposed in this paper. Simulation experiments were conducted to verify the performance of UEM4, compared with web log and 5-gram united events model. The experiment results show that web log has the worst session identification performance; UEM5 has high accuracy, best online and offline performance, but it needs the application system support the ability to identify the session; UEM4 does not require the application system to support session identification, and also has a good accuracy and performance of session identification. Therefore, this model can be used in e-commerce, which can provide high quality data sources for web mining algorithms and improve the quality of intelligent services.