Incremental Web Search: Tracking Changes in the Web
Candidate: Ziyang Wang
Third, we model the changes of web documents using survival analysis. Modeling web changes is useful for web crawler scheduling and web caching. Currently people model changes to web pages as a Poisson Process, and use a necessarily incomplete detection history to estimate the true frequencies of changes. However, other features that can be used to predict change frequency have not previously been studied. Our analysis shows that PageRank value is a good predictor. Statistically, the change frequency is a function proportional to $\exp[0.36\cdot (\ln(PageRank)+C)]$. We further study the problem of combining the predictor and change history into a unified framework. An improved estimator of change frequency is presented, which successfully reduces the error by 27.3% when the change history is short.