We’ve just changed the way we detect duplicate or near-duplicate web pages in our custom crawler to better serve you. Our previous code produced good results, but it could fall apart on large crawls (ones larger than about 85,000 pages), and takes an excessively long time (sometimes on the order of weeks) to finish.
Now that the change is live, you’ll see some great improvements and a few changes:
Results will come in faster (up to an hour faster on small crawls and literally days faster on larger crawls)
More accurate duplicate removal, resulting in fewer duplicates in your crawl results
This post provides a look into the motivations behind our decision to change the way our custom crawl detects duplicate and near-duplicate web pages at a high level. Enjoy!