Visualizing Duplicate Web Pages | SEOmoz

We’ve just changed the way we detect duplicate or near-duplicate web pages in our custom crawler to better serve you. Our previous code produced good results, but it could fall apart on large crawls (ones larger than about 85,000 pages), and takes an excessively long time (sometimes on the order of weeks) to finish.


Now that the change is live, you’ll see some great improvements and a few changes:


Results will come in faster (up to an hour faster on small crawls and literally days faster on larger crawls)

More accurate duplicate removal, resulting in fewer duplicates in your crawl results

This post provides a look into the motivations behind our decision to change the way our custom crawl detects duplicate and near-duplicate web pages at a high level. Enjoy!

via Visualizing Duplicate Web Pages | SEOmoz.


About Bob Warfield

Here's my bio:
This entry was posted in Uncategorized. Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s