Finding Similar Webpages by Text Mining

Articles —> Finding Similar Webpages by Text Mining

text mining

While creating this website, I thought it would be useful to implement a related webpages feature - each webpage would display a few webpages similar to the current one, at least as related as can be. Granted this website is small so I wanted to keep things simple, and rather than doing a keyword search I wished for a more flexible way to score related pages. I decided to implement this feature by defining distances between webpages, allowing me to calculate nearest neighbors and, if the need be, clusters. The implementation fell into two phase.

Feature Extraction

Extracting features for data mining can be a complex process. This case is relatively simple as the features are simply text. I could mine the complete text for each webpage, but to keep it simple I focused on the page title and additional tags added to each article. To maintain consistency and remove problems such as plurality, tense, etc...each word was 'stemmed' using a Stemming Algorithm. Should one mine the text itself, a weighting system could be implemented which weighs words based upon location, where the title may weigh higher than article text, and text higher than 'invisible' tags (meta tags, alt descriptions, etc...), and so on.

Identifying Similarities

Rather than calculate everything on the fly, which could be computationally intensive as a website grows, each page was indexed for its features (words) as it is created or edited, and those features tallied (here - and again to keep things simple - the tally was simply a word count without weights).

To identify similar pages, a distance matrix is then built. To build the matrix, the word counts for the page is tallied and a matrix containing all this data built - rows represent pages and columns words. The distance between each page/row is then calculated - in this case I used Euclidian distance: the nearest page to the query page has the smallest distance to said page. These scores are stored then stored in a database such that it can be easily queried and sorted.

Word 1Word 2Word 3
Page 11020
Page 2023
Page 3700

Example Word Counts

Page 1Page 2
Page 210.4
Page 33.67.8

Example distance matrix

In the end, each page I edit or create is re-indexed on the fly using php and SQL. To find similar pages, all that is needed is a simple database query in which similar pages are found and ordered by distance. Although this is relatively fast at the moment, I have added categories which can be used to speed up indexing: distances would then only calculated between webpages within the same category.

While a bit simplistic, this implementation serves its purpose for a smaller website such as this. If this technique were to be implemented for a website much larger, more advanced mining techniques may need to be incorporated, such as nested classifications, parallel computing (hadoop/map reduce), etc...



There are no comments on this article.

Back to Articles


© 2008-2017 Greg Cope