Tuesday, May 24, 2016

PHP Script To Parse Out Common Phrases In A Document

Today I will discuss with you a function written in PHP which will enable you parse out common phrases in a document.
function getPhraseCount($string, $numWords=1, $limit=0) {
// make case-insensitive
$string = strtolower($string);
// get all words. Assume any 1 or more letter, number or in a row is a word
preg_match_all(~[a-z0-9]+~,$string,$words);
$words = $words[0];
// foreach word...
foreach($words as $k => $v) {
// remove single quotes that are by themselves or wrapped around the word
$words[$k] = trim($words[$k],"");
} // end foreach $words
// remove any empty elements produced from trimming
$words = array_filter($words);
// reset array keys
$words = array_values($words);
// foreach word...
foreach ($words as $k => $word) {
// if there are enough words after the current word to make a $numWords length phrase...
if (isset($words[$k+$numWords])) {
// add the phrase to list of phrases
$phrases[] = implode( ,array_slice($words,$k,$numWords));
} // end if isset
} // end foreach $words
// create an array of phrases => count
$x = array_count_values($phrases);
// reverse sort it (preserving keys, since the keys are the phrases
arsort($x);
// if limit is specified, return only $limit phrases. otherwise, return all of them
return ($limit > 0) ? array_slice($x,0,$limit) : $x;
} // end getPhraseCount

//examples:

getPhraseCount($string); // return full list of single keyword count
getPhraseCount($string,2); // return full list of 2 word phrase count
getPhraseCount($string,2,10); // return top 10 list of 2 word phrase count

Description:
Okay, so basically this function will take the string and return a phrase => count  associative array.  If you only pass it the string, it defaults to doing a count of individual words and returning all of them in descending order.  Optional 2nd argument lets you specify how many words in the phrase.  So if you put 2 as 2nd argument, it will go through and for each word, take the word and the word after it and count how many times that 2 word phrase occurs, returning the list in descending order.  If the optional 3rd argument is used, it returns top x amount of words, so like 10 would return top 10 phrase occurance.

Limitations:

  • hyphenated words are not matched.
  • case in-sensitive.
  • assumes $string is "human" readable text.  In other words, if you were to pass a file_get_contents of some webpage to it, you should probably strip_tags() first, as well as do some regex to remove stuff between php script tags, etc...
If you enjoy reading this post then consider Subscribing to our blog in order to keep yourself updated with similar useful posts.