Today I will discuss with you a function written in PHP which will enable you parse out common phrases in a document.
Description:
Okay, so basically this function will take the string and return a phrase => count associative array. If you only pass it the string, it defaults to doing a count of individual words and returning all of them in descending order. Optional 2nd argument lets you specify how many words in the phrase. So if you put 2 as 2nd argument, it will go through and for each word, take the word and the word after it and count how many times that 2 word phrase occurs, returning the list in descending order. If the optional 3rd argument is used, it returns top x amount of words, so like 10 would return top 10 phrase occurance.
Limitations:
function getPhraseCount($string, $numWords=1, $limit=0) {
// make case-insensitive
$string = strtolower($string);
// get all words. Assume any 1 or more letter, number or in a row is a word
preg_match_all(~[a-z0-9]+~,$string,$words);
$words = $words[0];
// foreach word...
foreach($words as $k => $v) {
// remove single quotes that are by themselves or wrapped around the word
$words[$k] = trim($words[$k],"");
} // end foreach $words
// remove any empty elements produced from trimming
$words = array_filter($words);
// reset array keys
$words = array_values($words);
// foreach word...
foreach ($words as $k => $word) {
// if there are enough words after the current word to make a $numWords length phrase...
if (isset($words[$k+$numWords])) {
// add the phrase to list of phrases
$phrases[] = implode( ,array_slice($words,$k,$numWords));
} // end if isset
} // end foreach $words
// create an array of phrases => count
$x = array_count_values($phrases);
// reverse sort it (preserving keys, since the keys are the phrases
arsort($x);
// if limit is specified, return only $limit phrases. otherwise, return all of them
return ($limit > 0) ? array_slice($x,0,$limit) : $x;
} // end getPhraseCount
//examples:
getPhraseCount($string); // return full list of single keyword count
getPhraseCount($string,2); // return full list of 2 word phrase count
getPhraseCount($string,2,10); // return top 10 list of 2 word phrase count
Description:
Okay, so basically this function will take the string and return a phrase => count associative array. If you only pass it the string, it defaults to doing a count of individual words and returning all of them in descending order. Optional 2nd argument lets you specify how many words in the phrase. So if you put 2 as 2nd argument, it will go through and for each word, take the word and the word after it and count how many times that 2 word phrase occurs, returning the list in descending order. If the optional 3rd argument is used, it returns top x amount of words, so like 10 would return top 10 phrase occurance.
Limitations:
- hyphenated words are not matched.
- case in-sensitive.
- assumes $string is "human" readable text. In other words, if you were to pass a file_get_contents of some webpage to it, you should probably strip_tags() first, as well as do some regex to remove stuff between php script tags, etc...