I have recently been configuring FeedApi to import a youtube feed and looked towards the scraper to help me pick out the video description. Everything works great when testing using a regex of /(.+)<\/span>/isU however when I refreshed my feed I noticed that my descriptions were getting cut off. I also noticed that the cuts were happening at latin characters, eg "I love this café downtown" was just giving me "I love this caf".
After pulling my hair out by changing parsers, parser order, etc I realized that youtube serves up the video descriptions raw and the scraper was decoding html as part of the parsing.
feedapi_scraper.module line 682
function _feedapi_scraper_regex_parser($expression, $raw) {
$matches = array();
if(preg_match_all($expression, $raw, $matches) && isset($matches[1])) {
// Quick fix to clean up stripped HTML in feed fields
$replace = str_replace('amp;', '&', $matches[1][0], $count);
if ($count > 0) {
$matches[1][0] = html_entity_decode($replace);
}
return $matches[1][0];
}
return '';
}
I think the problem is that the raw latin chars get stripped when using html_entity_decode, so to fix this I simply edited as follows:
**Note that I am new to drupal and I am sure that this IS NOT the proper way to override a module**
function _feedapi_scraper_regex_parser($expression, $raw) {
$matches = array();
// edit
$raw = utf8_encode($raw);
// end edit
if(preg_match_all($expression, $raw, $matches) && isset($matches[1])) {
// Quick fix to clean up stripped HTML in feed fields
$replace = str_replace('amp;', '&', $matches[1][0], $count);
if ($count > 0) {
$matches[1][0] = html_entity_decode($replace);
}
return $matches[1][0];
}
return '';
}
This, for the moment, is working nicely. But I would like to know what the proper way to override this function is or if anyone knows a downfall of encoding this in utf8 before parsing it with the scraper.