Quantcast
Viewing latest article 2
Browse Latest Browse All 75

unable to scrape image from html

I have followed the very useful guide at http://groups.drupal.org/node/24472#comment-84916 without success — I am unable to extract an image from a feed at the test stage.

A sample feed item is

<item>
      <guid>http://www.facebook.com/posted.php?id=742515120&amp;share_id=107130567870#s107130567870</guid>
      <title>Annabel Crabb | Peter Garrett</title>
      <link>http://www.facebook.com/posted.php?id=742515120&amp;share_id=107130567870#s107130567870</link>
      <description>&lt;div class=&quot;ext_media clearfix has_extra has_thumb&quot;&gt;&lt;div class=&quot;title&quot;&gt;&lt;a href=&quot;http://www.smh.com.au/opinion/the-brutal-thriving-industry-that-is-the-modern-garrett-hunt-20090717-do7z.html?page=-1&quot; title=&quot;http://www.smh.com.au/opinion/the-brutal-thriving-industry-that-is-the-modern-garrett-hunt-20090717-do7z.html?page=-1&quot; target=&quot;_blank&quot;&gt;Annabel Crabb | Peter Garrett&lt;/a&gt;&lt;/div&gt;&lt;div class=&quot;url&quot;&gt;Source: www.smh.com.au&lt;/div&gt;&lt;div class=&quot;story_posted_item clearfix&quot;&gt;&lt;div class=&quot;extra&quot;&gt;&lt;div class=&quot;share_thumb&quot;&gt;&lt;a href=&quot;http://www.smh.com.au/opinion/the-brutal-thriving-industry-that-is-the-modern-garrett-hunt-20090717-do7z.html?page=-1&quot; target=&quot;_blank&quot;&gt;&lt;img src=&quot;http://images.smh.com.au/2009/07/17/638064/420rocco-420x0.jpg&quot; alt=&quot;&quot; class=&quot;img_loading&quot; onload=&quot;var img = this; onloadRegister(function() { adjustImage(img); });&quot; id=&quot;share_thumb_107130567870&quot; /&gt;&lt;/a&gt;&lt;/div&gt;&lt;/div&gt;&lt;div class=&quot;story_content_excerpt textual&quot;&gt;&lt;div class=&quot;metadata&quot;&gt;&lt;div class=&quot;summary&quot;&gt;The Sydney Morning Herald - Business News, World News &amp; Breaking News in Australia Skip directly to: Search Box, Section Navigation, Content, Text Version.&lt;/div&gt;&lt;/div&gt;&lt;/div&gt;&lt;/div&gt;&lt;div class=&quot;story_comment&quot;&gt;&lt;div class=&quot;direction_ltr&quot;&gt;&lt;span class=&quot;start_quote&quot;&gt;    &lt;/span&gt;&lt;span class=&quot;story_comment&quot;&gt;Interesting comment on new Uranium mine and Peter Garret, yet it confounds me how people can still refer to nuclear power as a &apos;viable&apos; option — how long can people keep their heads in the sand...&lt;/span&gt;&lt;span class=&quot;end_quote&quot;&gt;    &lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;/div&gt;</description>
      <pubDate>Sat, 18 Jul 2009 00:38:48 -0900</pubDate>
      <author>XXX</author>
      <dc:creator>XXX</dc:creator>
    </item>

When I try to scrape the description, the input is shown as

<div class="ext_media clearfix no_extra"><div class="title"><a href="http://www.smh.com.au/environment/conservation/garrett-concedes-extinction-inevitable-20090817-enoe.html" title="http://www.smh.com.au/environment/conservation/garrett-concedes-extinction-inevitable-20090817-enoe.html" target="_blank">Garrett concedes: extinction inevitable</a></div><div class="url">Source: www.smh.com.au</div><div class="story_posted_item clearfix"><div class="story_content_excerpt textual"><div class="metadata"><div class="summary">THE Environment Minister, Peter Garrett, has warned that money to save endangered wildlife is limited and some species may have to be abandoned when funding decisions are made.</div></div></div></div></div>

The input is does not list the image, so I am a little unsure how to scrape it?? Anyone have a solution?


Viewing latest article 2
Browse Latest Browse All 75

Trending Articles