DOMDocument, DOMXPath and invalid html

I've been doing some work with DOMDocument and DOMXPath to parse webpages recently, and even though I could get the XPath directly from Firefox or Chrome, it would not match in the $xpath->query(); What I found is that the $dom->LoadHTML($page); will handle invalid HTML, but mostly just by stripping it out. This is fine, unless you depend on the structure for the XPath query. The problem was the the page contained tables (we've all built them, missing the TBODY tags). Chrome and Firefox fix these themselves, then include them in the XPath query, but LoadHTML just strips out all the table contents! By doing some regex replacements, I was able to fix the HTML so that LoadHTML preserves the entire table, and the original XPath query works. The regex replacements I used were:

  // Load the html page with curl first (details omitted).
  $page = curl_exec($curl);

  // Fix invalid HTML here
  $patterns = array('/<TABLE(.*?)>s+<TR>/i', '/</TR>s+</TABLE>/i');
  $replacements = array('<TABLE\1><TBODY><TR>', '</TR></TBODY></TABLE>');
  $item_page = preg_replace($patterns, $replacements, $page);

  $dom->loadHTML($page);

  $xpath = new DOMXPath($dom);

  // Now do your xpath query etc.

Hopefully that will help someone else.

Topics

Drupal

PHP