How do you parse and process HTML/XML in PHP ?

How do you parse and process HTML/XML in PHP ?

Asked on December 13, 2018 in XML.
Add Comment


  • 3 Answer(s)

    Here is the answer:

    Native XML Extensions

         I pose using one of the native XML extensions since they come bundled with PHP, are usually faster than all the 3rd party libs and give me all the control I need over the markup.

    DOM:

        The DOM extension allows you to operate on XML documents through the DOM API with PHP 5. A platform- and language-neutral interface that allows programs and scripts to dynamically access and update the content, structure and style of documents.

    DOM is capable of parsing and modifying real world broken HTML and it can do XPath queries. It is based on libxml.

    It takes some time to get productive with DOM, but that time is well worth it IMO. Since DOM is a language-agnostic interface, you’ll find implementations in many languages, so if you need to change your programming language, chances are you will already know how to use that language’s DOM API then.

    A basic usage example can be found in Grabbing the href attribute of an A element and a general conceptual overview can be found at DOMDocument in php

    How to use the DOM extension has been covered extensively on StackOverflow, so if you choose to use it, you can be sure most of the issues you run into can be solved by searching/browsing Stack Overflow.

    XMLReader:

       The XMLReader extension is an XML pull parser. The reader acts as a cursor going forward on the document stream and stopping at each node on the way.

    XMLReader, like DOM, is based on libxml. I am not aware of how to trigger the HTML Parser Module, so chances are using XMLReader for parsing broken HTML might be less robust than using DOM where you can explicitly tell it to use libxml’s HTML Parser Module.

    A basic usage example can be found at getting all values from h1 tags using php

    XML Parser:

         This extension lets you create XML parsers and then define handlers for different XML events. Each XML parser also has a few parameters you can adjust.

    The XML Parser library is also based on libxml, and implements a SAX style XML push parser. It may be a better choice for memory management than DOM or SimpleXML, but will be more difficult to work with than the pull parser implemented by XMLReader.

    SimpleXml:

         The SimpleXML extension provides a very simple and easily usable toolset to convert XML to an object that can be processed with normal property selectors and array iterators.

    SimpleXML is an option when you know the HTML is valid XHTML. If you need to parse broken HTML, don’t even consider SimpleXml because it will choke.

    A basic usage example can be found at A simple program to CRUD node and node values of xml file and there is lots of additional examples in the PHP Manual.

     

    Answered on December 13, 2018.
    Add Comment

    Try Simple HTML DOM Parser

    • A HTML DOM parser written in PHP 5+ that lets you manipulate HTML in a very easy way!
    • Require PHP 5+.
    • Supports invalid HTML.
    • Find tags on an HTML page with selectors just like jQuery.
    • Extract contents from HTML in a single line.

    Examples:

    How to get HTML elements:

    // Create DOM from URL or file
    $html = file_get_html('http://www.example.com/');
    // Find all images
    foreach($html->find('img') as $element)
      echo $element->src . '<br>';
    // Find all links
    foreach($html->find('a') as $element)
      echo $element->href . '<br>';
    

    How to modify HTML elements:

    // Create DOM from string
      $html = str_get_html('<div id="hello">Hello</div><div id="world">World</div>');
      $html->find('div', 1)->class = 'bar';
      $html->find('div[id=hello]', 0)->innertext = 'foo';
      echo $html;
    

    Extract content from HTML:

    // Dump contents (without tags) from HTML
    echo file_get_html('http://www.google.com/')->plaintext;
    

    Scraping Slashdot:

    // Create DOM from URL
    $html = file_get_html('http://slashdot.org/');
    // Find all article blocks
    foreach($html->find('div.article') as $article) {
      $item['title'] = $article->find('div.title', 0)->plaintext;
      $item['intro'] = $article->find('div.intro', 0)->plaintext;
      $item['details'] = $article->find('div.details', 0)->plaintext;
      $articles[] = $item;
    }
    print_r($articles);
    
    Answered on December 13, 2018.
    Add Comment

    Just use this:

     DOMDocument->loadHTML() and be done with it. libxml’s HTML parsing algorithm is quite good and fast, and contrary to popular belief, does not choke on malformed HTML.

    Answered on December 13, 2018.
    Add Comment


  • Your Answer

    By posting your answer, you agree to the privacy policy and terms of service.