PHP - How To Parse Html By Domdocument?
Imagine an html with the following structure
Code: [Select] <div class="item"> <div class="title"> <a class="title" href="http://www.domain.com/title.html">Title is here</a> </div> <div class="image"> <a href="http://www.domain.com/title.html"><img src=image.jpg /></a> </div> </div> How to make an array containing $title - $url - $image_url ? Similar TutorialsDelete ... need to change host .. What would the code be to display server side code like the date and time to display in html pages? I have the date code for copyrights that I want to display <?php echo date(Y); ?>and I think I need to create an .htaccess file to put it in but not sure what to put. Thanks, OK.. pretty new to PHP for the most part, but i understand programming languages to a decent extent! Anyways im trying to parse an HTML page to get data out of it and probably in turn put into an sql table.. all i need help with is doing the parsing with dom, and xpath querys or however would be the best way to do this... page im trying to parse: http://us.battle.net/wow/en/guild/Moonrunner/The%20Eternal%20Blade/news basically the data i want to put into sql or variables for the time being would be the 25 results returned in news. (first one is mudkips item Vicious Gladiator's Signet of Cruelty. , and last item is: Lionus earned the achievement Level 30 for 10 points. ) Can anyone please give me some help with a function that could do this? please! I'm trying to parse 2 things. 1. Specific TD tags from a table. 2. Specific URLs from an HTML page. Here's part of the data I'm trying to parse: Code: [Select] <tr> <td class="f"> <a href="http://main1.site.com/x.html">Page 1</a> </td> <td>1572</td> <td class="a">Type: F</td> <td><img src="http://site.com/image.gif" title="N" alt="N" /></td> <td class="f">F</td> </tr> <tr class="x"> <td class="m"> <a href="http://main2.site.com/x.html">Page 2</a> </td> <td>1771</td> <td class="a">Type: M</td> Here's the parser that I'm working with: Code: [Select] <?php $html = file_get_contents('http://www.website.com/page.html'); // use this to only match "td" tags #preg_match_all ( "/(<(td)>)([^<]*)(<\/\\2>)/", $html, $matches ); // use this to match any tags #preg_match_all("/(<([\w]+)[^>]*>)([^<]*)(<\/\\2>)/", $html, $matches); //use this to match URLs #preg_match_all ( "/http:\/\/[a-z0-9A-Z.]+(?(?=[\/])(.*))/", $html, $matches ); //use this to match URLs #preg_match_all ( "/<a href=\"([^\"]*)\">(.*)<\/a>/iU", $html, $matches ); preg_match_all ( "/<a href=\"([^\"]*)\">(.*)<\/a>/iU", $html, $matches ); for ( $i=0; $i< count($matches[0]); $i++) { echo "matched: " . $matches[0][$i] . "\n<br>"; echo "part 1: " . $matches[1][$i] . "\n<br>"; echo "part 2: " . $matches[2][$i] . "\n<br>"; echo "part 3: " . $matches[3][$i] . "\n<br>"; echo "part 4: " . $matches[4][$i] . "\n\n<br>"; } ?> What I'm trying to output is: Code: [Select] <a href="http://main1.site.com/x.html">Page 1</a> Hits: 1572 <a href="http://main2.site.com/x.html">Page 2</a> Hits: 1771 ...for the entire table What I've managed to get out of it so far are the "Hits" with the "td" snippet. What I can't figure out is how to extra the full: <a href="http://main.site.com/p#.html">Page #</a> So my question is how can I make it look for just "<a href="http://main#.......">Page #</a>"? Currently it looks for every URL, which is not what I need. Hello, I need to parse some html for validation. Not an entire html page but something like a "string" of html tags, don't know how to say it correct in english. Basically, I have this to parse (for example): Code: [Select] <object width="100%" height="81"> <param value="http://player.soundcloud.com/player.swf?url=http%3A%2F%2Fapi.soundcloud.com%2Ftracks%2F12483908" name="movie"> <param value="always" name="allowscriptaccess"> <embed width="100%" height="81" type="application/x-shockwave-flash" src="http://player.soundcloud.com/player.swf?url=http%3A%2F%2Fapi.soundcloud.com%2Ftracks%2F12483908" allowscriptaccess="always"> </object> I know it's possible in php, but don't know which function there is for this? Anyone? Thanks! Hi Everyone, I have been successful to parse out some data out of an html page that I am downloading using CURL. I used arrays and preg_match to get the data I need. However, some part of the data has a great deal of SPACE charecters and it seems that my arrays method doesn't work. Can someone please point out how I can parse the following to get only the information out and not tags (****quoted excerpt including all the space characters like it was downloaded): Code: [Select] <span class="basic_serial">(777) 777-7777</span> <br /> 1111 ABCD, EFGH, IJKL <br /> Thanks, Is it possible to parse html document with snippets of php code in them using DOMDocument? i.e load html from file then parse/change them with DOMDocument and then save them back to file I have tryed but i get <?php%20echo%20URL();%20?> I have some code $doc = new DOMDocument(); $doc->loadHTML( '<html> <head><title>Test</title></head> <body></body></html>' ); $doc->encoding = 'iso-8859-1'; file_put_contents('test.html', $doc->saveHTML()); when i view the output file i get <html><head><title>Test</title></head><body></body></html> all on one line is there no way of having it format it like the original source code so that its not all bunched together? Is there any reason that people can think as to why DOMDocument::saveHTML would remove the following: Code: [Select] <![if !vml]> <img src="someimage.jpg" /> <![endif]> A little clarification... This HTML comment tag is used in my company's email newsletter code and is necessary to make Outlook 2007 behave properly. For whatever reason, saveHTML strips it out. I know that this doesn't conform to HTML standards and I'm guessing that that is why it is being stripped. BUT, from reading on the internet, saveHTML can produce junk html code anyways. Any help is appreciated. Hi guys, Just starting to play with PHP Domdocument, only to fail at the very first step: <?php $html = 'test/php/somefile.html' ; if(!empty($html)){ $dom_1 = new domDocument ; $dom_1->loadHTML($html) ; $links = $dom_1->getElementsByTagName('li') ; foreach ( $links as $link) { // echo $link ; echo $link->nodeValue, PHP_EOL; } } ?> When I visit it in a browser I get a WSOD, what am I missing? Hi all, I am pretty new to php and I am having an issue trying to load an XML document. When ever I try to use Xpath it negates all the code below the line, including the HTML, and returns a white page. here is my code: Code: [Select] <html> <head> <?php $xpath = new DOMXPath("structure.xml"); ?> <body> hello world </body> </html> I checked phpinfo() and I have both the DOM and XPath enables and installed. I have also tried using just DOM and that worked so it is only Xpath that is not working. Ideas? Thank you James S so there's some html i'm having to fetch and parse for personal use...but some of the data i want is started in a table written this way: rawTableData = {"rows": [{"colname1": value, "colname2": "value", "colname3: value} how can i use domdocument to parse this data if say i want the value for colname2? there are no tags for me to use. It is easy to get image or link by DomDocument, but I did not find a way to get image with its target link. Imagine a html as Code: [Select] <div class=image> <a href='http://site.com'><img src='imagelink.jpg'></a> </div>How to get both the image link and href? $dom = new DOMDocument(); @$dom->loadHTML($html); $xpath = new DOMXPath($dom); $hrefs = $xpath->evaluate("/html/body//div[@class='image']"); for ($i = 0; $i < $hrefs->length; $i++) { $href = $hrefs->item($i); Now to get the image and its href, we need first getElementsByTagName('a') and getElementsByTagName('img') but they do not work inside foreach. What's your idea? Hi guys, Reading this from php.net, has got me a wee bit confused. Trying to implement is has got me doubly confused! My code: $dom = new DOMDocument; libxml_use_internal_errors(true); $dom->loadHTMLFile($parent_node); if($dom->childNodes <>0) { $kids = array ( 'url' => $parent_node, 'No_of_kids' => count($dom->childNodes) ); } Results in '' Notice: Object of class DOMNodeList could not be converted to int' How the heck am i supposed to count the childNodes? good evening dear PHPFreaks - hello to everybody. i want to create a link parser. i have choosen to do it with Curl. I have some lines together now. Love to hear your review... Since i am new to programming i love to get some hints from experienced devs. Here some details: well since we have several hundred of resultpages derived from this one: http://www.educa.ch/dyn/79362.asp?action=search Note: i want to itterate over the resultpages - with a loop. http://www.educa.ch/dyn/79376.asp?id=1568 http://www.educa.ch/dyn/79376.asp?id=2149 i take this loop: for($i=1;$i<=$match[1];$i++) { $url = "http://www.example.com/page?page={$i}"; // access new sub-page, extract necessary data } Dear PHP-Freaks, what do you think? What about the Loop over the target-Urls? BTW: you see - there will be some pages empty. Note - the empty pages should be thrown away. I do not want to store "empty" stuff. well this is what i want to. And now i need to have a good parser-script. Note: this is a tree-part-job: 1. fetching the sub-pages 2. parsing them 3. storing the data in a mysql-db Well - the problem - some of the above mentioned pages are empty. so i need to find a solution to leave them aside - unless i do not want to populate my mysql-db with too much infos.. Btw- parsing should be a part that can be done with DomDocument - What do you think? I need to combine the first part with tthe second - can you give me some starting points and hints to get this. The fetching-job should be done with CuRL - and to process the data into a DomDocument-Parser-Job. note ive taken the script from this place: http://www.merchantos.com/makebeta/php/scraping-links-with-php/ function storeLink($url,$gathered_from) { $query = "INSERT INTO links (url, gathered_from) VALUES ('$url', '$gathered_from')"; mysql_query($query) or die('Error, insert query failed'); } for($i=1;$i<= 10000; $i++) { $target_url = "http://www.educa.ch/dyn/79376.asp?id={$i}"; } // access new sub-page, extract necessary data $userAgent = 'Googlebot/2.1 (http://www.googlebot.com/bot.html)'; // make the cURL request to $target_url $ch = curl_init(); curl_setopt($ch, CURLOPT_USERAGENT, $userAgent); curl_setopt($ch, CURLOPT_URL,$target_url); curl_setopt($ch, CURLOPT_FAILONERROR, true); curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true); curl_setopt($ch, CURLOPT_AUTOREFERER, true); curl_setopt($ch, CURLOPT_RETURNTRANSFER,true); curl_setopt($ch, CURLOPT_TIMEOUT, 10); $html= curl_exec($ch); if (!$html) { echo "<br />cURL error number:" .curl_errno($ch); echo "<br />cURL error:" . curl_error($ch); exit; } // parse the html into a DOMDocument $dom = new DOMDocument(); @$dom->loadHTML($html); // grab all the on the page $xpath = new DOMXPath($dom); $hrefs = $xpath->evaluate("/html/body//a"); for ($i = 0; $i < $hrefs->length; $i++) { $href = $hrefs->item($i); $url = $href->getAttribute('href'); storeLink($url,$target_url); echo "<br />Link stored: $url"; } Dear PHP-Freaks, what do you think? What about the Loop over the target-Urls? love to hear from you ! I just finished (or so I thought) a project. But my client's server runs PHP4, so I need to adapt my code. Here's what stopped working: Code: [Select] $localClasses = new DOMDocument; $localClasses -> load("file.xml"); $localClasses -> get_elements_by_tagname('Title') -> item(0) -> firstChild -> nodeValue Here's my petty attempt at trying to adapt this code to run in PHP4: Code: [Select] $file = file_get_contents("localClasses.xml"); $localClasses = new DOMDocument($file); $test = $localClasses -> get_elements_by_tagname('Title'); $testText = $test -> item[0] -> firstChild -> nodeValue; print $testText; This doesn't give me any errors, but nothing shows up. Any help would be appreciated. Thanks for reading! Code: [Select] $domdoc=new DOMDocument(); $domdoc->formatOutput=TRUE; $empty_cart_xml= '<Order> <Cart> <Items> <Item>1</Item> <Item>2</Item> <Item>3</Item> </Items> </Cart> </Order>'; $domdoc->loadXML($empty_cart_xml); print $domdoc->saveXML()."<hr/>"; //works up to this point $xpath=new DOMXPath($domdoc); $items=$xpath->query('Order/Cart/Items'); foreach($itemses AS $items) { $items->appendChild($domdoc->createElement('Item','4')); } print $domdoc->saveXML(); All I want to do is to add a new Item to Items. What am I doing wrong? Hi, I have a php code that could extract the categories and display them. However, I still can't extract the numbers that goes along with it too(without the bracket).
This is my code:
<?php $grep = new DoMDocument(); @$grep->loadHTMLFile("http://www.lelong.com.my/Auc/List/BrowseAll.asp"); $finder = new DomXPath($grep); $class = "CatLevel1"; $nodes = $finder->query("//*[contains(@class, '$class')]"); foreach ($nodes as $node) { $span = $node->childNodes; echo $span->item(0)->nodeValue."<br>"; } ?>This is my desired output: Arts, Antiques & Collectibles : 9768 B2B & Industrial Products : 2342 Baby : 3453 etc...Any help is appreciated. Thanks! good day dear PHPFreaks - hello to everybody. i want to create a link parser. i have choosen to do it with Curl. I have some lines together now. Love to hear your review... Since i am new to programming i love to get some hints from experienced devs. Here some details: well since we have several hundred of resultpages derived from this one: http://www.educa.ch/dyn/79362.asp?action=search Note: i want to itterate over the resultpages - with a loop. http://www.educa.ch/dyn/79376.asp?id=1568 http://www.educa.ch/dyn/79376.asp?id=2149 i take this loop: for($i=1;$i<=$match[1];$i++) { $url = "http://www.example.com/page?page={$i}"; // access new sub-page, extract necessary data } what do you think? What about the Loop over the target-Urls? BTW: you see - there will be some pages empty. Note - the empty pages should be thrown away. I do not want to store "empty" stuff. well this is what i want to. And now i need to have a good parser-script. Note: this is a tree-part-job: 1. fetching the sub-pages 2. parsing them 3. storing the data in a mysql-db Well - the problem - some of the above mentioned pages are empty. so i need to find a solution to leave them aside - unless i do not want to populate my mysql-db with too much infos.. Btw- parsing should be a part that can be done with DomDocument - What do you think? I need to combine the first part with tthe second - can you give me some starting points and hints to get this. The fetching-job should be done with CuRL - and to process the data into a DomDocument-Parser-Job. No Problem he But how to do the DOM-Document-Job ... i have installed FireBug into the FireFox... now i have the Xpaths for the sites: http://www.educa.ch/dyn/79376.asp?id=1187 http://www.educa.ch/dyn/79376.asp?id=2939 http://www.educa.ch/dyn/79376.asp?id=1515 http://www.educa.ch/dyn/79376.asp?id=1469 Altes Schulhaus Ossingen :: /html/body/div[2] Guntibachstrasse 10 :: /html/body/div[4] 8475 Ossingen :: /html/body/div[6] sekretariat.psossingen@bluewin.ch :: /html/body/div[9]/a Tel:052 317 15 45 :: /html/body/div[11] Fax:052 317 04 42 :: /html/body/div[12] but how to appyl in the Simple DomDocument - i want to use this he http://simplehtmldom.sourceforge.net/ look forward to a hint that gives me a starting point hello dear friends - i want to test if the DOMdocument [class] exists? Can i do this in the shell (on OpenSuse 11.3)? bool class_exists ( string $class_name [, bool $autoload = true ] ) bool class_exists ( string $DOMdocument [, bool $autoload = true ] ) or do i have to create a file that i call itself in the shell!? look forward for an idea / hint / tipp regards db1 |