PHP - Domdocument Parsing Obstacle
so there's some html i'm having to fetch and parse for personal use...but some of the data i want is started in a table written this way:
rawTableData = {"rows": [{"colname1": value, "colname2": "value", "colname3: value} how can i use domdocument to parse this data if say i want the value for colname2? there are no tags for me to use. Similar TutorialsI am trying to take a specific link from my site and place it into my database. I only want links starts with CORPSEARCH.ENTITY_INFORMATION?p_nameid= Can someone point me in the right direction here? Code for this is below: // make the cURL request to $target_url $html= curl_exec($ch); if (!$html) { echo "<br />cURL error number:" .curl_errno($ch); echo "<br />cURL error:" . curl_error($ch); exit; } // parse the html into a DOMDocument $dom = new DOMDocument(); @$dom->loadHTML($html); // grab all the on the page $xpath = new DOMXPath($dom); $hrefs = $xpath->evaluate("/html/body//a"); for ($i = 0; $i < $hrefs->length; $i++) { $href = $hrefs->item($i); $url = $href->getAttribute('href'); $sql="INSERT INTO links(cid, nlink)VALUES('$i','$url')"; $result=mysql_query($sql); echo $result; echo $url; Delete ... need to change host .. Hi all, I am pretty new to php and I am having an issue trying to load an XML document. When ever I try to use Xpath it negates all the code below the line, including the HTML, and returns a white page. here is my code: Code: [Select] <html> <head> <?php $xpath = new DOMXPath("structure.xml"); ?> <body> hello world </body> </html> I checked phpinfo() and I have both the DOM and XPath enables and installed. I have also tried using just DOM and that worked so it is only Xpath that is not working. Ideas? Thank you James S I have some code $doc = new DOMDocument(); $doc->loadHTML( '<html> <head><title>Test</title></head> <body></body></html>' ); $doc->encoding = 'iso-8859-1'; file_put_contents('test.html', $doc->saveHTML()); when i view the output file i get <html><head><title>Test</title></head><body></body></html> all on one line is there no way of having it format it like the original source code so that its not all bunched together? Is there any reason that people can think as to why DOMDocument::saveHTML would remove the following: Code: [Select] <![if !vml]> <img src="someimage.jpg" /> <![endif]> A little clarification... This HTML comment tag is used in my company's email newsletter code and is necessary to make Outlook 2007 behave properly. For whatever reason, saveHTML strips it out. I know that this doesn't conform to HTML standards and I'm guessing that that is why it is being stripped. BUT, from reading on the internet, saveHTML can produce junk html code anyways. Any help is appreciated. Hi guys, Just starting to play with PHP Domdocument, only to fail at the very first step: <?php $html = 'test/php/somefile.html' ; if(!empty($html)){ $dom_1 = new domDocument ; $dom_1->loadHTML($html) ; $links = $dom_1->getElementsByTagName('li') ; foreach ( $links as $link) { // echo $link ; echo $link->nodeValue, PHP_EOL; } } ?> When I visit it in a browser I get a WSOD, what am I missing? Code: [Select] $domdoc=new DOMDocument(); $domdoc->formatOutput=TRUE; $empty_cart_xml= '<Order> <Cart> <Items> <Item>1</Item> <Item>2</Item> <Item>3</Item> </Items> </Cart> </Order>'; $domdoc->loadXML($empty_cart_xml); print $domdoc->saveXML()."<hr/>"; //works up to this point $xpath=new DOMXPath($domdoc); $items=$xpath->query('Order/Cart/Items'); foreach($itemses AS $items) { $items->appendChild($domdoc->createElement('Item','4')); } print $domdoc->saveXML(); All I want to do is to add a new Item to Items. What am I doing wrong? It is easy to get image or link by DomDocument, but I did not find a way to get image with its target link. Imagine a html as Code: [Select] <div class=image> <a href='http://site.com'><img src='imagelink.jpg'></a> </div>How to get both the image link and href? $dom = new DOMDocument(); @$dom->loadHTML($html); $xpath = new DOMXPath($dom); $hrefs = $xpath->evaluate("/html/body//div[@class='image']"); for ($i = 0; $i < $hrefs->length; $i++) { $href = $hrefs->item($i); Now to get the image and its href, we need first getElementsByTagName('a') and getElementsByTagName('img') but they do not work inside foreach. What's your idea? Imagine an html with the following structure Code: [Select] <div class="item"> <div class="title"> <a class="title" href="http://www.domain.com/title.html">Title is here</a> </div> <div class="image"> <a href="http://www.domain.com/title.html"><img src=image.jpg /></a> </div> </div> How to make an array containing $title - $url - $image_url ? Hi, I have a php code that could extract the categories and display them. However, I still can't extract the numbers that goes along with it too(without the bracket).
This is my code:
<?php $grep = new DoMDocument(); @$grep->loadHTMLFile("http://www.lelong.com.my/Auc/List/BrowseAll.asp"); $finder = new DomXPath($grep); $class = "CatLevel1"; $nodes = $finder->query("//*[contains(@class, '$class')]"); foreach ($nodes as $node) { $span = $node->childNodes; echo $span->item(0)->nodeValue."<br>"; } ?>This is my desired output: Arts, Antiques & Collectibles : 9768 B2B & Industrial Products : 2342 Baby : 3453 etc...Any help is appreciated. Thanks! I just finished (or so I thought) a project. But my client's server runs PHP4, so I need to adapt my code. Here's what stopped working: Code: [Select] $localClasses = new DOMDocument; $localClasses -> load("file.xml"); $localClasses -> get_elements_by_tagname('Title') -> item(0) -> firstChild -> nodeValue Here's my petty attempt at trying to adapt this code to run in PHP4: Code: [Select] $file = file_get_contents("localClasses.xml"); $localClasses = new DOMDocument($file); $test = $localClasses -> get_elements_by_tagname('Title'); $testText = $test -> item[0] -> firstChild -> nodeValue; print $testText; This doesn't give me any errors, but nothing shows up. Any help would be appreciated. Thanks for reading! good evening dear PHPFreaks - hello to everybody. i want to create a link parser. i have choosen to do it with Curl. I have some lines together now. Love to hear your review... Since i am new to programming i love to get some hints from experienced devs. Here some details: well since we have several hundred of resultpages derived from this one: http://www.educa.ch/dyn/79362.asp?action=search Note: i want to itterate over the resultpages - with a loop. http://www.educa.ch/dyn/79376.asp?id=1568 http://www.educa.ch/dyn/79376.asp?id=2149 i take this loop: for($i=1;$i<=$match[1];$i++) { $url = "http://www.example.com/page?page={$i}"; // access new sub-page, extract necessary data } Dear PHP-Freaks, what do you think? What about the Loop over the target-Urls? BTW: you see - there will be some pages empty. Note - the empty pages should be thrown away. I do not want to store "empty" stuff. well this is what i want to. And now i need to have a good parser-script. Note: this is a tree-part-job: 1. fetching the sub-pages 2. parsing them 3. storing the data in a mysql-db Well - the problem - some of the above mentioned pages are empty. so i need to find a solution to leave them aside - unless i do not want to populate my mysql-db with too much infos.. Btw- parsing should be a part that can be done with DomDocument - What do you think? I need to combine the first part with tthe second - can you give me some starting points and hints to get this. The fetching-job should be done with CuRL - and to process the data into a DomDocument-Parser-Job. note ive taken the script from this place: http://www.merchantos.com/makebeta/php/scraping-links-with-php/ function storeLink($url,$gathered_from) { $query = "INSERT INTO links (url, gathered_from) VALUES ('$url', '$gathered_from')"; mysql_query($query) or die('Error, insert query failed'); } for($i=1;$i<= 10000; $i++) { $target_url = "http://www.educa.ch/dyn/79376.asp?id={$i}"; } // access new sub-page, extract necessary data $userAgent = 'Googlebot/2.1 (http://www.googlebot.com/bot.html)'; // make the cURL request to $target_url $ch = curl_init(); curl_setopt($ch, CURLOPT_USERAGENT, $userAgent); curl_setopt($ch, CURLOPT_URL,$target_url); curl_setopt($ch, CURLOPT_FAILONERROR, true); curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true); curl_setopt($ch, CURLOPT_AUTOREFERER, true); curl_setopt($ch, CURLOPT_RETURNTRANSFER,true); curl_setopt($ch, CURLOPT_TIMEOUT, 10); $html= curl_exec($ch); if (!$html) { echo "<br />cURL error number:" .curl_errno($ch); echo "<br />cURL error:" . curl_error($ch); exit; } // parse the html into a DOMDocument $dom = new DOMDocument(); @$dom->loadHTML($html); // grab all the on the page $xpath = new DOMXPath($dom); $hrefs = $xpath->evaluate("/html/body//a"); for ($i = 0; $i < $hrefs->length; $i++) { $href = $hrefs->item($i); $url = $href->getAttribute('href'); storeLink($url,$target_url); echo "<br />Link stored: $url"; } Dear PHP-Freaks, what do you think? What about the Loop over the target-Urls? love to hear from you ! Hi guys, Reading this from php.net, has got me a wee bit confused. Trying to implement is has got me doubly confused! My code: $dom = new DOMDocument; libxml_use_internal_errors(true); $dom->loadHTMLFile($parent_node); if($dom->childNodes <>0) { $kids = array ( 'url' => $parent_node, 'No_of_kids' => count($dom->childNodes) ); } Results in '' Notice: Object of class DOMNodeList could not be converted to int' How the heck am i supposed to count the childNodes? good day dear PHPFreaks - hello to everybody. i want to create a link parser. i have choosen to do it with Curl. I have some lines together now. Love to hear your review... Since i am new to programming i love to get some hints from experienced devs. Here some details: well since we have several hundred of resultpages derived from this one: http://www.educa.ch/dyn/79362.asp?action=search Note: i want to itterate over the resultpages - with a loop. http://www.educa.ch/dyn/79376.asp?id=1568 http://www.educa.ch/dyn/79376.asp?id=2149 i take this loop: for($i=1;$i<=$match[1];$i++) { $url = "http://www.example.com/page?page={$i}"; // access new sub-page, extract necessary data } what do you think? What about the Loop over the target-Urls? BTW: you see - there will be some pages empty. Note - the empty pages should be thrown away. I do not want to store "empty" stuff. well this is what i want to. And now i need to have a good parser-script. Note: this is a tree-part-job: 1. fetching the sub-pages 2. parsing them 3. storing the data in a mysql-db Well - the problem - some of the above mentioned pages are empty. so i need to find a solution to leave them aside - unless i do not want to populate my mysql-db with too much infos.. Btw- parsing should be a part that can be done with DomDocument - What do you think? I need to combine the first part with tthe second - can you give me some starting points and hints to get this. The fetching-job should be done with CuRL - and to process the data into a DomDocument-Parser-Job. No Problem he But how to do the DOM-Document-Job ... i have installed FireBug into the FireFox... now i have the Xpaths for the sites: http://www.educa.ch/dyn/79376.asp?id=1187 http://www.educa.ch/dyn/79376.asp?id=2939 http://www.educa.ch/dyn/79376.asp?id=1515 http://www.educa.ch/dyn/79376.asp?id=1469 Altes Schulhaus Ossingen :: /html/body/div[2] Guntibachstrasse 10 :: /html/body/div[4] 8475 Ossingen :: /html/body/div[6] sekretariat.psossingen@bluewin.ch :: /html/body/div[9]/a Tel:052 317 15 45 :: /html/body/div[11] Fax:052 317 04 42 :: /html/body/div[12] but how to appyl in the Simple DomDocument - i want to use this he http://simplehtmldom.sourceforge.net/ look forward to a hint that gives me a starting point hello dear friends - i want to test if the DOMdocument [class] exists? Can i do this in the shell (on OpenSuse 11.3)? bool class_exists ( string $class_name [, bool $autoload = true ] ) bool class_exists ( string $DOMdocument [, bool $autoload = true ] ) or do i have to create a file that i call itself in the shell!? look forward for an idea / hint / tipp regards db1 Hi, I have a RSS feed cached as an XML file. I need to pull some info out of it so I can then print it to the page. Currently I am using this code to extract the data: $doc = new DOMDocument(); $doc->load('inthenews.xml'); $inthenews = array(); foreach ($doc->getElementsByTagName('item') as $node) { $itemRSS = array ( 'title' => $node->getElementsByTagName('title')->item(0)->nodeValue, 'link' => $node->getElementsByTagName('link')->item(0)->nodeValue, 'desc' => $node->getElementsByTagName('description')->item(0)->nodeValue, ); array_push($inthenews, $itemRSS); } However, the Description node contains more than I want. I need to remove everything except for the image ( <img../> ) it contains. Is there some way of running preg or similar on the nodeValue as it is extracted? Or an alternative to "getElementsByTagName" that allows searching for strings? If not, does anyone have a suggestion for doing this? I tried running preg_replace on the array, but it doesn't seem to do anything?? An example of the Array created by my code above is shown below: Code: [Select] [0] => Array ( [title] => BBC radio Cambridge 7.20am [link] => images/news/Matthew_Freeman_Radio_Cambridgeshire_21-10-10.mp3 [desc] => <img alt="BBC-logo" src="http://www2.mrc-lmb.cam.ac.uk/images/news/BBC-logo.jpg" height="54" width="127" /><br/>BBC radio Cambridge 7.20am 21.10.10: Dr Matthew Freeman"<br/> 21 October 2010 ) Thanks in advance for any advice Phil Hi the what i'm trying to do is i want to get the contents of heading and text_details and merge them together before adding them as one entry into database: Code: [Select] <div class="heading"><p> Second quote: </div><div class=text_details> <p> Satan's substitute for repentance is the man's rationalization of evil. <p></b></div> my code looks s/thing like this: Code: [Select] foreach( $dom->getElementsByTagName('div') as $div ) { foreach( $div->attributes as $attributes ) { if( strtolower($attributes->name) == 'class' ) { if( strtolower($attributes->value) == 'heading' || strtolower($attributes->value) == 'text_details') { $quote = $div->textContent; $clean_quote = mysql_real_escape_string($quote); echo "the quote is: " . $clean_quote . "<br />"; mysql_query("INSERT INTO quotes (quote) VALUES ('$clean_quote')")or die(mysql_error()); } } } } when i do this obviously i get: Second quote of the day entered into a field by itself and Satan's substitute for repentance is the man's rationalization of evil. into another field...how to make them one!?? thanks in advance I have been able to pull the html of a given node good evening dear Community, Well first of all: felize Navidad - I wanna wish you a Merry Christmas!! Today i'm trying to debug a little DOMDocument object in PHP. Ideally it'd be nice if I could get DOMDocument to output in a array-like format, to store the data in a database! My example: head over to the url - see the example: the target http://dms-schule.bildung.hessen.de/suchen/suche_schul_db.html?show_school=8880 I investigated the Sourcecode: I want to filter out the data that that is in the following class <div class="floatbox"> See the sourcecode: <span class="grey"> <span style="font-size:x-small;">></span></span> <a class="navLink" href="http://dms-schule.bildung.hessen.de/suchen/index.html" title="Suchformulare zum hessischen schulischen Bildungssystem">suche</a> </div> </div> <!-- begin of text --> <h3>Siegfried-Pickert Schule</h3> <div class="floatbox"> See my approach: Here is the solution return the labels and values in a formatted array ready for input to mysql! <?php $dom = new DOMDocument(); @$dom->loadHTMLFile('http://dms-schule.bildung.hessen.de/suchen/suche_schul_db.html?show_school=8880'); $divElement = $dom->getElementById('floatbox'); $innerHTML= ''; $children = $divElement->childNodes; foreach ($children as $child) { $innerHTML = $child->ownerDocument->saveXML( $child ); $doc = new DOMDocument(); $doc->loadHTML($innerHTML); //$divElementNew = $dom->getElementsByTagName('td'); $divElementNew = $dom->getElementsByTagname('td'); /*** the array to return ***/ $out = array(); foreach ($divElementNew as $item) { /*** add node value to the out array ***/ $out[] = $item->nodeValue; } echo '<pre>'; print_r($out); echo '</pre>'; } well Duhh: this outputs lot of garbage. The code spits out a lot of html anyway. What can i do to get a more cleaned up code!? What is wrong with the idea of using this attribute: $dom->getElementById('floatbox'); any idea!? any and all help will greatly appreciated. season-greetings db1 NB: The first section of php in this thread is extracted from the second. It is where the problem lies - ($newItem = $XMLpage->createElement('item', $updateValue);). I am trying to simply add an element into an xml document via DOMDocument.createElement(). However the following returns a fatal error: Quote <b>Fatal error</b>: Call to undefined method DOMElement::createElement() The error section in question extracted from the full function //load up and get ready the xml file to edit $xml = new DOMDocument('1.0', 'utf-8'); $xml->load($fullPathToXML); //load the page we are changing $XMLpage = $xml->getElementsByTagName('page')->item($page_number); //created the new item node $newItem = $XMLpage->createElement('item', $updateValue); //append the item to the xml sheet on the correct page $XMLpage->appendChild($newItem); //now save the xml to file $xml->save($fullPathToXML); Would anyone have any ideas why it is returning a fatal error here? I have used the official example on the DOMDocument.createElement() and the php version i am using is 5.3.8 The full function (the rest of which works fine): /*** modifying a single section of an xml sheet ***/ public function model_updateXMLfile($id = null, $updateValue = null, $inputType = null) { //if either the values sent are empty if(empty($id)) { exit(); } //get the comain name form the sessions $xmlFile= $_SESSION['xmlFile']; //replace all . with DOT as this will be the actual file name $xmalFileName = str_replace('.','DOT',$xmlFile); //get the xmla location directory from the sessions $xmlLocation = $_SESSION['xml_location']; //first make the path a whole path based the $_SESSION['xml_location'] $fullPathToXML = WEBROOT.'xml/'.$xmlLocation.'/'.$xmalFileName.'.xml'; //now get the coords of what page to change $id_exploded = explode('_',$id); //page number to edit $page_number = $id_exploded[0]; //item number $item_number = $id_exploded[1]; //load up and get ready the xml file to edit $xml = new DOMDocument('1.0', 'utf-8'); $xml->formatOutput = true; $xml->preserveWhiteSpace = false; $xml->load($fullPathToXML); //load the page we are changing $XMLpage = $xml->getElementsByTagName('page')->item($page_number); //if the user wants to change the top title if($item_number == 'topTitle') { $htmlTitle = $XMLpage->getElementsByTagName('topTitle')->item(0); $htmlTitle ->nodeValue = $updateValue; $XMLpage->replaceChild($htmlTitle, $htmlTitle); $xml->save($fullPathToXML); } //else changing an item else{ /**check the item exists, if it does then simply edit ***/ if($XMLpage->getElementsByTagName('item')->item($item_number)) { //update the content of the xml tag $xmlItem = $XMLpage->getElementsByTagName('item')->item($item_number); //change the content of the xml tag $xmlItem ->nodeValue = $updateValue; //update the node in the xml file $XMLpage->replaceChild($xmlItem, $xmlItem); //now change the attibute of the item according to what was sent $xmlItem->setAttribute('type', "$inputType"); //now save the xml to file $xml->save($fullPathToXML); } /** a new item is being created **/ else { //created the new item node $newItem = $XMLpage->createElement('item', $updateValue); //append the item to the xml sheet on the correct page $XMLpage->appendChild($newItem); //now save the xml to file $xml->save($fullPathToXML); } } //return the message saved return 'saved'; } |