PHP - Simple Xml2array That Preserves Html
The following code is relatively short, but censors any HTML tags inside the XML:
function object2array($object) // from php.net { $return = NULL; if(is_array($object)) { foreach($object as $key => $value) $return[$key] = object2array($value); } else { $var = get_object_vars($object); if($var) { foreach($var as $key => $value) $return[$key] = ($key && !$value) ? NULL : object2array($value); } else return $object; } return $return; } $bla=simplexml_load_file($xml_file); $bla=object2array($bla); This one keeps HTML but turns everything into one giant string: $bla=$bla->asXML(); So how can I easily preserve HTML? But better yet, can I somehow just tell PHP which tags to convert? For example, only <this> and <that> in: Code: [Select] <this> <that>Text <foo>and</foo> test and <whatever>something</whatever>.</that> </this> thus creating: Code: [Select] Array ( [this] => Array [0] => Array ( [that] => Text <foo>and</foo> test and <whatever>something</whatever>. ) } Similar TutorialsI have stepped into more than one project with a function similar to the following: Code: [Select] <? /** * xml2array() will convert the given XML text to an array in the XML structure. * Link: http://www.bin-co.com/php/scripts/xml2array/ * Arguments : $contents - The XML text * $get_attributes - 1 or 0. If this is 1 the function will get the attributes as well as the tag values - this results in a different array structure in the return value. * $priority - Can be 'tag' or 'attribute'. This will change the way the resulting array sturcture. For 'tag', the tags are given more importance. * Return: The parsed XML in an array form. Use print_r() to see the resulting array structure. * Examples: $array = xml2array(file_get_contents('feed.xml')); * $array = xml2array(file_get_contents('feed.xml', 1, 'attribute')); */ function xml2array($contents, $get_attributes=1, $priority = 'tag') { if(!$contents) return array(); if(!function_exists('xml_parser_create')) { //print "'xml_parser_create()' function not found!"; return array(); } //Get the XML parser of PHP - PHP must have this module for the parser to work $parser = xml_parser_create(''); xml_parser_set_option($parser, XML_OPTION_TARGET_ENCODING, "UTF-8"); # http://minutillo.com/steve/weblog/2004/6/17/php-xml-and-character-encodings-a-tale-of-sadness-rage-and-data-loss xml_parser_set_option($parser, XML_OPTION_CASE_FOLDING, 0); xml_parser_set_option($parser, XML_OPTION_SKIP_WHITE, 1); xml_parse_into_struct($parser, trim($contents), $xml_values); xml_parser_free($parser); if(!$xml_values) return;//Hmm... //Initializations $xml_array = array(); $parents = array(); $opened_tags = array(); $arr = array(); $current = &$xml_array; //Refference //Go through the tags. $repeated_tag_index = array();//Multiple tags with same name will be turned into an array foreach($xml_values as $data) { unset($attributes,$value);//Remove existing values, or there will be trouble //This command will extract these variables into the foreach scope // tag(string), type(string), level(int), attributes(array). extract($data);//We could use the array by itself, but this cooler. $result = array(); $attributes_data = array(); if(isset($value)) { if($priority == 'tag') $result = $value; else $result['value'] = $value; //Put the value in a assoc array if we are in the 'Attribute' mode } //Set the attributes too. if(isset($attributes) and $get_attributes) { foreach($attributes as $attr => $val) { if($priority == 'tag') $attributes_data[$attr] = $val; else $result['attr'][$attr] = $val; //Set all the attributes in a array called 'attr' } } //See tag status and do the needed. if($type == "open") {//The starting of the tag '<tag>' $parent[$level-1] = &$current; if(!is_array($current) or (!in_array($tag, array_keys($current)))) { //Insert New tag $current[$tag] = $result; if($attributes_data) $current[$tag. '_attr'] = $attributes_data; $repeated_tag_index[$tag.'_'.$level] = 1; $current = &$current[$tag]; } else { //There was another element with the same tag name if(isset($current[$tag][0])) {//If there is a 0th element it is already an array $current[$tag][$repeated_tag_index[$tag.'_'.$level]] = $result; $repeated_tag_index[$tag.'_'.$level]++; } else {//This section will make the value an array if multiple tags with the same name appear together $current[$tag] = array($current[$tag],$result);//This will combine the existing item and the new item together to make an array $repeated_tag_index[$tag.'_'.$level] = 2; if(isset($current[$tag.'_attr'])) { //The attribute of the last(0th) tag must be moved as well $current[$tag]['0_attr'] = $current[$tag.'_attr']; unset($current[$tag.'_attr']); } } $last_item_index = $repeated_tag_index[$tag.'_'.$level]-1; $current = &$current[$tag][$last_item_index]; } } elseif($type == "complete") { //Tags that ends in 1 line '<tag />' //See if the key is already taken. if(!isset($current[$tag])) { //New Key $current[$tag] = $result; $repeated_tag_index[$tag.'_'.$level] = 1; if($priority == 'tag' and $attributes_data) $current[$tag. '_attr'] = $attributes_data; } else { //If taken, put all things inside a list(array) if(isset($current[$tag][0]) and is_array($current[$tag])) {//If it is already an array... // ...push the new element into that array. $current[$tag][$repeated_tag_index[$tag.'_'.$level]] = $result; if($priority == 'tag' and $get_attributes and $attributes_data) { $current[$tag][$repeated_tag_index[$tag.'_'.$level] . '_attr'] = $attributes_data; } $repeated_tag_index[$tag.'_'.$level]++; } else { //If it is not an array... $current[$tag] = array($current[$tag],$result); //...Make it an array using using the existing value and the new value $repeated_tag_index[$tag.'_'.$level] = 1; if($priority == 'tag' and $get_attributes) { if(isset($current[$tag.'_attr'])) { //The attribute of the last(0th) tag must be moved as well $current[$tag]['0_attr'] = $current[$tag.'_attr']; unset($current[$tag.'_attr']); } if($attributes_data) { $current[$tag][$repeated_tag_index[$tag.'_'.$level] . '_attr'] = $attributes_data; } } $repeated_tag_index[$tag.'_'.$level]++; //0 and 1 index is already taken } } } elseif($type == 'close') { //End of tag '</tag>' $current = &$parent[$level-1]; } } return($xml_array); } ?> The strange thing is, this function seems to be used a lot, I even started using it. The issue however, is memory. When the xml is anymore than a small size, it causes out of memory errors while trying to use this function. Does anyone have a similar alternative that is better on the memory? Hi everyone, I'm trying to select either a class or an id using PHP Simple HTML DOM Parser with absolutely no luck. My example is very simple and seems to comply to the examples given in the manual(http://simplehtmldom.sourceforge.net/manual.htm) but it just wont work, it's driving me up the wall. Here is my example: http://schulnetz.nibis.de/db/schulen/schule.php?schulnr=94468&lschb= I think the HTML is invalid: i cannot parse it. Well i need more examples - probly i have overseen something! If anybody has a working example of Simple-html-dom-parser...i would be happy. The examples on the developersite are not very helpful. your dilbertone require_once 'phpSimpleHtmlDomClass.php'; $html = '<div> <div class="man">Name: madac</div> <div class="man">Age: 18 <div class="man">Class: 12</div> </div>' $name=$html->find('div[class="man"]', 0)->innertext; $age=$html->find('div[class="man"]', 1)->innertext; $cls=$html->find('div[class="man"]', 2)->innertext; wanna get a text from each div class="man" but it didn't work because there is a missing closing div tag on 2nd line of html code. please help me to fix this. thanks in advance. so say I want to make a url decoder and put it on my website [php] <?php $url = "url"; $urldecode = urldecode($url); [\php] I want there to be an entry box for somenoe to put the url in, a submit button so they put the url in and hit submit then a box to display the new decoded url any help with this? Thanks suppose a page has the following tags Code: [Select] <div class = "news"> <div class = "article"> <h2>title 1</h2> <div class = "content"> <p>content 1....</p> </div> </div> <div class = "article"> <h2>title 2</h2> <div class = "content"> <p>content 2....</p> </div> </div> </div> is it possible to check, using simple html dom library, whether the value of <h2> is title 1 and then if it is echo it, or store it in a variable? what i've been able to do up to now is: <?php include('simple_html_dom.php'); foreach($article->find('div[class=news]') as $news) { foreach ($news->find('div[class=article]') as $content) { foreach ($content->find('h2') as $heading) { echo $heading; } } } ?> this only echos all the h2 Hiya! I need to create a simple PDF script that will always create A4 documents. I need the content to be controlled using HTML and CSS. Where do you start in creating such a script? Any help is greatly received. im using simple_html_dom.php i want to extract the following html: and number the array key so i will know the location of each <td> and extract the value the this cell: <TD ALIGN=RIGHT NOWRAP class="ftableline1"> 3.7200 </TD> with this : Code: [Select] foreach($html->find('td[class=ftableline1]') as $e) echo $e->innertext . '<br>'; Code: [Select] <TR class="ftableline1"> <TD ALIGN=RIGHT NOWRAP class="ftableline1"> 3.7200 </TD> <TD ALIGN=RIGHT NOWRAP class="ftableline1"> 3.5400 </TD> <TD ALIGN=RIGHT NOWRAP class="ftableline1"> 3.6651 </TD> <TD ALIGN=RIGHT NOWRAP class="ftableline1"> 3.5982 </TD> <TD align="right" NOWRAP class="ftableline1"> <A HREF=_matbea=1><IMG SRC="images/tezuga_graphit.gif" WIDTH=15 HEIGHT=15 ALT="Show Graph" BORDER="0"></a><BR> </TD> <TD ALIGN=RIGHT NOWRAP>0.01%</TD> <TD ALIGN=right dir="rtl"> <IMG SRC="images/arrow_up.gif" WIDTH=10 HEIGHT=8 BORDER=0><BR> </TD> <TD align="right" NOWRAP dir="rtl" class="ftableline1"> 3.6316 </TD> <TD align="right" NOWRAP dir="rtl" class="ftableline1"> 1 </TD> <TD ALIGN=RIGHT NOWRAP dir="rtl" class="ftableline1"> <A HREF=_matbea=1> דולר ארה"ב</A><BR> </TD> <TD align="right" NOWRAP dir="rtl" class="ftableline1"> <A href="_matbea=1"><IMG SRC="../../meida/images/f1.gif" HEIGHT=15 WIDTH=21 border=0></A><BR> </TD> <TD ALIGN=center NOWRAP dir="rtl"><INPUT TYPE="Checkbox" VALUE="1" NAME="check" id="check" ></TD> </TR> Guys I m trying to make a crawler and the crawler works fine but the problem is that after some random amount of time the crawler crashes with the message "PHP Fatal error: Maximum function nesting level of '100' reached, aborting! in ....." . Please help me out . Is it my code or there is a problem with simple html dom ?? All some help would be welcome. I'm having troubles with this simple email script. the idea: would like to make a mailing from a form. This works just fine. I'm having troubles with the mailing it self. What i posted he I edited my code to try without concatenating strings (*$_Post varible within xml). This however doesn't work eitherway. tests: all variables get set properly (tested with echo's so I cut them out to save you the reading :-)) connection gets established just fine All email adresses from the database pass by just fine. (also by echo :-)) I'm thinking semantic error but can't find the problem. Can anyone help me? <?php //make connection with database $con = mysql_connect("XXXXXXXXX","XXXXXXXXXX","XXXXXXXX"); if (!$con){ die('Could not connect: ' . mysql_error()); } mysql_select_db("dejuistestudiek", $con); //select all emails from mailing table if($con){ $sql = "SELECT * FROM mailing"; $result = mysql_query($sql); //prepare text for html mail $text = trim($_POST['TEXT']); $text = nl2br($text); $text = stripslashes($text); if($result){ while($row = mysql_fetch_array($result)){ // single recipient $to = $row['email']; // subject $subject = 'StuWay - Nieuwsbrief'; // message $message = '<html> <head> <title>Birthday Reminders for August</title> </head> <body> <p>Here are the birthdays upcoming in August!</p> <table> <tr> <th>Person</th><th>Day</th><th>Month</th><th>Year</th> </tr> <tr> <td>Joe</td><td>3rd</td><td>August</td><td>1970</td> </tr> <tr> <td>Sally</td><td>17th</td><td>August</td><td>1973</td> </tr> </table> </body> </html>'; // To send HTML mail, the Content-type header must be set $headers = 'MIME-Version: 1.0' . "\r\n"; $headers .= 'Content-type: text/html; charset=iso-8859-1' . "\r\n"; // Additional headers $headers .= 'From: Do-Not-Reply@dejuistestudiekeuze.be' . "\r\n"; // Mail it $bool = mail($to, $subject, $message, $headers); } } } ?> I have a quick question to ask, as i cant see it in their docs. Maybe you can help me. I am using http://simplehtmldom.sourceforge.net/ Right, i have written a Simple HTML Dom script to collect data from this page, as an example: http://www.visualdesign.ie/_dev/myscraper/simplehtmldom/dev-env/scraping/daily/daily.html Script executed by running a cron job on this file (which generates the XML): http://www.visualdesign.ie/_dev/myscraper/simplehtmldom/dev-env/scraping/daily/daily.php The data is collected, and written to this file in XML file: http://www.visualdesign.ie/_dev/myscraper/simplehtmldom/dev-env/scraping/daily/daily.xml Thats is fine, and the script essentially scrapes the entire page, with no conditional IF's for any sections. A sample of my code for one segment is below: http://pastebin.com/JLb8f92N What i would like to do now, and i am hoping you could help out is... Scrap this same page, but produce seperate XML files, based on the date. So if you view that page, i want to scrape the data in the table Saturday 4th February 2012 and produce XML for for that day only. The i want to scrape the data in the section for Sunday 5th February 2012, and another for Tuesday 7th February 2012. I think its self explanatory. The table date sections are separated by this HTML element: bg color = " #CCCCCC " But bare in mind that the table rows on each date section will change when the site is updated, and there may be additional or less football game records. Anyway, i would appreciate any help you can show me, on how to place in any conditionals in the code, and only scrape the date sections needed. Separate script files would be fine if it was needed. Many thanks for your time, Darren. Hello, can any1 show me how to insert the following in db Code: [Select] <?php include('dbconnect.php'); include('simple_html_dom.php'); $html = file_get_dom('test3.php'); foreach($html->find('h2') as $heading) { //echo $heading; foreach ($heading->find('a') as $link) { $item_title = md5($link); echo $link."<br/>"; $sql = "INSERT INTO articles(article_id, article_link) VALUES($item_title, $link)"; $result = mysql_query($sql); } } ?> what i want to do is insert every link in the db, but this is not working. The output is ok but the insert doesnt work. Any help please?? Hi All, I am using the PHP Simple HTML DOM parser to connect to a financials website, parse out a companies financial information (Income statement in this case) and then insert the scrapped data into a mysql database that I can then later use to run automated calculations. Here is the code I have so far: Code: [Select] <?php include_once 'simple_html_dom.php'; //Connect to financial Website and Create DOM from URL $income_statement = file_get_html('http://www.WEBSITE.com/finance?etc..etc...etc...etc...'); //PULL FINANCIAL DATA foreach($income_statement->find('td[class]' ) as $lines=>$data) { echo $data->plaintext . "<br/>"; } // clean up memory $html->clear(); unset($html); ?> So far I am able to get output that looks like this: Code: [Select] Revenue 336.57 331.52 324.32 319.29 320.40 Other Revenue, Total - - - - - Total Revenue 336.57 331.52 324.32 319.29 320.40 etc............................. But being a newb I do not understand how I can break each $ value and each - into their own variables and then insert them to their corresponding mysql table fields. During the database insert I would like to ignore field headings from insertion (i.e Revenue, Total Revenue, etc.... Any help would be absolutely amazing, as I have been reading, scripting and searching for information like crazy, but just can't seem to figure it out. Hi, I am learning PHP now, so pardon my silly question which I am not able to resolve from a week. I have created a simple web form where in I display the values entered by a user. <form action ="reply.php" id="myForm" method="post" > Name: <input type="text" name="name" size="25" maxlength="50" /> <br> </br> Description: <textarea name="editor1"> </textarea> <input type="submit" value="Submit" /> </form> and reply.php contains: <?php echo "In reply page"; foreach($_POST as $field => $value) { echo "$field = $value"; } ?> When I click on the submit button, I just get a blank page without any values from the form. Can anyone please let me know what am I missing? Set up: I am using Netbeans with PHP bundle added on to it. When i run only a simple php proj it displays that page in localhost/nameofproj, but when I run a php file along with a jsp file, it runs in localhost:8080/nameofproj. Is this the reason(localhost:8080 instead of just localhost ) for showing a blank page, not even a simple echo stmt, when i click on submit? I even re installed netbeans, still no luck. Thanks in advance. Hi Everyone, I have just started using Simple HTML DOM today and I have spent 4 hours not getting what I want. I want to be able to extract the following information: Code: [Select] <div class="listing_content"> <span class="serialNumb" style="line-height: 21px;">77777</span> <br /> 444 ASDF, Alpha, Tango, Beta <br /> 77777 Director:99999 <div> <img title='web' src='http://cpgimg.com/images/icon_sm_web.gif' alt='web'/> <a href='javascript:void(0)' onClick="window.open('/redir.jsp?p_url=http:%2f%2fwww.cnn.com&p_cid=2707304&p_hid=279E00&p_ct=3527&p_pr=KO&p_fr=U');" class='listing_link'>website</a> <img title='email' src='http://cpgimg.com/images/icon_sm_mail.gif' alt='email'/> <a class='listing_link' href="javascript:void(0)" onclick="popupEmail('/email.jsp?lang=0&p_cid=2707304');(new Image()).src='/redir.jsp?p_url=&p_cid=2707304&p_hid=279E00&p_ct=3527&p_pr=ON&p_fr=E&msec='+(new Date()).getMilliseconds()">E-mail</a> </div> </div> The content I need to pull separately from above include: 1- serialNumb = 77777 2- 444 ASDF, Alpha, Tango, Beta 3- 77777 Director:99999 4- www.cnn.com I want all the data to recorded to different variables so I can upload them to MySQL. Any help with this is much appreciated. I don't have to use Simple DOM HTML but per my search it seems to be the best tool (however, I am not so lucky with it.) ***Not to forget that this page is full of <div>, <br />, <img>, and other tags. The quoted part is just one excerpt but this part is unique and used once in the page "style="line-height: 21px;". Also the "('/redir.jsp?p_url" is also unique for the URL portion. Thanks again. hello dear community, i am currently wroking on a approach to parse some sites that contain datas on Foundations in Switzerland with some details like goals, contact-E-Mail and the like,,, See http://www.foundationfinder.ch/ which has a dataset of 790 foundations. All the data are free to use - with no limitations copyrights on it. I have tried it with PHP Simple HTML DOM Parser - but , i have seen that it is difficult to get all necessary data -that is needed to get it up and running. Who is wanting to jump in and help in creating this scraper/parser. I love to hear from you. Please help me - to get up to speed with this approach? regards Dilbertone Hello dear Community, i have a document i need to parse it and spit out only this part of the table: see http://schulnetz.nibis.de/db/schulen/schule.php?schulnr=67003&lschb= how to i parse the stuff!? With perl or php? Note i have the xpaths (see below) Sad that i cannot apply them on Simple DOM Parser since this Dom Parser does not work with Xpaths but with CSS-Selectors: Well i want to get all the data with that are within the table that name is called class="fliess" How to dump all the results? BTW - thinking about the most elegant way, i think it is the most pretty way would be to do it with perl - So i can try it with HTML::TableExtract or.... Well what do you suggest - Which way to choose to do this [very] simple thing? Look forward to hear from you! see the xpaths: Schule: /html/body/center/table/tbody/tr[2]/td[1] Stasse: /html/body/center/table/tbody/tr[3]/td[1] Ort: /html/body/center/table/tbody/tr[4]/td[1] Tel: /html/body/center/table/tbody/tr[5]/td[1] Schulgliederungen: /html/body/center/table/tbody/tr[6]/td[1] Besonderheite: /html/body/center/table/tbody/tr[7]/td[1] E-Mail: /html/body/center/table/tbody/tr[8]/td[1] Schulnummer: /html/body/center/table/tbody/tr[9]/td[1] good day dear community, this is a big issue. I have to decide: between native PHP DOM Extension or of simple DOM html parser well i want to parse the site he http://buergerstiftungen.de/cps/rde/xchg/SID-A7DCD0D1-702CE0FA/buergerstiftungen/hs.xsl/db.htm http://buergerstiftungen.de/cps/rde/xchg/SID-A7DCD0D1-702CE0FA/buergerstiftungen/hs.xsl/db.htm I will suggest to use the native PHP "DOM" Extension instead of "simple html parser", since it will be much faster and easier What do you think about this one here...: Code: [Select] $doc = new DOMDocument @$doc->loadHTMLFile('...URL....'); // Using the @ operator to hide parse errors $contents = $doc->getElementById('content')->nodeValue; // Text contents of #content look forward to hear from you best regards db1 Hi, I m doing some work for my self an because of that i been reading a lot arround about PHP, and theres something that i would like to ask a bit of enlightenment. So my question is as the title says about html form's using php to insert data into mysql, i been reading tutorials arround the interwebs and even made afew successful tests, but pretty much all tutorials use 2 files to accomplish this the html file with the form and an insert.php where the actual code is stored so this made me think is this how usually it's done? in over all you will have 1 file for the form, 1 for the insert, 1 for the edit php code and 1 for delete. How do you guys usually do it? PS: one of the tests i did was making 1 single file with all these using an switch. My interest in making this question is solo to learn how other people do it to see if i m in the right way. Thanks in advance. I'm using PHP 5.2 Server and Simple HTML DOM 1.5. This script scrape or extract data from a football site, its fully working on PHP 5.9 Server but I need to know how I can fix it for PHP 5.2 server. Can someone give me a hint on how can I fix the error? Thanks in advance. My PHP 5.2 Server script output shows: ++++++++++++++++ Object id #599 Object id #604 Object id #609 Object id #614 Object id #619 Object id #627 Object id #632 Object id #637 Object id #642 Object id #647 Object id #655 Object id #660 Object id #665 Object id #670 Object id #675 Object id #683 Object id #688 Object id #693 Object id #698 Object id #703 Object id #711 Object id #716 Object id #721 Object id #726 Object id #731 ++++++++++++++++ while PHP 5.9 Server says ++++++++++++++++ Rk Player Team POS OPPONENT 1 Aaron Rodgers GB QB at CAR 2 Tom Brady NE QB vs. SD 3 Matt Schaub HOU QB at MIA 4 Michael Vick PHI QB at ATL ++++++++++++++++ I did applied the bug solution listed on https://sourceforge.net/tracker/index.php?func=detail&aid=3107230&group_id=218559&atid=1044037 but it is still not working. It says: ++++++++++++++++ Details: I get compiler errors in PHP 5.2 when using this as an object. The offending lines are 609 and 940, which both contain this construct: if ($this->size>0) $this->char = $this->doc[0]; This tries to get the first character of $this->doc, but PHP 5.2 sees it as trying to access it as an array. It's easily fixed by this: if ($this->size>0) $this->char = substr($this->doc, 0, 1); Or you could probably use chr(ord($this->doc)) as well. Either way solves the compile error without changing functionality. ++++++++++++++++ Here are my codes: Code: [Select] <?php # don't forget the library include('simple_html_dom.php'); # this is the global array we fill with article information $articles = array(); $source = 'http://www.athlonsports.com/columns/winning-game-plan/fantasy-football-qb-rankings'; # passing in the first page to parse, it will crawl to the end # on its own getArticles($source); function getArticles($page) { global $articles, $descriptions; $html = new simple_html_dom(); $html->load_file($page); //$items = $html->find('div[class=preview]'); $items = $html->find('tbody tr'); foreach($items as $post) { # remember comments count as nodes /*$articles[] = array($post->children(3)->outertext, $post->children(6)->first_child()->outertext);*/ $articles[] = array($post->children(0), $post->children(1), $post->children(2), $post->children(3), $post->children(4)); } # lets see if there's a next page if($next = $html->find('a[class=nextpostslink]', 0)) { $URL = $next->href; echo "going on to $URL <<<\n"; # memory leak clean up $html->clear(); unset($html); getArticles($URL); } } ?> <html> <head> </head> <body> <? echo "Source: " . $source; ?> <table cellpadding="5" cellspacing="0" border="0"> <?php foreach($articles as $item) { echo "<tr>"; echo "<td>" . $item[0] . "</td><td>" . $item[1] . "</td><td>" . $item[2] . "</td>"; echo "<td>" . $item[3] . "</td><td>" . $item[4] . "</td>"; echo "<tr>"; } ?> </table> </body> </html> |