PHP - Domdocument - Parser: I Need A Starting Point
good day dear PHPFreaks - hello to everybody.
i want to create a link parser. i have choosen to do it with Curl. I have some lines together now. Love to hear your review... Since i am new to programming i love to get some hints from experienced devs. Here some details: well since we have several hundred of resultpages derived from this one: http://www.educa.ch/dyn/79362.asp?action=search Note: i want to itterate over the resultpages - with a loop. http://www.educa.ch/dyn/79376.asp?id=1568 http://www.educa.ch/dyn/79376.asp?id=2149 i take this loop: for($i=1;$i<=$match[1];$i++) { $url = "http://www.example.com/page?page={$i}"; // access new sub-page, extract necessary data } what do you think? What about the Loop over the target-Urls? BTW: you see - there will be some pages empty. Note - the empty pages should be thrown away. I do not want to store "empty" stuff. well this is what i want to. And now i need to have a good parser-script. Note: this is a tree-part-job: 1. fetching the sub-pages 2. parsing them 3. storing the data in a mysql-db Well - the problem - some of the above mentioned pages are empty. so i need to find a solution to leave them aside - unless i do not want to populate my mysql-db with too much infos.. Btw- parsing should be a part that can be done with DomDocument - What do you think? I need to combine the first part with tthe second - can you give me some starting points and hints to get this. The fetching-job should be done with CuRL - and to process the data into a DomDocument-Parser-Job. No Problem he But how to do the DOM-Document-Job ... i have installed FireBug into the FireFox... now i have the Xpaths for the sites: http://www.educa.ch/dyn/79376.asp?id=1187 http://www.educa.ch/dyn/79376.asp?id=2939 http://www.educa.ch/dyn/79376.asp?id=1515 http://www.educa.ch/dyn/79376.asp?id=1469 Altes Schulhaus Ossingen :: /html/body/div[2] Guntibachstrasse 10 :: /html/body/div[4] 8475 Ossingen :: /html/body/div[6] sekretariat.psossingen@bluewin.ch :: /html/body/div[9]/a Tel:052 317 15 45 :: /html/body/div[11] Fax:052 317 04 42 :: /html/body/div[12] but how to appyl in the Simple DomDocument - i want to use this he http://simplehtmldom.sourceforge.net/ look forward to a hint that gives me a starting point Similar TutorialsHello dear friends, first of all : merry merry Xmas!!! i want to parse with the simple Simple HTML DOM Parser, well i am pretty new to php and to the Simple HTML DOM Parser. My example: http://schulen.bildung-rp.de/gehezu/startseite/einzelanzeige.html?tx_wfqbe_pi1[uid]=60119 I want to collect the data in the block: I have investigated the sourcecode - and found out that the attribute of interest should be this one: class="content"div class="content"><!-- TYPO3SEARCH_begin --> here the code is: - my trails. // inculde the Simple HTML DOM Parser include_once('simple_html_dom.php'); // get the file we want to parse right now,create a DOM $html = file_get_html(''); // simple_html_dom::find() creates a new // simple_html_dom-Objekt, that consists out of // corresponding childelements foreach($html->find('class: content ') as $h3) { // simple_html_dom::get the text in a tag // den Text innerhalb eines Tags if($h3->innertext == 'Text of a H3 Tag') { break; } } // simple_html_dom::next_sibling() gives the // next Element $table = $h3->next_sibling(); but believe me - it gives me not back what is aimed. what have id done wrong...? dbone Delete ... need to change host .. hello dear Freaks
i am currently musing bout the portover of a python bs4 parser to php - working with the simplehtmldom-parser / pr the DOM-selectors... (see below). The project: for a list of meta-data of wordpress-plugins: - approx 50 plugins are of interest! but the challenge is: i want to fetch meta-data of all the existing plugins. What i subsequently want to filter out after the fetch is - those plugins that have the newest timestamp - that are updated (most) recently. It is all aobut acutality... https://wordpress.org/plugins/participants-database ....and so on and so forth.
https://wordpress.org/plugins/wp-job-manager we have the following set of meta-data for each wordpress-plugin: Version: 1.9.5.12 installations: 10,000+ WordPress Version: 5.0 or higher Tested up to: 5.4 PHP Version: 5.6 or higher Tags 3 Tags:databasemembersign-up formvolunteer Last updated: 19 hours ago
the project consits of two parts: the looping-part: (which seems to be pretty straightforward). the parser-part: where i have some issues - see below. I'm trying to loop through an array of URLs and scrape the data below from a list of wordpress-plugins. See my loop below- as a base i think it is good starting point to work from the following target-url:
plugins wordpress.org/plugins/browse/popular with 99 pages of content: cf ...
the Output of text_nodes: ['Version: 1.9.5.12', 'Active installations: 10,000+', 'Tested up to: 5.6 '] but if we want to fetch the data of all the wordpress-plugins and subesquently sort them to show the -let us say - latest 50 updated plugins. This would be a interesting task:
first of all we need to fetch the urls then we fetch the information and have to sort out the newest- the newest timestamp. Ie the plugin that updated most recently List the 50 newest items - that are the 50 plugins that are updated recently ..
we have the following set see here the Soup_ soup = BeautifulSoup(r.content, 'html.parser') target = [item.get_text(strip=True, separator=" ") for item in soup.find( "h3", class_="screen-reader-text").find_next("ul").findAll("li")[:8]] head = [soup.find("h1", class_="plugin-title").text] new = [x for x in target if x.startswith( ("V", "Las", "Ac", "W", "T", "P"))] return head + new with ThreadPoolExecutor(max_workers=50) as executor1: futures1 = [executor1.submit(parser, url) for url in allin] for future in futures1: print(future.result())
see the formal output Quote
background: https://stackoverflow.com/questions/61106309/fetching-multiple-urls-with-beautifulsoup-gathering-meta-data-in-wp-plugins Well - i guess that we c an do this with the simple DOM Parser - here the seclector reference. https://stackoverflow.com/questions/1390568/how-can-i-match-on-an-attribute-that-contains-a-certain-string
look forward to any hint and help.
have a great day Edited May 3, 2020 by dil_bertIs there any reason that people can think as to why DOMDocument::saveHTML would remove the following: Code: [Select] <![if !vml]> <img src="someimage.jpg" /> <![endif]> A little clarification... This HTML comment tag is used in my company's email newsletter code and is necessary to make Outlook 2007 behave properly. For whatever reason, saveHTML strips it out. I know that this doesn't conform to HTML standards and I'm guessing that that is why it is being stripped. BUT, from reading on the internet, saveHTML can produce junk html code anyways. Any help is appreciated. I have some code $doc = new DOMDocument(); $doc->loadHTML( '<html> <head><title>Test</title></head> <body></body></html>' ); $doc->encoding = 'iso-8859-1'; file_put_contents('test.html', $doc->saveHTML()); when i view the output file i get <html><head><title>Test</title></head><body></body></html> all on one line is there no way of having it format it like the original source code so that its not all bunched together? Hi guys, Just starting to play with PHP Domdocument, only to fail at the very first step: <?php $html = 'test/php/somefile.html' ; if(!empty($html)){ $dom_1 = new domDocument ; $dom_1->loadHTML($html) ; $links = $dom_1->getElementsByTagName('li') ; foreach ( $links as $link) { // echo $link ; echo $link->nodeValue, PHP_EOL; } } ?> When I visit it in a browser I get a WSOD, what am I missing? Hi all, I am pretty new to php and I am having an issue trying to load an XML document. When ever I try to use Xpath it negates all the code below the line, including the HTML, and returns a white page. here is my code: Code: [Select] <html> <head> <?php $xpath = new DOMXPath("structure.xml"); ?> <body> hello world </body> </html> I checked phpinfo() and I have both the DOM and XPath enables and installed. I have also tried using just DOM and that worked so it is only Xpath that is not working. Ideas? Thank you James S Hi, I have a php code that could extract the categories and display them. However, I still can't extract the numbers that goes along with it too(without the bracket).
This is my code:
<?php $grep = new DoMDocument(); @$grep->loadHTMLFile("http://www.lelong.com.my/Auc/List/BrowseAll.asp"); $finder = new DomXPath($grep); $class = "CatLevel1"; $nodes = $finder->query("//*[contains(@class, '$class')]"); foreach ($nodes as $node) { $span = $node->childNodes; echo $span->item(0)->nodeValue."<br>"; } ?>This is my desired output: Arts, Antiques & Collectibles : 9768 B2B & Industrial Products : 2342 Baby : 3453 etc...Any help is appreciated. Thanks! good evening dear PHPFreaks - hello to everybody. i want to create a link parser. i have choosen to do it with Curl. I have some lines together now. Love to hear your review... Since i am new to programming i love to get some hints from experienced devs. Here some details: well since we have several hundred of resultpages derived from this one: http://www.educa.ch/dyn/79362.asp?action=search Note: i want to itterate over the resultpages - with a loop. http://www.educa.ch/dyn/79376.asp?id=1568 http://www.educa.ch/dyn/79376.asp?id=2149 i take this loop: for($i=1;$i<=$match[1];$i++) { $url = "http://www.example.com/page?page={$i}"; // access new sub-page, extract necessary data } Dear PHP-Freaks, what do you think? What about the Loop over the target-Urls? BTW: you see - there will be some pages empty. Note - the empty pages should be thrown away. I do not want to store "empty" stuff. well this is what i want to. And now i need to have a good parser-script. Note: this is a tree-part-job: 1. fetching the sub-pages 2. parsing them 3. storing the data in a mysql-db Well - the problem - some of the above mentioned pages are empty. so i need to find a solution to leave them aside - unless i do not want to populate my mysql-db with too much infos.. Btw- parsing should be a part that can be done with DomDocument - What do you think? I need to combine the first part with tthe second - can you give me some starting points and hints to get this. The fetching-job should be done with CuRL - and to process the data into a DomDocument-Parser-Job. note ive taken the script from this place: http://www.merchantos.com/makebeta/php/scraping-links-with-php/ function storeLink($url,$gathered_from) { $query = "INSERT INTO links (url, gathered_from) VALUES ('$url', '$gathered_from')"; mysql_query($query) or die('Error, insert query failed'); } for($i=1;$i<= 10000; $i++) { $target_url = "http://www.educa.ch/dyn/79376.asp?id={$i}"; } // access new sub-page, extract necessary data $userAgent = 'Googlebot/2.1 (http://www.googlebot.com/bot.html)'; // make the cURL request to $target_url $ch = curl_init(); curl_setopt($ch, CURLOPT_USERAGENT, $userAgent); curl_setopt($ch, CURLOPT_URL,$target_url); curl_setopt($ch, CURLOPT_FAILONERROR, true); curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true); curl_setopt($ch, CURLOPT_AUTOREFERER, true); curl_setopt($ch, CURLOPT_RETURNTRANSFER,true); curl_setopt($ch, CURLOPT_TIMEOUT, 10); $html= curl_exec($ch); if (!$html) { echo "<br />cURL error number:" .curl_errno($ch); echo "<br />cURL error:" . curl_error($ch); exit; } // parse the html into a DOMDocument $dom = new DOMDocument(); @$dom->loadHTML($html); // grab all the on the page $xpath = new DOMXPath($dom); $hrefs = $xpath->evaluate("/html/body//a"); for ($i = 0; $i < $hrefs->length; $i++) { $href = $hrefs->item($i); $url = $href->getAttribute('href'); storeLink($url,$target_url); echo "<br />Link stored: $url"; } Dear PHP-Freaks, what do you think? What about the Loop over the target-Urls? love to hear from you ! Imagine an html with the following structure Code: [Select] <div class="item"> <div class="title"> <a class="title" href="http://www.domain.com/title.html">Title is here</a> </div> <div class="image"> <a href="http://www.domain.com/title.html"><img src=image.jpg /></a> </div> </div> How to make an array containing $title - $url - $image_url ? It is easy to get image or link by DomDocument, but I did not find a way to get image with its target link. Imagine a html as Code: [Select] <div class=image> <a href='http://site.com'><img src='imagelink.jpg'></a> </div>How to get both the image link and href? $dom = new DOMDocument(); @$dom->loadHTML($html); $xpath = new DOMXPath($dom); $hrefs = $xpath->evaluate("/html/body//div[@class='image']"); for ($i = 0; $i < $hrefs->length; $i++) { $href = $hrefs->item($i); Now to get the image and its href, we need first getElementsByTagName('a') and getElementsByTagName('img') but they do not work inside foreach. What's your idea? I just finished (or so I thought) a project. But my client's server runs PHP4, so I need to adapt my code. Here's what stopped working: Code: [Select] $localClasses = new DOMDocument; $localClasses -> load("file.xml"); $localClasses -> get_elements_by_tagname('Title') -> item(0) -> firstChild -> nodeValue Here's my petty attempt at trying to adapt this code to run in PHP4: Code: [Select] $file = file_get_contents("localClasses.xml"); $localClasses = new DOMDocument($file); $test = $localClasses -> get_elements_by_tagname('Title'); $testText = $test -> item[0] -> firstChild -> nodeValue; print $testText; This doesn't give me any errors, but nothing shows up. Any help would be appreciated. Thanks for reading! Code: [Select] $domdoc=new DOMDocument(); $domdoc->formatOutput=TRUE; $empty_cart_xml= '<Order> <Cart> <Items> <Item>1</Item> <Item>2</Item> <Item>3</Item> </Items> </Cart> </Order>'; $domdoc->loadXML($empty_cart_xml); print $domdoc->saveXML()."<hr/>"; //works up to this point $xpath=new DOMXPath($domdoc); $items=$xpath->query('Order/Cart/Items'); foreach($itemses AS $items) { $items->appendChild($domdoc->createElement('Item','4')); } print $domdoc->saveXML(); All I want to do is to add a new Item to Items. What am I doing wrong? Hi guys, Reading this from php.net, has got me a wee bit confused. Trying to implement is has got me doubly confused! My code: $dom = new DOMDocument; libxml_use_internal_errors(true); $dom->loadHTMLFile($parent_node); if($dom->childNodes <>0) { $kids = array ( 'url' => $parent_node, 'No_of_kids' => count($dom->childNodes) ); } Results in '' Notice: Object of class DOMNodeList could not be converted to int' How the heck am i supposed to count the childNodes? so there's some html i'm having to fetch and parse for personal use...but some of the data i want is started in a table written this way: rawTableData = {"rows": [{"colname1": value, "colname2": "value", "colname3: value} how can i use domdocument to parse this data if say i want the value for colname2? there are no tags for me to use. Hi the what i'm trying to do is i want to get the contents of heading and text_details and merge them together before adding them as one entry into database: Code: [Select] <div class="heading"><p> Second quote: </div><div class=text_details> <p> Satan's substitute for repentance is the man's rationalization of evil. <p></b></div> my code looks s/thing like this: Code: [Select] foreach( $dom->getElementsByTagName('div') as $div ) { foreach( $div->attributes as $attributes ) { if( strtolower($attributes->name) == 'class' ) { if( strtolower($attributes->value) == 'heading' || strtolower($attributes->value) == 'text_details') { $quote = $div->textContent; $clean_quote = mysql_real_escape_string($quote); echo "the quote is: " . $clean_quote . "<br />"; mysql_query("INSERT INTO quotes (quote) VALUES ('$clean_quote')")or die(mysql_error()); } } } } when i do this obviously i get: Second quote of the day entered into a field by itself and Satan's substitute for repentance is the man's rationalization of evil. into another field...how to make them one!?? thanks in advance I am trying to take a specific link from my site and place it into my database. I only want links starts with CORPSEARCH.ENTITY_INFORMATION?p_nameid= Can someone point me in the right direction here? Code for this is below: // make the cURL request to $target_url $html= curl_exec($ch); if (!$html) { echo "<br />cURL error number:" .curl_errno($ch); echo "<br />cURL error:" . curl_error($ch); exit; } // parse the html into a DOMDocument $dom = new DOMDocument(); @$dom->loadHTML($html); // grab all the on the page $xpath = new DOMXPath($dom); $hrefs = $xpath->evaluate("/html/body//a"); for ($i = 0; $i < $hrefs->length; $i++) { $href = $hrefs->item($i); $url = $href->getAttribute('href'); $sql="INSERT INTO links(cid, nlink)VALUES('$i','$url')"; $result=mysql_query($sql); echo $result; echo $url; hello dear friends - i want to test if the DOMdocument [class] exists? Can i do this in the shell (on OpenSuse 11.3)? bool class_exists ( string $class_name [, bool $autoload = true ] ) bool class_exists ( string $DOMdocument [, bool $autoload = true ] ) or do i have to create a file that i call itself in the shell!? look forward for an idea / hint / tipp regards db1 Hi, I have a RSS feed cached as an XML file. I need to pull some info out of it so I can then print it to the page. Currently I am using this code to extract the data: $doc = new DOMDocument(); $doc->load('inthenews.xml'); $inthenews = array(); foreach ($doc->getElementsByTagName('item') as $node) { $itemRSS = array ( 'title' => $node->getElementsByTagName('title')->item(0)->nodeValue, 'link' => $node->getElementsByTagName('link')->item(0)->nodeValue, 'desc' => $node->getElementsByTagName('description')->item(0)->nodeValue, ); array_push($inthenews, $itemRSS); } However, the Description node contains more than I want. I need to remove everything except for the image ( <img../> ) it contains. Is there some way of running preg or similar on the nodeValue as it is extracted? Or an alternative to "getElementsByTagName" that allows searching for strings? If not, does anyone have a suggestion for doing this? I tried running preg_replace on the array, but it doesn't seem to do anything?? An example of the Array created by my code above is shown below: Code: [Select] [0] => Array ( [title] => BBC radio Cambridge 7.20am [link] => images/news/Matthew_Freeman_Radio_Cambridgeshire_21-10-10.mp3 [desc] => <img alt="BBC-logo" src="http://www2.mrc-lmb.cam.ac.uk/images/news/BBC-logo.jpg" height="54" width="127" /><br/>BBC radio Cambridge 7.20am 21.10.10: Dr Matthew Freeman"<br/> 21 October 2010 ) Thanks in advance for any advice Phil |