PHP - Need Help With Simple_html_dom
I am using simple_html_dom.php
I am stuck with the Code of How to parse below Content : Quote <div id="entry_4" class="entry clearfix "><div class="entry_title clearfix"><h1 class=" ">Smith J</h1></div><div class="full_listing"><div class="blocks"><div id="entry_4_block_0" class="block indent-level-0"><div class="share_link" wpol:entryId="719183066N00W" wpol:contactPointId="719183066N00W"><div class="save_menu"><div class="icon"></div></div><div class="share_menu"><div class="icon"></div></div><a class="screen_reader_only" rel="nofollow" href="/mobile/send-to-mobile-accessible?entryId=719183066N00W&listingId=719183066N00W&searchType=R&channel=WP" name="Smith">Send this listing to your mobile</a></div><span class="phone_number ">0457 599 539</span> <div class="address"><span class="street_line">1 Martin Pl</span><span class="locality">Sydney</span><span class="state">NSW</span><span class="postcode">2000</span></div><a rel="nofollow" class="show_map" name="Smith" href="/search/where-is?locality=Sydney&streetNumber=1&streetName=Martin&streetType=Pl&state=NSW&product=N00W%23719183066N00W%23Smith+J&channel=WP" onclick="return false;">Show map...</a></div></div></div></div> I am trying if(!$html->find('div[id=entry_' .$i.']',0)==""){ echo "inside0000"; foreach($html->find('div[id=entry_' .$i.']') as $result){ $resultdata[]=array( 'name' => $result->find('h[class=" "]',0)->innertext, 'streetLine' => $result->find('span[class=street_line]',0)->innertext, 'locality' => $result->find('span[class=locality]',0)->innertext, 'state' => $result->find('span[class=state]',0)->innertext, 'postcode' => $result->find('span[class=postcode]',0)->innertext, 'phone' => $result->find('span[phone_number ]',0)->innertext ); It gets Into inside0000 But doesn't Parse the Data. Can anyone help me please ? Similar TutorialsI successfully load a page by simple_html_dom.php (developed in simplehtmldom.sourceforge.net) as $html = file_get_html('externalpage'); But sometimes this make a high load on CPU and the page does not load for a long time (probably due to the external site server). How can I skip the process when it is not normal to avoid high CPU usage? Gidday all, My Utimate goal is to parse the data on the first row in first table and first row in second table. from he http://www.bom.gov.au/products/IDQ60901/IDQ60901.94580.shtml Presently I can only parse data in the last row in the last table. I got to this point about 2 days ago, I am unable to find any info as to what I need to do to achieve what I want. some of the info I've found I don't understand. Need newbie help. What do I need to add/change to parse the data in at least the first table row? Code: [Select] <?php error_reporting(E_ALL); include_once('htmldom/simple_html_dom.php'); $url = 'http://www.bom.gov.au/products/IDQ60901/IDQ60901.94580.shtml'; // Create DOM from URL $html = file_get_html($url); foreach($html->find('table tr') as $weather) { if($weather->find('th')) {continue;} //apparently this needs to be added because there is a bug in simple_html_dom.php if(!$weather->find('td ', 0)) {continue;} $datetime = $weather->find('td', 0)->plaintext; $currentTemp = $weather->find('td', 1)->plaintext; } print_r('updated:' . ' ' .$datetime); print_r ('<br>'); print_r('CurrentTmp:' . ' ' .$currentTemp); print_r ('<br>'); ?> I want to remove empty paragraphs from an HTML document using simple_html_dom.php. I know how to do it using the DOMDocument class, but, because the HTML files I work with are prepared in MS Word, the DOMDocument's loadHTMLFile() function gives this exception "Namespaces are not defined". This is the code I use with the DOMDocument object for HTML files not prepared in MS Word: <?php /* Using the DOMDocument class */ /* Create a new DOMDocument object. */ $html = new DOMDocument("1.0", "UTF-8"); /* Load HTML code from an HTML file into the DOMDocument. */ $html->loadHTMLFile("HTML File With Empty Paragraphs.html"); /* Assign all the <p> elements into the $pars DOMNodeList object. */ $pars = $html->getElementsByTagName("p"); echo "The initial number of paragraphs is " . $pars->length . ".<br />"; /* The trim() function is used to remove leading and trailing spaces as well as * newline characters. */ for ($i = 0; $i < $pars->length; $i++){ if (trim($pars->item($i)->textContent) == ""){ $pars->item($i)->parentNode->removeChild($pars->item($i)); $i--; } } echo "The final number of paragraphs is " . $pars->length . ".<br />"; // Write the HTML code back into an HTML file. $html->saveHTMLFile("HTML File WithOut Empty Paragraphs.html"); ?> This is the code I use with the simple_html_dom.php module for HTML files prepared in MS Word: <?php /* Using simple_html_dom.php */ include("simple_html_dom.php"); $html = file_get_html("HTML File With Empty Paragraphs.html"); $pars = $html->find("p"); for ($i = 0; $i < count($pars); $i++) { if (trim($pars[$i]->plaintext) == "") { unset($pars[$i]); $i--; } } $html->save("HTML File without Empty Paragraphs.html"); ?> It is almost the same, except that that the $pars variable is a DOMNodeList when using DOMDocument and an array when using simple_html_dom.php. But this code does not work. First it runs for two minutes and then reports these errors: "Undefined offset: 1" and "Trying to get property of nonobject" for this line: "if (trim($pars[$i]->plaintext == "")) {". Does anyone know how I can fix this? Thank you. I also asked on stackoverflow. hello dear php-experts,
https://europa.eu/youth/volunteering/organisations_en#open
<?php // Report all PHP errors (see changelog) error_reporting(E_ALL); include('inc/simple_html_dom.php'); //base url $base = 'https://europa.eu/youth/volunteering/organisations_en#open'; //home page HTML $html_base = file_get_html( $base ); //get all category links foreach($html_base->find('a') as $element) { echo "<pre>"; print_r( $element->href ); echo "</pre>"; } $html_base->clear(); unset($html_base); ?>
I have the above code and I'm trying to get certain elements of the page but it isn't returning anything.
Is it possible that certain PHP functions might be disabled on the server to stop that? The above code works perfectly on other sites.
Is there any workaround?
btw: i have created a small snipped as a proof of concept to run this with Python and BeautifulSoup -
import requests from bs4 import BeautifulSoup url = 'https://europa.eu/youth/volunteering/organisations_en#open' response = requests.get(url) soup = BeautifulSoup(response.content, 'lxml') print(soup.find('title').text) block = soup.find('div', class_="eyp-card block-is-flex")
and this....
European Youth Portal >>> block.a <a href="/youth/volunteering/organisation/48592_en" target="_blank">"Academy for Peace and Development" Union</a> >>> block.a.text '"Academy for Peace and Development" Union' >>> block.select_one('div > div > p:nth-child(9)') <p><strong>PIC:</strong> 948417016</p> >>> block.select_one('div > div > p:nth-child(9)').text 'PIC: 948417016'
what is aimed in the end - i want to gather the first 20 results of the page - and put them in to a sql-db or alternatively show the information in a little widget |