PHP - Scrape Website With Curl
Hi!
I'm into a little project where I want to retrieve data from the swedish news website DN's SOS Live page (http://www.dn.se/nyheter/soslive). On the page there is an iFrame and there is the data I want to retrieve. The iFrame address is: http://div.dn.se/dn/sos/soslive.php?id= ... er/soslive Here is the code I have and it stoped working yesterday. Code: [Select] function curl_download($Url){ // is cURL installed yet? if (!function_exists('curl_init')){ die('Sorry cURL is not installed!'); } // OK cool - then let's create a new cURL resource handle $ch = curl_init(); // Now set some options (most are optional) // Set URL to download curl_setopt($ch, CURLOPT_URL, $Url); // Set a referer curl_setopt($ch, CURLOPT_REFERER, "http://www.dn.se"); // User agent curl_setopt($ch, CURLOPT_USERAGENT, "MozillaXYZ/1.0"); // Include header in result? (0 = yes, 1 = no) curl_setopt($ch, CURLOPT_HEADER, 0); // Should cURL return or print out the data? (true = return, false = print) curl_setopt($ch, CURLOPT_RETURNTRANSFER, true); // Timeout in seconds curl_setopt($ch, CURLOPT_TIMEOUT, 10); // Download the given URL, and return output $output = curl_exec($ch); // Close the cURL resource, and free system resources curl_close($ch); return $output; } $sosURL = 'http://div.dn.se/dn/sos/soslive.php?id=p://www.dn.se/nyheter/soslive'; $data = curl_download($sosURL); The data variable is empty. I notice now that when I enter the webaddress "http://div.dn.se/dn/sos/soslive.php?id=p://www.dn.se/nyheter/soslive" there is no content, although this is the web address in the iFrame. How has DN solved this and how could I get around it? Best regards Stefan Similar TutorialsI'm just doing a simple scrape on a web page.. works great, except... the section I scrape is just 10 words inside a table which are updated every few hours. Problem is... my scrape seems to only grab what I found the first time I ran my script, despite the new words being updated a couple hours ago. I've tested everything script wise and looked at the html source to be sure everything is showing up... just wondering, is there any thing you can think of that would allow for caching (from source site or the php script) of the words on a page when running my script? I even opened 2 different browsers and ran the script fresh from these browsers that had never ran it. When I echo the results, I still get the original result set. I'm perplexed. I'm trying to get a value which changes but is always contained the same way this is the text Next: Brigadier (81654 of 220000) as you can see below this is the value i need to get I just can't figure out the RegEx for it? help please! Smaller CODE Code: [Select] <div class="rankBar"> <p class="grade"><span id="ctl00_mainContent_identityBar_currentGlobalRankLabel" class="current">Colonel Grade 3</span><span id="ctl00_mainContent_identityBar_nextGlobalRankLabel" class="future">Next: Brigadier (81654 of 220000)</span></p> <div class="rankMeter"> <div id="ctl00_mainContent_identityBar_rankBarPanel" class="bar" style="width:37%;"> <span></span> </div> </div> FULL CODE Code: [Select] <div id="ctl00_mainContent_identityBar_idbarPanel"> <div class="glowBox idRankContainer"> <div class="corner bottomLeft"></div> <div class="lattice"></div> <div class="content"> <a href="/Stats/Reach/default.aspx?player=l RaH l" id="ctl00_mainContent_identityBar_avatar" class="avatar"><img id="ctl00_mainContent_identityBar_emblemImg" src="/Stats/emblem.ashx?s=70&0=0&1=0&2=0&3=0&fi=0&bi=0&fl=0&m=3" style="height:70px;width:70px;border-width:0px;"></a> <div class="idRank"> <div class="userInfo"> <img id="ctl00_mainContent_identityBar_namePlateImg" class="img_plate" src="/images/reachstats/nameplates/512.png" alt="" style="height:30px;width:60px;border-width:0px;"> <a href="/projects/reach/article.aspx?ucc=faq&cid=28756"><img id="ctl00_mainContent_identityBar_currentGlobalRankImage" title="Colonel Grade 3" class="img_rankIcon" src="/images/reachstats/grades/med/A4BF62C6-1E3F-468A-9A73-8237600B2AD3.png" alt="Colonel Grade 3" style="height:40px;width:40px;border-width:0px;"></a> <h2><a href="/Stats/Reach/default.aspx?player=l RaH l" id="ctl00_mainContent_identityBar_gamerTag">l RaH l</a> - FLIR</h2> </div> <div class="rankBar"> <p class="grade"><span id="ctl00_mainContent_identityBar_currentGlobalRankLabel" class="current">Colonel Grade 3</span><span id="ctl00_mainContent_identityBar_nextGlobalRankLabel" class="future">Next: Brigadier (81654 of 220000)</span></p> <div class="rankMeter"> <div id="ctl00_mainContent_identityBar_rankBarPanel" class="bar" style="width:37%;"> <span></span> </div> </div> </div> </div> </div> </div> <div class="glowBox shields"> <div class="corner topRight"></div> <div class="lattice"></div> <div class="content"> <ul> <li class="campaign"> <p>Campaign</p> <img id="ctl00_mainContent_identityBar_spCampaignImage" title="Completed Legendary Difficulty" alt="Campaign Progress" src="/images/reachstats/campaign_progress/legendary.png" style="height:59px;width:54px;border-width:0px;"> </li> <li class="coop"> <p>Co-op</p> <img id="ctl00_mainContent_identityBar_coopCampaignImage" title="Completed Normal Difficulty" alt="Coop Campaign Progress" src="/images/reachstats/campaign_progress/normal.png" style="height:59px;width:54px;border-width:0px;"> </li> <li class="arena_display"> <p>Arena</p> <a id="ctl00_mainContent_identityBar_arenaLink" href="/stats/reach/careerstats/playlists.aspx?player=l RaH l&vc=2"><img id="ctl00_mainContent_identityBar_currentBestArenaDivisionImg" alt="" src="/images/reachstats/arena_div/0.png" style="height:47px;width:48px;border-width:0px;"></a> <div class="glowBox popOut po_arenaSeason2"> <div class="corner topRight"></div> <div class="corner bottomLeft"></div> <div class="content"> <h4>Arena Season 6</h4> <div class="twoColumn"> <img id="ctl00_mainContent_identityBar_arenaRepeater_ctl00_arenaImage" src="/images/reachstats/arena_div/0.png" style="height:47px;width:48px;border-width:0px;"> <h5>Doubles Arena</h5> <p id="ctl00_mainContent_identityBar_arenaRepeater_ctl00_divisionName">Not Qualified Yet</p> <p id="ctl00_mainContent_identityBar_arenaRepeater_ctl00_lastSeasonControl">Last Season: <img id="ctl00_mainContent_identityBar_arenaRepeater_ctl00_lastSeasonImage" src="/images/reachstats/arena_div/0.png" style="height:23px;width:23px;border-width:0px;"> </p> </div> <div class="rule_transparent"></div> <div class="twoColumn"> <img id="ctl00_mainContent_identityBar_arenaRepeater_ctl02_arenaImage" src="/images/reachstats/arena_div/0.png" style="height:47px;width:48px;border-width:0px;"> <h5>FFA Arena</h5> <p id="ctl00_mainContent_identityBar_arenaRepeater_ctl02_divisionName">Not Qualified Yet</p> <p id="ctl00_mainContent_identityBar_arenaRepeater_ctl02_lastSeasonControl">Last Season: <img id="ctl00_mainContent_identityBar_arenaRepeater_ctl02_lastSeasonImage" src="/images/reachstats/arena_div/0.png" style="height:23px;width:23px;border-width:0px;"> </p> </div> <div class="rule_transparent"></div> <div class="twoColumn"> <img id="ctl00_mainContent_identityBar_arenaRepeater_ctl04_arenaImage" src="/images/reachstats/arena_div/0.png" style="height:47px;width:48px;border-width:0px;"> <h5>Team Arena</h5> <p id="ctl00_mainContent_identityBar_arenaRepeater_ctl04_divisionName">Not Qualified Yet</p> <p id="ctl00_mainContent_identityBar_arenaRepeater_ctl04_lastSeasonControl">Last Season: <img id="ctl00_mainContent_identityBar_arenaRepeater_ctl04_lastSeasonImage" src="/images/reachstats/arena_div/0.png" style="height:23px;width:23px;border-width:0px;"> </p> </div> </div> </div> </li> </ul> </div> </div> </div> I need to scrape a Chinese website. (I guess there is no difference between scraping a Chinese website and a normal one?) It's my suppliers weblink. They have have told me to download images and text from their website profile link off 1688.com. There is an API - but from what I've read, it's pants + my virus checked doesn't allow me to visit the API doc page. What tool should I use? I've got lots of experience coding - but master of none. Maybe I still fall into the newbie category. LOL. I saw a link from an article... they gave these names:
Goutte Which should I consider? IMPORTANT: I need to download and then upload into my Woocommerce website. I need images + variation details. I'll have a little file with translations, so if XYZ is found in Chinese, then it is replaced with this. I thought I would mention the extra detail incase it was relevant to considering which scraping tool to pick. Thanks! Edited July 11, 2019 by acebase OK, I have the initial cURL working but need to figure out how to extract data I want off that webpage to display or store in a database, I tried using dom and xpath, but because of the way the page displays using css, i think its not picking it up. Here is my cURL script: <?php $userAgent = 'Googlebot/2.1 (http://www.googlebot.com/bot.html)'; $target_url = "www.test.com"; $ch = curl_init(); curl_setopt($ch, CURLOPT_USERAGENT, $userAgent); curl_setopt($ch, CURLOPT_URL,$target_url); curl_setopt($ch, CURLOPT_FAILONERROR, true); curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true); curl_setopt($ch, CURLOPT_AUTOREFERER, true); curl_setopt($ch, CURLOPT_RETURNTRANSFER,true); curl_setopt($ch, CURLOPT_TIMEOUT, 10); $html = curl_exec($ch); if (!$html) { echo "<br />cURL error number:" .curl_errno($ch); echo "<br />cURL error:" . curl_error($ch); exit; } // parse the html into a DOMDocument $dom = new DOMDocument(); $dom->loadHTML($html); // grab all the on the page $xpath = new DOMXPath($dom); $hrefs = $xpath->evaluate("/html/body//td"); for ($i = 0; $i < $hrefs->length; $i++) { $href = $hrefs->item($i); $url = $href->getAttribute('href'); storeLink($url,$target_url); echo "<br />Link stored: $url"; } ?> and here is a snippet of the source of the page I am getting: <span id="lblTest"><h1 id='surrZipTitle'>Agents in Surrounding Zip Codes</h1><table cellpadding='0' cellspacing='0' border='0' class='tblDent'><tr><td class='tdEliteTitle'><span class='caaSubHead3 addwidth'>H.K. Dent Elite</span></td></tr><tr><td class='tdEliteContent'><table cellpadding='0' cellspacing='0' border='0'><tr><td valign='top'><span class='caaAgencyName2 addwidth'>PROFESSIONAL INS ASSOC, INC.</span></td><td valign='top'> </td></tr></table><table cellpadding='0' cellspacing='0' border='0'><tr><td width='360px' valign='top'><div class='addressBlock'><span>4444 MANZANITA AVE STE 6</span><br /><span>CARMICHAEL , CA 95608-1488</span><br /><a class='faaBlueLink' id='lnkContact' href='http://www.safeco.com/portal/server.pt/gateway/PTARGS_0_20656_395_362_0_43/http%3B/por-portlets-prd.int.apps.safeco.com%3B13425/dotcom/FindAnAgent/find-an-agent/contactanagent.aspx?RequestType=agency&level=elite&Id=0415199904150295&lat=38.646142&lng=-121.327623' onclick='oOobj4.Preferences.Plugins.Events.poX=0;'>Contact & Directions</a> <a class='faaBlueLink' id='lnkWebSite' style='display: none;' href='http://' target='_blank' onclick="return trackEvent('/External-Link/AgentWebsite/ ','PROFESSIONAL INS ASSOC, INC. ');">Website</a></div></td><td valign='top'> </td></tr></table></td></tr></table><table cellpadding='0' cellspacing='0' border='0' class='tblDent'><tr><td class='tdEliteTitle'><span class='caaSubHead3 addwidth'>H.K. Dent Elite</span></td></tr><tr><td class='tdEliteContent'><table cellpadding='0' cellspacing='0' border='0'><tr><td valign='top'><span class='caaAgencyName2 addwidth'>AMERICAN AIM AUTO INS AGY, INC</span></td><td valign='top'> </td></tr></table><table cellpadding='0' cellspacing='0' border='0'><tr><td width='360px' valign='top'><div class='addressBlock'><span>5339 SAN JUAN AVE</span><br /><span>FAIR OAKS , CA 95628-3318</span><br /><a class='faaBlueLink' id='lnkContact' href='http://www.safeco.com/portal/server.pt/gateway/PTARGS_0_20656_395_362_0_43/http%3B/por-portlets-prd.int.apps.safeco.com%3B13425/dotcom/FindAnAgent/find-an-agent/contactanagent.aspx?RequestType=agency&level=elite&Id=0415911704151222&lat=38.66237&lng=-121.292429' onclick='oOobj4.Preferences.Plugins.Events.poX=0;'>Contact & Directions</a> So basically I want to extract the agency name like "<span class='caaAgencyName2 addwidth'>PROFESSIONAL INS ASSOC, INC.</span>" and the address which always use the same div class like "caaAgencyName2" and "addressBlock". How can this be accomplished? Hello, I am trying to use cURL to login to a website, but I can't seem to get it working. Website I'm trying to login to: http://www.uniquearticlewizard.com/amember/member.php Here is what their form code looks like: Code: [Select] <form name="login" method="post" action="/amember/member.php"> <table class="vedit" > <tr> <th>Username</th> <td><input type="text" name="amember_login" size="15" value="" /></td> </tr> <tr> <th>Password</th> <td><input type="password" name="amember_pass" size="15" /></td> </tr> <tr> <td colspan="2" style='padding:0px; padding-bottom: 2px;'> <input type="checkbox" name="remember_login" value="1"> <span class="small">Remember my password?</span> </td> </tr> </table> <input type="hidden" name="login_attempt_id" value="1291657877" /> <br /> <span class='button'><input type="submit" value=" Login " /></span> <span class='button'><input type="button" value=" Back " onclick="history.back(-1)" /></span> </form> As you can see they are using a javascript button to submit the form, which doesn't have a name attribute. So I'm not sure how to get around this and tell cURL to submit the form. When I Googled I found something that said just submit the other information and it will submit itself, but I'm not sure if that's right. Here is my attempt, but I just get a blank screen. I think the script is working, but something on there end is exiting out due to me not supplying a required piece of information. I'm not sure what that is though. Code: [Select] <?php set_time_limit(0); $options = array( CURLOPT_RETURNTRANSFER => true, // return web page CURLOPT_HEADER => false, // don't return headers CURLOPT_FOLLOWLOCATION => true, // follow redirects CURLOPT_ENCODING => "", // handle all encodings CURLOPT_USERAGENT => "spider", // who am i CURLOPT_AUTOREFERER => true, // set referer on redirect CURLOPT_CONNECTTIMEOUT => 120, // timeout on connect CURLOPT_TIMEOUT => 120, // timeout on response CURLOPT_MAXREDIRS => 10, // stop after 10 redirects ); $ch = curl_init( "http://www.uniquearticlewizard.com/amember/member.php" ); curl_setopt_array( $ch, $options ); $content = curl_exec( $ch ); $err = curl_errno( $ch ); $errmsg = curl_error( $ch ); $header = curl_getinfo( $ch ); curl_close( $ch ); $header['content'] = $content; preg_match('/name="login_attempt_id" value="(.*)" \/>/', $header['content'], $form_id); $value = $form_id[1]; $ch = curl_init(); // SET URL FOR THE POST FORM LOGIN curl_setopt($ch, CURLOPT_URL, 'http://www.uniquearticlewizard.com/amember/member.php'); // ENABLE HTTP POST curl_setopt ($ch, CURLOPT_POST, 1); $data = array('amember_login' => '*****', 'amember_pass' => '*****', 'login_attempt_id' => $value, 'remember_login' => '1'); // SET POST PARAMETERS : FORM VALUES FOR EACH FIELD curl_setopt($ch, CURLOPT_POSTFIELDS, $data); // IMITATE CLASSIC BROWSER'S BEHAVIOUR : HANDLE COOKIES curl_setopt ($ch, CURLOPT_COOKIEJAR, 'cookie.txt'); # Setting CURLOPT_RETURNTRANSFER variable to 1 will force cURL # not to print out the results of its query. # Instead, it will return the results as a string return value # from curl_exec() instead of the usual true/false. curl_setopt ($ch, CURLOPT_RETURNTRANSFER, 1); // EXECUTE 1st REQUEST (FORM LOGIN) $store = curl_exec($ch); echo $store; curl_close ($ch); ?> They do have a form value that changes on every page refresh, it just tracks the login attempt (which is a long number). I was able to scrape that and put it in the form with the correct value. I thought adding that would successfully log me in, but apparently there is something else going on. Any help would be greatly appreciated! I am trying to create a remote login to one website using mine. The users will need to enter their username and password on my site, and if they are registered to my website, their login credentials will be sent to another website and a page will be retrieved.
I am stuck at sending the users' data to the original site. The original site's viewsource is this..
<form method=post> <input type="hidden" name="action" value="logon"> <table border=0> <tr> <td>Username:</td> <td><input name="username" type="text" size=30></td> </tr> <tr> <td>Password:</td> <td><input name="password" type="password" size=30></td> </tr> <td></td> <td align="left"><input type=submit value="Sign In"></td> </tr> <tr> <td align="center" colspan=2><font size=-1>Don't have an Account ?</font> <a href="?action=newuser"><font size=-1 color="#0000EE">Sign UP Now !</font></a></td> </tr> </table>I have tried this code, but not works. <?php $username="username"; $password="password"; $url="http://www.example.com/index.php"; $postdata = "username=".$username."&password=".$password; $ch = curl_init(); curl_setopt ($ch, CURLOPT_URL, $url); curl_setopt ($ch, CURLOPT_SSL_VERIFYPEER, FALSE); curl_setopt ($ch, CURLOPT_USERAGENT, "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.6) Gecko/20070725 Firefox/2.0.0.6"); curl_setopt ($ch, CURLOPT_TIMEOUT, 60); curl_setopt ($ch, CURLOPT_FOLLOWLOCATION, 1); curl_setopt ($ch, CURLOPT_RETURNTRANSFER, 1); curl_setopt ($ch, CURLOPT_REFERER, $url); curl_setopt ($ch, CURLOPT_POSTFIELDS, $postdata); curl_setopt ($ch, CURLOPT_POST, 1); $result = curl_exec ($ch); header('Location: track.html'); //echo $result; curl_close($ch); ?>Any help would be appreciated, Thanks in advance. Hi, I'm trying to auto login to a website(created by me ;-) ) using the curl. as I am new to this I don't know how to make this possible. following is the code I tried but this is not submitting the data in the other site. the 'usr_name' and 'password' the field names in the page "http://localhost/myproject/users/login". and I have given a print_r in that site and it is displaying Array ( [loginType] => L [step] => confirmation [usr_name] => dasd@hotmail.com [password] => test123 ) but not submitting the form. please help me.... this is the code i've tried..I got this from web... $login = "http://localhost/myproject/users/login"; $param="loginType=L&step=confirmation&usr_name=dasd@hotmail.com&password=test123"; $c = curl_init(); curl_setopt($c, CURLOPT_URL, $login); curl_setopt($c, CURLOPT_COOKIEJAR, "cookies.txt"); curl_setopt($c, CURLOPT_COOKIEFILE, "cookies.txt"); curl_setopt($c, CURLOPT_POST, 1); curl_setopt($c, CURLOPT_POSTFIELDS, $param); curl_setopt($c, CURLOPT_RETURNTRANSFER, 1); curl_setopt($c, CURLOPT_FOLLOWLOCATION, 1); echo curl_exec($c); thanks in advance.... Hello! I would like to use cURL to login to the website: lockerz.com I have some code, but it doesn't seem to work: <?php // INIT CURL $ch = curl_init(); // SET URL FOR THE POST FORM LOGIN curl_setopt($ch, CURLOPT_URL, 'http://lockerz.com/auth/login'); // ENABLE HTTP POST curl_setopt ($ch, CURLOPT_POST, 1); // SET POST PARAMETERS : FORM VALUES FOR EACH FIELD curl_setopt ($ch, CURLOPT_POSTFIELDS, 'email-email=EMAIL@hotmail.com&password-password=PASSWPRD'); // IMITATE CLASSIC BROWSER'S BEHAVIOUR : HANDLE COOKIES curl_setopt ($ch, CURLOPT_COOKIEJAR, 'cookie.txt'); # Setting CURLOPT_RETURNTRANSFER variable to 1 will force cURL # not to print out the results of its query. # Instead, it will return the results as a string return value # from curl_exec() instead of the usual true/false. curl_setopt ($ch, CURLOPT_RETURNTRANSFER, 1); // EXECUTE 1st REQUEST (FORM LOGIN) $store = curl_exec ($ch); // SET FILE TO DOWNLOAD curl_setopt($ch, CURLOPT_URL, 'http://lockerz.com/auction'); // EXECUTE 2nd REQUEST (FILE DOWNLOAD) $content = curl_exec ($ch); // CLOSE CURL curl_close ($ch); echo $content; ?> Thank you very much if you can help! Hello, I'm using curl to grab a new solar image once an hour or so from the Solar Dynamics Observatory (example below). I'm trying to archive new images and am struggling with that. If I download an image, the filetime() function returns the current time since I downloaded it and wrote it to a fresh file. The result is that the file is always "new", even if the image hasn't changed on the SDO website. Do you have an idea on how to check the last modified time of a file through curl or other means so that I'm not downloading duplicate images? Thanks a ton! //fetch image $ch = curl_init("http://www.somewebsite.com/theimage.jpg"); curl_setopt($ch,CURLOPT_FOLLOWLOCATION,1); $file = "../images/latest/theimage.jpg"; $fp = fopen($file, "w"); curl_setopt($ch, CURLOPT_FILE, $fp); curl_setopt($ch, CURLOPT_HEADER, 0); curl_exec($ch); curl_close($ch); fclose($fp); //this part is no good... //get last modified date if (file_exists($file)) { $filetime = filemtime($file); } Hi all, I'm php stupid but from what I read its what I need. I am looking to grab just the number this page outputs http://api.radioreference.com/audio/listeners.php?feedId=2798 and put it on a page for some tracking software. When you view the source page of the page it needs to show the number and not the coding for it so Javascripting is out of the question. Can anyone help me? good day dear community, i am workin on a Curl loop to fetch multiple pages: i have some examples - and a question: Example: If we want to get information from 3 sites with CURL we can do it like so: $list[1] = "http://www.example1.com"; $list[2] = "ftp://example.com"; $list[3] = "http://www.example2.com"; After creating the list of links we should initialize the cURL multi handle and adding the cURL handles. $curlHandle = curl_multi_init(); for ($i = 1;$i <= 3; $i++) $curl[$i] = addHandle($curlHandle,$list[$i]); Now we should execute the cURL multi handle retrive the content from the sub handles that we added to the cURL multi handle. ExecHandle($curlHandle); for ($i = 1;$i <= 3; $i++) { $text[$i] = curl_multi_getcontent ($curl[$i]); echo $text[$i]; } In the end we should release the handles from the cURL multi handle by calling curl_multi_remove_handle and close the cURL multi handle! If we want to another Fetch of sites with cURL-Multi - since this is the most pretty way to do it! Well I am not sure bout the string concatenation. How to do it - Note I want to fetch several hundred pages: see the some details for this target-server sites - /(I have to create a loop over several hundred sites). * siteone.example/?show_subsite=9009 * siteone.example/?show_subsite=9742 * siteone.example/?show_subsite=9871 .... and so on and so forth Question: How to appy this loop into the array of the curl-multi? <?php /************************************\ * Multi interface in PHP with curl * * Requires PHP 5.0, Apache 2.0 and * * Curl * ************************************* * Writen By Cyborg 19671897 * * Bugfixed by Jeremy Ellman * \***********************************/ $urls = array( "siteone", "sitetwo", "sitethree" ); $mh = curl_multi_init(); foreach ($urls as $i => $url) { $conn[$i]=curl_init($url); curl_setopt($conn[$i],CURLOPT_RETURNTRANSFER,1);//return data as string curl_setopt($conn[$i],CURLOPT_FOLLOWLOCATION,1);//follow redirects curl_setopt($conn[$i],CURLOPT_MAXREDIRS,2);//maximum redirects curl_setopt($conn[$i],CURLOPT_CONNECTTIMEOUT,10);//timeout curl_multi_add_handle ($mh,$conn[$i]); } do { $n=curl_multi_exec($mh,$active); } while ($active); foreach ($urls as $i => $url) { $res[$i]=curl_multi_getcontent($conn[$i]); curl_multi_remove_handle($mh,$conn[$i]); curl_close($conn[$i]); } curl_multi_close($mh); print_r($res); ?> I look forward to your ideas. Dear all, I write this code to extract the widget from this page:http://www.widgetbox.com/widget/accuwidget The widget information is hidden under the tag <iframe> and is inside the src. I try using this code and it always show me error of: Fatal error: Call to undefined method DOMNodeList::getAttribute() Code: [Select] <?php get(); function get(){ $url = "http://www.widgetbox.com/widget/accuwidget"; $tidy = new tidy(); $repaired = $tidy->repairfile($url); //The code is dirty, so it need to be tidy $xml = new DOMDocument(); $xml->loadHTML($repaired); $xpath = new DOMXpath($xml); $cloud = $xpath->query("//div[@id='preview-div']/div/iframe"); $widget = $cloud->getAttribute("src"); echo $widget; } ?> Sorry that I didn't input the code of the page i want to scrape the information. It's just that the code is so long. Thank you all in advance Hey all, What's the most efficient way to wait until a page on your own website is done being rendered, and then parse it for something specific? The reason I'm having to scrape it rather than just generate it myself is because the part being scraped if being generated in an iframe on my site via another site, and the data inside of it is dynamic. Thanks Hi everyone, I need help with scraping. I have script for scraping IMDB, but I need a few more thing. I dont know how to scrape more Budget, Opening Weekend, and Gross. Rest information that I need I scrape on this way: Code: [Select] //code removed to discourage people from scraping IMDB Can somebody help me to scrape Budget, Opening Weekend and Gross also? Hi Guys quite new to php, but getting along roses. I am a big football fan and on my localhost I want to replicate the football fixtures for the week. Its really an learning exercise, but I need projects to learn - struggle to learn from books. I want to copy this list into a table (eventually): http://news.bbc.co.uk/sport1/hi/foot...es/default.stm I have managed to understand how to grab the data, however I think my problem is in the for loops that display the data. It displays the dates, and the tournament titles, but then it lists all the games - and doesnt get some quite right anyway (look at Stevenage). Then if you scroll down it then just displays the Tournaments. I think its pretty close, but please can someone point out my mistakes. Thanks guys: Code: [Select] <?Php $file_string = file_get_contents('http://news.bbc.co.uk/sport1/hi/football/fixtures/default.stm'); preg_match_all('/<div class="mvb"><b>(.*)<\/b><\/div>/i', $file_string, $links); preg_match_all('/<div class="pvtb"><b>(.*)<\/b><\/div>/i', $file_string, $games); preg_match_all('/class="stats">(.*)<\/a>/i', $file_string, $teams); echo '<ol>'; $l = 0; for($i = 0; $i < count($links[1]); $i++) { echo '<div>' . $links[1][$i] . '</div><BR>'; for($j = 0; $j < count($games[1]); $j++) { echo '<BR><B><U><div>' . $games[1][$j] . '</div></U></B><BR>'; for($k = $l; $k < count($teams[1]); $k++) { echo strip_tags($teams[1][$k]) . '</a><BR>'; $l=$k; } } } echo '</ol>'; ?> I think the problem is that it doesnt know what order the stuff is supposed to be in. But Im not sure how to write the code to tell it the order. Should each type of preg_match_all be an array or something? I am using file_get_contents with a url (http) to screen scrape certain web pages that are publicly accessible. However there are certain web pages (https) that require me to use an x509 certificate stored in my browser to make them visible. I would like to scrape them too. How would I scrape them using a php script? Thanks in Advance. Hello guys, i'm trying to screen scrape the original content from every RSS feed. The RSS feeds works fine however when i try to screen scrape every content using the library simple html dom. At first it works fine but when it tries to extract the second feed's original content then i get this error: Fatal error: Cannot redeclare file_get_html() (previously declared in C:\wamp\www\mashup\protected\views\articles\simple_html_dom.php:37) in C:\wamp\www\mashup\protected\views\articles\simple_html_dom.php on line 41 part of my code is as follows: Code: [Select] foreach($RSS_DOC->channel->item as $RSSitem) { $item_id = md5($RSSitem->title); $item_title = $RSSitem->title; $item_date = date("Y-m-j G:i:s", strtotime($RSSitem->pubDate)); $item_url = $RSSitem->link; echo "Processing item '" , $item_id , "<br/>"; echo $item_title, " - "; echo $item_date, "<br/>"; echo $item_url, "<br/>"; //screen scrape original article include('simple_html_dom.php'); $html = file_get_dom($item_url); foreach($html->find('td[class=rel_headline_cmt]') as $element) { echo $element; } } Any help with this? For my site I need to screenscrape a page on a site. The problem is, to access the page that contains the data I need, I have to login to my account first. I know there are ways to simulate a form submission with ASP, but my server is Linux and can't use ASP. I'm wondering if any of you know how I would be able to simulate a POST with something like cURL? And possibly write an example script? Thanks in advance. (This may be in the wrong section, please move it if it is. Thanks) I usually run scrape on a website and get information and the new ones are usually added automatically but its not adding a new information. I am wondering whethere I have an issue with my code. I usually run scrape through the sitemap. Code: [Select] $qry = "CREATE TABLE sitemap ( id varchar(30), price decimal(6,2), url varchar( 1024 ) )"; // Create the table mysql_query ( $qry, $con ); $numSitemapPages = 350; $html = new simple_html_dom(); if($_ECHO) echo "START: Fetching site map...<br />"; for( $i = 0; $i < $numSitemapPages; $i++ ) { if($_ECHO) echo "Page $i<br />"; $fileContents = file_get_contents( "http://www.website.co.uk/SiteMap-S" . $i . ".aspx" ); $html->load( $fileContents ); $hrefs = $html->find( "a[style=color: Blue; text-decoration: underline;]" ); if ( isset( $hrefs[ 0 ] ) ) { foreach( $hrefs as $href ) { $url = "http://www.website.co.uk/" . $href->href; $qry = "INSERT INTO sitemap(url) VALUES( '$url' )"; mysql_query( $qry, $con ); if($_ECHO) echo "MYSQL: Added $href->href to DB<br />"; } } else if($_ECHO) echo "NO URLS FOUND ON THIS PAGE!<br />"; } echo "END: Fetching site map...<br />"; exit(0); |