PHP - Using Curl To Scrape - Why Are My Results Cached?
I'm just doing a simple scrape on a web page.. works great, except... the section I scrape is just 10 words inside a table which are updated every few hours.
Problem is... my scrape seems to only grab what I found the first time I ran my script, despite the new words being updated a couple hours ago. I've tested everything script wise and looked at the html source to be sure everything is showing up... just wondering, is there any thing you can think of that would allow for caching (from source site or the php script) of the words on a page when running my script? I even opened 2 different browsers and ran the script fresh from these browsers that had never ran it. When I echo the results, I still get the original result set. I'm perplexed. Similar TutorialsHi! I'm into a little project where I want to retrieve data from the swedish news website DN's SOS Live page (http://www.dn.se/nyheter/soslive). On the page there is an iFrame and there is the data I want to retrieve. The iFrame address is: http://div.dn.se/dn/sos/soslive.php?id= ... er/soslive Here is the code I have and it stoped working yesterday. Code: [Select] function curl_download($Url){ // is cURL installed yet? if (!function_exists('curl_init')){ die('Sorry cURL is not installed!'); } // OK cool - then let's create a new cURL resource handle $ch = curl_init(); // Now set some options (most are optional) // Set URL to download curl_setopt($ch, CURLOPT_URL, $Url); // Set a referer curl_setopt($ch, CURLOPT_REFERER, "http://www.dn.se"); // User agent curl_setopt($ch, CURLOPT_USERAGENT, "MozillaXYZ/1.0"); // Include header in result? (0 = yes, 1 = no) curl_setopt($ch, CURLOPT_HEADER, 0); // Should cURL return or print out the data? (true = return, false = print) curl_setopt($ch, CURLOPT_RETURNTRANSFER, true); // Timeout in seconds curl_setopt($ch, CURLOPT_TIMEOUT, 10); // Download the given URL, and return output $output = curl_exec($ch); // Close the cURL resource, and free system resources curl_close($ch); return $output; } $sosURL = 'http://div.dn.se/dn/sos/soslive.php?id=p://www.dn.se/nyheter/soslive'; $data = curl_download($sosURL); The data variable is empty. I notice now that when I enter the webaddress "http://div.dn.se/dn/sos/soslive.php?id=p://www.dn.se/nyheter/soslive" there is no content, although this is the web address in the iFrame. How has DN solved this and how could I get around it? Best regards Stefan When using cURL, how would I be able to include a call inside my get_all that basically will loop through all the next_page, get and store the data and then output it to $response->data when the "next_page" parameter becomes null? **Here is the method**: public function get_all() { return $response->data; } **This is what $response->data is returning as of now** (The cURL code wasn't included here): "paginator": { "total": 3092, "per_page": 500, "current_page": 2, "last_page": 7, "prev_page": "https://oc.com/api/v1/employees/?page=1", "next_page": "https://oc.com/api/v1/employees/?page=3" }, "data": [ { "id": 1592, etc.... Here are all of my unsuccessful attempts: public function get_all() { // $next_url = $response->paginator->next_page; // // foreach ($response as $next => $next_page) { // print_r2($next); // // if ($next_url !== null) { // $next_page = $response->data; // } // } // foreach ($response as $paginator => $next_page) { // if ($next_url !== null) { // $return[] = $response->data; // } // } // var_dump($response->paginator); // if ($next_url !== null) { // $this->get_all($path, $args, $next_url); // } return $response->data; } Edited October 30, 2019 by Sema314 Hi all, I should first mention I'm not much of a coder, I'm using PHP to create a custom weather solution for myself. Basically am pulling XML from weather.gov and working with the data. So far, so good. I'm getting the data I want displayed correctly, but noticed when I refresh the page I sometimes receive old data. (From the past hour, two hours, etc.) I figured this was cached info and I'm trying to figure out how to clear that out. This is how I'm accessing the XML: $url = 'http://forecast.weather.gov/MapClick.php?lat=40.65160&lon=-74.34420&FcstType=digitalDWML'; $xml = file_get_contents($url); I did some research and tried the following headers, but that doesnt seem to work: <?php header("Expires: Mon, 26 Jul 1997 05:00:00 GMT"); header("Cache-Control: no-cache"); header("Pragma: no-cache"); ?> Also tried appending a random number onto the $url, (as per a forum question/response somewhere) but that didn't work. Anyone suggestions would be great. Thanks Hi all, I'm php stupid but from what I read its what I need. I am looking to grab just the number this page outputs http://api.radioreference.com/audio/listeners.php?feedId=2798 and put it on a page for some tracking software. When you view the source page of the page it needs to show the number and not the coding for it so Javascripting is out of the question. Can anyone help me? good day dear community, i am workin on a Curl loop to fetch multiple pages: i have some examples - and a question: Example: If we want to get information from 3 sites with CURL we can do it like so: $list[1] = "http://www.example1.com"; $list[2] = "ftp://example.com"; $list[3] = "http://www.example2.com"; After creating the list of links we should initialize the cURL multi handle and adding the cURL handles. $curlHandle = curl_multi_init(); for ($i = 1;$i <= 3; $i++) $curl[$i] = addHandle($curlHandle,$list[$i]); Now we should execute the cURL multi handle retrive the content from the sub handles that we added to the cURL multi handle. ExecHandle($curlHandle); for ($i = 1;$i <= 3; $i++) { $text[$i] = curl_multi_getcontent ($curl[$i]); echo $text[$i]; } In the end we should release the handles from the cURL multi handle by calling curl_multi_remove_handle and close the cURL multi handle! If we want to another Fetch of sites with cURL-Multi - since this is the most pretty way to do it! Well I am not sure bout the string concatenation. How to do it - Note I want to fetch several hundred pages: see the some details for this target-server sites - /(I have to create a loop over several hundred sites). * siteone.example/?show_subsite=9009 * siteone.example/?show_subsite=9742 * siteone.example/?show_subsite=9871 .... and so on and so forth Question: How to appy this loop into the array of the curl-multi? <?php /************************************\ * Multi interface in PHP with curl * * Requires PHP 5.0, Apache 2.0 and * * Curl * ************************************* * Writen By Cyborg 19671897 * * Bugfixed by Jeremy Ellman * \***********************************/ $urls = array( "siteone", "sitetwo", "sitethree" ); $mh = curl_multi_init(); foreach ($urls as $i => $url) { $conn[$i]=curl_init($url); curl_setopt($conn[$i],CURLOPT_RETURNTRANSFER,1);//return data as string curl_setopt($conn[$i],CURLOPT_FOLLOWLOCATION,1);//follow redirects curl_setopt($conn[$i],CURLOPT_MAXREDIRS,2);//maximum redirects curl_setopt($conn[$i],CURLOPT_CONNECTTIMEOUT,10);//timeout curl_multi_add_handle ($mh,$conn[$i]); } do { $n=curl_multi_exec($mh,$active); } while ($active); foreach ($urls as $i => $url) { $res[$i]=curl_multi_getcontent($conn[$i]); curl_multi_remove_handle($mh,$conn[$i]); curl_close($conn[$i]); } curl_multi_close($mh); print_r($res); ?> I look forward to your ideas. Dear all, I write this code to extract the widget from this page:http://www.widgetbox.com/widget/accuwidget The widget information is hidden under the tag <iframe> and is inside the src. I try using this code and it always show me error of: Fatal error: Call to undefined method DOMNodeList::getAttribute() Code: [Select] <?php get(); function get(){ $url = "http://www.widgetbox.com/widget/accuwidget"; $tidy = new tidy(); $repaired = $tidy->repairfile($url); //The code is dirty, so it need to be tidy $xml = new DOMDocument(); $xml->loadHTML($repaired); $xpath = new DOMXpath($xml); $cloud = $xpath->query("//div[@id='preview-div']/div/iframe"); $widget = $cloud->getAttribute("src"); echo $widget; } ?> Sorry that I didn't input the code of the page i want to scrape the information. It's just that the code is so long. Thank you all in advance Hey all, What's the most efficient way to wait until a page on your own website is done being rendered, and then parse it for something specific? The reason I'm having to scrape it rather than just generate it myself is because the part being scraped if being generated in an iframe on my site via another site, and the data inside of it is dynamic. Thanks Hi everyone, I need help with scraping. I have script for scraping IMDB, but I need a few more thing. I dont know how to scrape more Budget, Opening Weekend, and Gross. Rest information that I need I scrape on this way: Code: [Select] //code removed to discourage people from scraping IMDB Can somebody help me to scrape Budget, Opening Weekend and Gross also? Hi Guys quite new to php, but getting along roses. I am a big football fan and on my localhost I want to replicate the football fixtures for the week. Its really an learning exercise, but I need projects to learn - struggle to learn from books. I want to copy this list into a table (eventually): http://news.bbc.co.uk/sport1/hi/foot...es/default.stm I have managed to understand how to grab the data, however I think my problem is in the for loops that display the data. It displays the dates, and the tournament titles, but then it lists all the games - and doesnt get some quite right anyway (look at Stevenage). Then if you scroll down it then just displays the Tournaments. I think its pretty close, but please can someone point out my mistakes. Thanks guys: Code: [Select] <?Php $file_string = file_get_contents('http://news.bbc.co.uk/sport1/hi/football/fixtures/default.stm'); preg_match_all('/<div class="mvb"><b>(.*)<\/b><\/div>/i', $file_string, $links); preg_match_all('/<div class="pvtb"><b>(.*)<\/b><\/div>/i', $file_string, $games); preg_match_all('/class="stats">(.*)<\/a>/i', $file_string, $teams); echo '<ol>'; $l = 0; for($i = 0; $i < count($links[1]); $i++) { echo '<div>' . $links[1][$i] . '</div><BR>'; for($j = 0; $j < count($games[1]); $j++) { echo '<BR><B><U><div>' . $games[1][$j] . '</div></U></B><BR>'; for($k = $l; $k < count($teams[1]); $k++) { echo strip_tags($teams[1][$k]) . '</a><BR>'; $l=$k; } } } echo '</ol>'; ?> I think the problem is that it doesnt know what order the stuff is supposed to be in. But Im not sure how to write the code to tell it the order. Should each type of preg_match_all be an array or something? I am using file_get_contents with a url (http) to screen scrape certain web pages that are publicly accessible. However there are certain web pages (https) that require me to use an x509 certificate stored in my browser to make them visible. I would like to scrape them too. How would I scrape them using a php script? Thanks in Advance. I'm trying to get a value which changes but is always contained the same way this is the text Next: Brigadier (81654 of 220000) as you can see below this is the value i need to get I just can't figure out the RegEx for it? help please! Smaller CODE Code: [Select] <div class="rankBar"> <p class="grade"><span id="ctl00_mainContent_identityBar_currentGlobalRankLabel" class="current">Colonel Grade 3</span><span id="ctl00_mainContent_identityBar_nextGlobalRankLabel" class="future">Next: Brigadier (81654 of 220000)</span></p> <div class="rankMeter"> <div id="ctl00_mainContent_identityBar_rankBarPanel" class="bar" style="width:37%;"> <span></span> </div> </div> FULL CODE Code: [Select] <div id="ctl00_mainContent_identityBar_idbarPanel"> <div class="glowBox idRankContainer"> <div class="corner bottomLeft"></div> <div class="lattice"></div> <div class="content"> <a href="/Stats/Reach/default.aspx?player=l RaH l" id="ctl00_mainContent_identityBar_avatar" class="avatar"><img id="ctl00_mainContent_identityBar_emblemImg" src="/Stats/emblem.ashx?s=70&0=0&1=0&2=0&3=0&fi=0&bi=0&fl=0&m=3" style="height:70px;width:70px;border-width:0px;"></a> <div class="idRank"> <div class="userInfo"> <img id="ctl00_mainContent_identityBar_namePlateImg" class="img_plate" src="/images/reachstats/nameplates/512.png" alt="" style="height:30px;width:60px;border-width:0px;"> <a href="/projects/reach/article.aspx?ucc=faq&cid=28756"><img id="ctl00_mainContent_identityBar_currentGlobalRankImage" title="Colonel Grade 3" class="img_rankIcon" src="/images/reachstats/grades/med/A4BF62C6-1E3F-468A-9A73-8237600B2AD3.png" alt="Colonel Grade 3" style="height:40px;width:40px;border-width:0px;"></a> <h2><a href="/Stats/Reach/default.aspx?player=l RaH l" id="ctl00_mainContent_identityBar_gamerTag">l RaH l</a> - FLIR</h2> </div> <div class="rankBar"> <p class="grade"><span id="ctl00_mainContent_identityBar_currentGlobalRankLabel" class="current">Colonel Grade 3</span><span id="ctl00_mainContent_identityBar_nextGlobalRankLabel" class="future">Next: Brigadier (81654 of 220000)</span></p> <div class="rankMeter"> <div id="ctl00_mainContent_identityBar_rankBarPanel" class="bar" style="width:37%;"> <span></span> </div> </div> </div> </div> </div> </div> <div class="glowBox shields"> <div class="corner topRight"></div> <div class="lattice"></div> <div class="content"> <ul> <li class="campaign"> <p>Campaign</p> <img id="ctl00_mainContent_identityBar_spCampaignImage" title="Completed Legendary Difficulty" alt="Campaign Progress" src="/images/reachstats/campaign_progress/legendary.png" style="height:59px;width:54px;border-width:0px;"> </li> <li class="coop"> <p>Co-op</p> <img id="ctl00_mainContent_identityBar_coopCampaignImage" title="Completed Normal Difficulty" alt="Coop Campaign Progress" src="/images/reachstats/campaign_progress/normal.png" style="height:59px;width:54px;border-width:0px;"> </li> <li class="arena_display"> <p>Arena</p> <a id="ctl00_mainContent_identityBar_arenaLink" href="/stats/reach/careerstats/playlists.aspx?player=l RaH l&vc=2"><img id="ctl00_mainContent_identityBar_currentBestArenaDivisionImg" alt="" src="/images/reachstats/arena_div/0.png" style="height:47px;width:48px;border-width:0px;"></a> <div class="glowBox popOut po_arenaSeason2"> <div class="corner topRight"></div> <div class="corner bottomLeft"></div> <div class="content"> <h4>Arena Season 6</h4> <div class="twoColumn"> <img id="ctl00_mainContent_identityBar_arenaRepeater_ctl00_arenaImage" src="/images/reachstats/arena_div/0.png" style="height:47px;width:48px;border-width:0px;"> <h5>Doubles Arena</h5> <p id="ctl00_mainContent_identityBar_arenaRepeater_ctl00_divisionName">Not Qualified Yet</p> <p id="ctl00_mainContent_identityBar_arenaRepeater_ctl00_lastSeasonControl">Last Season: <img id="ctl00_mainContent_identityBar_arenaRepeater_ctl00_lastSeasonImage" src="/images/reachstats/arena_div/0.png" style="height:23px;width:23px;border-width:0px;"> </p> </div> <div class="rule_transparent"></div> <div class="twoColumn"> <img id="ctl00_mainContent_identityBar_arenaRepeater_ctl02_arenaImage" src="/images/reachstats/arena_div/0.png" style="height:47px;width:48px;border-width:0px;"> <h5>FFA Arena</h5> <p id="ctl00_mainContent_identityBar_arenaRepeater_ctl02_divisionName">Not Qualified Yet</p> <p id="ctl00_mainContent_identityBar_arenaRepeater_ctl02_lastSeasonControl">Last Season: <img id="ctl00_mainContent_identityBar_arenaRepeater_ctl02_lastSeasonImage" src="/images/reachstats/arena_div/0.png" style="height:23px;width:23px;border-width:0px;"> </p> </div> <div class="rule_transparent"></div> <div class="twoColumn"> <img id="ctl00_mainContent_identityBar_arenaRepeater_ctl04_arenaImage" src="/images/reachstats/arena_div/0.png" style="height:47px;width:48px;border-width:0px;"> <h5>Team Arena</h5> <p id="ctl00_mainContent_identityBar_arenaRepeater_ctl04_divisionName">Not Qualified Yet</p> <p id="ctl00_mainContent_identityBar_arenaRepeater_ctl04_lastSeasonControl">Last Season: <img id="ctl00_mainContent_identityBar_arenaRepeater_ctl04_lastSeasonImage" src="/images/reachstats/arena_div/0.png" style="height:23px;width:23px;border-width:0px;"> </p> </div> </div> </div> </li> </ul> </div> </div> </div> Hello guys, i'm trying to screen scrape the original content from every RSS feed. The RSS feeds works fine however when i try to screen scrape every content using the library simple html dom. At first it works fine but when it tries to extract the second feed's original content then i get this error: Fatal error: Cannot redeclare file_get_html() (previously declared in C:\wamp\www\mashup\protected\views\articles\simple_html_dom.php:37) in C:\wamp\www\mashup\protected\views\articles\simple_html_dom.php on line 41 part of my code is as follows: Code: [Select] foreach($RSS_DOC->channel->item as $RSSitem) { $item_id = md5($RSSitem->title); $item_title = $RSSitem->title; $item_date = date("Y-m-j G:i:s", strtotime($RSSitem->pubDate)); $item_url = $RSSitem->link; echo "Processing item '" , $item_id , "<br/>"; echo $item_title, " - "; echo $item_date, "<br/>"; echo $item_url, "<br/>"; //screen scrape original article include('simple_html_dom.php'); $html = file_get_dom($item_url); foreach($html->find('td[class=rel_headline_cmt]') as $element) { echo $element; } } Any help with this? For my site I need to screenscrape a page on a site. The problem is, to access the page that contains the data I need, I have to login to my account first. I know there are ways to simulate a form submission with ASP, but my server is Linux and can't use ASP. I'm wondering if any of you know how I would be able to simulate a POST with something like cURL? And possibly write an example script? Thanks in advance. (This may be in the wrong section, please move it if it is. Thanks) I usually run scrape on a website and get information and the new ones are usually added automatically but its not adding a new information. I am wondering whethere I have an issue with my code. I usually run scrape through the sitemap. Code: [Select] $qry = "CREATE TABLE sitemap ( id varchar(30), price decimal(6,2), url varchar( 1024 ) )"; // Create the table mysql_query ( $qry, $con ); $numSitemapPages = 350; $html = new simple_html_dom(); if($_ECHO) echo "START: Fetching site map...<br />"; for( $i = 0; $i < $numSitemapPages; $i++ ) { if($_ECHO) echo "Page $i<br />"; $fileContents = file_get_contents( "http://www.website.co.uk/SiteMap-S" . $i . ".aspx" ); $html->load( $fileContents ); $hrefs = $html->find( "a[style=color: Blue; text-decoration: underline;]" ); if ( isset( $hrefs[ 0 ] ) ) { foreach( $hrefs as $href ) { $url = "http://www.website.co.uk/" . $href->href; $qry = "INSERT INTO sitemap(url) VALUES( '$url' )"; mysql_query( $qry, $con ); if($_ECHO) echo "MYSQL: Added $href->href to DB<br />"; } } else if($_ECHO) echo "NO URLS FOUND ON THIS PAGE!<br />"; } echo "END: Fetching site map...<br />"; exit(0); I need to scrape a Chinese website. (I guess there is no difference between scraping a Chinese website and a normal one?) It's my suppliers weblink. They have have told me to download images and text from their website profile link off 1688.com. There is an API - but from what I've read, it's pants + my virus checked doesn't allow me to visit the API doc page. What tool should I use? I've got lots of experience coding - but master of none. Maybe I still fall into the newbie category. LOL. I saw a link from an article... they gave these names:
Goutte Which should I consider? IMPORTANT: I need to download and then upload into my Woocommerce website. I need images + variation details. I'll have a little file with translations, so if XYZ is found in Chinese, then it is replaced with this. I thought I would mention the extra detail incase it was relevant to considering which scraping tool to pick. Thanks! Edited July 11, 2019 by acebase Hi all, I'm trying to scrape the contents of a page that is behind a login screen; namely: http://my.mail.ru/apps. Here's my code. It almost works, but doesn't appear to be properly logging in -- I just get a login screen on the url download. Any ideas? Thanks much. Here's my code <?php $ch=login(); $html=downloadUrl('http://my.mail.ru/apps', $ch); echo $html; function downloadUrl($Url, $ch){ curl_setopt($ch, CURLOPT_URL, $Url); curl_setopt($ch, CURLOPT_POST, 0); curl_setopt($ch, CURLOPT_REFERER, "http://my.mail.ru/cgi-bin/login?noclear=1&page=http%3a%2f%2fmy.mail.ru%2fapps%2f"); curl_setopt($ch, CURLOPT_USERAGENT, "MozillaXYZ/1.0"); curl_setopt($ch, CURLOPT_HEADER, 0); curl_setopt($ch, CURLOPT_RETURNTRANSFER, true); curl_setopt($ch, CURLOPT_TIMEOUT, 10); $output = curl_exec($ch); return $output; } function login(){ $ch = curl_init(); curl_setopt($ch, CURLOPT_URL, 'http://my.mail.ru/cgi-bin/login?noclear=1&page=http%3a%2f%2fmy.mail.ru%2fapps%2f'); //login URL curl_setopt ($ch, CURLOPT_POST, 1); $postData=' page=http%3A%2F%2Fmy.mail.ru%2Fapps%2F &Login=username &Domain=mail.ru &Password=password'; curl_setopt ($ch, CURLOPT_POSTFIELDS, $postData); curl_setopt ($ch, CURLOPT_COOKIEJAR, 'cookie.txt'); curl_setopt ($ch, CURLOPT_FOLLOWLOCATION,1); curl_setopt ($ch, CURLOPT_MAXREDIRS, 10); curl_setopt ($ch, CURLOPT_RETURNTRANSFER, 1); $store = curl_exec ($ch); return $ch; } ?> Hi everyone, I am making a screen scraper in php which scrapes the usernames from forum posts and stores them in an SQL database. I need some help with part of the preg_match code if possible please? The code and also the pseudo code I have so far is:(the pseudocode I am having trouble with but will try to solve my self if possible). Edit: sorry, editted the page as was confusing to read. Please ask for clarificaiton if there is anything I have failed to explain properly. thank you. //I will be placing the following php in the confirmation page people see after making a new post, so for this example lets say the referrer header says: http://www.mysite.com/showthread.php?tid=1' Code: [Select] $threadurl=$_SERVER['HTTP_REFERER']; // scrape the page Code: [Select] $content = file_get_contents($threadurl); // find the pattern in source which makes it easy to find the username- the only things that change are the uid and the color Code: [Select] if (preg_match("/\b uid=792"><span style="color:#ffcc00">fapafap</span></a>\b /i", $content)) { //extract username from this string search Don't know! //copy the username (in this case 'fapafap') to the database along with the referral ID Code: [Select] $query_insert="INSERT INTO newpostersdatabase(username,referrerurl) VALUES('$username','$threadurl')" ; $result=mysql_query ( $query_insert); if(!$result){ die(mysql_error()); } Thank you so much for any guidance, I know that I have totally messed up with the string search also but my brain is too small it seems! Alright, so I play a browser game called Politics and War. I run an alliance that has 74 members. In that alliance we offer a bank service for all our members, but I - being the leader - am the only one who can access the bank. I have been building a site that works with the game API to gather data for members and create a dashboard. One of the features I am trying to build is allowing them to withdraw from their account instantly.
So, what I need: To be able to submit a POST request to login to the site (specifically on this page --> https://politicsandwar.com/login) with my username and password, but then I need to keep the session active and navigate to a different page (the alliance bank page). On that page I first need to scrape a value from a hidden input (token) and then I need to submit a POST request to this same page while still being logged in.
I am not asking someone to do it for me, but rather someone to help me know how to go about this. I have never submitted post requests with PHP, but I have used PHP cURL in the past. I also have made POST requests with JS, but never PHP.
Thank you so much for anyone that is able to help! |