PHP - Https Screen Scrape
I am using file_get_contents with a url (http) to screen scrape certain web pages that are publicly accessible.
However there are certain web pages (https) that require me to use an x509 certificate stored in my browser to make them visible. I would like to scrape them too. How would I scrape them using a php script? Thanks in Advance. Similar TutorialsHello guys, i'm trying to screen scrape the original content from every RSS feed. The RSS feeds works fine however when i try to screen scrape every content using the library simple html dom. At first it works fine but when it tries to extract the second feed's original content then i get this error: Fatal error: Cannot redeclare file_get_html() (previously declared in C:\wamp\www\mashup\protected\views\articles\simple_html_dom.php:37) in C:\wamp\www\mashup\protected\views\articles\simple_html_dom.php on line 41 part of my code is as follows: Code: [Select] foreach($RSS_DOC->channel->item as $RSSitem) { $item_id = md5($RSSitem->title); $item_title = $RSSitem->title; $item_date = date("Y-m-j G:i:s", strtotime($RSSitem->pubDate)); $item_url = $RSSitem->link; echo "Processing item '" , $item_id , "<br/>"; echo $item_title, " - "; echo $item_date, "<br/>"; echo $item_url, "<br/>"; //screen scrape original article include('simple_html_dom.php'); $html = file_get_dom($item_url); foreach($html->find('td[class=rel_headline_cmt]') as $element) { echo $element; } } Any help with this? For my site I need to screenscrape a page on a site. The problem is, to access the page that contains the data I need, I have to login to my account first. I know there are ways to simulate a form submission with ASP, but my server is Linux and can't use ASP. I'm wondering if any of you know how I would be able to simulate a POST with something like cURL? And possibly write an example script? Thanks in advance. (This may be in the wrong section, please move it if it is. Thanks) Hi all, I'm php stupid but from what I read its what I need. I am looking to grab just the number this page outputs http://api.radioreference.com/audio/listeners.php?feedId=2798 and put it on a page for some tracking software. When you view the source page of the page it needs to show the number and not the coding for it so Javascripting is out of the question. Can anyone help me? Dear all, I write this code to extract the widget from this page:http://www.widgetbox.com/widget/accuwidget The widget information is hidden under the tag <iframe> and is inside the src. I try using this code and it always show me error of: Fatal error: Call to undefined method DOMNodeList::getAttribute() Code: [Select] <?php get(); function get(){ $url = "http://www.widgetbox.com/widget/accuwidget"; $tidy = new tidy(); $repaired = $tidy->repairfile($url); //The code is dirty, so it need to be tidy $xml = new DOMDocument(); $xml->loadHTML($repaired); $xpath = new DOMXpath($xml); $cloud = $xpath->query("//div[@id='preview-div']/div/iframe"); $widget = $cloud->getAttribute("src"); echo $widget; } ?> Sorry that I didn't input the code of the page i want to scrape the information. It's just that the code is so long. Thank you all in advance Hey all, What's the most efficient way to wait until a page on your own website is done being rendered, and then parse it for something specific? The reason I'm having to scrape it rather than just generate it myself is because the part being scraped if being generated in an iframe on my site via another site, and the data inside of it is dynamic. Thanks Hi everyone, I need help with scraping. I have script for scraping IMDB, but I need a few more thing. I dont know how to scrape more Budget, Opening Weekend, and Gross. Rest information that I need I scrape on this way: Code: [Select] //code removed to discourage people from scraping IMDB Can somebody help me to scrape Budget, Opening Weekend and Gross also? I'm trying to get a value which changes but is always contained the same way this is the text Next: Brigadier (81654 of 220000) as you can see below this is the value i need to get I just can't figure out the RegEx for it? help please! Smaller CODE Code: [Select] <div class="rankBar"> <p class="grade"><span id="ctl00_mainContent_identityBar_currentGlobalRankLabel" class="current">Colonel Grade 3</span><span id="ctl00_mainContent_identityBar_nextGlobalRankLabel" class="future">Next: Brigadier (81654 of 220000)</span></p> <div class="rankMeter"> <div id="ctl00_mainContent_identityBar_rankBarPanel" class="bar" style="width:37%;"> <span></span> </div> </div> FULL CODE Code: [Select] <div id="ctl00_mainContent_identityBar_idbarPanel"> <div class="glowBox idRankContainer"> <div class="corner bottomLeft"></div> <div class="lattice"></div> <div class="content"> <a href="/Stats/Reach/default.aspx?player=l RaH l" id="ctl00_mainContent_identityBar_avatar" class="avatar"><img id="ctl00_mainContent_identityBar_emblemImg" src="/Stats/emblem.ashx?s=70&0=0&1=0&2=0&3=0&fi=0&bi=0&fl=0&m=3" style="height:70px;width:70px;border-width:0px;"></a> <div class="idRank"> <div class="userInfo"> <img id="ctl00_mainContent_identityBar_namePlateImg" class="img_plate" src="/images/reachstats/nameplates/512.png" alt="" style="height:30px;width:60px;border-width:0px;"> <a href="/projects/reach/article.aspx?ucc=faq&cid=28756"><img id="ctl00_mainContent_identityBar_currentGlobalRankImage" title="Colonel Grade 3" class="img_rankIcon" src="/images/reachstats/grades/med/A4BF62C6-1E3F-468A-9A73-8237600B2AD3.png" alt="Colonel Grade 3" style="height:40px;width:40px;border-width:0px;"></a> <h2><a href="/Stats/Reach/default.aspx?player=l RaH l" id="ctl00_mainContent_identityBar_gamerTag">l RaH l</a> - FLIR</h2> </div> <div class="rankBar"> <p class="grade"><span id="ctl00_mainContent_identityBar_currentGlobalRankLabel" class="current">Colonel Grade 3</span><span id="ctl00_mainContent_identityBar_nextGlobalRankLabel" class="future">Next: Brigadier (81654 of 220000)</span></p> <div class="rankMeter"> <div id="ctl00_mainContent_identityBar_rankBarPanel" class="bar" style="width:37%;"> <span></span> </div> </div> </div> </div> </div> </div> <div class="glowBox shields"> <div class="corner topRight"></div> <div class="lattice"></div> <div class="content"> <ul> <li class="campaign"> <p>Campaign</p> <img id="ctl00_mainContent_identityBar_spCampaignImage" title="Completed Legendary Difficulty" alt="Campaign Progress" src="/images/reachstats/campaign_progress/legendary.png" style="height:59px;width:54px;border-width:0px;"> </li> <li class="coop"> <p>Co-op</p> <img id="ctl00_mainContent_identityBar_coopCampaignImage" title="Completed Normal Difficulty" alt="Coop Campaign Progress" src="/images/reachstats/campaign_progress/normal.png" style="height:59px;width:54px;border-width:0px;"> </li> <li class="arena_display"> <p>Arena</p> <a id="ctl00_mainContent_identityBar_arenaLink" href="/stats/reach/careerstats/playlists.aspx?player=l RaH l&vc=2"><img id="ctl00_mainContent_identityBar_currentBestArenaDivisionImg" alt="" src="/images/reachstats/arena_div/0.png" style="height:47px;width:48px;border-width:0px;"></a> <div class="glowBox popOut po_arenaSeason2"> <div class="corner topRight"></div> <div class="corner bottomLeft"></div> <div class="content"> <h4>Arena Season 6</h4> <div class="twoColumn"> <img id="ctl00_mainContent_identityBar_arenaRepeater_ctl00_arenaImage" src="/images/reachstats/arena_div/0.png" style="height:47px;width:48px;border-width:0px;"> <h5>Doubles Arena</h5> <p id="ctl00_mainContent_identityBar_arenaRepeater_ctl00_divisionName">Not Qualified Yet</p> <p id="ctl00_mainContent_identityBar_arenaRepeater_ctl00_lastSeasonControl">Last Season: <img id="ctl00_mainContent_identityBar_arenaRepeater_ctl00_lastSeasonImage" src="/images/reachstats/arena_div/0.png" style="height:23px;width:23px;border-width:0px;"> </p> </div> <div class="rule_transparent"></div> <div class="twoColumn"> <img id="ctl00_mainContent_identityBar_arenaRepeater_ctl02_arenaImage" src="/images/reachstats/arena_div/0.png" style="height:47px;width:48px;border-width:0px;"> <h5>FFA Arena</h5> <p id="ctl00_mainContent_identityBar_arenaRepeater_ctl02_divisionName">Not Qualified Yet</p> <p id="ctl00_mainContent_identityBar_arenaRepeater_ctl02_lastSeasonControl">Last Season: <img id="ctl00_mainContent_identityBar_arenaRepeater_ctl02_lastSeasonImage" src="/images/reachstats/arena_div/0.png" style="height:23px;width:23px;border-width:0px;"> </p> </div> <div class="rule_transparent"></div> <div class="twoColumn"> <img id="ctl00_mainContent_identityBar_arenaRepeater_ctl04_arenaImage" src="/images/reachstats/arena_div/0.png" style="height:47px;width:48px;border-width:0px;"> <h5>Team Arena</h5> <p id="ctl00_mainContent_identityBar_arenaRepeater_ctl04_divisionName">Not Qualified Yet</p> <p id="ctl00_mainContent_identityBar_arenaRepeater_ctl04_lastSeasonControl">Last Season: <img id="ctl00_mainContent_identityBar_arenaRepeater_ctl04_lastSeasonImage" src="/images/reachstats/arena_div/0.png" style="height:23px;width:23px;border-width:0px;"> </p> </div> </div> </div> </li> </ul> </div> </div> </div> Hi! I'm into a little project where I want to retrieve data from the swedish news website DN's SOS Live page (http://www.dn.se/nyheter/soslive). On the page there is an iFrame and there is the data I want to retrieve. The iFrame address is: http://div.dn.se/dn/sos/soslive.php?id= ... er/soslive Here is the code I have and it stoped working yesterday. Code: [Select] function curl_download($Url){ // is cURL installed yet? if (!function_exists('curl_init')){ die('Sorry cURL is not installed!'); } // OK cool - then let's create a new cURL resource handle $ch = curl_init(); // Now set some options (most are optional) // Set URL to download curl_setopt($ch, CURLOPT_URL, $Url); // Set a referer curl_setopt($ch, CURLOPT_REFERER, "http://www.dn.se"); // User agent curl_setopt($ch, CURLOPT_USERAGENT, "MozillaXYZ/1.0"); // Include header in result? (0 = yes, 1 = no) curl_setopt($ch, CURLOPT_HEADER, 0); // Should cURL return or print out the data? (true = return, false = print) curl_setopt($ch, CURLOPT_RETURNTRANSFER, true); // Timeout in seconds curl_setopt($ch, CURLOPT_TIMEOUT, 10); // Download the given URL, and return output $output = curl_exec($ch); // Close the cURL resource, and free system resources curl_close($ch); return $output; } $sosURL = 'http://div.dn.se/dn/sos/soslive.php?id=p://www.dn.se/nyheter/soslive'; $data = curl_download($sosURL); The data variable is empty. I notice now that when I enter the webaddress "http://div.dn.se/dn/sos/soslive.php?id=p://www.dn.se/nyheter/soslive" there is no content, although this is the web address in the iFrame. How has DN solved this and how could I get around it? Best regards Stefan Hi Guys quite new to php, but getting along roses. I am a big football fan and on my localhost I want to replicate the football fixtures for the week. Its really an learning exercise, but I need projects to learn - struggle to learn from books. I want to copy this list into a table (eventually): http://news.bbc.co.uk/sport1/hi/foot...es/default.stm I have managed to understand how to grab the data, however I think my problem is in the for loops that display the data. It displays the dates, and the tournament titles, but then it lists all the games - and doesnt get some quite right anyway (look at Stevenage). Then if you scroll down it then just displays the Tournaments. I think its pretty close, but please can someone point out my mistakes. Thanks guys: Code: [Select] <?Php $file_string = file_get_contents('http://news.bbc.co.uk/sport1/hi/football/fixtures/default.stm'); preg_match_all('/<div class="mvb"><b>(.*)<\/b><\/div>/i', $file_string, $links); preg_match_all('/<div class="pvtb"><b>(.*)<\/b><\/div>/i', $file_string, $games); preg_match_all('/class="stats">(.*)<\/a>/i', $file_string, $teams); echo '<ol>'; $l = 0; for($i = 0; $i < count($links[1]); $i++) { echo '<div>' . $links[1][$i] . '</div><BR>'; for($j = 0; $j < count($games[1]); $j++) { echo '<BR><B><U><div>' . $games[1][$j] . '</div></U></B><BR>'; for($k = $l; $k < count($teams[1]); $k++) { echo strip_tags($teams[1][$k]) . '</a><BR>'; $l=$k; } } } echo '</ol>'; ?> I think the problem is that it doesnt know what order the stuff is supposed to be in. But Im not sure how to write the code to tell it the order. Should each type of preg_match_all be an array or something? I usually run scrape on a website and get information and the new ones are usually added automatically but its not adding a new information. I am wondering whethere I have an issue with my code. I usually run scrape through the sitemap. Code: [Select] $qry = "CREATE TABLE sitemap ( id varchar(30), price decimal(6,2), url varchar( 1024 ) )"; // Create the table mysql_query ( $qry, $con ); $numSitemapPages = 350; $html = new simple_html_dom(); if($_ECHO) echo "START: Fetching site map...<br />"; for( $i = 0; $i < $numSitemapPages; $i++ ) { if($_ECHO) echo "Page $i<br />"; $fileContents = file_get_contents( "http://www.website.co.uk/SiteMap-S" . $i . ".aspx" ); $html->load( $fileContents ); $hrefs = $html->find( "a[style=color: Blue; text-decoration: underline;]" ); if ( isset( $hrefs[ 0 ] ) ) { foreach( $hrefs as $href ) { $url = "http://www.website.co.uk/" . $href->href; $qry = "INSERT INTO sitemap(url) VALUES( '$url' )"; mysql_query( $qry, $con ); if($_ECHO) echo "MYSQL: Added $href->href to DB<br />"; } } else if($_ECHO) echo "NO URLS FOUND ON THIS PAGE!<br />"; } echo "END: Fetching site map...<br />"; exit(0); I'm just doing a simple scrape on a web page.. works great, except... the section I scrape is just 10 words inside a table which are updated every few hours. Problem is... my scrape seems to only grab what I found the first time I ran my script, despite the new words being updated a couple hours ago. I've tested everything script wise and looked at the html source to be sure everything is showing up... just wondering, is there any thing you can think of that would allow for caching (from source site or the php script) of the words on a page when running my script? I even opened 2 different browsers and ran the script fresh from these browsers that had never ran it. When I echo the results, I still get the original result set. I'm perplexed. I need to scrape a Chinese website. (I guess there is no difference between scraping a Chinese website and a normal one?) It's my suppliers weblink. They have have told me to download images and text from their website profile link off 1688.com. There is an API - but from what I've read, it's pants + my virus checked doesn't allow me to visit the API doc page. What tool should I use? I've got lots of experience coding - but master of none. Maybe I still fall into the newbie category. LOL. I saw a link from an article... they gave these names:
Goutte Which should I consider? IMPORTANT: I need to download and then upload into my Woocommerce website. I need images + variation details. I'll have a little file with translations, so if XYZ is found in Chinese, then it is replaced with this. I thought I would mention the extra detail incase it was relevant to considering which scraping tool to pick. Thanks! Edited July 11, 2019 by acebase Hi everyone, I am making a screen scraper in php which scrapes the usernames from forum posts and stores them in an SQL database. I need some help with part of the preg_match code if possible please? The code and also the pseudo code I have so far is:(the pseudocode I am having trouble with but will try to solve my self if possible). Edit: sorry, editted the page as was confusing to read. Please ask for clarificaiton if there is anything I have failed to explain properly. thank you. //I will be placing the following php in the confirmation page people see after making a new post, so for this example lets say the referrer header says: http://www.mysite.com/showthread.php?tid=1' Code: [Select] $threadurl=$_SERVER['HTTP_REFERER']; // scrape the page Code: [Select] $content = file_get_contents($threadurl); // find the pattern in source which makes it easy to find the username- the only things that change are the uid and the color Code: [Select] if (preg_match("/\b uid=792"><span style="color:#ffcc00">fapafap</span></a>\b /i", $content)) { //extract username from this string search Don't know! //copy the username (in this case 'fapafap') to the database along with the referral ID Code: [Select] $query_insert="INSERT INTO newpostersdatabase(username,referrerurl) VALUES('$username','$threadurl')" ; $result=mysql_query ( $query_insert); if(!$result){ die(mysql_error()); } Thank you so much for any guidance, I know that I have totally messed up with the string search also but my brain is too small it seems! Hi all, I'm trying to scrape the contents of a page that is behind a login screen; namely: http://my.mail.ru/apps. Here's my code. It almost works, but doesn't appear to be properly logging in -- I just get a login screen on the url download. Any ideas? Thanks much. Here's my code <?php $ch=login(); $html=downloadUrl('http://my.mail.ru/apps', $ch); echo $html; function downloadUrl($Url, $ch){ curl_setopt($ch, CURLOPT_URL, $Url); curl_setopt($ch, CURLOPT_POST, 0); curl_setopt($ch, CURLOPT_REFERER, "http://my.mail.ru/cgi-bin/login?noclear=1&page=http%3a%2f%2fmy.mail.ru%2fapps%2f"); curl_setopt($ch, CURLOPT_USERAGENT, "MozillaXYZ/1.0"); curl_setopt($ch, CURLOPT_HEADER, 0); curl_setopt($ch, CURLOPT_RETURNTRANSFER, true); curl_setopt($ch, CURLOPT_TIMEOUT, 10); $output = curl_exec($ch); return $output; } function login(){ $ch = curl_init(); curl_setopt($ch, CURLOPT_URL, 'http://my.mail.ru/cgi-bin/login?noclear=1&page=http%3a%2f%2fmy.mail.ru%2fapps%2f'); //login URL curl_setopt ($ch, CURLOPT_POST, 1); $postData=' page=http%3A%2F%2Fmy.mail.ru%2Fapps%2F &Login=username &Domain=mail.ru &Password=password'; curl_setopt ($ch, CURLOPT_POSTFIELDS, $postData); curl_setopt ($ch, CURLOPT_COOKIEJAR, 'cookie.txt'); curl_setopt ($ch, CURLOPT_FOLLOWLOCATION,1); curl_setopt ($ch, CURLOPT_MAXREDIRS, 10); curl_setopt ($ch, CURLOPT_RETURNTRANSFER, 1); $store = curl_exec ($ch); return $ch; } ?> Alright, so I play a browser game called Politics and War. I run an alliance that has 74 members. In that alliance we offer a bank service for all our members, but I - being the leader - am the only one who can access the bank. I have been building a site that works with the game API to gather data for members and create a dashboard. One of the features I am trying to build is allowing them to withdraw from their account instantly.
So, what I need: To be able to submit a POST request to login to the site (specifically on this page --> https://politicsandwar.com/login) with my username and password, but then I need to keep the session active and navigate to a different page (the alliance bank page). On that page I first need to scrape a value from a hidden input (token) and then I need to submit a POST request to this same page while still being logged in.
I am not asking someone to do it for me, but rather someone to help me know how to go about this. I have never submitted post requests with PHP, but I have used PHP cURL in the past. I also have made POST requests with JS, but never PHP.
Thank you so much for anyone that is able to help! so i have a number of websites on a server all work fine but when i put https://
in front of the other sites on the website's it redirects to another site on the server without changing domain name.
example:
http://www.example.com
works fine
https://www.example.com
shows another site on the server
how can i stop this?
i tryed mod rewrites in htaccess but somehow isnt working?
Hello,
I cannot work out this one.
I am loading a css file on an https page as:
<link rel="stylesheet" type="text/css" href="/assets/fa687e60/jui/css/base/jquery-ui.css" />
But using the chrome element tool I see this error:
The page at 'https://mysite.com/deal/create' was loaded over HTTPS, but displayed insecure content from 'http://mysite.com/cs...bf9ee_1x400.png': this content should also be loaded over HTTPS.
Now I went in the css file and the code is like that:
.ui-state-highlight, .ui-widget-content .ui-state-highlight, Anyone know how to force URL with SSL with www.
<rewrite> <rules> <rule name="Redirect to HTTPS" stopProcessing="true"> <match url="(.*)" /> <conditions><add input="{HTTPS}" pattern="^OFF$" /> </conditions> <action type="Redirect" url="https://www.site.com/{R:0}" redirectType="SeeOther" /> </rule> </rules> </rewrite>Please check the sample. I'm trying to determine the best way to provide HTTPS access to a web application that I'm building. I know that you can use the following code to redirect anyone manually accessing the http version of an https page: Code: [Select] if($_SERVER["HTTPS"] != "on") { header("HTTP/1.1 301 Moved Permanently"); header("Location: "https://" . $_SERVER["SERVER_NAME"] . $_SERVER["REQUEST_URI"]); exit(); } I have also heard that you could use mod_rewrite in Apache to achieve similar results. The entire web application should use https so I just want to make sure that I'm setting this up correctly. Feedback on the best approach or other suggestions would be very helpful. Thanks in advance. |