PHP - Scraping: Need Some Pointers
I need to scrape pages - I only need one page at a time
I'm only looking for 2/3 bits of data within each page Can someone give me some pointers where to start? I've searched and see names like DOMXpath and Xpath mentioned - do I need these? It's important that I can run the script on a standard Linux hosting with nothing extra installed like packages - I'd like to have something I can just use immediately using standard php and functions I've seen plenty of tutorials + youtube videos - just looking for recommendations and pointers for recommended practices Thanks OM Similar Tutorialsi am working on making this forum script i wanna allows users to upload a photo album and pictures i have everything in place including all the database info the problem is i need a page that is considered a static page lets say i use the following code in images.php Code: [Select] <a href="view_images.php?=<?php echo $images_id; ?>"><?php echo $images_name; ?></a> this code should take me to that address but the layout is designed in view_images this is the code located in the view_albums page to load the picture from the link Code: [Select] <IMG SRC="/images/<?php echo $images_id; ?>/<?php echo $images_name; ?>"> now im not getting the image it shows just the first one located in the database now i can change the img src line to any id number and name and it will work for that image only how can i make this work the right way please respond ASAP for this is for a client of mine Thanks in advance, Mike Hi everybody. I have this idea and I need some pointers to get started. I am new to php and therefore I have been searching around for the best way to solve my problem. I want to create a site where you can register your username. Once you have done that, I want you to be able to make a series of choises, for example different colors of your choise. Then I want want those choises to be saved in my database. Next I want my site to be able to compare the choises of two users. Example: If user one selected blue, red, and yellow, and user two selected blue, red and orange, I want the site to display that their common colors are red and blue. I know what I ask of is alot, and would therefore be grateful if someone could point me in some direction on where to get started. Regards Kristoffer Being fresh to PHP I installed PHPMailer (Not sure how to install classes so what I did is I dropped the PHPMailer files into my PEAR folder since that's where all my extensions are being sent to in XAMPP). Then I ran a basic script to send email via PHPMailer using the functions offered by PHPMailer instead of Mail() (of course) but when I do that and execute my code by going to localhost my Firefox will just display my PHP code right back at me. Hello,
Currently my webscraper signs into the site and pulls all the html -> perfect.
What I need to do is to loop only specific information (horses that ran)
here is my current php code
<? $url = 'site'; $postdata = array('username' => "username", 'password' => "password"); $ch = curl_init(); if($ch){ curl_setopt($ch, CURLOPT_URL, $url); curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 15); curl_setopt($ch, CURLOPT_RETURNTRANSFER, true); curl_setopt($ch, CURLOPT_POST, 1); curl_setopt($ch, CURLOPT_POSTFIELDS, $postdata); curl_setopt($ch, CURLOPT_COOKIEFILE, 'cookies.txt'); // set cookie file to given file curl_setopt($ch, CURLOPT_COOKIEJAR, 'cookies.txt'); // set same file as cookie jar $content = curl_exec($ch); $headers = curl_getinfo($ch); curl_close($ch); // Debug option // print_r($headers); if($headers['http_code'] == 200){ echo $content; } } ?>here is the html im pulling <table width=100% border=1><tr><td class=instruction6 colspan=4><b>My Race Notes</b></td></tr> <tr><td width=90%><form action='races.php?id=7456132' method=post> <textarea name='comments' rows=2 cols=38>Type notes & press Add</textarea></td> <td width=5%><input type=submit class='weestatbutton' value='Add'></form></td></tr></table></td></tr></table><table width=100%><tr class=databreakdown2253><th><a href='races.php?id=7456132&sortby=1'>Place</a></th><th>Dist Bt</th><th>Stall</th> <th>Horse</th><th>Age</th><th><a href='races.php?id=7456132&sortby=3'>Weight</a></th><th>Headgear</th><th>OR</th><th>Trainer</th> <th><a href='races.php?id=7456132&sortby=2'>Odds</a></th><th>Jockey (Claim)</th></tr><tr><td class=databreakdown2253>1st</td><td class=databreakdown2253></td><td class=databreakdown2253>4</td> <td class=databreakdown2253><a href='horses.php?id=298745'>Telegraph (IRE)</a></td> <td class=databreakdown2253>3</td><td class=databreakdown2253>9-3</td><td class=databreakdown2253></td> <td class=databreakdown2253>57</td> <td class=databreakdown2253><a href='trainers.php?id=2448'>Evans, P D</a></td> <td class=databreakdown2253>28/1 </td> <td class=databreakdown2253><a href='jockeys.php?id=694'>Egan, John</a> </td></tr><tr class=databreakdown18><td colspan=12>soon led, brought field stands side from 3f out, headed 2f out, rallied inside final furlong, bumped and led again towards finish</td></tr><tr><td class=databreakdown2253>2nd</td><td class=databreakdown2253>0.5</td><td class=databreakdown2253>3</td> <td class=databreakdown2253><a href='horses.php?id=305855'>Ecliptic Sunrise</a></td> <td class=databreakdown2253>3</td><td class=databreakdown2253>8-12td><td class=databreakdown2253></td> <td class=databreakdown2253>52</td> <td class=databreakdown2253><a href='trainers.php?id=4516'>Donovan, D</a></td> <td class=databreakdown2253>10/1 </td> <td class=databreakdown2253><a href='jockeys.php?id=3414'>Cosgrave, Pat</a> </td></tr><tr class=databreakdown18><td colspan=12>chased leaders, challenged 2f out, led 2f out, edged right inside final furlong, rider lost whip and headed towards finish</td></tr><tr><td class=databreakdown2253>3rd</td><td class=databreakdown2253>1.5</td><td class=databreakdown2253>1</td> <td class=databreakdown2253><a href='horses.php?id=300316'>Bookmaker</a></td> <td class=databreakdown2253>4</td><td class=databreakdown2253>9-6</td><td class=databreakdown2253><a title='Blinkers worn'>Blnk</a></td> <td class=databreakdown2253>59</td> <td class=databreakdown2253><a href='trainers.php?id=933'>Bridger, J J</a></td> <td class=databreakdown2253>6/1 </td> <td class=databreakdown2253><a href='jockeys.php?id=3848'>Carson, William</a> </td></tr><tr class=databreakdown18><td colspan=12>prominent, took keen hold, led 2f out, headed over 1f out, not much room inside final furlong, stayed on same pace</td></tr><tr><td class=databreakdown2253>4th</td><td class=databreakdown2253>1</td><td class=databreakdown2253>2</td> <td class=databreakdown2253><a href='horses.php?id=261986'>Night Trade (IRE)</a></td> <td class=databreakdown2253>7</td><td class=databreakdown2253>8-8</td><td class=databreakdown2253><a title='Cheekpieces worn'>CkPc</a></td> <td class=databreakdown2253>50</td> <td class=databreakdown2253><a href='trainers.php?id=2653'>Harris, R A</a></td> <td class=databreakdown2253>6/1 </td> <td class=databreakdown2253><a href='jockeys.php?id=7348'>Hardie, Cameron</a> (3)</td></tr><tr class=databreakdown18><td colspan=12>prominent, ridden over 2f out, switched left inside final furlong, no extra close home</td></tr><tr><td class=databreakdown2253>5th</td><td class=databreakdown2253>1.5</td><td class=databreakdown2253>6</td> <td class=databreakdown2253><a href='horses.php?id=299296'>Trigger Park (IRE)</a></td> <td class=databreakdown2253>3</td><td class=databreakdown2253>8-10</td><td class=databreakdown2253></td> <td class=databreakdown2253>50</td> <td class=databreakdown2253><a href='trainers.php?id=2653'>Harris, R A</a></td> <td class=databreakdown2253>20/1 </td> <td class=databreakdown2253><a href='jockeys.php?id=3422'>Dobbs, Pat</a> </td></tr><tr class=databreakdown18><td colspan=12>chased leaders, ridden over 2f out, one pace over 1f out, no impression</td></tr><tr><td class=databreakdown2253>6th</td><td class=databreakdown2253>2.25</td><td class=databreakdown2253>7</td> <td class=databreakdown2253><a href='horses.php?id=300337'>Port Lairge</a></td> <td class=databreakdown2253>4</td><td class=databreakdown2253>8-11</td><td class=databreakdown2253><a title='Blinkers worn'>Blnk</a></td> <td class=databreakdown2253>50</td> <td class=databreakdown2253><a href='trainers.php?id=914'>Gallagher, J</a></td> <td class=databreakdown2253>33/1 </td> <td class=databreakdown2253><a href='jockeys.php?id=193'>Catlin, Chris</a> </td></tr><tr class=databreakdown18><td colspan=12>slowly into stride, in rear, stayed on inside final furlong, never dangerous</td></tr><tr><td class=databreakdown2253>7th</td><td class=databreakdown2253>NK</td><td class=databreakdown2253>11</td> <td class=databreakdown2253><a href='horses.php?id=289934'>Lionheart</a></td> <td class=databreakdown2253>4</td><td class=databreakdown2253>8-13</td><td class=databreakdown2253></td> <td class=databreakdown2253>59</td> <td class=databreakdown2253><a href='trainers.php?id=4910'>Crate, Peter</a></td> <td class=databreakdown2253>10/1 </td> <td class=databreakdown2253><a href='jockeys.php?id=7375'>Crouch, Hector</a> (7)</td></tr><tr class=databreakdown18><td colspan=12>reared start and slowly away, held up in rear, headway over 1f out, weakened inside final furlong</td></tr><tr><td class=databreakdown2253>8th</td><td class=databreakdown2253>2.75</td><td class=databreakdown2253>14</td> <td class=databreakdown2253><a href='horses.php?id=289421'>Koharu</a></td> <td class=databreakdown2253>4</td><td class=databreakdown2253>9-4</td><td class=databreakdown2253><a title='Cheekpieces worn'>CkPc</a></td> <td class=databreakdown2253>60</td> <td class=databreakdown2253><a href='trainers.php?id=2495'>Makin, P J</a></td> <td class=databreakdown2253>9/4 (Fav) </td> <td class=databreakdown2253><a href='jockeys.php?id=5952'>Bates, Mr D J</a> (3)</td></tr><tr class=databreakdown18><td colspan=12>in rear, ridden over 3f out, no impression</td></tr><tr><td class=databreakdown2253>9th</td><td class=databreakdown2253>3</td><td class=databreakdown2253>5</td> <td class=databreakdown2253><a href='horses.php?id=269827'>Saskias Dream</a></td> <td class=databreakdown2253>6</td><td class=databreakdown2253>9-6</td><td class=databreakdown2253><a title='Visor worn'>Vsor</a></td> <td class=databreakdown2253>59</td> <td class=databreakdown2253><a href='trainers.php?id=2002'>Chapple-Hyam, Jane</a></td> <td class=databreakdown2253>4/1 </td> <td class=databreakdown2253><a href='jockeys.php?id=3544'>Hughes, Richard</a> </td></tr><tr class=databreakdown18><td colspan=12>mid-division, headway and switched left over 1f out, edged left entering final furlong, soon eased</td></tr><tr><td class=databreakdown2253>10th</td><td class=databreakdown2253>1.75</td><td class=databreakdown2253>12</td> <td class=databreakdown2253><a href='horses.php?id=304248'>Crafty Business (IRE)</a></td> <td class=databreakdown2253>3</td><td class=databreakdown2253>9-2</td><td class=databreakdown2253><a title='Visor worn'>Vsor</a></td> <td class=databreakdown2253>59</td> <td class=databreakdown2253><a href='trainers.php?id=695'>Moore, G L</a></td> <td class=databreakdown2253>14/1 </td> <td class=databreakdown2253><a href='jockeys.php?id=6669'>Bishop, Mr C</a> (3)</td></tr><tr class=databreakdown18><td colspan=12>towards rear, pushed along over 3f out, well beaten 2f out</td></tr></table><br><hr></td></tr></table>*note I'm using this for personal reasons Okay so I am scraping websites for their descriptions keywords and titles. I noticed that a lot of websites use the same keywords and descriptions on every page.. so my idea is to scrape the index and find all the links in there and scrape them all then after they been scraped check all of the descriptions and if the descriptions match then pull some text unique to each page and use that. I can't seem to wrap my head around it.. how would I accomplish this? I scrape with curl then find keywords description and title then find all links on the site and scrape those. soo I was thinking making an array of the descriptions and then checking and inserting to the db but doesn't seem like it would work. Any ideas? Oh also.. how would I grab just text from each page that is different from every other page? lol very confusing i need some help to scrape a link from specified page. for example if i have a page like this http://br.4ce.info/ i want to scrape all link on that page and i want to show all link in that page on my wordpress widget in another blog ? can you help me with this ? dont use iframe i think better using cURL thanks More information on the job posting. I am looking to fetch information from daily deal website, Such as tuango.ca, socialliving.com, groupon.com...ect I want to retrieve data from different daily deal sites, and I want to retrieve all the deals of the day from each different city in the website. For example www.tuango.ca Has a deal a day in Montreal, Toronto,...ect I want to be apply to retrieve data from all the different location within the site. I want the script to fetch the data of deals. To be more clear I want the script to fetch What site the deal was on What location was it for What's the tittle of the deal What price is the deal What's the value of the deal What's the saving in percentage of the deal How much were sold What's the minimum amount of the deal before it becomes activated What's the company who did the deal Company address Company postal code Company phone number (there might be more categories..will talk more if you pass this stage of the interview process) Ones all this data is fetched I need it to automatically be store in a database. Every morning at 4:am (eastern time) I need it to run the script, because the days deals finish at midnight and it's the only way of getting a number of the total number of coupons sold. you'll usually see the final stats of the deal on their recent deals page of the website. I want to know how a site like http://onespout.com/deals/montreal did it.. I'm not asking somebody to do it for me I'm just asking someone to guide me in takeing the right steps Hi, Im trying to work out a way to get the New York Lottery's Take 5 results. Theres a few sites that list the winning numbers, i assume automatically as there is alot of lottery games on these sites. what would be the best way to get this? http://www.myfreepost.com/lottery/index.php/us/newyorklottery/takefive/result/ http://www.elite-lottery-results.com/?action=view_game&gid=NY2 Ok, I know how to screen scrape, but I don't know how to screen scrape when there is a login. I've looked this up for awhile, but no luck. I'd like to also make it so I can execute a url when I am logged in on the script for the script, for an example execute this url: http://site.com/data.php?id=9912&submit=1 Thanks in advanced. I am a newbie and I am trying to do a site scraping project to obtain all the following fields: Test Year, Test Name, Grade Level, Question #, Question Type, Reporting Category, Standard #, Standard description, Example Question (with image) for this web page that has a page for each question. http://www.doe.mass.edu/mcas/search/question.aspx?mcasyear=2010&QuestionSetID=1&grade=8&subjectcode=MTH&questionnumber=36 I am a newbie at PHP and would love if you could point me in the right direction. The page uses tables and I need to extract the data from the body of the page as well as some of the info from the url and then have it inserted into a MySQL database. Thank you so much for your help. Hi, I have the written the following code which scrapes price info from a website: $url = 'http://www.mydomain.com'; $html = file_get_contents($url); $pattern = '/<span class="price">(.*?)<\/span>/'; preg_match_all($pattern, $html, $matches); print_r($matches); It works well however I need to add in the delivery cost to each array element with a different pattern: /<span class="delivery">(.*?)<\/span>/'; Any idea how i can do this so each array element has both the price and delivery costs in a two dimensional array? Thanks for your advice I'm looking to scrape the schedule details for any particular class at my university as part of a school project. I have been able to log a student into the university site, grab their name and course information. In order to grab the schedule for a particular class I now have to visit a different area of the university site, the registrar. The course schedule section of the registrar is coded in ASP .net and I'm having trouble making HTTP requests to this area of the site. I understand the need to make post requests to mimic the Viewstate but I'm running into an issue before I even get to that part. I am able to load the page via an HTTP request almost every time. But it always takes almost exactly 2 minutes. I have tried simple get requests, post requests with the Viewstate, and other variations to one of a few different pages on the site. Each time it works. But each time it takes 2 minutes. Any ideas why it takes so long? Any suggestions on what I can possibly do differently? Here is the basic site I'm using to test my code on before implementing it fully into my program: University Site Here is my link that takes 2 minutes to load the same page: My Site Here is my latest code I've tried: Code: [Select] <?php $postdata = "__VIEWSTATE=/wEPDwULLTIwNjY2MzUzMDEPZBYCAgUPDxYCHgRUZXh0BRNNYXIgMjMgMjAxMSAgNzoxNVBNZGQYAQUeX19Db250cm9sc1JlcXVpcmVQb3N0QmFja0tleV9fFgEFDmN0bDEwJGltZ0xvZ2luaYy4H4gz+Bjb4GVdsO1ecd9c9EA="; $postdata .= "&__EVENTVALIDATION=/wEWAgKs/IaWBAKpyP2zAXWcNEO0tMqDX53r6m+Hzo/nKHwZ"; $postdata = urlencode($postdata); $host = 'courseschedules.njit.edu'; $path = '/index.aspx'; $fp1 = fsockopen($host,80,$errno,$errstr,30); if(!$fp1) die($_err.$errstr.$errno); else { fputs($fp1, "POST $path HTTP/1.1\r\n"); fputs($fp1, "Host: $host\r\n"); fputs($fp1, "User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.2.15) Gecko/20110303 Firefox/3.6.15 ( .NET CLR 3.5.30729)\r\n"); fputs($fp1, "Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8\r\n"); fputs($fp1, "Accept-Language: en-us,en;q=0.5\r\n"); fputs($fp1, "Accept-Encoding: gzip,deflate\r\n"); fputs($fp1, "Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7\r\n"); fputs($fp1, "Keep-Alive: 115\r\n"); fputs($fp1, "Connection: keep-alive\r\n"); fputs($fp1, "Content-length: ".strlen($postdata)."\r\n\r\n"); fputs($fp1, $postdata."\r\n\r\n"); $response = ''; while(!feof($fp1)) $response .= fgets($fp1,2000); fclose($fp1); echo $response; } ?> Like I said, I've also tried a standard get request which works as well, just takes 2 minutes. Could anyone point me in the right direction for downloading app store statistics from: -the Apple App Store -The Android Market -the Amazon AppStore Specifically, I'd like to get the -average selling price of apps -the top selling apps -the distribution of tablets vs. phones, etc. (e.g. how many apps are there for Honeycomb? How many for iPad?) -total number of apps in store -free vs. paid apps I've seen some sites like http://148apps.biz/app-store-metrics/ and http://www.appbrain.com/stats/ How do these sites get their data? There must be a way to export the whole app store database as a CSV file, or import it to MySQL and run queries. Thanks much for any direction. Hello, I have checked out many of the scripts and tried implementing them to help me scrape 1 single image from a url. Example www.123.com/333.png Getting a script to scrape that image isnt the problem. Im not sure on how to implement the simple curl to save the image every 30mins and name it in successive order so it appears as , 1.jpg, 2.jpg, 3.jpg I am working with a debian 6 server and php would be the easiest way to do this that i can work with. I have searched the web endlessly and still cant produce such thing. Any help is appreciated. I'm trying to pull the stock quotes Beta from yahoo finance since the yahoo query language doesn't support it. My code returns an empty array. Any ideas why? Code: [Select] <?php $content = file_get_contents('http://finance.yahoo.com/q?s=NFLX'); preg_match('#<tr><th width="48%" scope="row">Beta:</th><td class="yfnc_tabledata1">(.*)</td></tr>#', $content, $match); print_array($match); ?> I am writing a sql dump file and some of my fields have ' in it. Like the name is "Joe's Cake Shop". How should i add ' infront of ' to make it look like Joe''s Cake Shop.Also, I got an idea about adding ' infront of ' by seeing other database dump.Can someone please enlighten me why should i do it. My Code :- Code: [Select] <?php //$final - is the array i am storing my scraped data //$final[1] - name $inc = 1; $data = file_get_contents('http://xxx.com'); $regex = '~<td\s+colspan="2"\s+width="350"><font\s+size="2">\s+<b>\s+(.*?) <\/b><br>(.*?) <br>(.*?),\s+(.*?)\s+<br>(.*?), (.*?)\s+<BR><BR><font\s+size="2"><img\s+src="\.\.\/images\/phone1.gif"\s+align="left"\s+hspace="4"\s+alt\s+=(.*)>\s+-\s+Phone\s+#\s+(.*?)\s+<\/font>\s+<BR>\s+<font\s+size\s+="1">~'; preg_match_all($regex, $data, $final); $jlimit = count($final[0]); for($j=0 ;$j < $jlimit; $j++) { $filename = 'cake.sql'; $somecontent = "(".$inc.", '".$final[1][$j]."', '".$final[2][$j]."', '".$final[3][$j]."', '".$final[4][$j]."', '".$final[6][$j]."', '".$final[8][$j]."'),\n"; if (is_writable($filename)) { if (!$handle = fopen($filename, 'a')) { echo "Cannot open file ($filename)"; exit; } if (fwrite($handle, $somecontent) === FALSE) { echo "Cannot write to file ($filename)"; exit; } echo "Success, wrote ($somecontent) to file ($filename)"; $inc = $inc + 1; fclose($handle); } else { echo "The file $filename is not writable"; } } ?> I have a form which lets the user put in the URL to their twitter account. When the enter their URL I am trying to create a screen scraping script that scrapes that page to get basic information like their twitter name and number of tweets. I'm not sure how I am going to do this, I don't think there is a twitter API for this so I may have to use something like cURL. I was just wondering if anyone has done this and could give me any advice about the best method? Thanks for any help Hi, I have lottery blog and i want to automate lottery results to post on my website instead of manually enter into website. Is there anyway to do it. I am not technical expert. Please guide me step by step so i can do it easily. Own website: https://uk49sresult.co.uk/ Scraping website: https://49s.co.uk//49s/ Edited May 26 by Johnwilliam |