PHP - Data Scraping With Preg_match_all()
Hi,
I have the written the following code which scrapes price info from a website: $url = 'http://www.mydomain.com'; $html = file_get_contents($url); $pattern = '/<span class="price">(.*?)<\/span>/'; preg_match_all($pattern, $html, $matches); print_r($matches); It works well however I need to add in the delivery cost to each array element with a different pattern: /<span class="delivery">(.*?)<\/span>/'; Any idea how i can do this so each array element has both the price and delivery costs in a two dimensional array? Thanks for your advice Similar TutorialsMore information on the job posting. I am looking to fetch information from daily deal website, Such as tuango.ca, socialliving.com, groupon.com...ect I want to retrieve data from different daily deal sites, and I want to retrieve all the deals of the day from each different city in the website. For example www.tuango.ca Has a deal a day in Montreal, Toronto,...ect I want to be apply to retrieve data from all the different location within the site. I want the script to fetch the data of deals. To be more clear I want the script to fetch What site the deal was on What location was it for What's the tittle of the deal What price is the deal What's the value of the deal What's the saving in percentage of the deal How much were sold What's the minimum amount of the deal before it becomes activated What's the company who did the deal Company address Company postal code Company phone number (there might be more categories..will talk more if you pass this stage of the interview process) Ones all this data is fetched I need it to automatically be store in a database. Every morning at 4:am (eastern time) I need it to run the script, because the days deals finish at midnight and it's the only way of getting a number of the total number of coupons sold. you'll usually see the final stats of the deal on their recent deals page of the website. I want to know how a site like http://onespout.com/deals/montreal did it.. I'm not asking somebody to do it for me I'm just asking someone to guide me in takeing the right steps I am writing a sql dump file and some of my fields have ' in it. Like the name is "Joe's Cake Shop". How should i add ' infront of ' to make it look like Joe''s Cake Shop.Also, I got an idea about adding ' infront of ' by seeing other database dump.Can someone please enlighten me why should i do it. My Code :- Code: [Select] <?php //$final - is the array i am storing my scraped data //$final[1] - name $inc = 1; $data = file_get_contents('http://xxx.com'); $regex = '~<td\s+colspan="2"\s+width="350"><font\s+size="2">\s+<b>\s+(.*?) <\/b><br>(.*?) <br>(.*?),\s+(.*?)\s+<br>(.*?), (.*?)\s+<BR><BR><font\s+size="2"><img\s+src="\.\.\/images\/phone1.gif"\s+align="left"\s+hspace="4"\s+alt\s+=(.*)>\s+-\s+Phone\s+#\s+(.*?)\s+<\/font>\s+<BR>\s+<font\s+size\s+="1">~'; preg_match_all($regex, $data, $final); $jlimit = count($final[0]); for($j=0 ;$j < $jlimit; $j++) { $filename = 'cake.sql'; $somecontent = "(".$inc.", '".$final[1][$j]."', '".$final[2][$j]."', '".$final[3][$j]."', '".$final[4][$j]."', '".$final[6][$j]."', '".$final[8][$j]."'),\n"; if (is_writable($filename)) { if (!$handle = fopen($filename, 'a')) { echo "Cannot open file ($filename)"; exit; } if (fwrite($handle, $somecontent) === FALSE) { echo "Cannot write to file ($filename)"; exit; } echo "Success, wrote ($somecontent) to file ($filename)"; $inc = $inc + 1; fclose($handle); } else { echo "The file $filename is not writable"; } } ?> This topic has been moved to PHP Regex. http://www.phpfreaks.com/forums/index.php?topic=334273.0 Hi, I'm trying to retrieve/scrape some information from a website using the class name and the tag name.
Below is the example in VB:
Dim htmL_cat As HTMLDocument Dim objTableL_cat As Object, objDatL_cat As Object, objItemL_cat As Object, objKeyL_cat As Object Dim intRowL_cat As Long Set htmL_cat = New HTMLDocument With CreateObject("MSXML2.XMLHTTP") .Open "GET", "http://www.lelong.com.my/Auc/List/BrowseAll.asp", False .send htmL_cat.body.innerHTML = .responseText End With With htmL_cat Set objTableL_cat = .getElementsByClassName("CatLevel1") 'Find elements with class name first For Each objDatL_cat In objTableL_cat Set objKeyL_cat = objDatL_cat.getElementsByTagName("a") 'Next, find elements with tag name For Each objItemL_cat In objKeyL_cat Sheets("Analytics").Range("E6").Offset(intRowL_cat, 0) = objItemL_cat.innerText intRowL_cat = intRowL_cat + 1 Next Next End With Set htmL_cat = Nothing Set objTableL_cat = Nothing Set objKeyL_cat = NothingHow do I do the same using PHP? Thanks. Hello again, I have some form data, which I then search through for particular code data like so: $html2 = $_POST['fname']; preg_match_all("/<bla>(.*)<\/bla>/", $html2, $matches40); So the above searches for all the data between <bla>XXXXXX</bla> from $POST Which I then print to my page using: (Only so I can see while developing) print_r($matches40); This displays HTML output like so: Code: [Select] Array ( [0] => Array ( [0] => Hello [1] => My [2] => Name [3] => Is [4] => Tom ) [1] => Array ( [0] => Hello [1] => My [2] => Name [3] => Is [4] => Tom ) ) What I am trying to do is again use the preg_match_all function to look through the array output and find data that I want to remove. E.g. If one of the variables from $matches40 is 'Tom' I want to find and replaces this with 'Ben'. I spent a day searching Google but to not success. Any help? i want to find the text between "{:" and ":}", may be 1 or more instances of this i'm using this php: $str = "hello {:first_name:} ha, this is {:awesome:} haha"; $do = preg_match_all("/{:(.*):}/", $str, $matches); which works if theres just one instance, but when you use more than 1 instance (like the above example) it returns: first_name:} ha, this is {:awesome But i want it to return a value of first_name, AND a separate value of "awesome" ideas? thanks Hello all! So I am working on screen scraping a site for my son's rec league. I seem to be having problem with the pre_match_all syntax. Here is my code Code: [Select] <?php $url = "http://www.mywebsite.com"; $raw = file_get_contents($url); $newlines = array("\t","\n","\r","\x20\x20","\0","\x0B"); $content = str_replace($newlines, "", html_entity_decode($raw)); $start = strpos($content,'table border="1" cellpadding="1" cellspacing="0"'); $end = strpos($content,'</table>',$start) + 8; $table = substr($content,$start,$end-$start); preg_match_all("|<tr(.*)</tr>|U",$table,$rows); foreach ($rows[0] as $row){ if ((strpos($row,'<th')===false)){ preg_match_all("|<td(.*)</td>|U",$row,$cells); $game_date = strip_tags($cells[0][0]); $game_time = strip_tags($cells[0][1]); $rink = strip_tags($cells[0][2]); $home_team = strip_tags($cells[0][3]); $home_score = strip_tags($cells[0][4]); $visiting_team = strip_tags($cells[0][5]); $visiting_score = strip_tags($cells[0][6]); echo "{$game_date} @ {$game_time} : [{$home_team}] - {$home_score} vs. [{$visiting_team}] - {$visiting_score} <br>\n"; } } ?> My issue is that I am trying to get it to only display the data if the team name = x. I tried to replace the preg_match_all("|<td(.*)</td>|U",$row,$cells); with preg_match_all("|Posse|U",$row,$cells); (Posse is one of the team names). No luck. Any input/thoughts?! Thank you!! This topic has been moved to PHP Regex. http://www.phpfreaks.com/forums/index.php?topic=328802.0 This is rather bothering as I know if you use the delimiter / regex pattern s it should ignore newlines preg_match_all("%<p><b>(.*?)</b>%s", $html, $data); Returns a blank array the page data is like so <p> <b>41,910</b><br/> Total Points </p> Never had a problem before that i can recall but for some reason with this page it's giving me issues. Maybe i'm missing something? This topic has been moved to PHP Regex. http://www.phpfreaks.com/forums/index.php?topic=348635.0 I am trying to use preg_match_all to find some information on a webpage. Here is what I am currently using <?php $homepage = "http://www.example.com"; $page_contents1 = file_get_contents($homepage); $names1 = preg_match_all('/<span class="video_date">(.*)</span> - <a class="b" href="/(.*)/">(.*)</a><br/>\/', $page_contents1, $matches1); echo implode(", ", $matches1[1]); ?> I am trying to match this piece of html: <span class="video_date">Oct 21</span> - <a class="b" href="/meanwhilezealand/"> Meanwhile in New Zealand...</a><br/> Thanks for looking! Hi there i have this code: Code: [Select] $str = "<i><font color="800080"> man </font></i><p><font color="9898989"> hi </font></p><p><font color="1111111"> cheers </font></p>"; $pattern = '/<font .*?>(.*?)<\/font>/'; if(preg_match_all($pattern, addslashes($str), $posts)){ $i=0; for($i; $i < count($posts[0]); $i++){ echo "content: " . $posts[0][$i] . "<br/>"; echo "colour: " . $posts[1][$i] . "<br/>"; echo "<br />"; } } and it doesn't work apparently because of the addslashes but its really needed as double quotes needs to be escaped, consider that i'm applying this code to a larger html file with hundreds of double quotes to be escaped.... error msg i get is Parse error: syntax error, unexpected T_LNUMBER in thanks in advance.. I have noticed that if I run the preg_match_all function and use PREG_OFFSET_CAPTURE option to start capture somwhere in the middle of the string the second half of the string will be searched first returning the matching sections along with positions, then it goes up to the top half and returns matches from there too. Is there way to parse only between start point and end of string? For example. I have the following: Andrew (Age 19) How would I get the content between the brackets, Age 19 using preg_match_all or a similar function? Thanks very much Hello, i am trying to pull the innerHTML out of this: Code: [Select] <a href="(.*?)">(.*?)</a> here is what I have: Code: [Select] <?php $html = file_get_contents("http://www.businessinvestingsource.com/blcheck2.html"); preg_match_all('/<a href="(.*?)">(.*?)<\/a>/', $html, $links, PREG_SET_ORDER); foreach ($links as $link) { $linkto = $link[1]; $anchor = $link[0]; echo "<b>Link:</b> ".$linkto."<br /><b>Anchor:</b> ".$anchor."<br /><br /> "; } ?> Now this code works but the innerHTML is coming out as a link I want it to come out as plaintext you can view he http://businessinvestingsource.com/anchorcheck2.php Can anyone help? Thank you. I have code: $proname1 = preg_match_all('/<div class=("|\')agentContainer("|\')>(\n\s)<div class="strong">(\n\s)(.*?)(\n\s)<\/div>/', $html, $name1);() Which is putting everything between these tags into an array, but the info contains new lines and whitespace, thus displaying empty entries in the array. How do I strip the whitespace and newlines prior to getting to the array? The data Im getting looks like... Code: [Select] <div class="agentContainer"> <div class="strong"> Blah Blah Company </div> And blah blah company isnt showing up in the array, but I know the regex is working. Hello,
Currently my webscraper signs into the site and pulls all the html -> perfect.
What I need to do is to loop only specific information (horses that ran)
here is my current php code
<? $url = 'site'; $postdata = array('username' => "username", 'password' => "password"); $ch = curl_init(); if($ch){ curl_setopt($ch, CURLOPT_URL, $url); curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 15); curl_setopt($ch, CURLOPT_RETURNTRANSFER, true); curl_setopt($ch, CURLOPT_POST, 1); curl_setopt($ch, CURLOPT_POSTFIELDS, $postdata); curl_setopt($ch, CURLOPT_COOKIEFILE, 'cookies.txt'); // set cookie file to given file curl_setopt($ch, CURLOPT_COOKIEJAR, 'cookies.txt'); // set same file as cookie jar $content = curl_exec($ch); $headers = curl_getinfo($ch); curl_close($ch); // Debug option // print_r($headers); if($headers['http_code'] == 200){ echo $content; } } ?>here is the html im pulling <table width=100% border=1><tr><td class=instruction6 colspan=4><b>My Race Notes</b></td></tr> <tr><td width=90%><form action='races.php?id=7456132' method=post> <textarea name='comments' rows=2 cols=38>Type notes & press Add</textarea></td> <td width=5%><input type=submit class='weestatbutton' value='Add'></form></td></tr></table></td></tr></table><table width=100%><tr class=databreakdown2253><th><a href='races.php?id=7456132&sortby=1'>Place</a></th><th>Dist Bt</th><th>Stall</th> <th>Horse</th><th>Age</th><th><a href='races.php?id=7456132&sortby=3'>Weight</a></th><th>Headgear</th><th>OR</th><th>Trainer</th> <th><a href='races.php?id=7456132&sortby=2'>Odds</a></th><th>Jockey (Claim)</th></tr><tr><td class=databreakdown2253>1st</td><td class=databreakdown2253></td><td class=databreakdown2253>4</td> <td class=databreakdown2253><a href='horses.php?id=298745'>Telegraph (IRE)</a></td> <td class=databreakdown2253>3</td><td class=databreakdown2253>9-3</td><td class=databreakdown2253></td> <td class=databreakdown2253>57</td> <td class=databreakdown2253><a href='trainers.php?id=2448'>Evans, P D</a></td> <td class=databreakdown2253>28/1 </td> <td class=databreakdown2253><a href='jockeys.php?id=694'>Egan, John</a> </td></tr><tr class=databreakdown18><td colspan=12>soon led, brought field stands side from 3f out, headed 2f out, rallied inside final furlong, bumped and led again towards finish</td></tr><tr><td class=databreakdown2253>2nd</td><td class=databreakdown2253>0.5</td><td class=databreakdown2253>3</td> <td class=databreakdown2253><a href='horses.php?id=305855'>Ecliptic Sunrise</a></td> <td class=databreakdown2253>3</td><td class=databreakdown2253>8-12td><td class=databreakdown2253></td> <td class=databreakdown2253>52</td> <td class=databreakdown2253><a href='trainers.php?id=4516'>Donovan, D</a></td> <td class=databreakdown2253>10/1 </td> <td class=databreakdown2253><a href='jockeys.php?id=3414'>Cosgrave, Pat</a> </td></tr><tr class=databreakdown18><td colspan=12>chased leaders, challenged 2f out, led 2f out, edged right inside final furlong, rider lost whip and headed towards finish</td></tr><tr><td class=databreakdown2253>3rd</td><td class=databreakdown2253>1.5</td><td class=databreakdown2253>1</td> <td class=databreakdown2253><a href='horses.php?id=300316'>Bookmaker</a></td> <td class=databreakdown2253>4</td><td class=databreakdown2253>9-6</td><td class=databreakdown2253><a title='Blinkers worn'>Blnk</a></td> <td class=databreakdown2253>59</td> <td class=databreakdown2253><a href='trainers.php?id=933'>Bridger, J J</a></td> <td class=databreakdown2253>6/1 </td> <td class=databreakdown2253><a href='jockeys.php?id=3848'>Carson, William</a> </td></tr><tr class=databreakdown18><td colspan=12>prominent, took keen hold, led 2f out, headed over 1f out, not much room inside final furlong, stayed on same pace</td></tr><tr><td class=databreakdown2253>4th</td><td class=databreakdown2253>1</td><td class=databreakdown2253>2</td> <td class=databreakdown2253><a href='horses.php?id=261986'>Night Trade (IRE)</a></td> <td class=databreakdown2253>7</td><td class=databreakdown2253>8-8</td><td class=databreakdown2253><a title='Cheekpieces worn'>CkPc</a></td> <td class=databreakdown2253>50</td> <td class=databreakdown2253><a href='trainers.php?id=2653'>Harris, R A</a></td> <td class=databreakdown2253>6/1 </td> <td class=databreakdown2253><a href='jockeys.php?id=7348'>Hardie, Cameron</a> (3)</td></tr><tr class=databreakdown18><td colspan=12>prominent, ridden over 2f out, switched left inside final furlong, no extra close home</td></tr><tr><td class=databreakdown2253>5th</td><td class=databreakdown2253>1.5</td><td class=databreakdown2253>6</td> <td class=databreakdown2253><a href='horses.php?id=299296'>Trigger Park (IRE)</a></td> <td class=databreakdown2253>3</td><td class=databreakdown2253>8-10</td><td class=databreakdown2253></td> <td class=databreakdown2253>50</td> <td class=databreakdown2253><a href='trainers.php?id=2653'>Harris, R A</a></td> <td class=databreakdown2253>20/1 </td> <td class=databreakdown2253><a href='jockeys.php?id=3422'>Dobbs, Pat</a> </td></tr><tr class=databreakdown18><td colspan=12>chased leaders, ridden over 2f out, one pace over 1f out, no impression</td></tr><tr><td class=databreakdown2253>6th</td><td class=databreakdown2253>2.25</td><td class=databreakdown2253>7</td> <td class=databreakdown2253><a href='horses.php?id=300337'>Port Lairge</a></td> <td class=databreakdown2253>4</td><td class=databreakdown2253>8-11</td><td class=databreakdown2253><a title='Blinkers worn'>Blnk</a></td> <td class=databreakdown2253>50</td> <td class=databreakdown2253><a href='trainers.php?id=914'>Gallagher, J</a></td> <td class=databreakdown2253>33/1 </td> <td class=databreakdown2253><a href='jockeys.php?id=193'>Catlin, Chris</a> </td></tr><tr class=databreakdown18><td colspan=12>slowly into stride, in rear, stayed on inside final furlong, never dangerous</td></tr><tr><td class=databreakdown2253>7th</td><td class=databreakdown2253>NK</td><td class=databreakdown2253>11</td> <td class=databreakdown2253><a href='horses.php?id=289934'>Lionheart</a></td> <td class=databreakdown2253>4</td><td class=databreakdown2253>8-13</td><td class=databreakdown2253></td> <td class=databreakdown2253>59</td> <td class=databreakdown2253><a href='trainers.php?id=4910'>Crate, Peter</a></td> <td class=databreakdown2253>10/1 </td> <td class=databreakdown2253><a href='jockeys.php?id=7375'>Crouch, Hector</a> (7)</td></tr><tr class=databreakdown18><td colspan=12>reared start and slowly away, held up in rear, headway over 1f out, weakened inside final furlong</td></tr><tr><td class=databreakdown2253>8th</td><td class=databreakdown2253>2.75</td><td class=databreakdown2253>14</td> <td class=databreakdown2253><a href='horses.php?id=289421'>Koharu</a></td> <td class=databreakdown2253>4</td><td class=databreakdown2253>9-4</td><td class=databreakdown2253><a title='Cheekpieces worn'>CkPc</a></td> <td class=databreakdown2253>60</td> <td class=databreakdown2253><a href='trainers.php?id=2495'>Makin, P J</a></td> <td class=databreakdown2253>9/4 (Fav) </td> <td class=databreakdown2253><a href='jockeys.php?id=5952'>Bates, Mr D J</a> (3)</td></tr><tr class=databreakdown18><td colspan=12>in rear, ridden over 3f out, no impression</td></tr><tr><td class=databreakdown2253>9th</td><td class=databreakdown2253>3</td><td class=databreakdown2253>5</td> <td class=databreakdown2253><a href='horses.php?id=269827'>Saskias Dream</a></td> <td class=databreakdown2253>6</td><td class=databreakdown2253>9-6</td><td class=databreakdown2253><a title='Visor worn'>Vsor</a></td> <td class=databreakdown2253>59</td> <td class=databreakdown2253><a href='trainers.php?id=2002'>Chapple-Hyam, Jane</a></td> <td class=databreakdown2253>4/1 </td> <td class=databreakdown2253><a href='jockeys.php?id=3544'>Hughes, Richard</a> </td></tr><tr class=databreakdown18><td colspan=12>mid-division, headway and switched left over 1f out, edged left entering final furlong, soon eased</td></tr><tr><td class=databreakdown2253>10th</td><td class=databreakdown2253>1.75</td><td class=databreakdown2253>12</td> <td class=databreakdown2253><a href='horses.php?id=304248'>Crafty Business (IRE)</a></td> <td class=databreakdown2253>3</td><td class=databreakdown2253>9-2</td><td class=databreakdown2253><a title='Visor worn'>Vsor</a></td> <td class=databreakdown2253>59</td> <td class=databreakdown2253><a href='trainers.php?id=695'>Moore, G L</a></td> <td class=databreakdown2253>14/1 </td> <td class=databreakdown2253><a href='jockeys.php?id=6669'>Bishop, Mr C</a> (3)</td></tr><tr class=databreakdown18><td colspan=12>towards rear, pushed along over 3f out, well beaten 2f out</td></tr></table><br><hr></td></tr></table>*note I'm using this for personal reasons Hello All, I have been wrestling with a regex for a couple of hours now and I finally had to give in and ask for help. The weird thing is that it works if there are no new lines in the text, it fails if there is a new line(s) present. The code: $matches = array(); $pattern = '~\[CUSTOM_TAG(.*?)\](.*?)\[/CUSTOM_TAG\]~'; preg_match_all($pattern, $html, $matches); if (!empty($matches[0])){ foreach($matches[0] as $code){ $parameter = preg_replace($pattern, '$1', $code); $content = preg_replace($pattern, '$2', $code);//get the content between the pattern }//foreach($matches[0] as $code){ }else{ echo 'Match failed'; }//if (!empty($matches[0])){ So with that code in mind, if the $html variable (the text to be processed) is: $html = '<h1>Hello, world!</h1><p style="color:#ff0000;">Some red text</p>';A match is found. If the $html variable is: $html = '<h1>Hello, world!</h1> <p style="color:#ff0000;">Some red text</p>';Match not found Hopefully I'm just missing something simple in my regex. Thanks in advance! Twitch i need some help to scrape a link from specified page. for example if i have a page like this http://br.4ce.info/ i want to scrape all link on that page and i want to show all link in that page on my wordpress widget in another blog ? can you help me with this ? dont use iframe i think better using cURL thanks |