PHP - Extract Text And Strip It
Hey folks, I am trying to create a small script that will retrieve content from a site, strip it of everything but human readable words, then remove numbers, single letters, and words that I specify. I have the following code which is live on http://salesleadhq.com/tools/crawler/meta.php?url=http://www.cooking.com. My problem is that it is not removing all of the the words I specify, only some... ?? I think i would rather an external word list as well... if anyone can assist me with that. Thank you! Code: [Select] <?php $url = (isset($_GET['url']) ?$_GET['url'] : 0); $str = file_get_contents($url); ####################################################################3 function get_url_contents($url){ $crl = curl_init(); $timeout = 5; curl_setopt ($crl, CURLOPT_URL,$url); curl_setopt ($crl, CURLOPT_RETURNTRANSFER, 1); curl_setopt ($crl, CURLOPT_CONNECTTIMEOUT, $timeout); $ret = curl_exec($crl); curl_close($crl); return $ret; } #--------------------------------------Strip html tag---------------------------------------------------- function StripHtmlTags( $text ) { // PHP's strip_tags() function will remove tags, but it // doesn't remove scripts, styles, and other unwanted // invisible text between tags. Also, as a prelude to // tokenizing the text, we need to insure that when // block-level tags (such as <p> or <div>) are removed, // neighboring words aren't joined. $text = preg_replace( array( // Remove invisible content '@<head[^>]*?>.*?</head>@siu', '@<style[^>]*?>.*?</style>@siu', '@<script[^>]*?.*?</script>@siu', '@<object[^>]*?.*?</object>@siu', '@<embed[^>]*?.*?</embed>@siu', '@<applet[^>]*?.*?</applet>@siu', '@<noframes[^>]*?.*?</noframes>@siu', '@<noscript[^>]*?.*?</noscript>@siu', '@<noembed[^>]*?.*?</noembed>@siu', // Add line breaks before & after blocks '@<((br)|(hr))@iu', '@</?((address)|(blockquote)|(center)|(del))@iu', '@</?((div)|(h[1-9])|(ins)|(isindex)|(p)|(pre))@iu', '@</?((dir)|(dl)|(dt)|(dd)|(li)|(menu)|(ol)|(ul))@iu', '@</?((table)|(th)|(td)|(caption))@iu', '@</?((form)|(button)|(fieldset)|(legend)|(input))@iu', '@</?((label)|(select)|(optgroup)|(option)|(textarea))@iu', '@</?((frameset)|(frame)|(iframe))@iu', ), array(' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', "\n\$0", "\n\$0", "\n\$0", "\n\$0", "\n\$0", "\n\$0", "\n\$0", "\n\$0",),$text ); // Remove all remaining tags and comments and return. return strtolower( $text ); } function RemoveComments( & $string ) { $string = preg_replace("%(#|;|(//)).*%","",$string); $string = preg_replace("%/\*(?:(?!\*/).)*\*/%s","",$string); // google for negative lookahead return $string; } $html = StripHtmlTags($str); ###Remove number in html################ $html = preg_replace("/[0-9]/", " ", $html); #replace by ' ' $html = str_replace(" ", " ", $html); ######remove any words################ $remove_word = array("amp","carry","serious","for","re","looking","accessories","you","used","wright","none","selection","come","second","you","new","a","able","about","across","after","all","almost","also","am","among","an","and","any","are","as","at","be","because","been","but","by","can","cannot","could","dear","did","do","does","either","else","ever","every","for","from","get","got","had","has","have","he","her","hers","him","his","how","however","i","if","in","into","is","it","its","just","least","let","like","likely","may","me","might","most","must","my","neither","no","nor","not","of","off","often","on","only","or","other","our","own","rather","said","say","says","she","should","since","so","some","than","that","the","their","them","then","there","these","they","this","tis","to","too","twas","us","wants","was","we","were","what","when","where","which","while","who","whom","why","will","with","would","yet","you","your"); foreach($remove_word as $word) { $html = preg_replace("/\s". $word ."\s/", " ", $html); } ######remove space $html = preg_replace ('/<[^>]*>/', '', $html); $html = preg_replace('/\s\s+/', ', ', $html); $html = preg_replace('/[\s\W]+/',', ',$html); // Strip off spaces and non-alpha-numeric #remove white space, Keep : . ( ) : & //$html = preg_replace('/\s+/', ', ', $html); ###process######################################################################### $array_loop = explode(",", $html); $array_loop1 = $array_loop; $arr_tem = array(); foreach($array_loop as $key=>$val) { if(in_array($val, $array_loop1)) { if(!$arr_tem[$val]) $arr_tem[$val] = 0; $arr_tem[$val] += 1; if ( ($k = array_search($val, $array_loop1) ) !== false ) unset($array_loop1[$k]); } } arsort($arr_tem); ###echo top 20 words############################################################ echo "<h3>Top 20 words used most</h3>"; $i = 1; foreach($arr_tem as $key=>$val) { if($i<=20) { echo $i.": ".$key." (".$val." words)<br />"; $i++; }else break; } echo "<hr />"; ###print array##################################################################### echo (implode(", ", array_keys($arr_tem))); ?> Similar TutorialsHi All, Bit of a strange one but i would like to be able to supply a URL to a page. This page will always contain an image and the copyright that goes with it for example http://www.geograph.org.uk/photo/693325 The copyright lies undearneath I would like to get some php code that would automatically grab the image and copy this to a directory on mt site and also take the creative commons copyright notice as a string ( which i will then display along side the image when i add it to my site) How can i do this through php I know that the word "copyright" only ever appears once on the page ( as part of the bit im trying to grab) so can i use this somehow to grab the whole string? Basically im being lazy and would like to automate the process of grabbing the image and copywrite without having to download it to my computer first and reload to my server ( as i will be doing this quite a lot) Any ideas much appreciated Thanks For example: I am using this code: Code: [Select] $myFile = "newuser.txt"; $fh = fopen($myFile, 'r'); $theData = fread($fh, 5); fclose($fh); echo $theData; and it displays: Code: [Select] Bob 2 Which I am reading from my newuser.txt file! Which corresponds to the username bob, and he has the ID of 2. Now I want to make that linkable like this: Code: [Select] <a href=.?act=Profile&id=$IDFROMTEXTFILE(2)>$NAMEFROMTEXTFILE(BOB)</a> this is possible? If so, Thanks! Hello. I have one programming problem. I have this log, from witch i have to read specific area of text: webtopay.log OK 123.456.7.89 [2012-03-15 09:09:59 -0400] v1.5: MIKRO to:"1398", from:"865458961", id:"13525948", sms:"MCLADM thing" So i need the script to extract word "thing" from that log. Also that script has to check if there is new entries in the log, and extract text from the last one. (Explaining in other words, that script should extract word AFTER MCLADM. Every time its a different word) p.s. I need that script to be integrated here (this has to send command to server "/manuadd (text from log)" : Code: [Select] <?php try{ $HOST = "178.16.35.196"; //the ip of the bukkit server $password = "MCLietuva"; //Can't touch this: $sock = socket_create(AF_INET, SOCK_STREAM, 0) or die("error: could not create socket\n"); $succ = socket_connect($sock, $HOST, 4445) or die("error: could not connect to host\n"); //Authentification socket_write($sock, $command = md5($password)."<Password>", strlen($command) + 1) or die("error: failed to write to socket\n"); //Begin custom code here. socket_write($sock, $command = "/Command/ExecuteConsoleCommandAndReturn-SimpleBroadCast:broadcast lol;", strlen($command) + 1) //Writing text/command we want to send to the server or die("error: failed to write to socket\n"); sleep(2); // This is example code and here has to be that script i want to make. //while(($returnedString = socket_read($sock,50000))!= ""){ $returnedString = socket_read($sock,50000,PHP_NORMAL_READ); print($returnedString) //} print("End of script"); socket_close($sock); }catch(Exception $e){ echo $e->getMessage(); } ?> I hope i made things clear and you will help me Thanks This topic has been moved to Third Party PHP Scripts. http://www.phpfreaks.com/forums/index.php?topic=321546.0 $text = "wow {one|two|three}fsasfa happy ness"; preg_match('/\b{*+}\b/i', $text, $matches); print_r($matches); Basically, $matches will contain "one|two|three" - but all I got is an array with "}" So I have been working on my website for a while which all is php&mysql based, now working on the social networking part building in similar functions like Facebook has. I encountered a difficulty with getting information back from a link. I've checked several sources how it is possible, with title 'Facebook Like URL data Extract Using jQuery PHP and Ajax' was the most popular answer, I get the scripts but all of these scripts work with html links only. My site all with php extensions and copy&paste my site links into these demos do not return anything . I checked the code and all of them using file_get_contents(), parsing through the html file so if i pass 'filename.php' it returns nothing supposing that php has not processed yet and the function gets the content of the php script with no data of course. So my question is that how it is possible to extract data from a link with php extension (on Facebook it works) or how to get php file executed for file_get_contents() to get back the html?
here is the link with code&demo iamusing: http://www.sanwebe.c...-php-and-jquery
thanks in advance.
I have a paragrpah which has tags that needs to be stripped off. so the paragraph looks like Quote <div id="ctl00_placeholderMain_pnlInTheBox" class="tabitem"> <p> HP LaserJet 9050 printer<br/> Power cord<br/> Parallel cable<br/> HP LaserJet Q8543X Smart print cartridge<br/> Printer documentation<br/> Printer software CD<br/> Control panel overlay<br/> Face-up output bin<br/> Two 500-sheet input tray<br/> 100 Sheet Multipurpose Tray<br/> HP JetDirect Fast</p> </div> I want it to look like Quote HP LaserJet 9050 printer Power cord Parallel cable HP LaserJet Q8543X Smart print cartridge Printer documentation Printer software CD Control panel overlay Face-up output bin Two 500-sheet input tray 100 Sheet Multipurpose Tray HP JetDirect Fast How would I go on about doing this.. currently i use Code: [Select] $inbox = $html->find( "#ctl00_placeholderMain_pnlInTheBox" ); if ( isset( $inbox[ 0 ] ) ) { $box =( $inbox[0] ); $box = strpos($box, ';') !== FALSE ? substr( $box, strpos( $box, ";" ) + 1 ) : $box; } else { $box = "0"; } Hi All I Am confused I would like to put info into a database but need it to be secure. I have some code shown below. The problem is I would like to put in ' but keep the data secure. When it comes back I do not want to show \' I think you might know what I am trying to do. Here is the code but would like to know how to stop the \' showing. Code: [Select] $password = mysql_real_escape_string(stripslashes(trim($_POST['password']))); Any help would be great thank you. Hi everyone! I'm trying to get the variables out of a string like this: Code: [Select] $string='this is a sentence with a load of [great, clever, textual] variables '; What I want to do is output a list of strings that have used the variables to create unique strings like this: Code: [Select] this is a sentence with a load of great variables this is a sentence with a load of clever variables this is a sentence with a load of textual variables I tried exploding the string into arrays, but what I really need to do is explode on '[' then output the words to an array (until I get to the closing ']') and then move on... The eventual string will have potentially loads of variables, but let's do one thing at a time. What's the best way of starting a project like this? Neil Hi: Is this the proper way to remove slashes from apostrophes: Code: [Select] if ($_SERVER['REQUEST_METHOD'] == 'POST') { $myTitle = mysql_real_escape_string(stripslashes($_POST['myTitle'])); $myDesc = mysql_real_escape_string(stripslashes($_POST['myDesc'])); $myHeader = mysql_real_escape_string(stripslashes($_POST['myHeader'])); $mySubHeader = mysql_real_escape_string(stripslashes($_POST['mySubHeader'])); $myPageData = mysql_real_escape_string(stripslashes($_POST['myPageData'])); It seems to work fine, I'd just like to clarify I'm not missing anything. Thanks! I am building a component where users enter a Youtube URL and press save. I need to run some code on the form field where the URL was entered to strip all but the unique identifier. i.e. something like this: $url = 'http://www.youtube.com/watch?v=_VaAlaIJ384&feature=rec-LGOUT-real_rev-rn-4r-10-HM'; strip all befor (including) 'v=' strip all after (including) '&' Can anyone suggest how I would do this? $name = "D'Angelo" ok I'm running a mysql query as $query = " INSERT INTO TEST (ID, NAME) VALUES ('NULL','$NAME')"; If the name = "D'Angelo" the apostrophe would cause it to fail. Is there a way to do this without striping the characters? Hey guys. I am trying to strip everything between a key phrase and ending tag but for some reason it is not working. I always get blank data. I've tried many different ways but no luck.
basically I have a script that connect to imap and store emails into MySQL as service tickets. works great but I am trying to strip everything except for user reply because currently if a user reply to an email it re-inserts the entire email into MySQL. I added a key phrase at the top of all outgoing emails .
1. structure looks like this.
--Reply below this line to respond--
------------------------------------------------------------------------------------------------
Email body message...
2. When replying to the message it becomes
New Message reply......
--Reply below this line to respond--
old message body.
3. so I would only like to insert the new reply message only.
This is what I got so far.
$message=strip_tags($message, "<br><div><p><u><hr></section>");
$message=preg_replace("</p>", "br /", $message); $message=preg_replace('#--REPLY above this line to respond--(.*?)</section>)#s', ' ', $message); $message=clean("<br/><hr><u>Received On $rep_date / $from_email</u><br><br/>$message"); it inserts the "Received On date and From but $message is blank. If i remove $message=preg_replace('#--REPLY above this line to respond--(.*?)</section>)#s', ' ', $message); it inserts the entire email Any suggestion on what i am doing wrong? thank you all very much. I have this command that basically populates my website with titles for pages based on the information that is placed within "info.txt" I have multiple titles being created so I'd like a way of putting them in order beside alphabetically. My thought was that I could place numbers as the first character in the "info.txt" file and that would allow me to put the titles in an order I find acceptable. Now this causes the issue that I don't want those numbers displayed on my website. So the Echo needs to strip the number.... I'm lost, but my brain thinks it's a good idea. Here's the code I'm working with: Code: [Select] <tr> <td id="body"> <table id="columns" cellspacing="0"> <tr> <td> <h2>LSAV Latest Events</h2> <p>Select an event from below to view a gallery of images from that event.</p> <ul> <?php $dir = 'latest-events'; if ($handle = opendir($dir)) { while (false !== ($file = readdir($handle))) { if( $file != '.' && $file != '..') { $title = file_get_contents( "$dir/$file/info.txt" ); echo "<li><a href=\"event-viewer.php?event=$file\">$title</a></li>"; } } closedir($handle); This is going to sound like a very beginner question, because it is. I am modifying some free code snippets to fit my use and stuck. It's probably a simple fix. First here is the code. This is printing from a while loop. <?php print("[". $current_month .",". number_format($remaining_balance, "2", ".", ",") . "],");?> Here is the last remaining items from the loop at it comes to an end: [343,14,495.86], [344,13,682.26], [345,12,863.91], [346,12,040.79], [347,11,212.87], [348,10,380.12], [349,9,542.52], [350,8,700.02], [351,7,852.61], [352,7,000.26], [353,6,142.94], [354,5,280.62], [355,4,413.26], [356,3,540.85], [357,2,663.34], [358,1,780.72], [359,892.95], [360,0.00], My question is kind of simple, how do I get it do the comma doesn't print on the last print? I need it to end like this: [359,892.95], [360,0.00] Does that make sense? I don't know if I explained it well. I could really use some help! Thanks so much! Looking for the best method to conditionally strip leading zeros for the following situation: $a = array(01, 02, 03, ... 10, 11); $a < 10 ? $a = ? : ''; For the following result: 1, 2, 3, ... 10, 11 Is there a better method than explode? Firstly I'm sure the answer is simple however due to my inexperience I am unable to figure it out!
I am executing a simple query to retrieve all ISBN numbers within my db and display these in a table.
In my db some of the ISBN numbers contain hyphens and some don't. I would like to display then in my table without hyphens.
I'm grateful of any assistance and advice.
I know I can remove all the hyphens from the db by executing a separate query however I'd like to strip them if possibly.
My Code;
$query = "SELECT * FROM book WHERE status != 'Archive' ORDER BY id"; while($row = mysqli_fetch_array($query)){ ?> <tr> <td><a href="itemView.php<?php echo '?isbn='.$row['isbn']; ?>" </a></td> <td>...</td> <td>...</td> <td>...</td> <td>...</td> </tr> <?php } ?>J Hi All,
How could i implemnent tab strip (jquery ui tab) in php.
Any help in this regard will be appreciated
Thanks
Ahsan
I need to strip some html tags out of an uploaded string of code. I need to keep the <td> tags...but some code that is being uploaded include <p> tags INSIDE the <td> tag.
How would I go about stripping ALL other tags inside these allowed tags: <td> <tr><table>
Hiya peeps, I am using preg_match to validate URLS; preg_match("/(((https?|ftp|gopher):\/\/|(mailto|file|news):)[^' <>\"]+|(www|web|w3).[-a-z0-9.]+)[^' .,;<>\":]/i", $this->_searchString) I works great, the only issue I have is I am using it in an if statment, and what I need to do is something like this. if(preg_match("/(((https?|ftp|gopher):\/\/|(mailto|file|news):)[^' <>\"]+|(www|web|w3).[-a-z0-9.]+)[^' .,;<>\":]/i", $this->_searchString)) { CHECK IF $this->_searchString has http:// at the start of it or if it has a / at the end and remove them if it does. I need $this->_searchString to end up with only www.SITE.co.uk before I can input it into the relevant function. } else { } Many thanks, James. |