PHP - Help Cleaning Invidible Characters Out Of Scraped Data
Hi. I've been banging my head against the wall with this stupid problem and I just figured out the problem.
I scrape data from a website. I query my table to see if I already have everything in there. If no match is found, I insert it.
Today, I noticed even after inserting, my script kept telling me there's a new entry I need to insert, despite it actually being there when I physically check the table.
But when I echo the query out and run it in phpmyadmin, it finds the row. But not if I run the query directly in my script.
Turns out, there are several invisible characters in my string. When I do var_dump(), it says length is 25. When I copy and paste the string into notepad, then back into a fresh php script and wrap it in quotes and echo out a strlen(), I get 21. There are apparently 4 invisible characters that I can't see. I trim() everything, so that apparently didn't catch it.
So... is there a good way to "clean" my data before inserting or comparing it to avoid this in the future? This wasted a ton of time and I hope to find a way to clean this junk out of my data.
Thanks! It just seems like I run into this sort of thing often when it's data scraped from the web. (breaks my regex! grr!)
Similar TutorialsI'm trying to get back into PHP, but I'm running into some trouble. I'm creating a little script that scraps data from a page and then displays it elsewhere. (I'm actually going to insert it into a database, but for now, just trying to display it! There's two pages I'm scraping from. This scrapes the list of players from one web page: $html = file_get_contents("http://www.somesite.com/players.php?search=who"); preg_match_all( '/<td width=\'33\.3333333333\%\'><a href=\'\/players.php\?lookup=.*?\'>(.*?)<\/a>/', $html, $players, PREG_SET_ORDER ); That matches fine. Then for each name I want to scrape another page that has more information about each individual player: foreach ($players as $player) { $html2 = file_get_contents("http://www.somesite.com/players.php?lookup=$player"); preg_match_all( '/<font size=\'\+1\'><strong>.*?<\/font><\/strong>/', $html2, $longName, PREG_SET_ORDER ); preg_match_all( '/Race<\/font><\/td><td><font size=\'\-1\'>.*?<\/font>/', $html2, $race, PREG_SET_ORDER ); preg_match_all( '/Citizenship<\/font><\/td><td><font size=\'\-1\'>.*?<\/font>/', $html2, $citizenship, PREG_SET_ORDER ); preg_match_all( '/Level<\/font><\/td><td><font size=\'\-1\'>.*?<\/font>/', $html2, $level, PREG_SET_ORDER ); echo $longName[0][0]; echo ("<br>"); $rac = implode("", $race[0]); $newrace = substr($rac, 4, -1); echo $newrace; echo ("<br>"); $cit = implode("", $citizenship[0]); $newcit = substr($cit, 11, -1); echo $newcit; echo ("<br>"); $lev = implode("", $level[0]); $newlevel = substr($lev, 5, -1); echo $newlevel; echo ("<br>"); } This does not work. I get Warning: implode() [function.implode]: Invalid arguments passed in /my/path/names.php on line <three implode lines> and I get NO echos except for the break lines. HOWEVER, if I just use this code outside of a loop, i.e. I just do $html2 = file_get_contents("http://www.somesite.com/players.php?lookup=MyPlayerName"); it works perfectly. I'm newb at coding, so I can't even figure out what data it is or is not getting in the iteration, and why it's not displaying/getting it when it's looped through each name as opposed to when it works perfectly when I input a single name in the file_get_contents. Can anyone see errors in how I'm going about this or suggest some checks to put in to help figure out what the heck is going wrong? Thanks much for any help. I'm trying to clean up my code as best I know how and get rid of code that isn't necessary. This is where I'm most confused: On every page, I start a session and if someone goes to a page other than the index/login page, they are redirected if there is no session, otherwise, they are allowed to continue to view the page content. Here is my code: Code: [Select] <?php session_start(); $email=$_SESSION['email']; // is this line necessary on every page... if ($_SESSION['logged'] != 1) { $_SESSION['email'] = $email; // as well as this line? header("Location: index.php"); exit(); } ?> Basically, I need to know if the SESSION['email'] has to be set on every page, or does it stay set as long is the session is logged? If it stays set for the entire session, I shouldn't need to include the "$email=$_SESSION['email'];" line on every page, right? Also, if my thinking is correct, I shouldn't need to include the "$_SESSION['email'] = $email;" line unless I need to use the $email variable for some other purpose. Am I correct on this? Thanks for any input. basically this is like 60% written by me with a few copy/paste. What I am looking to do, is clean up this code a little. Functionally it does what I want it to relatively well, but it does look very very messy and I would like it to be a little more tidy. Any suggestions on how to do this? I am no PHP expert as you can most likely tell by the code itself. <?php require("cfg.php"); //FOLDER PATH $dir_cat= $imgpath; $openDir_cat = opendir($dir_cat); while (false !== ($fileNames_cat = readdir($openDir_cat))) { $check_cat = $dir_cat. "/" . $fileNames_cat; $size_cat = @getimagesize($check_cat); // this check mime of file if($fileNames == "." || $fileNames_cat == ".." || strpos($size_cat[mime], "image") === FALSE) { continue; // exclude everything wha is not image } else { $imagesAll_cat[] = $fileNames_cat; // create an array of images } } $imgnr_cat=0; $imgct_cat=count(glob($dir_cat . "*")); //COUNT THE NUMBER OF FILES IN THE UPLOAD DIR. echo '<center><fieldset class=fieldset>'; echo '<legend> IMGS '; echo ' (' .$imgct_cat. ' IMAGES)</legend><div>'; //GET IMAGES AND DISPLAY THEM while ($imgnr_cat<$imgct_cat){ echo '<a href=' .$dir_cat.$imagesAll_cat[$imgnr_cat]. '>'; echo '<img width=140 height=100 src=' .$dir_cat.$imagesAll_cat[$imgnr_cat]. '>'; echo '</a>'; $imgnr_cat++; } echo '</div></fieldset></center>'; ?> Ok, i have this code that i want to print out the total number of times the defined username appears in my database with the defined date. Right now all it prints out is the echo at the end but with no numbers in the result. I just need help with getting it to display the number. Code: [Select] <?php //select a database to work with $selected = mysql_select_db("nwacompu_totals",$dbhandle) or die("Could not select examples"); $username = "bayyari"; $date = "3-16-2011"; $query = "SELECT COUNT(*) FROM totals WHERE date = '$date' AND username = '$username'"; $result = mysql_query($query) or die(mysql_error()); // Print out result while($row = mysql_fetch_array($result)){ echo "There are ". $row['COUNT(username)'] ." ". $row['date'] ." items."; echo "<br />"; echo $result; } ?> what is printed is this There are items. Resource id #2Resource id #3 which makes no sense to me. Any help is very much appreciated. I need to pull a list of mp3 file names to create a playlist. I have a column for "priority" which will handle the rotation of the mp3's...so that a priority of 4 will play more often than a priority of 3...and so on. I have this code (which works) to select a column from the db that matches the selected criteria...AND has a priority of "4". If there are no matches, select a row that has a priority of "3". How can I clean up this code...preferably to remove the need for adding 2's after each variable? Also, is there a better way of handling this, so that it's ONE select...and then puts the mp3's in the playlist the correct number of times? Code: [Select] <?php $sql = "SELECT * FROM audio WHERE `client_id` = '$client' AND (`start_date` <= '$nownohour' AND `end_date` >= '$nownohour') AND `$dow` = '1' AND `is_active` = '1' AND (`start_hour` <= '$hournozero' AND `end_hour` >= '$hournozero') AND `priority` = '4' LIMIT 1"; $rs = mysql_query($sql,$dbc); $matches = 0; while ($row = mysql_fetch_assoc($rs)) { $matches++; echo "$row[title].mp3<br />"; } if (!$matches) { $sql2 = "SELECT * FROM audio WHERE `client_id` = '$client' AND (`start_date` <= '$now' AND `end_date` >= '$now') AND `$dow` = '1' AND `is_active` = '1' AND (`start_hour` <= '$hournozero' AND `end_hour` >= '$hournozero') AND `priority` = '3' LIMIT 1"; $rs2 = mysql_query($sql2,$dbc); $matches2 = 0; while ($row2 = mysql_fetch_assoc($rs2)) { $matches2++; echo "$row2[title].mp3<br />"; } } echo "-------------<br />"; ?> Thanks! Hi there I allow users to make new albums. For each album they give a full display name, eg "New Years Eve 2012 at Persons' place", and a shorter display eg "New Years Eve 2012". The first string should be stored in a db and should be shown exactly as it is, so for that string I have: Code: [Select] <?php //$fullname gets stored in the db $fullname = myql_real_escape_string(htmlentities($fullname)); //when pulled from the db, it is only echoed, nothing else: echo $fullname; ?> Now I presume it is safe so far. Now is the 'tricky' part. Currently this is what I do to the short name: Code: [Select] <?php //$shortname gets stored in the database $shortname = mysql_real_escape_string(urlncode(str_replace(' ','',strtolower($shortname)))); ?> Now from this shortname, before the mysql_real_escape_string(), it makes a directory with that name. Now what happens is, if there are special charcters like '@', the '@' changes into '%40'. So the directory would be eg 'fun%40myplace'. The directories get made without a problem, but for some reason my uploader won't upload to this directory. It isn't the uploaders fault because in folders without these special characters there is no problem with uploading. Any ideas on how to fix this, or what the best method is to clean a string for url/directory names? The attached script does everything I want it to do, but it's not very elegant. I could use some help cleaning it up. Any and all help is greatly appreciated. Thanks!!! Hi, This isn't so much a problem as my code/page works however my host shut down scripting on the site as too many connections to the database were left open apparently with this page. Also ive been looking into a way of cleaning up the code and making it less bloated but the only thing I have found is mysqli for running multiple queries in one go but I am unsure if this would help. Code: [Select] <?php session_start(); include("./includes/db_con.inc.php"); $image = $_GET["id"]; $id = str_replace("-", "/", $_GET["id"]); $cat = $_GET["cat"]; $sub_cat = $_GET["sub"]; $page_title = $cat . " - " . $sub_cat . " - " . $id; $sql_meta = mysql_query("SELECT * FROM english WHERE PRODID='$id' AND display='1'"); $row_meta = mysql_fetch_assoc($sql_meta); $meta_desc = $row_meta['META_DESC']; $meta_key = $row_meta['META_KEY']; include("./includes/head.php"); include("./includes/header.php"); ?> <div id="main-page-header"> <?php $header = "img/product-pages/product-specific-headers/" . $image . ".jpg"; if (file_exists($header)) { ?> <img src="<?php echo $img_loc; ?>/product-pages/product-specific-headers/<?php echo $image; ?>.jpg" width="940" height="310" /> <?php } else { $desc_sql = mysql_query("SELECT * FROM sub_cat WHERE name='$sub_cat'"); $row_desc = mysql_fetch_assoc($desc_sql); $description = $row_desc['description']; echo "<img src='$img_loc/product-pages/sub-cat-head/".str_replace(" ", "-", $row_desc['name'])."-sub-cat-head.jpg' width='940' height='310' />"; } ?> </div> <div id="page-content"> <div id="main-page-text"> <h1><?php $result = mysql_query("SELECT * FROM $table WHERE PRODID='$id' AND display='1'"); while($row = mysql_fetch_array($result)) { $prod_title = $row['PROD_TITLE']; echo $prod_title; ?> </h1> <h2 class="crumbs"><a href="<?php echo "$url/Products/$lang/"; ?>">Products</a> > <a href="<?php echo "$url/Category/$lang/".urlencode($cat)."/"; ?>"><?php echo $cat; ?></a> > <a href="<?php echo "$url/Range/$lang/".urlencode($cat)."/".urlencode($sub_cat)."/"; ?>"><?php echo $sub_cat; ?></a> > <?php echo $prod_title; ?> </h2> <p><?php echo $row['DESCRIPTION']; } ?></p> <script src="http://connect.facebook.net/en_US/all.js#xfbml=1"></script><fb:like layout="button_count" show_faces="false" width="170" font="verdana"></fb:like> <br /> <?php $i = 1; while ($i <= 8) { $filename = "img/product-pages/hovers-800px/" . $image . "-" . $i . ".jpg"; if (file_exists($filename)) { ?> <a class="fuglybox" rel="gxr" href="<?php echo "../../../../../" . $filename; ?>"> <img src="../../../../../img/product-pages/product-detail-90px/<?php echo $image . "-" . $i . ".jpg"; ?>" alt="" width="90" height="90" /></a> <?php } else { break; } $i++; } ?> <table> <th>Product Code</th> <th>Description</th> <?php $table_result = mysql_query("SELECT * FROM PRODID WHERE PRODID='$id'"); $table_entry = mysql_fetch_assoc($table_result); echo "<tr><td class='prod-id'>" . $table_entry['PROD_CODE'] . "</td>\n"; echo "<td class='prod-desc'>" . $table_entry['TA_DESC'] . "</td></tr>\n"; ?> </table> <div id="FAQ-wrap"><h2>Frequently Asked Questions</h2> <?php $QA_table = $lang . "_qanda"; if ($id=="") { $list_QA = mysql_query("SELECT * FROM $QA_table WHERE SUB_CATEGORY='$sub_cat' AND DISPLAY='1'"); } else { $list_QA = mysql_query("SELECT * FROM $QA_table WHERE PRODID='$id' AND DISPLAY='1'"); } while($row_QA = mysql_fetch_array($list_QA)) { echo "<div id='FAQ-QA'>"; echo "<h2 class='FAQ-question'>Q: ".$row_QA['QUESTION']."</h2>"; echo "<p class='FAQ-question'>A: ".$row_QA['ANSWER']."</p></div>"; } ?> <div id="FAQ-question"> <h2 class="white">Got a question about the <span class="got-question-product"><?php echo $sub_cat; ?></span> </h2> </div> </div> </div> <div id="totem-menu-container"><?php include("./includes/search.php") ?> <?php include("./includes/totem.php") ?> </div> </div> </div> <?php include("./footer.php") ?> Please note in the db_con.inc.php is the connection to the database stored in variable $con and included in the footer.php is mysql_close($con) which is also why I dont understand how connections are being left open, any help greatly appreciated. I am cleaning up my website (replace tables with divs and turning all design code in to CSS). I would like to know a couple things. First, I see that register_globals is deprecated, but I still see new blog posts and such saying to keep it. I want to know if I should remove it, how it works, and if I do remove it what should I replace it with. Also, I have code like: Code: [Select] '".$SETTINGS['siteurl']."uploaded/".$val["pict_url"]."' and I was wondering whether it can be replaced with Javascript. Please don't call me a noob or tell me to Google it because when it comes to me, Google, and PHP; I'm like a kid in a candy shop and all the candy is filled with dirt. I have been trying to hack two parts of code together... I had code writen that will grab the text from a website and completely clean it of all junk except for full words... then echo it. Now I am trying to use the same script to pull from a database instead of a URL but am lost... Here is my code... I would make another donation to the site if we can get this going... THANK YOU! Code: [Select] <?php $con = mysql_connect("localhost","USERNAME","PASSWORD!"); mysql_select_db("DATABASE",$con); $get = "SELECT * FROM information_description WHERE information_id=4"; $SQ_query = mysql_query($get) or die("Query failed: $get\n" . mysql_error()); $fetch = mysql_fetch_array($SQ_query); $raw = $fetch['description']; /* Set internal character encoding to UTF-8 */ mb_internal_encoding("UTF-8"); mb_http_output( "UTF-8" ); ob_start("mb_output_handler"); function clean($html) { ###Remove number in html################ //$html = preg_replace("/[0-9]/", " ", $html); $html = preg_replace("/<([a-z][a-z0-9]*)[^>]*?(\/?)>/i",'<$1$2>', $html); // $html = preg_replace('/(<[^>]+) style=".*?"/i', '$1', $html); echo $html; $html = str_replace(" ", " ", $html); $html = str_replace("&", " ", $html); $html = str_replace("-", " ", $html); ######remove space $html = preg_replace ('/<[^>]*>/', '', $html); $html = preg_replace('/\s\s+/', ', ', $html); $html = preg_replace('/[\s\W]+/',' ',$html); // Strip off spaces and non-alpha-numeric return $html; } #call function //$raw = StripHtmlTags($raw); $raw = clean($raw); echo $raw; ##echo clean($html); $url = (isset($_GET['url']) ?$_GET['url'] : 0); $str = file_get_contents($url); ####################################################################3 function get_url_contents($url){ $crl = curl_init(); $timeout = 5; curl_setopt ($crl, CURLOPT_URL,$url); curl_setopt ($crl, CURLOPT_RETURNTRANSFER, 1); curl_setopt ($crl, CURLOPT_CONNECTTIMEOUT, $timeout); $ret = curl_exec($crl); curl_close($crl); return $ret; } #--------------------------------------Strip html tag---------------------------------------------------- function StripHtmlTags( $text ) { // PHP's strip_tags() function will remove tags, but it // doesn't remove scripts, styles, and other unwanted // invisible text between tags. Also, as a prelude to // tokenizing the text, we need to insure that when // block-level tags (such as <p> or <div>) are removed, // neighboring words aren't joined. $text = preg_replace( array( // Remove invisible content '@<head[^>]*?>.*?</head>@siu', '@<style[^>]*?>.*?</style>@siu', '@<script[^>]*?.*?</script>@siu', '@<object[^>]*?.*?</object>@siu', '@<embed[^>]*?.*?</embed>@siu', '@<applet[^>]*?.*?</applet>@siu', '@<noframes[^>]*?.*?</noframes>@siu', '@<noscript[^>]*?.*?</noscript>@siu', '@<noembed[^>]*?.*?</noembed>@siu', // Add line breaks before & after blocks '@<((br)|(hr))@iu', '@</?((address)|(blockquote)|(center)|(del))@iu', '@</?((div)|(h[1-9])|(ins)|(isindex)|(p)|(pre))@iu', '@</?((dir)|(dl)|(dt)|(dd)|(li)|(menu)|(ol)|(ul))@iu', '@</?((table)|(th)|(td)|(caption))@iu', '@</?((form)|(button)|(fieldset)|(legend)|(input))@iu', '@</?((label)|(select)|(optgroup)|(option)|(textarea))@iu', '@</?((frameset)|(frame)|(iframe))@iu', ), array(' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', "\n\$0", "\n\$0", "\n\$0", "\n\$0", "\n\$0", "\n\$0", "\n\$0", "\n\$0",),$text ); // Remove all remaining tags and comments and return. return strtolower( $text ); } function RemoveComments( & $string ) { $string = preg_replace("%(#|;|(//)).*%","",$string); $string = preg_replace("%/\*(?:(?!\*/).)*\*/%s","",$string); // google for negative lookahead return $string; } $html = StripHtmlTags($str); ###Remove number in html################ $html = preg_replace("/[0-9]/", " ", $html); #replace by ' ' $html = str_replace(" ", " ", $html); ######remove any words################ $remove_word = file("swords.txt", FILE_IGNORE_NEW_LINES | FILE_SKIP_EMPTY_LINES); foreach($remove_word as $word) { $html = preg_replace("/\b". $word ."\b/", " ", $html); } ######remove space $html = preg_replace ('/<[^>]*>/', '', $html); $html = preg_replace('/\b\s+/', ', ', $html); $html = preg_replace('/[\b\W]+/',', ',$html); // Strip off spaces and non-alpha-numeric #remove white space, Keep : . ( ) : & //$html = preg_replace('/\s+/', ', ', $html); ###process######################################################################### $array_loop = explode(",", $html); $array_loop1 = $array_loop; $arr_tem = array(); foreach($array_loop as $key=>$val) { if(in_array($val, $array_loop1)) { if(!$arr_tem[$val]) $arr_tem[$val] = 0; $arr_tem[$val] += 1; if ( ($k = array_search($val, $array_loop1) ) !== false ) unset($array_loop1[$k]); } } arsort($arr_tem); ###echo top 20 words############################################################ echo "<h3>Top 20 words used most</h3>"; $i = 1; foreach($arr_tem as $key=>$val) { if($i<=20) { echo $i.": ".$key." (".$val." words)<br />"; $i++; }else break; } echo "<hr />"; ###print array##################################################################### echo (implode(", ", array_keys($arr_tem))); ?> Just want to finalize what the best options/functions are when cleaning user supplied input in the following 3 situations: 1. Inserting data into a database 2. Sending data to an email (and displaying it as HTML) 3. Displaying data back to screen This topic has been moved to PHP Regex. http://www.phpfreaks.com/forums/index.php?topic=326004.0 I want to be able to keep people from entering certain characters in a form. I've tried google, and had no luck so far. Thanks! First forgive me, I do not know if this is the correct forum for this question. I did not write the code but I have a feeling if its possible to correct it, then its probably coding. My question is, is there anyway to post larger News articles to the homepage. I am trying to post an article that a friend sent from another site onto mine (giving the other site proper credit). But the article is apparently too big. I do not know how many characters the article is, or what the character limit is set by Nuke 8.0. Is there anything that I can do? Thanks..oh and congrats on apparently being the only active PHP forums :-S I have tried a few others and they haven't made posts since 2011. Anyways thanks in advance for any and all help. Hi guys! I am learning PHP now and I am enjoying it. Im not an I.T. graduate that is why Im having very difficult time to understand codes. My problem is how to get the last character of a URL that I get using another php code. I can already post the URL on my page but it displays all the URL of the certain page that I get. Example: the URL is "http://mysite.com/page_1/pp1/?lang=zh" I only want to get the "?lang=zh". I am working under 3 languages and I want to get only that last part of the URL for me to continue my work. I dont exactly know what string or filtering I will do to get that part only. Please help me guys. I will appreciate all your comments here. Hi What code would i use to take the first 2 letters of a post code so for example CM11 2AY I want the CM bit what command would strip the first 2 characters out? Thanks Hi experts. I am received a GET variable into a page which is a id with a value as uuid:3242_2323_4444_9909_433/child_repeat[1]. Which is being used in my mysql query. So my query will be SELECT * from table_1 WHERE id = $id. However php is treating this string differently, my query fails, it says that it has a error near :3242_2323_4444_9909_433/child_repeat[1]. So it seems that its interpreting the colon as something else and removing the text before the colon. Does anybody know of a function or a way of letting me get the string between whatever characters? Say I had [SOMEWORD] text [SOMEWORD] then how could I go about getting the value "text", please note I'm not trying to make bbcode or similar. |