PHP - Php Xml Large File Parser - Special Character Errors Xmlreader Expand()
Hi guys, I did read allot of documentation on the internet about reading/writing/parsing an XML file. I ended up using the following code, because I really have large files (some about 200MB) and regular dom does not work:
while ($xml->read()) { switch ($xml->nodeType) { case (XMLReader::ELEMENT): if ($xml->localName == "job") { $node = $xml->expand(); $dom = new DomDocument(); $n = $dom->importNode($node,true); $dom->appendChild($n); $job = simplexml_import_dom($n); The problem I have is a special character error in the xml file, error returned on this line: "$node = $xml->expand();" I am literally banging my head to the wall to find a simple solution to this. I already have a cleaning function, but that can be applied only after the code above. As the file is large, to clean it, I would have to use the same code above to work on partial content at once, so I would have the same special character problem when I would try to read and split the file. I bet I am not the first one to be in this situation, but after about 5 hours of searching on the internet, I cannot do it no more. And I am not a php expert to come up with a new idea. One other thing to do probably would be to split the file into multiple files, and read them after that, without using the XMLReader. But this would ask for a different application. If, for example, on a file where I have an error, I do the reading with simplexml, without using the XMLReader, I don't get the error. But I cannot use simplexml on the files, since file size is variable. I have to use a reliable method that works for all situations. Hopefully someone has an idea to this STUPID situation! Thanks. Similar TutorialsIs there a way I can readily stream-process a compressed XML file in PHP? Something like this NON WORKING example: Code: [Select] $reader = new XMLReader(); $reader->open(bzopen($planet_file,"r")); while ($reader->read()) { ... See also http://php.net/manual/en/xmlreader.open.php Folks, Requirement: I want a .htaccess level solution to 404 when the URL contains special characters other than mentioned in the below Rewrite rule: Code: [Select] RewriteRule ^([a-zA-Z0-9-!@#$^&*:"<>/?]{4,})\.html$ search.php?q=$1 [QSA,L] So, what i want is, i want to show a 404 when the URL contains anything other than "a-zA-Z0-9-!@#$^&*:"<>/?" What i have done: Code: [Select] RewriteRule ^([a-zA-Z0-9-!@#$^&*:"<>/?]{4,})\.html$ search.php?q=$1 [QSA,L] Problem: Its not working with Special characters but working only with English letters and Numerics in URL. Cheers Natasha T <form action="main.php?id=test.php" method="post"> <input name="name" type="text" /> <input name="submit" type="submit" value="Look Up" /> </form> <?php if(isset($_POST['submit'])) { $today = date("Y-m-d"); $pname = rawurlencode($_POST['name']); $xml_feed_url = 'http://api.eve-online.com/eve/CharacterID.xml.aspx?names='.$pname.''; The above works perfect until the user enters a character like ' to the box, how do I get it to pass the ' into the address? Thank you Hi people! I have a form with a select list where options are populated from a table in my db.. the string format is like this -> car - branch (ex. toyota - japan so on..) when viewing the options it displays correctly with the "-" but when i tried submitting the form which will be inserted into the db the "- branch" gets cut off.. i think i need to encode it but i don't know how to do it.. thanks for any reply! hello; I have: mysql: utf8_general_ci index.php: header( ... UTF-8) index.php: <meta ... content-type ... UTF-8> index.php, mysql query procedu ... mysql_set_charset( utf8 , .. ) ... So, if I put a special character in my db, it WILL display correctly in index.php But, if I put the same character in a php variable, it is BAD ... diamond-shape with question mark inside Since my special characters work from the db, I would like to also use them in html (index.php ). am I missing anything? thanks for your time .. Shannon I need to pass info to a php script via link while ignoring the character '&' for example: mywebsite.com/myphp.php?info=thisInformation&thecharacter& is there anyway I can tell php to ignore the character '&' and just read it as text (cause I understand its a special character.. maybe something like in XML you can ignore the character & by using <![CDATA[ ]]> thanks!!! much appreciated!!! PS: this is a must.. since in the end I am passing links.. and urls.. that are like amazon.com/blah&blah I'm having a problem that I didn't seem to have in the past... for some reason, it just popped up... or maybe I'm just noticing it. Anyway, I this in a field in a mysql database: "Here's a test." When I query the database and echo the text to the page with this code: Code: [Select] $result = mysql_query("SELECT question FROM signin WHERE email = '$email'") or die (mysql_error()); $row = mysql_fetch_row($result); $test = $row[0]; echo $test; This is the result: "Heres a test." Any ideas on why the apostrophe is not being displayed and how to fix it? I'm stumped. Thanks for any help. I have this code to pull the first few words from the body text to use as a title. But, when the body contains apostrophe, it shows on my title as '. I found the following code that's supposedly fixes the code. But, I can't figure out how. I need to add this: $return = = htmlspecialchars_decode(token_replace($output, 'node', $node), ENT_QUOTES); Or something similar ... to this: Code: [Select] $limit = 10; $text = $node->body[$node->language][0]['value']; $text = strip_tags($text); $words = str_word_count($text, 2); $pos = array_keys($words); if (count($words) > $limit) { $text = substr( $text, 0, $pos[$limit]); $text = trim( $text ); $text = rtrim( $text, '.' ); $text = trim( $text ) . '...'; } return $text; Thanks in advance. Hi guys, I am currently receiving a large text file ( > 500mb), once per week which I have been manually splitting then processing to obtain the required CSV files. However, this is taking in the region of 2 to 3 hours. Very soon, these files will be sent daily and I really dont have the time to split and process this everyday I have been playing for a while to try and parse everything properly/automatically with fopen, feof and fgets ( and other 'f' options), but the script never seems to read the file all the way to the end - I assume this is due to memory usage. The data received in the file follows a strict pattern throughout the file which is: Code: [Select] BSNY990141112271112270100000 POO2C35 122354000 DMUS 075 O BX NTY LOLANCSTR 1132 11322 TB LIMORCMSJ 1135 00000000 LICRNFNJN 1140 00000000 H LICRNF 1141H1142H 11421142 T LISDAL 1147H1148H 11481148 T LIARNSIDE 1152H1153 11531153 T LIGOVS 1158 1159 11581159 T LIKTBK 1202 1202H 12021202 T LICARK 1206 1207 12061207 T LIULVRSTN 1214H1215H 12151215 T LIDALTON 1223 1223H 12231223 T LIDALTONJ 1225 00000000 LIROOSE 1229 1229H 12291229 T 2 LTBAROW 1237 12391 TF That is just one record of informaton (1 of around 140,000 records), each record has no fixed amount of lines but each line in each record is fixed to 80 characters and all lines in each record need to have the same unique 'id', at present, Im using an md5 hash of microtime. The first line of every record starts with 'BS' and the last line of each record starts with 'LT' terminating with 'TF'. All the other stuff between also follows a certain pattern of which I can break down effectively. The record above show one train service schedule, hence why each line in each record needs the same unique id. Anyone got any ideas on how I could process such a file effectively?? Many thanks Dave Hello, Im trying to find a way to check around 500-600 links to check if they are alive. It works fine for 5-6 links but once i add more links it just times out. Is there a way i could process this so it does 1 link at a time or somthing ? <?php include("config.php"); $query = "SELECT * FROM `games` WHERE `r_fileserve` <> \"\" LIMIT 500"; $result = mysql_query($query); while($row=mysql_fetch_assoc($result)) { $link_str = file_get_contents("$row[r_fileserve]"); $pattern = '<input type="hidden" name="download" value="normal"/>'; preg_match($pattern,$link_str,$match); if ($match[0] != null) { echo "Working <br />"; } else { echo "File Down <br />"; } } ?> hello dear Freaks
i am currently musing bout the portover of a python bs4 parser to php - working with the simplehtmldom-parser / pr the DOM-selectors... (see below). The project: for a list of meta-data of wordpress-plugins: - approx 50 plugins are of interest! but the challenge is: i want to fetch meta-data of all the existing plugins. What i subsequently want to filter out after the fetch is - those plugins that have the newest timestamp - that are updated (most) recently. It is all aobut acutality... https://wordpress.org/plugins/participants-database ....and so on and so forth.
https://wordpress.org/plugins/wp-job-manager we have the following set of meta-data for each wordpress-plugin: Version: 1.9.5.12 installations: 10,000+ WordPress Version: 5.0 or higher Tested up to: 5.4 PHP Version: 5.6 or higher Tags 3 Tags:databasemembersign-up formvolunteer Last updated: 19 hours ago
the project consits of two parts: the looping-part: (which seems to be pretty straightforward). the parser-part: where i have some issues - see below. I'm trying to loop through an array of URLs and scrape the data below from a list of wordpress-plugins. See my loop below- as a base i think it is good starting point to work from the following target-url:
plugins wordpress.org/plugins/browse/popular with 99 pages of content: cf ...
the Output of text_nodes: ['Version: 1.9.5.12', 'Active installations: 10,000+', 'Tested up to: 5.6 '] but if we want to fetch the data of all the wordpress-plugins and subesquently sort them to show the -let us say - latest 50 updated plugins. This would be a interesting task:
first of all we need to fetch the urls then we fetch the information and have to sort out the newest- the newest timestamp. Ie the plugin that updated most recently List the 50 newest items - that are the 50 plugins that are updated recently ..
we have the following set see here the Soup_ soup = BeautifulSoup(r.content, 'html.parser') target = [item.get_text(strip=True, separator=" ") for item in soup.find( "h3", class_="screen-reader-text").find_next("ul").findAll("li")[:8]] head = [soup.find("h1", class_="plugin-title").text] new = [x for x in target if x.startswith( ("V", "Las", "Ac", "W", "T", "P"))] return head + new with ThreadPoolExecutor(max_workers=50) as executor1: futures1 = [executor1.submit(parser, url) for url in allin] for future in futures1: print(future.result())
see the formal output Quote
background: https://stackoverflow.com/questions/61106309/fetching-multiple-urls-with-beautifulsoup-gathering-meta-data-in-wp-plugins Well - i guess that we c an do this with the simple DOM Parser - here the seclector reference. https://stackoverflow.com/questions/1390568/how-can-i-match-on-an-attribute-that-contains-a-certain-string
look forward to any hint and help.
have a great day Edited May 3, 2020 by dil_bertHello, I am working on a project that downloads large zip files from server, for small files the script works well and downlaod files successfully, but for larger files like currently we are trying to download a 922MB file it gives us this message (in firefox) and doesn't download any thing. " File not found Firefox can't find the file at http://www.domainname.com/abc.zip " Script to download the file is as below: " $filename = "xyz.mp3; header("Pragma: public"); header("Expires: 0"); header("Cache-Control: must-revalidate, post-check=0, pre-check=0"); header("Content-Type: application/force-download"); header("Content-Type: application/octet-stream"); header("Content-Type: application/download"); header("Content-Disposition: attachment; filename=".basename($filename).";"); header("Content-Transfer-Encoding: binary"); header("Content-Length: ".filesize($filename)); if( !ini_get('safe_mode') ) set_time_limit(360000000); readfile("$filename"); " Please advise what can be issue, if its file size issue then how and where can we increase the limit to solve this issue. pre-thanks, When I try to upload a file larger than the server's max limit, the following code is not executed. How am I supposed to inform the user that their file is too large? NOTE: I've stripped the code down for this post. Code: [Select] <?php if(isset($_POST['submit'])) { echo "test.."; } ?> <html> <head> <title>Upload Test</title> </head> <body> <form action='' enctype='multipart/form-data' method='POST'> <input type='file' name='file_upload' /> <input type='submit' name='submit' value='upload' /> </form> </body> </html> Hi I'm currently writing a script that basically downloads videos from a specific page. I am downloading with cURL however with some files, they're so large cURL is timing out. This is causing either a) PHP to timeout b) PHP memory to run out c) cURL to stop once defined timeout limit is reached This means that some files are only partitially downloaded as some files are over 100mb and some are only 20mb I have Code: [Select] set_time_limit(0);and Code: [Select] ini_set("memory_limit","500M");set but is there a way to make it so PHP will not timeout and the cURL session will not timeout until the file is downloaded? Hi I'm learning php and trying to write a script to extract registration information from a large text file. Sadly my meagre knowledge of php is letting me down a bit. It's a case of knowing what you want the script to do but not having the knowlege of how to 'say it'. So i was hoping that if I posted my code here someone could either give me a few pointers on where i am going wrong or suggest a better way. The text file data luckily has a recurring format as follows (for brevity i've only included one entry, which contains made up information): From: bella_done@yahoo.co.uk Sent: 02 February 2011 22:50 To: Jonny tum, patsy fells, dingly bongo Subject: Subject: Fun Run 2010 Categories: Fun Run Name: Bella Donna Address: 14 brondle avenue Postcode: cd83 1rg Phone: 0287343510 Email: bella_don@yahoo.co.uk DOB: 15/11/1945 Half or Full: Full fun run How did you hear: Took part in 2010 As you can see the data has a convenient boundary at the 'from' field and the colon (or so it occurred to me) so I created my script as follows: // the string being analysed $the_string = " From: bella_done@yahoo.co.uk Sent: 02 February 2011 22:50 To: Jonny tum, patsy fells, dingly bongo Subject: Subject: Fun Run 2010 Categories: Fun Run Name: Bella Donna Address: 14 brondle avenue Postcode: cd83 1rg Phone: 0287343510 Email: bella_don@yahoo.co.uk DOB: 15/11/1945 Half or Full: Full fun run How did you hear: Took part in 2010"; // remove all formatting to work with a clean string $clean_string = strip_tags($the_string); // remove form field entries from the data and replace with commas and a ZZZ boundary $remove_fields = array("Categories:" => "","Name:" => ",","Address:" => ",","Postcode:" => ",","Phone:" => ",","Email:" => ",","DOB:" => ",","Half or Full:" => ",","How did you hear:" => ",","From:" => "ZZZ","Sent:" => ",","To:" => ",", ); $new_string = strtr("$clean_string",$remove_fields); // split the data at the boundary ZZZ $string_to_array = explode("ZZZ", $new_string); $new_string2 = implode("</br>",$string_to_array); echo $new_string2; $myFile = "address_list.csv"; $fh = fopen($myFile, 'w') or die("can't open file"); $stringData = $new_string2; fwrite($fh, $stringData); fclose($fh); One major problem is when i write the new data to a csv file the csv contains spacings that cause it to be reproduced in a column form rather than as separate fields for each comma boundary. So can anyone suggest either a) a better way of extracting the data from the text file (doesn't need to be 100% clean and perfect) b) How can i stop the spaces in the csv (i thought i would have fixed this when i stripped the tags from the string at the start??). Any help would be greatly received by a newbie phper. It's my first shot at performing anything moderately taxing so if I've made some blaring oversites I would very much welcome your wisdom! Thank you Drongo Going to try and explain this the best I can but I don't really have the best idea on what's happening here. I have a submission form for users to fill out their information and upload an image. I've set the file limit size at 500000 which I assumed would be safe for images at 400k or below. When testing locally, any image that is below that file size gets uploaded successfully. However, when testing on my online host/server.. the submission form and data is successfully entered but the image isn't saved at all. It obviously isn't over the size of the file limit I set because it dooesn't return an error.. it successfuly submits but doesn't save or resize my image. I really have no clue what the problem could be. I went over the variables I set for folder locations to move the image to and everything works fine locally, but once on the host and online, it doesn't happen. Hello. My script is set to upload files upto 5GB large. For that script I've currently set memory_limit to 5GB. Is it alright? I mean what is the ideal value (for large upload scripts) If you feel, 5GB is large. I can make script to upload 2GB files and set memory_limit accordingly. Also, max_execution_time has been set by me to 86400 currently. Assuming, on a 500Kbps broadband, it would require upto 24 hours to upload a 3-5GB file. Please suggest. Thank you. I'm trying to utilize a PHP script to parse a large XML file (around 450 MB) to MYSQL database into certain structure and definitions of included XML elements. The problem is that the original script uses file_get_contents and SimpleXMLElement to get it done, but the corn job executed by the server halts due to the volume of the XML file. I'm no PHP expert, so I bought this XMLSplit software and divided the XML into 17 separated XML files each at size of 30 MB, parsed them one by one using the same script. However, the output database was missing a lot of input, and I have serious doubts whether this would be the same output of the original file if left not divided automatically and parsed one by one.
So, I've decided to use XMLReader with this exact PHP script to parse this big XML file, but so far I couldn't manage to simply replace the parsing code and keep other functionality intact.
I'm including the script below, I'd really appreciate if someone helps me to do so.
<?php set_time_limit(0); ini_set('memory_limit', '1024M'); include_once('../db.php'); include_once(DOC_ROOT.'/include/func.php'); mysql_query("TRUNCATE screenshots_list"); mysql_query("TRUNCATE pages"); mysql_query("TRUNCATE page_screenshots"); // This is the part I need help with to change into XMLReader instead of utilized function, to enable parsing of the large XML file correctly (while keeping rest of the script code as is if possible): $xmlstr = file_get_contents('t_info.xml'); $xml = new SimpleXMLElement($xmlstr); foreach ($xml->template as $item) { //print_r($item); $sql = sprintf("REPLACE INTO templates SET id = %d, state = %d, price = %d, exc_price = %d, inserted_date = '%s', update_date = '%s', downloads = %d, type_id = %d, type_name = '%s', is_flash = %d, is_adult = %d, width = '%s', author_id = %d, author_nick = '%s', package_id = %d, is_full_site = %d, is_real_size = %d, keywords = '%s', sources = '%s', description = '%s', software_required = '%s'", $item->id, $item->state, $item->price, $item->exc_price, $item->inserted_date, $item->update_date, $item->downloads, $item->template_type->type_id, $item->template_type->type_name, $item->is_flash, $item->is_adult, $item->width, $item->author->author_id, $item->author->author_nick, $item->package->package_id, $item->is_full_site, $item->is_real_size, $item->keywords, $item->sources, $item->description, $item->software_required); //echo '<br>'.$sql; mysql_query($sql); //print_r($item->screenshots_list->screenshot); foreach ($item->screenshots_list->screenshot as $scr) { $main = (!empty($scr->main_preview)) ? 1 : 0; $small = (!empty($scr->small_preview)) ? 1 : 0; insert_data($item->id, 'screenshots_list', 0, $scr->uri, $scr->filemtime, $main, $small); } foreach ($item->styles->style as $st) { insert_data($item->id, 'styles', $st->style_id, $st->style_name); } foreach ($item->categories->category as $cat) { insert_data($item->id, 'categories', $cat->category_id, $cat->category_name); } foreach ($item->sources_available_list->source as $so) { insert_data($item->id, 'sources_available_list', $so->source_id, ''); } foreach ($item->software_required_list->software as $soft) { insert_data($item->id, 'software_required_list', $soft->software_id, ''); } //print_r($item->pages->page); if (!empty($item->pages->page)) { foreach ($item->pages->page as $p) { mysql_query(sprintf("REPLACE INTO pages SET tpl_id = %d, name = '%s', id = NULL ", $item->id, $p->name)); $page_id = mysql_insert_id(); if (!empty($p->screenshots->scr)) { foreach ($p->screenshots->scr as $psc) { $href = (!empty($psc->href)) ? (string)$psc->href : ''; mysql_query(sprintf("REPLACE INTO page_screenshots SET page_id = %d, description = '%s', uri = '%s', scr_type_id = %d, width = %d, height = %d, href = '%s'", $page_id, $psc->description, $psc->uri, $psc->scr_type_id, $psc->width, $psc->height, $href)); } } } }}?>I'd appreciate your help with that... |