Learn VBA & Macros in 1 Week!

PHP - How To Parse Html By Domdocument?

Full Excel VBA Course - Beginner to Expert

How To Parse Html By Domdocument?	View Content

Imagine an html with the following structure
Code: [Select]
<div class="item">
<div class="title">
<a class="title" href="http://www.domain.com/title.html">Title is here</a>
</div>
<div class="image">
<a href="http://www.domain.com/title.html"><img src=image.jpg /></a>
</div>
</div>

How to make an array containing $title - $url - $image_url ?

Full Excel VBA Course - Beginner to Expert

Warning: Domdocument() [function.domdocument]:

Similar Tutorials

View Content

Delete ...
need to change host ..

How To Parse Php In Html

Similar Tutorials

View Content

What would the code be to display server side code like the date and time to display in html pages? I have the date code for copyrights that I want to display <?php echo date(Y); ?>and I think I need to create an .htaccess file to put it in but not sure what to put.
Thanks,

Help To Parse Html

Similar Tutorials

View Content

OK.. pretty new to PHP for the most part, but i understand programming languages to a decent extent! Anyways im trying to parse an HTML page to get data out of it and probably in turn put into an sql table.. all i need help with is doing the parsing with dom, and xpath querys or however would be the best way to do this...

page im trying to parse: http://us.battle.net/wow/en/guild/Moonrunner/The%20Eternal%20Blade/news

basically the data i want to put into sql or variables for the time being would be the 25 results returned in news. (first one is mudkips item Vicious Gladiator's Signet of Cruelty.
, and last item is: Lionus earned the achievement Level 30 for 10 points. ) Can anyone please give me some help with a function that could do this? please!

Parse Specific Url From Html?

Similar Tutorials

View Content

I'm trying to parse 2 things.
1. Specific TD tags from a table.
2. Specific URLs from an HTML page.

Here's part of the data I'm trying to parse:

Code: [Select]
<tr>
<td class="f">
<a href="http://main1.site.com/x.html">Page 1</a>
</td>
<td>1572</td>
<td class="a">Type: F</td>
<td><img src="http://site.com/image.gif" title="N" alt="N" /></td>
<td class="f">F</td>
</tr>

<tr class="x">
<td class="m">
<a href="http://main2.site.com/x.html">Page 2</a>
</td>
<td>1771</td>

<td class="a">Type: M</td>
Here's the parser that I'm working with:

Code: [Select]
<?php

$html = file_get_contents('http://www.website.com/page.html');

// use this to only match "td" tags
#preg_match_all ( "/(<(td)>)([^<]*)(<\/\\2>)/", $html, $matches );

// use this to match any tags
#preg_match_all("/(<([\w]+)[^>]*>)([^<]*)(<\/\\2>)/", $html, $matches);

//use this to match URLs
#preg_match_all ( "/http:\/\/[a-z0-9A-Z.]+(?(?=[\/])(.*))/", $html, $matches );

//use this to match URLs
#preg_match_all ( "/<a href=\"([^\"]*)\">(.*)<\/a>/iU", $html, $matches );

preg_match_all ( "/<a href=\"([^\"]*)\">(.*)<\/a>/iU", $html, $matches );

for ( $i=0; $i< count($matches[0]); $i++)
{
echo "matched: " . $matches[0][$i] . "\n<br>";
echo "part 1: " . $matches[1][$i] . "\n<br>";
echo "part 2: " . $matches[2][$i] . "\n<br>";
echo "part 3: " . $matches[3][$i] . "\n<br>";
echo "part 4: " . $matches[4][$i] . "\n\n<br>";
}

?>
What I'm trying to output is:

Code: [Select]
<a href="http://main1.site.com/x.html">Page 1</a>
Hits: 1572

<a href="http://main2.site.com/x.html">Page 2</a>
Hits: 1771

...for the entire table
What I've managed to get out of it so far are the "Hits" with the "td" snippet. What I can't figure out is how to extra the full: <a href="http://main.site.com/p#.html">Page #</a>

So my question is how can I make it look for just "<a href="http://main#.......">Page #</a>"?

Currently it looks for every URL, which is not what I need.

Parse Html For Validation

Similar Tutorials

View Content

Hello,

I need to parse some html for validation. Not an entire html page but something like a "string" of html tags, don't know how to say it correct in english.
Basically, I have this to parse (for example):

Code: [Select]
<object width="100%" height="81">
<param value="http://player.soundcloud.com/player.swf?url=http%3A%2F%2Fapi.soundcloud.com%2Ftracks%2F12483908" name="movie">
<param value="always" name="allowscriptaccess">
<embed width="100%" height="81" type="application/x-shockwave-flash" src="http://player.soundcloud.com/player.swf?url=http%3A%2F%2Fapi.soundcloud.com%2Ftracks%2F12483908" allowscriptaccess="always">
</object>
I know it's possible in php, but don't know which function there is for this?

Anyone?

Thanks!

How Can I Parse Data Out Of The Html Page I Am Downloading?

Similar Tutorials

View Content

Hi Everyone,

I have been successful to parse out some data out of an html page that I am downloading using CURL. I used arrays and preg_match to get the data I need. However, some part of the data has a great deal of SPACE charecters and it seems that my arrays method doesn't work. Can someone please point out how I can parse the following to get only the information out and not tags (****quoted excerpt including all the space characters like it was downloaded):
Code: [Select]
<span class="basic_serial">(777) 777-7777</span>

<br />

1111 ABCD, EFGH, IJKL

<br />

Thanks,

Parse Html Document With Snippets Of Php Code

Similar Tutorials

View Content

Is it possible to parse html document with snippets of php code in them using DOMDocument?

i.e load html from file then parse/change them with DOMDocument and then save them back to file I have tryed but i get

<?php%20echo%20URL();%20?>

Domdocument(); $doc->loadhtml(

Similar Tutorials

View Content

I have some code

$doc = new DOMDocument();
$doc->loadHTML(
'<html>
<head><title>Test</title></head>
<body></body></html>'
);
$doc->encoding = 'iso-8859-1';

file_put_contents('test.html', $doc->saveHTML());

when i view the output file i get
<html><head><title>Test</title></head><body></body></html>

all on one line is there no way of having it format it like the original source code so that its not all bunched together?

Domdocument::savehtml

Similar Tutorials

View Content

Is there any reason that people can think as to why DOMDocument::saveHTML would remove the following:

Code: [Select]
<![if !vml]>
<img src="someimage.jpg" />
<![endif]>

A little clarification... This HTML comment tag is used in my company's email newsletter code and is necessary to make Outlook 2007 behave properly. For whatever reason, saveHTML strips it out. I know that this doesn't conform to HTML standards and I'm guessing that that is why it is being stripped. BUT, from reading on the internet, saveHTML can produce junk html code anyways.

Any help is appreciated.

Getting Started With Domdocument

Similar Tutorials

View Content

Hi guys,

Just starting to play with PHP Domdocument, only to fail at the very first step:

<?php   

$html = 'test/php/somefile.html' ;

if(!empty($html)){  

      $dom_1 = new domDocument ; 
      $dom_1->loadHTML($html) ; 
      $links = $dom_1->getElementsByTagName('li') ;

      foreach ( $links as $link) {
			// echo $link ;
           echo $link->nodeValue, PHP_EOL;
      }  
}
?>

When I visit it in a browser I get a WSOD, what am I missing?

Xpath And Domdocument Help

Similar Tutorials

View Content

Hi all,

I am pretty new to php and I am having an issue trying to load an XML document. When ever I try to use Xpath it negates all the code below the line, including the HTML, and returns a white page. here is my code:

Code: [Select]
<html>
<head>
<?php
$xpath = new DOMXPath("structure.xml");
?>
<body>
hello world
</body>
</html>

I checked phpinfo() and I have both the DOM and XPath enables and installed. I have also tried using just DOM and that worked so it is only Xpath that is not working. Ideas?

Thank you
James S

Domdocument Parsing Obstacle

Similar Tutorials

View Content

so there's some html i'm having to fetch and parse for personal use...but some of the data i want is started in a table written this way:

rawTableData = {"rows": [{"colname1": value, "colname2": "value", "colname3: value}

how can i use domdocument to parse this data if say i want the value for colname2? there are no tags for me to use.

Getting Image With Its Link By Domdocument

Similar Tutorials

View Content

It is easy to get image or link by DomDocument, but I did not find a way to get image with its target link. Imagine a html as
Code: [Select]
<div class=image>
<a href='http://site.com'><img src='imagelink.jpg'></a>
</div>How to get both the image link and href?

$dom = new DOMDocument();
@$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$hrefs = $xpath->evaluate("/html/body//div[@class='image']");
for ($i = 0; $i < $hrefs->length; $i++) {
$href = $hrefs->item($i);

Now to get the image and its href, we need first getElementsByTagName('a') and getElementsByTagName('img') but they do not work inside foreach.

What's your idea?

Count Childnodes In Domdocument

Similar Tutorials

View Content

Hi guys,

Reading this from php.net, has got me a wee bit confused. Trying to implement is has got me doubly confused! My code:

        $dom = new DOMDocument;
        libxml_use_internal_errors(true);
        $dom->loadHTMLFile($parent_node);

        if($dom->childNodes <>0) {
            $kids = array (
                'url' => $parent_node,
                'No_of_kids' => count($dom->childNodes)
            ); 
        }

Results in '' Notice: Object of class DOMNodeList could not be converted to int'

How the heck am i supposed to count the childNodes?

Linkparser That Gives The Results Into A Domdocument

Similar Tutorials

View Content

good evening dear PHPFreaks - hello to everybody.

i want to create a link parser. i have choosen to do it with Curl. I have some lines together now. Love to hear your review... Since i am new to programming i love to get some hints from experienced devs.

Here some details: well since we have several hundred of resultpages derived from this one: http://www.educa.ch/dyn/79362.asp?action=search

Note: i want to itterate over the resultpages - with a loop.

http://www.educa.ch/dyn/79376.asp?id=1568
http://www.educa.ch/dyn/79376.asp?id=2149

i take this loop:

for($i=1;$i<=$match[1];$i++)
{
  $url = "http://www.example.com/page?page={$i}";
  // access new sub-page, extract necessary data
}

Dear PHP-Freaks, what do you think? What about the Loop over the target-Urls?

BTW: you see - there will be some pages empty. Note - the empty pages should be thrown away. I do not want to store "empty" stuff.

well this is what i want to. And now i need to have a good parser-script.

Note: this is a tree-part-job:

1. fetching the sub-pages
2. parsing them
3. storing the data in a mysql-db

Well - the problem - some of the above mentioned pages are empty. so i need to find a solution to
leave them aside - unless i do not want to populate my mysql-db with too much infos..

Btw- parsing should be a part that can be done with DomDocument - What do you think? I need to combine the first part with tthe second - can you give me some starting points and hints to get this.

The fetching-job should be done with CuRL - and to process the data into a DomDocument-Parser-Job.

note ive taken the script from this place:
http://www.merchantos.com/makebeta/php/scraping-links-with-php/

function storeLink($url,$gathered_from) {
$query = "INSERT INTO links (url, gathered_from) VALUES ('$url', '$gathered_from')";
mysql_query($query) or die('Error, insert query failed');
}

for($i=1;$i<= 10000; $i++)

{
$target_url = "http://www.educa.ch/dyn/79376.asp?id={$i}";
}
  // access new sub-page, extract necessary data

$userAgent = 'Googlebot/2.1 (http://www.googlebot.com/bot.html)';

// make the cURL request to $target_url
$ch = curl_init();
curl_setopt($ch, CURLOPT_USERAGENT, $userAgent);
curl_setopt($ch, CURLOPT_URL,$target_url);
curl_setopt($ch, CURLOPT_FAILONERROR, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_AUTOREFERER, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER,true);
curl_setopt($ch, CURLOPT_TIMEOUT, 10);
$html= curl_exec($ch);
if (!$html) {
echo "<br />cURL error number:" .curl_errno($ch);
echo "<br />cURL error:" . curl_error($ch);
exit;
}

// parse the html into a DOMDocument
$dom = new DOMDocument();
@$dom->loadHTML($html);

// grab all the on the page
$xpath = new DOMXPath($dom);
$hrefs = $xpath->evaluate("/html/body//a");

for ($i = 0; $i < $hrefs->length; $i++) {
$href = $hrefs->item($i);
$url = $href->getAttribute('href');
storeLink($url,$target_url);
echo "<br />Link stored: $url";
}

Dear PHP-Freaks, what do you think? What about the Loop over the target-Urls?

love to hear from you !

Domdocument - From Php5 To Php4

Similar Tutorials

View Content

I just finished (or so I thought) a project. But my client's server runs PHP4, so I need to adapt my code. Here's what stopped working:

Code: [Select]
$localClasses = new DOMDocument;
$localClasses -> load("file.xml");
$localClasses -> get_elements_by_tagname('Title') -> item(0) -> firstChild -> nodeValue

Here's my petty attempt at trying to adapt this code to run in PHP4:

Code: [Select]

$file = file_get_contents("localClasses.xml");
$localClasses = new DOMDocument($file);

$test = $localClasses -> get_elements_by_tagname('Title');
$testText = $test -> item[0] -> firstChild -> nodeValue;
print $testText;

This doesn't give me any errors, but nothing shows up. Any help would be appreciated. Thanks for reading!

How Do I Add Children To Xml Using Domdocument And Xpath?

Similar Tutorials

View Content

Code: [Select]
$domdoc=new DOMDocument();
$domdoc->formatOutput=TRUE;

$empty_cart_xml=
'<Order>
<Cart>
<Items>
<Item>1</Item>
<Item>2</Item>
<Item>3</Item>
</Items>
</Cart>
</Order>';

$domdoc->loadXML($empty_cart_xml);
print $domdoc->saveXML()."<hr/>"; //works up to this point

$xpath=new DOMXPath($domdoc);

$items=$xpath->query('Order/Cart/Items');
foreach($itemses AS $items)
{
$items->appendChild($domdoc->createElement('Item','4'));
}

print $domdoc->saveXML();

All I want to do is to add a new Item to Items. What am I doing wrong?

Website Scaping Using Domdocument

Similar Tutorials

View Content

Hi, I have a php code that could extract the categories and display them. However, I still can't extract the numbers that goes along with it too(without the bracket). This is my code:

<?php
$grep = new DoMDocument();
@$grep->loadHTMLFile("http://www.lelong.com.my/Auc/List/BrowseAll.asp");

$finder = new DomXPath($grep);
$class = "CatLevel1";
$nodes = $finder->query("//*[contains(@class, '$class')]");

foreach ($nodes as $node) {
    $span = $node->childNodes;
        echo $span->item(0)->nodeValue."<br>";
}
?>

This is my desired output:

Arts, Antiques & Collectibles : 9768
B2B & Industrial Products : 2342
Baby : 3453
etc...

Any help is appreciated. Thanks!

Domdocument - Parser: I Need A Starting Point

Similar Tutorials

View Content

good day dear PHPFreaks - hello to everybody.

i want to create a link parser. i have choosen to do it with Curl. I have some lines together now. Love to hear your review... Since i am new to programming i love to get some hints from experienced devs.

Here some details: well since we have several hundred of resultpages derived from this one: http://www.educa.ch/dyn/79362.asp?action=search

Note: i want to itterate over the resultpages - with a loop.

http://www.educa.ch/dyn/79376.asp?id=1568
http://www.educa.ch/dyn/79376.asp?id=2149

i take this loop:

for($i=1;$i<=$match[1];$i++)
{
$url = "http://www.example.com/page?page={$i}";
// access new sub-page, extract necessary data
}

what do you think? What about the Loop over the target-Urls?

BTW: you see - there will be some pages empty. Note - the empty pages should be thrown away. I do not want to store "empty" stuff.

well this is what i want to. And now i need to have a good parser-script.

Note: this is a tree-part-job:

1. fetching the sub-pages
2. parsing them
3. storing the data in a mysql-db

Well - the problem - some of the above mentioned pages are empty. so i need to find a solution to
leave them aside - unless i do not want to populate my mysql-db with too much infos..

Btw- parsing should be a part that can be done with DomDocument - What do you think? I need to combine the first part with tthe second - can you give me some starting points and hints to get this.

The fetching-job should be done with CuRL - and to process the data into a DomDocument-Parser-Job.
No Problem he But how to do the DOM-Document-Job ...

i have installed FireBug into the FireFox...

now i have the Xpaths for the sites:
http://www.educa.ch/dyn/79376.asp?id=1187
http://www.educa.ch/dyn/79376.asp?id=2939
http://www.educa.ch/dyn/79376.asp?id=1515
http://www.educa.ch/dyn/79376.asp?id=1469

Altes Schulhaus Ossingen :: /html/body/div[2]
Guntibachstrasse 10 :: /html/body/div[4]
8475 Ossingen :: /html/body/div[6]
sekretariat.psossingen@bluewin.ch :: /html/body/div[9]/a
Tel:052 317 15 45 :: /html/body/div[11]
Fax:052 317 04 42 :: /html/body/div[12]

but how to appyl in the Simple DomDocument - i want to use this he http://simplehtmldom.sourceforge.net/

look forward to a hint that gives me a starting point

How To Test If The Domdocument [class] Exists?

Similar Tutorials

View Content

hello dear friends - i want to test if the DOMdocument [class] exists?

Can i do this in the shell (on OpenSuse 11.3)?

bool class_exists ( string $class_name [, bool $autoload = true ] )
bool class_exists ( string $DOMdocument [, bool $autoload = true ] )

or do i have to create a file that i call itself in the shell!?

look forward for an idea / hint / tipp

regards
db1