Parsing HTLML

Andrew posted this at 15:34 — 4th December 2000.

Joined: Jan 2000

I'm writing a script where the response from a remote server contains an HTML string. I need to access only the info within the body tags of the HTML. For instance, the string returned might be:

Page Title

Here is the info that I need.

How would I accomplish this?

Rob Pengelly posted this at 22:55 — 4th December 2000.

They have: 850 posts

Joined: Jul 1999

Use a regular expression to do what you want.
perldoc perlre
for more information about regular expressions.

my $html = "<html><body>i need this</body></html>";
{
	$html =~ m/<body>(.+?)<\/body>/i;
	print $1;
}

output:
i need this

Hope that helps

[Edited by Rob Pengelly on Dec. 04, 2000 at 06:29 PM]

http://www.thehungersite.com - http://www.therainforestsite.com
http://www.ratemymullet.com - Beauty is only mullet deep.

japhy posted this at 19:53 — 5th December 2000.

They have: 161 posts

Joined: Dec 1999

Technically, to properly parse HTML, you need some sort of tokenizer. Take, for instance, this regex, which supposedly removes all HTML tags:

$text =~ s/<.*?>//sg;

This breaks in several places:

<!-- comment out the <hr> tag -->

<img src="arrow.gif" alt="-->">

if X < 1 + Y, then Z > 2 - X

Therefore, you need a module that can properly break things down. There is such a thing, HTML::Parser, on CPAN. I've also developed one, called YAPE::HTML, that will be available very soon. Look into using modules for this type of thing.

Andrew posted this at 20:05 — 5th December 2000.

They have: 16 posts

Joined: Jan 2000

Thank you both Rob and japhy.

I did look at the HTML::Parser module but unfortunately I could not figure out how to use it properly. (The documentation is vauge AND I'm a horrible Perl programmer...a bad combination.)

I ended up using something from the Perl Cookbook which is very similar to what Rob wrote:

($text) = ($html =~ m#\s*(.*?)\s*#is);

Luckily for me, the HTML I get back is always simple, so it works.

Thanks again!

Mark Hensler posted this at 03:09 — 6th December 2000.

He has: 4,048 posts

Joined: Aug 2000

will there ever be anything in the tag? This should work:

$html =~ /(.*)/i;
$text = $2;

japhy posted this at 03:39 — 6th December 2000.

They have: 161 posts

Joined: Dec 1999

Max - try that out first. First, you need the /s modifier in there to allow . to match newlines. Second, if you use it:

#!/usr/bin/perl

<< "HTML" =~ m!<BODY(.*)>(.*)</BODY>!is;
<html>
<body>
<b>Hello world!</b> What's up?
</body>   
</html>
HTML

print $2;

then you'll get too little content in $2. It will print " What's up?\n", instead of "\nHello world! What's up?\n". This is because the .* is greedy, and will match AS MUCH AS POSSIBLE and still allow for a valid match.

As it is, the first .* matches from the ">" at the end of "" to the "b" in "". Then the ">" matches, and then you match " What's up?\n" in the second (.*).

Greediness can bite you. And in cases like this, it's IMPERATIVE to use a properly formed tokenizer.

Mark Hensler posted this at 05:01 — 6th December 2000.

He has: 4,048 posts

Joined: Aug 2000

uh huh...

so can someone tell me what all these do...
. * ?

I always thought . was one or more characters, * was zero or more characters, and ? was one character. But I'm probably wrong. I may be remembering that from the DOS days...

Mark Hensler
If there is no answer on Google, then there is no question.

japhy posted this at 05:10 — 6th December 2000.

They have: 161 posts

Joined: Dec 1999

All regex metacharacters are explained in the perlre documentation.

. matches any character (except for \n, but that can be changed by using the /s modifier to a regex)

* matches 0 or more occurrences of some pattern, and opts for the most possible occurrences

? matches 0 or 1 occurrence of some pattern, and opts for 1 occurrence

If you have perl installed, you have perldoc, and can read the perlre documentation by typing perldoc perlre at your nearest shell. Or, read the documentation online, at http://www.perldoc.com/.