Parsing HTLML

They have: 16 posts

Joined: Jan 2000

I'm writing a script where the response from a remote server contains an HTML string. I need to access only the info within the body tags of the HTML. For instance, the string returned might be:

Page Title

Here is the info that I need.

How would I accomplish this?

They have: 850 posts

Joined: Jul 1999

Use a regular expression to do what you want.
perldoc perlre
for more information about regular expressions.

my $html = "<html><body>i need this</body></html>";
{
$html =~ m/<body>(.+?)<\/body>/i;
print $1;
}

output:
i need this
'

Hope that helps

[Edited by Rob Pengelly on Dec. 04, 2000 at 06:29 PM]

They have: 161 posts

Joined: Dec 1999

Technically, to properly parse HTML, you need some sort of tokenizer. Take, for instance, this regex, which supposedly removes all HTML tags:

$text =~ s/<.*?>//sg;
'

This breaks in several places:

<!-- comment out the <hr> tag -->

<img src="arrow.gif" alt="-->">

if X < 1 + Y, then Z > 2 - X
'

Therefore, you need a module that can properly break things down. There is such a thing, HTML::Parser, on CPAN. I've also developed one, called YAPE::HTML, that will be available very soon. Look into using modules for this type of thing.

They have: 16 posts

Joined: Jan 2000

Thank you both Rob and japhy.

I did look at the HTML::Parser module but unfortunately I could not figure out how to use it properly. (The documentation is vauge AND I'm a horrible Perl programmer...a bad combination.)

I ended up using something from the Perl Cookbook which is very similar to what Rob wrote:

($text) = ($html =~ m#\s*(.*?)\s*#is);

Luckily for me, the HTML I get back is always simple, so it works.

Thanks again!

Mark Hensler's picture

He has: 4,048 posts

Joined: Aug 2000

will there ever be anything in the tag? This should work:

$html =~ /(.*)/i;
$text = $2;

They have: 161 posts

Joined: Dec 1999

Max - try that out first. First, you need the /s modifier in there to allow . to match newlines. Second, if you use it:

#!/usr/bin/perl

<< "HTML" =~ m!<BODY(.*)>(.*)</BODY>!is;
<html>
<body>
<b>Hello world!</b> What's up?
</body>  
</html>
HTML

print $2;
'

then you'll get too little content in $2. It will print " What's up?\n", instead of "\nHello world! What's up?\n". This is because the .* is greedy, and will match AS MUCH AS POSSIBLE and still allow for a valid match.

As it is, the first .* matches from the ">" at the end of "" to the "b" in "". Then the ">" matches, and then you match " What's up?\n" in the second (.*).

Greediness can bite you. And in cases like this, it's IMPERATIVE to use a properly formed tokenizer.

Mark Hensler's picture

He has: 4,048 posts

Joined: Aug 2000

uh huh...

so can someone tell me what all these do...
. * ?

I always thought . was one or more characters, * was zero or more characters, and ? was one character. But I'm probably wrong. I may be remembering that from the DOS days...

Mark Hensler
If there is no answer on Google, then there is no question.

They have: 161 posts

Joined: Dec 1999

All regex metacharacters are explained in the perlre documentation.

. matches any character (except for \n, but that can be changed by using the /s modifier to a regex)

* matches 0 or more occurrences of some pattern, and opts for the most possible occurrences

? matches 0 or 1 occurrence of some pattern, and opts for 1 occurrence

If you have perl installed, you have perldoc, and can read the perlre documentation by typing perldoc perlre at your nearest shell. Or, read the documentation online, at http://www.perldoc.com/.

Want to join the discussion? Create an account or log in if you already have one. Joining is fast, free and painless! We’ll even whisk you back here when you’ve finished.