Parsing HTLML
I'm writing a script where the response from a remote server contains an HTML string. I need to access only the info within the body tags of the HTML. For instance, the string returned might be:
Page Title
Here is the info that I need.
How would I accomplish this?
Rob Pengelly posted this at 22:55 — 4th December 2000.
They have: 850 posts
Joined: Jul 1999
Use a regular expression to do what you want.
perldoc perlre
for more information about regular expressions.
my $html = "<html><body>i need this</body></html>";
{
$html =~ m/<body>(.+?)<\/body>/i;
print $1;
}
output:
i need this
Hope that helps
[Edited by Rob Pengelly on Dec. 04, 2000 at 06:29 PM]
http://www.thehungersite.com - http://www.therainforestsite.com
http://www.ratemymullet.com - Beauty is only mullet deep.
japhy posted this at 19:53 — 5th December 2000.
They have: 161 posts
Joined: Dec 1999
Technically, to properly parse HTML, you need some sort of tokenizer. Take, for instance, this regex, which supposedly removes all HTML tags:
$text =~ s/<.*?>//sg;
This breaks in several places:
<!-- comment out the <hr> tag -->
<img src="arrow.gif" alt="-->">
if X < 1 + Y, then Z > 2 - X
Therefore, you need a module that can properly break things down. There is such a thing, HTML::Parser, on CPAN. I've also developed one, called YAPE::HTML, that will be available very soon. Look into using modules for this type of thing.
Andrew posted this at 20:05 — 5th December 2000.
They have: 16 posts
Joined: Jan 2000
Thank you both Rob and japhy.
I did look at the HTML::Parser module but unfortunately I could not figure out how to use it properly. (The documentation is vauge AND I'm a horrible Perl programmer...a bad combination.)
I ended up using something from the Perl Cookbook which is very similar to what Rob wrote:
($text) = ($html =~ m#\s*(.*?)\s*#is);
Luckily for me, the HTML I get back is always simple, so it works.
Thanks again!
Mark Hensler posted this at 03:09 — 6th December 2000.
He has: 4,048 posts
Joined: Aug 2000
will there ever be anything in the tag? This should work:
$html =~ /(.*)/i;
$text = $2;
japhy posted this at 03:39 — 6th December 2000.
They have: 161 posts
Joined: Dec 1999
Max - try that out first. First, you need the /s modifier in there to allow . to match newlines. Second, if you use it:
#!/usr/bin/perl
<< "HTML" =~ m!<BODY(.*)>(.*)</BODY>!is;
<html>
<body>
<b>Hello world!</b> What's up?
</body>
</html>
HTML
print $2;
then you'll get too little content in $2. It will print " What's up?\n", instead of "\nHello world! What's up?\n". This is because the .* is greedy, and will match AS MUCH AS POSSIBLE and still allow for a valid match.
As it is, the first .* matches from the ">" at the end of "" to the "b" in "". Then the ">" matches, and then you match " What's up?\n" in the second (.*).
Greediness can bite you. And in cases like this, it's IMPERATIVE to use a properly formed tokenizer.
Mark Hensler posted this at 05:01 — 6th December 2000.
He has: 4,048 posts
Joined: Aug 2000
uh huh...
so can someone tell me what all these do...
. * ?
I always thought . was one or more characters, * was zero or more characters, and ? was one character. But I'm probably wrong. I may be remembering that from the DOS days...
Mark Hensler
If there is no answer on Google, then there is no question.
japhy posted this at 05:10 — 6th December 2000.
They have: 161 posts
Joined: Dec 1999
All regex metacharacters are explained in the perlre documentation.
. matches any character (except for \n, but that can be changed by using the /s modifier to a regex)
* matches 0 or more occurrences of some pattern, and opts for the most possible occurrences
? matches 0 or 1 occurrence of some pattern, and opts for 1 occurrence
If you have perl installed, you have perldoc, and can read the perlre documentation by typing perldoc perlre at your nearest shell. Or, read the documentation online, at http://www.perldoc.com/.
Want to join the discussion? Create an account or log in if you already have one. Joining is fast, free and painless! We’ll even whisk you back here when you’ve finished.