pattern matching in PHP

Suzanne posted this at 10:14 — 11th February 2002.

Joined: Feb 2000

I'm doing something daft, I just know it.

Using the source code from Dean Allen's word stripper, I want to convert it to search for a strange occurance in some other files.

His openly-released source is:

<?
// Okay, so here's the MS Word HTML cleaner,
// by Dean Allen - [email protected]
// No copyright, no warranty whatsoever.
// include() this on a PHP page, and it should work.
// If you don't know what that means, it likely won't.
// You should eat more greens.
// if no file, ask for one
if(!($userfile)) {
?>
How to Use
" method=post>

Write something in Word
Save as Web Page
Choose the HTML file:

Then,

<?
} else { if (!file_exists($userfile)) {
print 'No file selected';
exit;
} $text = file($userfile);
$text = implode("\r",$text); // normalize white space
$text = eregi_replace("[[:space:]]+", " ", $text);
$text = str_replace("> <",">\r\r<",$text);
$text = str_replace("","\r",$text); // remove everything before
$text = strstr($text,"]*BodyTextIndent[^>]*>([^\n|\n\015|\015\n]*)","\\1",$text);
$text = eregi_replace("]*margin-left[^>]*>([^\n|\n\015|\015\n]*)","

\\1

",$text);
$text = str_replace(" ","",$text); //clean up whatever is left inside and

$text = eregi_replace("]*>","",$text);
$text = eregi_replace("

]*>","

",$text); // kill unwanted tags
$text = eregi_replace("]*>","",$text);
$text = eregi_replace("]*>","",$text);
$text = eregi_replace("]*>","",$text);
$text = eregi_replace("<\![^>]*>","",$text);
$text = eregi_replace("]*>","",$text); // kill style and on mouse* tags
$text = eregi_replace("([ \f\r\t\n\'\"])style=[^>]+", "\\1", $text);
$text = eregi_replace("([ \f\r\t\n\'\"])on[a-z]+=[^>]+", "\\1", $text); //remove empty paragraphs
$text = str_replace("","",$text);

//remove closing
$text = str_replace("","",$text); //clean up white space again
$text = eregi_replace("[[:space:]]+", " ", $text);
$text = str_replace("> <",">\r\r<",$text);
$text = str_replace("","\r",$text);?>

Converted Text
(Parsed preview – code appears below)

<?
print $text;
?>

Cleaned HTML
<?
print "$text";
}
?>

Now I have run the file through this and it works mostly fine -- still some things to clean up. Through out the file (by which I mean there are hundreds of occurences) I have stuff like this:

I don't need this, don't know where it comes from but some paragraphs have up to 50 of these alone, and this is a 15 page document.

I don't want to remove them by hand.

SO...

Mark suggested this line when I first asked him:

$text = preg_replace('##i', '', $text);

The problem is, when I use that, it only returns the value "Array".

I have muddled through this on my own a few times and can't seem to do any better than "Array", or parsing errors.

I read through php.net but I'm afraid that it is all just a little too much for me to get my head around.

I want to replace the bold code with this:

$text = file($userfile);
$text = preg_replace('##i', '', $text);

(only those two lines, as all I want to do is ditch the stupid internal targets and the file has already been stripped).

So, the big question I have is why do I keep getting "Array" instead of nice cleaned up code?

Suzanne

Wil posted this at 12:13 — 11th February 2002.

They have: 601 posts

Joined: Nov 2001

http://www.php.net/manual/en/function.preg-replace.php

On the page above, it gives you the following example - why don't you use this one? Looks a lot cleaner to me.

<?php
// $document should contain an HTML document.
// This will remove HTML tags, javascript sections
// and white space. It will also convert some
// common HTML entities to their text equivalent.

$search = array (\"'&lt;script[^&gt;]*.*?&lt;/script&gt;'si\",  // Strip out javascript
                 \"'<[\/\!]*?[^<>]*'si\",           // Strip out html tags
                 \"'([\r\n])[\s]+'\",                 // Strip out white space
                 \"'&(quot|#34);'i\",                 // Replace html entities
                 \"'&(amp|#38);'i\",
                 \"'&(lt|#60);'i\",
                 \"'&(gt|#62);'i\",
                 \"'&(nbsp|#160);'i\",
                 \"'&(iexcl|#161);'i\",
                 \"'&(cent|#162);'i\",
                 \"'&(pound|#163);'i\",
                 \"'&(copy|#169);'i\",
                 \"'&#(\d+);'e\");                    // evaluate as php

$replace = array (\"\",
                  \"\",
                  \"\\1\",
                  \"\\"\",
                  \"&\",
                  \"<\",
                  \">\",
                  \" \",
                  chr(161),
                  chr(162),
                  chr(163),
                  chr(169),
                  \"chr(\\1)\");

$text = preg_replace ($search, $replace, $document);
?>

It seems to me a lot more efficent, compiling everything that needs to be search & replaced into an array before calling the regex engine?

- wil

Suzanne posted this at 19:47 — 11th February 2002.

She has: 5,507 posts

Joined: Feb 2000

That is very efficient, yes!

But that doesn't help me with my problem.

I don't know how to say "everything possible between the double quotes" in regular expressions.

I just need to search for and replace the extra internal targets. Everything else is done already.

Also, I don't want to remove all the html tags, I just want to remove specific html tags...

Hold the phone -- would "'<[/!]*?[^<>]*?>'si" be what I'm looking for? Albeit not entirely?

How would I adjust it to just , where the value of name is alphanumeric (caps, lowercase, numbers, hyphens)?

Suzanne

Mark Hensler posted this at 21:37 — 11th February 2002.

He has: 4,048 posts

Joined: Aug 2000

Ahh... your reading from a file, so it returns an array. We need to convert it into a string before we can work on it. So, we'll need to use the implode() function.

$text = file($userfile); 
$text = implode("", $text); 
$text = preg_replace('#<a name=(["\'])[^"\']+\\1></a>#i', '', $text); 

Mark Hensler
If there is no answer on Google, then there is no question.

Suzanne posted this at 21:51 — 11th February 2002.

She has: 5,507 posts

Joined: Feb 2000

Sweet heavenly programmers.

Thank you, Mark.

I knew I was doing something daft!

That worked like a charm.

And now I understand a little more about php, in the bargain.

Thank you!

Suzanne

Suzanne posted this at 22:01 — 11th February 2002.

She has: 5,507 posts

Joined: Feb 2000

I also have (now visible since all the others are gone), a whole raft of internal targets that look like this:

To modify the code so it matches everything with or without quotation marks, would I do this?:

$text = preg_replace('##i', '', $text);

No. that doesn't work.

Hm.

I think maybe I need to understand how this all goes together. Can you (anyone) put it in English for me?

Suzanne

Mark Hensler posted this at 00:40 — 12th February 2002.

He has: 4,048 posts

Joined: Aug 2000

LXXXIV. Regular Expression Functions (Perl-Compatible)
- Pattern Syntax

After writting this, I made an observation. I don't want to grab strings up to any quote, but the quote that matches the opening quote...

$text = file($userfile);
$text = implode("", $text);
$text = preg_replace('#<a name=(["\'])[^\\1]+\\1></a>#Ui', '', $text);

'Before, it wouldn't catch:

because it has a single quote within double quotes. This fixed that.

Lemme see if I can explain it...
$text = preg_replace('##Ui', '', $text);

Single quote. Can also be a double quote. PHP won't parse for variables inside single quotes.

The pound/hash (#) is the delimiter. It can be anything you want, but it must be the first character in the quotes. Most often the forward slash is used (/). The syntax is: delimiter, pattern, delimiter, modifiers. Without the commas or space, of course.

The backwards slash (\) is the escape character (as it is in the rest of PHP). This will tell the pattern to treat character as literals which would normally have a special meaning. I had to escape the single quotes so that PHP wouldn't think I ended the pattern there.

[] this is a "character class definition". yadda yadda... basically, your telling it to look specifically for one of the character within there. I had ["\'], this means look for a double quote or single quote.

Backreferences (\\1, \\2, etc.) mean look for whatever was in the X set of parentheses. The first set of parentheses was the pattern for the first quote. So \\1 would match the first quote that started the string.

[^] the circumflex (^) here is an inversion opperator. But it's only an inversion opperator when it is the first character in square braces. I had [^\\1], this means look for anything but quotes that match the quote that started the string.

The plus sign (+) is a "quantifier", which means look for 1 or more of the previous. I had [^\\1]+, so I was really saying, one or more characters that don't match...

\\1 is looking for a quote to end the string. And it must match the quote that started the strinig.

The U modifier tells it to be "un-greedy". This means, match the closest rather than the furthest.

The i modifier tells it to do all this case-insensitive.

Did that help at all, or did I just make it worse?

Mark Hensler
If there is no answer on Google, then there is no question.

Suzanne posted this at 00:50 — 12th February 2002.

She has: 5,507 posts

Joined: Feb 2000

Can I say both? lol...

No, actually, that's good. I can do all sorts of programming bits with exact matches, and can happily do while and if and other funny loops, but pattern matching makes my head spin.

I'm going to nail this, though, dammit.

Off to read the link, as well. Why couldn't I find that? Probably because I was looking under each function hoping to find it, heh.

Thanks, Mark,

Suzanne

Mark Hensler posted this at 01:08 — 12th February 2002.

He has: 4,048 posts

Joined: Aug 2000

ok.. next up to bat, anchors without quotes. hmm..

<?php

$text=<<<myHTML
blah
<a name=testing></a>
blah
<a name=bob's></a>
blah
<a name="testing 'quotes'"></a>

myHTML;

echo "<html><body><pre>\n";

$text = preg_replace('#<a name=(["\'])?(?(1)[^\\1]+\\1|[a-zA-Z0-9\-]+)></a>#Ui', '', $text); 

echo $text;


echo "</pre></body></html>";

?>

'
test:

<html><body><pre>
blah

blah
<a name=bob's></a>
blah

</pre></body></html>

'Look right to you? I don't know the naming rules, so I just guessed...

So this should replace the code I already gave you:

$text = file($userfile);
$text = implode("", $text);
$text = preg_replace('#<a name=(["\'])?(?(1)[^\\1]+\\1|[a-zA-Z0-9\-]+)></a>#Ui', '', $text);

'I'd rather not try to explain that one. In short, it checks for a quote to start the string. If there is a quote, it looks for one to end the string. If there is not quote, it looks for a string with only your alphanumeric and hyphens.

Mark Hensler
If there is no answer on Google, then there is no question.

Suzanne posted this at 02:46 — 12th February 2002.

She has: 5,507 posts

Joined: Feb 2000

I totally understand everything except the question mark parts, but I'll figure it out. The important thing is it works!

I have no idea why clients insist on sending Word files.

I would just copy and paste the text, except there are 10 files and each one is over 14 pages and I'm just not that much of a masochist!

Thank you very much Mark, I will try to do you proud by figuring out more on my own.

Suzanne