regular expression help

They have: 46 posts

Joined: May 2002

Hi,

I need a regular expression or any code (for a Perl script) that will return up to the first three letters of each word in a string.

For example, if:
$mystring = "George Washington is great";

Then I need to ultimately come up with: "GeoWasisgre" - which is the first three letters of each word in the string (only two letters for the word "is").

I plan to use the abbreviated string as a filename - one that's shorter than the full description of the file's contents but which is still barely readable.

Any ideas? thanks!

They have: 46 posts

Joined: May 2002

I think I figured out something that works...

my $part2 = "";
my @groupwords = split(/\b([A-Za-z]+)\b/, $group_name);
foreach $word (@groupwords) {
$word =~ s/[^A-Za-z]//g;
$part2 .= substr($word, 0, 3);
} # end for

They have: 461 posts

Joined: Jul 2003

ummm the perl shorts might be useful....

\w=[A-Za-z0-9_]
\W=^\w
and for the split, if you just want a-z, it's easier to go split(/\b([a-z]+)\b/ig, $group_name)

rest of the shorts....
\d=[0-9]
\D= not \d

\s=[\f\n\t\r ]
\S= not \s

* = 0 or more instances of what was just in front of it
? = 0 or 1 occurances
+ = 1 or more occuances
{#} = exactly # occurances
{#,#} = min of the first # occurances, max of the second #

i after the the last section delimiter used makes it case insensitive
g make it global (all instead of the first one)

POSIX. because a stable os that doesn't have memory leaks and isn't buggy is always good.

Mark Hensler's picture

He has: 4,048 posts

Joined: Aug 2000

tested on "This is perl, v5.8.0 built for i386-linux-thread-multi"
with both versions: alpha only (A-Za-z) and word characters (\w)

#!/usr/bin/perl

$text = "George Washington is great";
print "$text\n";

$i=1;
#@words = split(/[^a-z](?:([a-z]{1,3})[a-z]*?)(?:[^a-z]|$)/gi, $text);
@words = split(/\W(\w{1,3})\w*?(?:\W|$)/gi, $text);

foreach (@words) {
  print $i++ . ": ";
  print "$_\n";
}
'

Mark Hensler
If there is no answer on Google, then there is no question.

Mark Hensler's picture

He has: 4,048 posts

Joined: Aug 2000

#!/usr/bin/perl

$text = "George Washington is great";
print "string  : $text\n";

$i=1;
@words = split(/[^a-z]?(?:([a-z]{1,3})[a-z]*?)(?:[^a-z]|$)/gi, $text);
#@words = split(/\W?(\w{1,3})\w*?(?:\W|$)/gi, $text);

#foreach (@words) {
#  print $i++ . ": ";
#  print "$_\n";
#}

$filename = join('', @words);
$filename =~ tr/A-Z/a-z/;

print "filename: $filename\n";
'

Mark Hensler
If there is no answer on Google, then there is no question.

They have: 601 posts

Joined: Nov 2001

Hi

I put this question to a Perl-guru friend in an IRC channel I frequent and this was his blazingly fast response.

He's given me permission to reproduce the conversation here.

14:08 <wil> Prospero: golf this:
14:09 <wil> <a href="http://www.webmaster-forums.net/showthread.php?s=&threadid=22165" class="bb-url">http://www.webmaster-forums.net/showthread.php?s=&threadid=22165</a>
14:12 <Prospero> Well, right off:  perl -le '$_ = "George Washington is great"; print map /(.{1,3})/,
                 split'
'

Hope this helps.

- wil

They have: 46 posts

Joined: May 2002

one word: WOW!

thanks...

They have: 601 posts

Joined: Nov 2001

And a little later on ...

14:49 <Prospero> Oh, this is shorter, by the way:
14:49 <Prospero> perl -le '$_ = "George Washington is great"; print /(\w{1,3})\w+/g'
14:50 <Prospero> No, that's wrong.
14:50 <Prospero> perl -le '$_ = "George Washington is great"; print /(\w{1,3})\w*/g'
14:50 <wil> :)
'

- wil

Mark Hensler's picture

He has: 4,048 posts

Joined: Aug 2000

Do you not need to show boundries? I thought that by doing "/(\w{1,3})\w*/" would match "Geo", "eor", "org", etc.

They have: 5 posts

Joined: Aug 2003

Quote: Originally posted by Mark Hensler
Do you not need to show boundries? I thought that by doing "/(\w{1,3})\w*/" would match "Geo", "eor", "org", etc.

Well, I figured I'd better register... no sense holding a conversation through another person. Smiling

The thing is, the parens dictate what portion of the match will be returned. That regex actually matches whole words, by matching up to three 'word' (alphanumeric, as well as the underscore) characters, and then matching any subsequent contiguous word characters. This naturally stops at word boundaries, since the pattern won't match whitespace. The trick is, he parens only 'capture' the \w{1,3} part. The g switch (as you probably realize) means that it will match multiple instances... so you don't just get 'Geo'.

The reason you don't wind up with 'Geo', 'rge', and so on is the \w*. Because the pattern matches 'George', and only returns 'Geo', it resumes looking for its next match after 'George'. It looks at the space, sees that it doesn't match, and then it finds 'Washington', which does.

Anyway, one point to watch: you're returning a list of strings to &print, which is a list-context subroutine. But if you returned in scalar context, you wouldn't get the list, but the length of the list. So either save the results in an array, or, if you want them in a scalar variable, be sure to throw a join in there, or something similar:

perl -le '$_ = "George Washington is great"; $n = join "", /(\w{1,3})\w*/g; print $n'

Same applies for the split solution.

Suzanne's picture

She has: 5,507 posts

Joined: Feb 2000

welcome Prospero! And thanks for a super explanation. Smiling

Mark Hensler's picture

He has: 4,048 posts

Joined: Aug 2000

Yes, welcome!

Alright, I tested Prospero's code, and I see that it works.

I understand why \w* matches up to the end of the word string. But why does \w only match the begining of the word string. Seems like you'd need \W\w to match the begining of a word string.

I guess what's confusing me is that I've recently read about "Once-only subpatterns" -- or "(?>" -- and that's lingering in the back of my mind.

Mark Hensler
If there is no answer on Google, then there is no question.

They have: 5 posts

Joined: Aug 2003

Quote: Originally posted by Mark Hensler
Yes, welcome!

Alright, I tested Prospero's code, and I see that it works.

I understand why \w* matches up to the end of the word string. But why does \w only match the begining of the word string. Seems like you'd need \W\w to match the begining of a word string.

I guess what's confusing me is that I've recently read about "Once-only subpatterns" -- or "(?>" -- and that's lingering in the back of my mind.

Nothing quite so fancy as that. Smiling I'll try to explain it in a few different ways, and hope one will make sense.

Remember, to be satisfied, the string needs to satisfy the whole pattern. The pattern says, "Find instances where between one and three 'word characters' are immediately followed by zero or more word characters, and capture the one-to-three part."

So the pattern actually matches the whole word each time. If it finds between 1 and 3 word consecutive word characters, it will always match... because it can't fail to match the zero-or-more part. That'll match on nothing at all, if it has to, though it will try to match as many characters as it can. Regexes are greedy, that way. Hence it's a lot like saying \w+ That's exactly what we'd say, in fact, if we didn't need to capture the first several letters.

So, say you have a sequence like ' a '. The pattern first looks at the leading space. It realizes that no matter what comes next, a space cannot be the beginning of what it's looking for. So it ignores the space. Next step: look at the 'a'. Is 'a' a possible match? Yep. It matches \w, and also \w{1,3}.

But, it's greedy. It doesn't want 1, when it might be able to get 3. So it looks at the next character: another space. Drat. It's not getting any more than one... but that's enough. It satisfies \w{1,3}.

As it turns out, it also satisfies \w{1,3}\w*. It found one word character, followed immediately by zero word characters. So we have a valid match. Now, because the pattern is actually (\w{1,3})\w*, it 'captures' the \w{1,3} part, and puts it in a buffer.

(Incidentally, this is not a perfectly accurate description of what happens, but it should give you a pretty good model with which to work.)

Now, what if our string had been ' a priori '? Well, the matching for ' a ' went pretty much as described. The regex engine tosses out the second space, of course, because it has no use for it. Then it keeps going, because we used the /g switch, so it's not done yet. It comes upon a 'p'. Is that a possible match? Yep. And if the regex weren't greedy, it'd be satisfied already; 'p' matches (\w{1,3})\w* just as 'a' did.

But its job is to make the match as long as is possible. And good thing, too, or patterns like .* which would match an infinite number of times, and we'd never get anywhere. It's zero or more of anything, after all, and if it keeps stopping at zero, it's no fun.

So, 'p' is a satisfactory beginning. Next it looks and sees 'r', which matches \w. 'pr', too, matches \w{1,3} but isn't the maximum (3), so onward! Now it sees 'i'. 'pri' not only matches \w, it's the longest possible match for \w{1,3}, so there's no sense looking at the next one to see what it is, at this point. That part of the pattern matched.

The pattern, however, calls for \w{1,3} followed immediately by \w*. Well, the nice thing about * is, it's always a match. Whatever it is, you have at least zero of it. But again, * is greedy. So we look at the next character. Turns out it's 'o', and 'o' matches \w. But 'o' isn't the longest possible match... with *, there is no longest possible match. So we keep going. Next comes 'r', which matches \w, but 'or' again isn't the best \w* could hope for, so onward; 'i' matches \w, and we have 'ori'. Next comes a space. Oops. That doesn't match \w, so now our search for that pattern is finished. Stuff \w{1,3}, which is 'pri', in a buffer, and keep going.

What comes after space? The end of the string. So, no more matches, and we're done. Return the buffers, which contain 'a' and 'pri', and go back to our regularly scheduled programming.

Again, this description is subtly wrong, particularly the order in which things happen. But don't worry about that. The order doesn't interest you, just the results. If you ever get into writing parsers, it'll be more interesting to sweat the details.

Now, you were wondering why that pattern didn't match "Geo", "eor", "org", and so on. Think about this: what if our pattern had been .* and our string 'foobar'. Would you expect that to match 'foobar', 'oobar', 'obar', 'bar', 'ar', and 'r'? I wouldn't. The matching engine doesn't find 'f' 'o' 'o' 'b' 'a' 'r', see that it matches .*, and then head back to the first 'o', and try again.

Well, there. Hopefully you're not more confused than when I started. Smiling

Mark Hensler's picture

He has: 4,048 posts

Joined: Aug 2000

So would this be correct...

Because the pattern matched. The search was continued at the end of the last successfull match within the string.

Had the match failed, the search would have resumed at the letter after the beging of the last failed match within the string.

"3rd place"
^
first pass starts here and fails, as "3" is not \w
------------------------------------------------------------------
"3rd place"
  ^
second pass starts here and suceeds, as "rd" matches \w{1,3}
------------------------------------------------------------------
"3rd place"
    ^
third pass starts here and fails, as " " is not \w
------------------------------------------------------------------
"3rd place"
     ^
fourth pass starts here and suceeds, as "place" matches \w{1,3}\w*
'

Mark Hensler
If there is no answer on Google, then there is no question.

They have: 5 posts

Joined: Aug 2003

Quote: Originally posted by Mark Hensler
So would this be correct...

Because the pattern matched. The search was continued at the end of the last successfull match within the string.

Had the match failed, the search would have resumed at the letter after the beging of the last failed match within the string.

"3rd place"
^
first pass starts here and fails, as "3" is not \w
------------------------------------------------------------------
"3rd place"
  ^
second pass starts here and suceeds, as "rd" matches \w{1,3}
------------------------------------------------------------------
"3rd place"
    ^
third pass starts here and fails, as " " is not \w
------------------------------------------------------------------
"3rd place"
     ^
fourth pass starts here and suceeds, as "place" matches \w{1,3}\w*
'

That's pretty close. Actually, though, there's only one pass. The engine simply goes along one character at a time, moving to different states as it goes. When it gets to the end, it's done.

Let's say our pattern was /ri\w/g, and our string is "3rd prize".

"3rd prize"
^
Not a possible match. Discard.

"3rd prize"
  ^
A possible match. Keep the 'r', move on.

"3rd prize"
   ^
No longer a possible match. Discard the 'r', move on.

"3rd prize"
    ^
Not a possible match. Discard.

"3rd prize"
     ^
Not a possible match. Discard.

"3rd prize"
      ^
A possible match. Keep the 'r', move on.

"3rd prize"
       ^
A possible match. Keep the 'ri', move on.

"3rd prize"
        ^
A match. Keep the 'riz', move on, since /g is in effect.

"3rd prize"
         ^
Not a possible match. Discard.
'

But you've essentially got the idea, I think.

Another, less mechanical way to think about it is this: each character in the string is only going to be part of a single pattern match. Hence, if the 'r' in 'George' was part of the character sequence matched by (\w{1,3})\w* (which it was, since the \w* is greedy), it won't be used in any future matches. That's why I put \w* in there: to prevent the rest of the word from matching the part of the regex I actually care about, (\w{1,3})

They have: 5 posts

Joined: Aug 2003

You know, I'm feeling more and more squeamish about this explanation... so I wanted to say again, don't take the details here too seriously. Particularly the mechanical details, which, I hope, will get across a sense of how this sort of thing appears to work.

The model we're using to understand things actually describes what are called DFA regular expressions. It's a simpler model, which makes it nice for explanations like these. The problem is, Perl doesn't actually use DFA, but the other kind, NFA regular expressions. And technically, they're modified NFA, so they're not really 'regular' expressions at all.

You can probably see why I was loathe to go into this. Smiling Oversimplification is sometimes useful. However I'd hate to feel responsible for spreading misinformation, so I'm adding a caveat. This isn't a story you want to rely upon too closely.

And don't worry too much about this stuff, either. It's not that important for 99% of what you need in Perl.

Mark Hensler's picture

He has: 4,048 posts

Joined: Aug 2000

The mechanics happens to be exactly what I'm interested in.

I've taken several programming courses (in college), but they don't teach much farther than the language's syntax and grammar. Assembly (i386+) came the closest to "How's that work?", but still left me wondering about higher level languages. Regular Expressions are a whole other animal, and I'm quite interested to know what makes them tick.

Do you have any online resources you could refer me to?

Mark Hensler
If there is no answer on Google, then there is no question.

They have: 5 posts

Joined: Aug 2003

Quote: Originally posted by Mark Hensler
The mechanics happens to be exactly what I'm interested in.

I've taken several programming courses (in college), but they don't teach much farther than the language's syntax and grammar. Assembly (i386+) came the closest to "How's that work?", but still left me wondering about higher level languages. Regular Expressions are a whole other animal, and I'm quite interested to know what makes them tick.

Do you have any online resources you could refer me to?

I don't, offhand, know of online resources which give a good introduction to the mechanics. The information is certianly out there, but it's usually kind of raw. If I run across something, I'll let you know.

On the other hand, I can easily refer you to a printed resource: chapter 4 of 'Mastering Regular Expressions', from O'Reilly & Associates. Great introduction.

Mark Hensler's picture

He has: 4,048 posts

Joined: Aug 2000

Alright, thanks for the help!

They have: 601 posts

Joined: Nov 2001

Or try it free online from Safari for 30 days (the online Oreilly bookshop).

Want to join the discussion? Create an account or log in if you already have one. Joining is fast, free and painless! We’ll even whisk you back here when you’ve finished.