...making Linux just a little more fun!
Deividson Okopnik [deivid.okop at gmail.com]
Quick regular expressions questions, I have a string and i want to return only whats inside the quotes inside that string - example the string is -> "Deividson" Okopnik <-, and i want only -> "Deividson" <-. Its guaranted that there will be only a single pair of double-quotes inside the string, but the lenght of the string inside it is not constant.
Im using PHP btw
Thanks
DeiviD
Ben Okopnik [ben at linuxgazette.net]
On Thu, Jul 17, 2008 at 11:50:11PM -0300, Deividson Okopnik wrote:
> Quick regular expressions questions, I have a string and i want to > return only whats inside the quotes inside that string - example the > string is -> "Deividson" Okopnik <-, and i want only -> "Deividson" > <-. Its guaranted that there will be only a single pair of > double-quotes inside the string, but the lenght of the string inside > it is not constant.
Given that there's only one pair of double quotes, that's pretty easy. Assuming that you're using PHP's "preg_replace" function, and that your content is in a variable called $name:
echo preg_replace('/"(.*)"/', '$1', $name);
If there was more than one set of double quotes, and you wanted to make sure that you only got the content of the first one, you'd need to use a "balanced" capture. This is one of those classic regex methods that comes up all the time, and is well worth knowing.
echo preg_replace('/"([^"]+)"/', '$1', $name);
In Perl, you can comment regular expressions by using the '/x' option. I'll do that so I can explain what's going on:
/ " # Match the opening double quote ( # Begin capturing the content [^"]+ # One or more characters which are NOT double quotes ) # End capture (content will be in $1) " # Closing double quote /x;
This is very common in processing HTML. Capturing tag content, for example, looks like this:
/<([^>]+)>/
-- * Ben Okopnik * Editor-in-Chief, Linux Gazette * https://LinuxGazette.NET *
Ben Okopnik [ben at linuxgazette.net]
On Thu, Jul 17, 2008 at 11:26:18PM -0400, Benjamin Okopnik wrote:
> > `` > echo preg_replace('/"(.*)"/', '$1', $name); > ''
Whoops - I just realized that I forgot to throw away the rest of the
line (for some reason, I thought I was just extracting the matched
part.) I always knew that doing PHP would rot my brain sooner or later.
echo preg_replace('/.*"(.*)".*/', '$1', $name);
-- * Ben Okopnik * Editor-in-Chief, Linux Gazette * https://LinuxGazette.NET *
Jim Jackson [jj at franjam.org.uk]
On Thu, 17 Jul 2008, Ben Okopnik wrote:
> `` > echo preg_replace('/"(.*)"/', '$1', $name); > '' >
> `` > echo preg_replace('/"([^"]+)"/', '$1', $name); > ''
Any reason for the use of '+' instead of '*' in the second example? It could be there is a null string enclosed in quotes, which the first one would get and the second would miss.
Ben Okopnik [ben at linuxgazette.net]
On Fri, Jul 18, 2008 at 08:46:20AM +0100, Jim Jackson wrote:
> > > > On Thu, 17 Jul 2008, Ben Okopnik wrote: > > > > `` > > echo preg_replace('/"(.*)"/', '$1', $name); > > '' > > > > > `` > > echo preg_replace('/"([^"]+)"/', '$1', $name); > > '' > > Any reason for the use of '+' instead of '*' in the second example? It > could be there is a null string enclosed in quotes, which the first one > would get and the second would miss.
I've been working with regexes for many years now, and have never seen a practical reason for matching a null string. Do you know of a situation in which having a null string is to be preferred over 'undef' (the result of checking $1 when no capture has occurred)?
-- * Ben Okopnik * Editor-in-Chief, Linux Gazette * https://LinuxGazette.NET *
Jim Jackson [jj at franjam.org.uk]
On Fri, 18 Jul 2008, Ben Okopnik wrote:
> On Fri, Jul 18, 2008 at 08:46:20AM +0100, Jim Jackson wrote: >> >> >> >> On Thu, 17 Jul 2008, Ben Okopnik wrote: >> >> >>> `` >>> echo preg_replace('/"(.*)"/', '$1', $name); >>> '' >>> >> >>> `` >>> echo preg_replace('/"([^"]+)"/', '$1', $name); >>> '' >> >> Any reason for the use of '+' instead of '*' in the second example? It >> could be there is a null string enclosed in quotes, which the first one >> would get and the second would miss. > > I've been working with regexes for many years now, and have never seen a > practical reason for matching a null string. Do you know of a situation > in which having a null string is to be preferred over 'undef' (the > result of checking $1 when no capture has occurred)?
Still doesn't answer why you use the zero the first solution, and the one or more match operator '+' in the second example?
Maybe this string is valid input...
An "" example
A zero length string indicates the input was valid, and undef would indicate the input line was not of the correct format. A zero length string is often a perfectly ok value, and is different from nothing found.
Jim Jackson [jj at franjam.org.uk]
On Fri, 18 Jul 2008, Jim Jackson wrote:
> On Fri, 18 Jul 2008, Ben Okopnik wrote: >> On Fri, Jul 18, 2008 at 08:46:20AM +0100, Jim Jackson wrote: >>> On Thu, 17 Jul 2008, Ben Okopnik wrote: >>> >>>> `` >>>> echo preg_replace('/"(.*)"/', '$1', $name); >>>> '' >>>> >>> >>>> `` >>>> echo preg_replace('/"([^"]+)"/', '$1', $name); >>>> '' >>> >>> Any reason for the use of '+' instead of '*' in the second example? It >>> could be there is a null string enclosed in quotes, which the first one >>> would get and the second would miss. >> >> I've been working with regexes for many years now, and have never seen a >> practical reason for matching a null string. Do you know of a situation >> in which having a null string is to be preferred over 'undef' (the >> result of checking $1 when no capture has occurred)? >
oops, I pressed send in too much haste...
> Still doesn't answer why you use the zero the first solution, and thezero or more match operator in the first solution, and the
> one or more match operator '+' in the second example? > > Maybe this string is valid input... > > An "" example > > A zero length string indicates the input was valid, and undef would > indicate the input line was not of the correct format. A zero length string > is often a perfectly ok value, and is different from nothing found.
just curious.
Deividson Okopnik [deivid.okop at gmail.com]
actually, i just noticed there are 2 sets of quotes in the string (the RSS returns a link <a href="blablabla">). Im using preg_replace('/.*"([^"]+)".*/', '$1', $verse_body), but its returning the content of the second quotes pair of quotes...
Ben Okopnik [ben at linuxgazette.net]
On Fri, Jul 18, 2008 at 12:37:59PM -0300, Deividson Okopnik wrote:
> actually, i just noticed there are 2 sets of quotes in the string (the > RSS returns a link <a href="blablabla">). Im using > preg_replace('/.*"([^"]+)".*/', '$1', $verse_body), but its returning > the content of the second quotes pair of quotes...
Yep - since the initial '.*' is (correctly) greedy and consumes everything up to the last pair of quotes. If you always want the first pair, you could specify that in a couple of different ways in PHP:
// Method #1 preg_match('/"([^"]+)"/', $verse_body, $found); echo $found[1]; // Method #2 echo preg_replace('/^[^"]+"([^"]+)".*/', '$1', $verse_body);
-- * Ben Okopnik * Editor-in-Chief, Linux Gazette * https://LinuxGazette.NET *
Ben Okopnik [ben at linuxgazette.net]
On Fri, Jul 18, 2008 at 02:41:46PM +0100, Jim Jackson wrote:
> On Fri, 18 Jul 2008, Ben Okopnik wrote: > > Still doesn't answer why you use the zero the first solution, and the > one or more match operator '+' in the second example?
This is like asking why I would use a cup to drink my coffee one morning and a mug the next. The answer is, there's no real reason - since it doesn't matter one way or the other. If there's any reason at all, it may well be that I didn't do the dishes the night before and that the mug happened to be clean - i.e., the reason doesn't have anything to do with the thing you're asking about.
There are plenty of situations where '*' vs. '+' would matter, of course. This just doesn't happen to be one of them.
> Maybe this string is valid input... > > An "" example > > A zero length string indicates the input was valid, and undef would > indicate the input line was not of the correct format.
Really? That's a new one on me. In fact, I can demonstrate that this is incorrect in both directions.
ben@Tyr:~$ perl -wle'$a=undef; $b=qq["$a"]; $b=~/"([^"]*)"/; print $1' Use of uninitialized value in concatenation (.) or string at -e line 1.
Even though the format was indeed correct - i.e., there were two double quotes in the string - the capture returned 'undef'.
ben@Tyr:~$ perl -wle'$b=qq["""]; $b=~/"([^"]*)"/; print "-$1-"'quotes in the string - the capture returned an empty string.--
> A zero length string > is often a perfectly ok value, and is different from nothing found.
"undef" is also often a perfectly OK value, although it is indeed different from an empty string.
Jim, I understand that you're wondering about the inconsistency in my two regexes. The inconsistency is indeed there, but - as I've explained above - in the case of the problem as originally defined by Deividson, it really makes zero difference. Your idea about "correct format", though, is a case of making way too much soup out of a single oyster.
-- * Ben Okopnik * Editor-in-Chief, Linux Gazette * https://LinuxGazette.NET *
Deividson Okopnik [deivid.okop at gmail.com]
huh
weirdly enough, both ways are still returning me the content of the second pair of quotes... On method 1, found [0] is the content of the last pair of quotes (inside quotes), found[1] is the same content, but without no quotes, and finally found[2] is empty.
the first content have spaces - can that be a problem?
this is exactly what the server returns me, and it gets stored inside $verse_body: "I will praise you with an upright heart as I learn your righteous laws."<br><br> Brought to you by <a href="https://www.biblegateway.com">BibleGateway.com</a>. Copyright (C) NIV. All Rights Reserved.
> Yep - since the initial '.*' is (correctly) greedy and consumes > everything up to the last pair of quotes. If you always want the first > pair, you could specify that in a couple of different ways in PHP: > > `` > // Method #1 > preg_match('/"([^"]+)"/', $verse_body, $found); > echo $found[1]; > > // Method #2 > echo preg_replace('/^[^"]+"([^"]+)".*/', '$1', $verse_body); > ''
Ben Okopnik [ben at linuxgazette.net]
On Fri, Jul 18, 2008 at 06:06:18PM -0300, Deividson Okopnik wrote:
> huh > > weirdly enough, both ways are still returning me the content of the > second pair of quotes... On method 1, found [0] is the content of the > last pair of quotes (inside quotes), found[1] is the same content, but > without no quotes, and finally found[2] is empty. > > the first content have spaces - can that be a problem? > > this is exactly what the server returns me, and it gets stored inside > $verse_body: > "I will praise you with an upright heart as I learn your righteous > laws."<br><br> Brought to you by <a > href="https://www.biblegateway.com">BibleGateway.com</a>. Copyright (C) > NIV. All Rights Reserved.If you show us the wrong data, you're likely to get the wrong answer.
I've just taken a look at the site, and the line you're trying to process does not contain what you think it does. "View source" shows the following:
“I will praise you with an upright heart as I learn your righteous laws.”- [...]
This will, of course, not work with the regex. You'll need to do some processing first - PHP has functions for converting HTML to text - and then do the extraction. HTML can be pretty tricky that way.
-- * Ben Okopnik * Editor-in-Chief, Linux Gazette * https://LinuxGazette.NET *
Jim Jackson [jj at franjam.org.uk]
On Fri, 18 Jul 2008, Deividson Okopnik wrote:
> actually, i just noticed there are 2 sets of quotes in the string (the > RSS returns a link <a href="blablabla">). Im using > preg_replace('/.*"([^"]+)".*/', '$1', $verse_body), but its returning > the content of the second quotes pair of quotes...
It's being greedy, as has already been said. You need to alter the regexp to something like....
'/[^"]*"([^"]+)".*/'
i.e. match any non-" chars and find the first "
> > +-+--------------------------------------------------------------------+-+ > You've asked a question of The Answer Gang, so you've been sent the reply > directly as a courtesy. The TAG list has also been copied. Please send > all replies to tag@lists.linuxgazette.net, so that we can help our other > readers by publishing the exchange in our monthly Web magazine: > Linux Gazette (https://linuxgazette.net/) > +-+--------------------------------------------------------------------+-+ > _____________________________________________ > TAG mailing list > TAG@lists.linuxgazette.net > https://lists.linuxgazette.net/mailman/listinfo/tag >