Regular Expressions

Deividson Okopnik [deivid.okop at gmail.com]

Thu, 17 Jul 2008 23:50:11 -0300

Quick regular expressions questions, I have a string and i want to return only whats inside the quotes inside that string - example the string is -> "Deividson" Okopnik <-, and i want only -> "Deividson" <-. Its guaranted that there will be only a single pair of double-quotes inside the string, but the lenght of the string inside it is not constant.

Im using PHP btw

Thanks

DeiviD

Top Back

Ben Okopnik [ben at linuxgazette.net]

Thu, 17 Jul 2008 23:26:18 -0400

On Thu, Jul 17, 2008 at 11:50:11PM -0300, Deividson Okopnik wrote:

> Quick regular expressions questions, I have a string and i want to
> return only whats inside the quotes inside that string - example the
> string is -> "Deividson" Okopnik <-, and i want only -> "Deividson"
> <-. Its guaranted that there will be only a single pair of
> double-quotes inside the string, but the lenght of the string inside
> it is not constant.

Given that there's only one pair of double quotes, that's pretty easy. Assuming that you're using PHP's "preg_replace" function, and that your content is in a variable called $name:

echo preg_replace('/"(.*)"/', '$1', $name);

If there was more than one set of double quotes, and you wanted to make sure that you only got the content of the first one, you'd need to use a "balanced" capture. This is one of those classic regex methods that comes up all the time, and is well worth knowing.

echo preg_replace('/"([^"]+)"/', '$1', $name);

In Perl, you can comment regular expressions by using the '/x' option. I'll do that so I can explain what's going on:

/
"		# Match the opening double quote
(		# Begin capturing the content
[^"]+	# One or more characters which are NOT double quotes
)		# End capture (content will be in $1)
"		# Closing double quote
/x;

This is very common in processing HTML. Capturing tag content, for example, looks like this:

/<([^>]+)>/

-- 
* Ben Okopnik * Editor-in-Chief, Linux Gazette * https://LinuxGazette.NET *

Top Back

Ben Okopnik [ben at linuxgazette.net]

Thu, 17 Jul 2008 23:43:56 -0400

On Thu, Jul 17, 2008 at 11:26:18PM -0400, Benjamin Okopnik wrote:

> 
> ``
> echo preg_replace('/"(.*)"/', '$1', $name);
> ''

Whoops - I just realized that I forgot to throw away the rest of the line (for some reason, I thought I was just extracting the matched part.) I always knew that doing PHP would rot my brain sooner or later.

echo preg_replace('/.*"(.*)".*/', '$1', $name);

-- 
* Ben Okopnik * Editor-in-Chief, Linux Gazette * https://LinuxGazette.NET *

Top Back

Jim Jackson [jj at franjam.org.uk]

Fri, 18 Jul 2008 08:46:20 +0100 (BST)

On Thu, 17 Jul 2008, Ben Okopnik wrote:

> ``
> echo preg_replace('/"(.*)"/', '$1', $name);
> ''
>

> ``
> echo preg_replace('/"([^"]+)"/', '$1', $name);
> ''

Any reason for the use of '+' instead of '*' in the second example? It could be there is a null string enclosed in quotes, which the first one would get and the second would miss.

Top Back

Ben Okopnik [ben at linuxgazette.net]

Fri, 18 Jul 2008 08:31:56 -0400

On Fri, Jul 18, 2008 at 08:46:20AM +0100, Jim Jackson wrote:

> 
> 
> 
> On Thu, 17 Jul 2008, Ben Okopnik wrote:
> 
> 
> > ``
> > echo preg_replace('/"(.*)"/', '$1', $name);
> > ''
> >
> 
> > ``
> > echo preg_replace('/"([^"]+)"/', '$1', $name);
> > ''
> 
> Any reason for the use of '+' instead of '*' in the second example? It 
> could be there is a null string enclosed in quotes, which the first one 
> would get and the second would miss.

I've been working with regexes for many years now, and have never seen a practical reason for matching a null string. Do you know of a situation in which having a null string is to be preferred over 'undef' (the result of checking $1 when no capture has occurred)?

-- 
* Ben Okopnik * Editor-in-Chief, Linux Gazette * https://LinuxGazette.NET *

Top Back

Jim Jackson [jj at franjam.org.uk]

Fri, 18 Jul 2008 14:41:46 +0100 (BST)

On Fri, 18 Jul 2008, Ben Okopnik wrote:

> On Fri, Jul 18, 2008 at 08:46:20AM +0100, Jim Jackson wrote:
>>
>>
>>
>> On Thu, 17 Jul 2008, Ben Okopnik wrote:
>>
>>
>>> ``
>>> echo preg_replace('/"(.*)"/', '$1', $name);
>>> ''
>>>
>>
>>> ``
>>> echo preg_replace('/"([^"]+)"/', '$1', $name);
>>> ''
>>
>> Any reason for the use of '+' instead of '*' in the second example? It
>> could be there is a null string enclosed in quotes, which the first one
>> would get and the second would miss.
>
> I've been working with regexes for many years now, and have never seen a
> practical reason for matching a null string. Do you know of a situation
> in which having a null string is to be preferred over 'undef' (the
> result of checking $1 when no capture has occurred)?

Still doesn't answer why you use the zero the first solution, and the one or more match operator '+' in the second example?

Maybe this string is valid input...

An "" example

A zero length string indicates the input was valid, and undef would indicate the input line was not of the correct format. A zero length string is often a perfectly ok value, and is different from nothing found.

Top Back

Jim Jackson [jj at franjam.org.uk]

Fri, 18 Jul 2008 14:50:39 +0100 (BST)

On Fri, 18 Jul 2008, Jim Jackson wrote:

> On Fri, 18 Jul 2008, Ben Okopnik wrote:
>> On Fri, Jul 18, 2008 at 08:46:20AM +0100, Jim Jackson wrote:
>>> On Thu, 17 Jul 2008, Ben Okopnik wrote:
>>>
>>>> ``
>>>> echo preg_replace('/"(.*)"/', '$1', $name);
>>>> ''
>>>>
>>>
>>>> ``
>>>> echo preg_replace('/"([^"]+)"/', '$1', $name);
>>>> ''
>>>
>>> Any reason for the use of '+' instead of '*' in the second example? It
>>> could be there is a null string enclosed in quotes, which the first one
>>> would get and the second would miss.
>>
>> I've been working with regexes for many years now, and have never seen a
>> practical reason for matching a null string. Do you know of a situation
>> in which having a null string is to be preferred over 'undef' (the
>> result of checking $1 when no capture has occurred)?
>

oops, I pressed send in too much haste...

> Still doesn't answer why you use the zero the first solution, and the

zero or more match operator in the first solution, and the

> one or more match operator '+' in the second example?
>
> Maybe this string is valid input...
>
>   An "" example
>
> A zero length string indicates the input was valid, and undef would
> indicate the input line was not of the correct format. A zero length string
> is often a perfectly ok value, and is different from nothing found.

just curious.

Top Back

Deividson Okopnik [deivid.okop at gmail.com]

Fri, 18 Jul 2008 12:37:59 -0300

actually, i just noticed there are 2 sets of quotes in the string (the RSS returns a link <a href="blablabla">). Im using preg_replace('/.*"([^"]+)".*/', '$1', $verse_body), but its returning the content of the second quotes pair of quotes...

Top Back

Ben Okopnik [ben at linuxgazette.net]

Fri, 18 Jul 2008 15:51:35 -0400

On Fri, Jul 18, 2008 at 12:37:59PM -0300, Deividson Okopnik wrote:

> actually, i just noticed there are 2 sets of quotes in the string (the
> RSS returns a link <a href="blablabla">). Im using
> preg_replace('/.*"([^"]+)".*/', '$1', $verse_body), but its returning
> the content of the second quotes pair of quotes...

Yep - since the initial '.*' is (correctly) greedy and consumes everything up to the last pair of quotes. If you always want the first pair, you could specify that in a couple of different ways in PHP:

// Method #1
preg_match('/"([^"]+)"/', $verse_body, $found);
echo $found[1];
 
// Method #2
echo preg_replace('/^[^"]+"([^"]+)".*/', '$1', $verse_body);

-- 
* Ben Okopnik * Editor-in-Chief, Linux Gazette * https://LinuxGazette.NET *

Top Back

Ben Okopnik [ben at linuxgazette.net]

Fri, 18 Jul 2008 16:22:21 -0400

On Fri, Jul 18, 2008 at 02:41:46PM +0100, Jim Jackson wrote:

> On Fri, 18 Jul 2008, Ben Okopnik wrote:
> 
> Still doesn't answer why you use the zero the first solution, and the 
> one or more match operator '+' in the second example?

This is like asking why I would use a cup to drink my coffee one morning and a mug the next. The answer is, there's no real reason - since it doesn't matter one way or the other. If there's any reason at all, it may well be that I didn't do the dishes the night before and that the mug happened to be clean - i.e., the reason doesn't have anything to do with the thing you're asking about.

There are plenty of situations where '*' vs. '+' would matter, of course. This just doesn't happen to be one of them.

> Maybe this string is valid input...
> 
>    An "" example
> 
> A zero length string indicates the input was valid, and undef would 
> indicate the input line was not of the correct format.

Really? That's a new one on me. In fact, I can demonstrate that this is incorrect in both directions.

ben@Tyr:~$ perl -wle'$a=undef; $b=qq["$a"]; $b=~/"([^"]*)"/; print $1'
Use of uninitialized value in concatenation (.) or string at -e line 1.

Even though the format was indeed correct - i.e., there were two double quotes in the string - the capture returned 'undef'.

ben@Tyr:~$ perl -wle'$b=qq["""]; $b=~/"([^"]*)"/; print "-$1-"'
--

quotes in the string - the capture returned an empty string.

> A zero length string
> is often a perfectly ok value, and is different from nothing found.

"undef" is also often a perfectly OK value, although it is indeed different from an empty string.

Jim, I understand that you're wondering about the inconsistency in my two regexes. The inconsistency is indeed there, but - as I've explained above - in the case of the problem as originally defined by Deividson, it really makes zero difference. Your idea about "correct format", though, is a case of making way too much soup out of a single oyster.

-- 
* Ben Okopnik * Editor-in-Chief, Linux Gazette * https://LinuxGazette.NET *

Top Back

Deividson Okopnik [deivid.okop at gmail.com]

Fri, 18 Jul 2008 18:06:18 -0300

huh

weirdly enough, both ways are still returning me the content of the second pair of quotes... On method 1, found [0] is the content of the last pair of quotes (inside quotes), found[1] is the same content, but without no quotes, and finally found[2] is empty.

the first content have spaces - can that be a problem?

this is exactly what the server returns me, and it gets stored inside $verse_body: "I will praise you with an upright heart as I learn your righteous laws."<br><br> Brought to you by <a href="https://www.biblegateway.com">BibleGateway.com</a>. Copyright (C) NIV. All Rights Reserved.

> Yep - since the initial '.*' is (correctly) greedy and consumes
> everything up to the last pair of quotes. If you always want the first
> pair, you could specify that in a couple of different ways in PHP:
>
> ``
> // Method #1
> preg_match('/"([^"]+)"/', $verse_body, $found);
> echo $found[1];
>
> // Method #2
> echo preg_replace('/^[^"]+"([^"]+)".*/', '$1', $verse_body);
> ''

Top Back

Ben Okopnik [ben at linuxgazette.net]

Fri, 18 Jul 2008 17:48:08 -0400

On Fri, Jul 18, 2008 at 06:06:18PM -0300, Deividson Okopnik wrote:

> huh
> 
> weirdly enough, both ways are still returning me the content of the
> second pair of quotes... On method 1, found [0] is the content of the
> last pair of quotes (inside quotes), found[1] is the same content, but
> without no quotes, and finally found[2] is empty.
> 
> the first content have spaces - can that be a problem?
> 
> this is exactly what the server returns me, and it gets stored inside
> $verse_body:
> "I will praise you with an upright heart  as I learn your righteous
> laws."<br><br> Brought to you by <a
> href="https://www.biblegateway.com">BibleGateway.com</a>. Copyright (C)
> NIV. All Rights Reserved.

If you show us the wrong data, you're likely to get the wrong answer.

The regex would have worked fine for the string that you initially showed us.

I've just taken a look at the site, and the line you're trying to process does not contain what you think it does. "View source" shows the following:

&ldquo;I will praise you with an upright heart as I learn your righteous laws.&rdquo;- [...]

This will, of course, not work with the regex. You'll need to do some processing first - PHP has functions for converting HTML to text - and then do the extraction. HTML can be pretty tricky that way.

-- 
* Ben Okopnik * Editor-in-Chief, Linux Gazette * https://LinuxGazette.NET *

Top Back

Jim Jackson [jj at franjam.org.uk]

Mon, 21 Jul 2008 08:57:43 +0100 (BST)

On Fri, 18 Jul 2008, Deividson Okopnik wrote:

> actually, i just noticed there are 2 sets of quotes in the string (the
> RSS returns a link <a href="blablabla">). Im using
> preg_replace('/.*"([^"]+)".*/', '$1', $verse_body), but its returning
> the content of the second quotes pair of quotes...

It's being greedy, as has already been said. You need to alter the regexp to something like....

  '/[^"]*"([^"]+)".*/'

i.e. match any non-" chars and find the first "

>
> +-+--------------------------------------------------------------------+-+
> You've asked a question of The Answer Gang, so you've been sent the reply
> directly as a courtesy.  The TAG list has also been copied.  Please send
> all replies to tag@lists.linuxgazette.net, so that we can help our other
> readers by publishing the exchange in our monthly Web magazine:
>              Linux Gazette (https://linuxgazette.net/)
> +-+--------------------------------------------------------------------+-+
> _____________________________________________
> TAG mailing list
> TAG@lists.linuxgazette.net
> https://lists.linuxgazette.net/mailman/listinfo/tag
>

Top Back