<font face="Verdana,Arial,Helvetica,sans-serif" size="2"><div>Thanks Philippe and Tony for your responses. <br><font face="Default Sans Serif,Verdana,Arial,Helvetica,sans-serif" size="2"><div class="socmaildefaultfont" dir="ltr" style="font-family:Arial;font-size:10.5pt"><div class="socmaildefaultfont" dir="ltr" style="font-family:Arial;font-size:10.5pt"><div class="socmaildefaultfont" dir="ltr" style="font-family:Arial;font-size:10.5pt"><div dir="ltr"><div style="font-family: Verdana, Arial, Helvetica, sans-serif; font-size: small;"><br></div><div style="font-family: Verdana, Arial, Helvetica, sans-serif; font-size: small;">I have created a PR with the proposed fix - <a href="https://github.com/apple/swift-corelibs-foundation/pull/278" target="_blank">https://github.com/apple/swift-corelibs-foundation/pull/278</a></div><div style="font-family: Verdana, Arial, Helvetica, sans-serif; font-size: small;"><br></div><div style="font-family: Verdana, Arial, Helvetica, sans-serif; font-size: small;">@Philippe - I have a few questions mentioned inline. </div><div style="font-family: Verdana, Arial, Helvetica, sans-serif; font-size: small;"><br></div><div style="font-family: Verdana, Arial, Helvetica, sans-serif; font-size: small;">Regards,</div><span style="font-family:georgia,serif;"><span style="font-size: 1.143em;">Pushkar N Kulkarni,</span></span></div>
<div dir="ltr"><span style="font-family:georgia,serif;"><span style="font-size: 1.143em;">IBM Runtimes</span></span></div>
<div dir="ltr"> </div>
<div dir="ltr"><em><span style="font-family:georgia,serif;"><span style="font-size: 0.857em;">"Any sufficiently advanced technology is indistinguishable from magic." - Arthur Clarke</span></span></em></div></div></div></div></font></div><br><br><font face="Default Sans Serif,Verdana,Arial,Helvetica,sans-serif" size="2"><font color="#990099" style=""><font color="#000000"><a href="mailto:-----phausler@apple.com" target="_blank">-----phausler@apple.com</a> wrote: -----</font></font><div class="iNotesHistory" style="padding-left: 5px;"><div style="padding-right: 0px; padding-left: 5px; border-left-style: solid; border-left-color: black; border-left-width: 2px;">To: Pushkar N Kulkarni/India/IBM@IBMIN<br>From: Philippe Hausler <phausler@apple.com><br>Sent by: <a href="mailto:phausler@apple.com" target="_blank">phausler@apple.com</a><br>Date: 03/04/2016 02:10AM<br>Cc: <a href="mailto:swift-corelibs-dev@swift.org" target="_blank">swift-corelibs-dev@swift.org</a><br>Subject: Re: False match reported by NSRegularExpression.firstMatchInString()<br><br><!--Notes ACF
<meta http-equiv="Content-Type" content="text/html charset=utf8">--><div class="" style=""><font color="#000000">So there is something else going on here that might be of interest: the objective-c version has two distinct build paths, one goes down a very similar path as the code snippet from CFRegularExpression and the other takes a “shorter” path to uregex_setUText so it might be that this code path has some dust. From what I can tell the major differential between those two code-paths from a portability standpoint is CFAttributedString support (which previously did not exist in linux builds but now does). I think that if you can make a provably correct unit test to show this range differential it would be interesting to test against the objective-c version as well to verify the behavior. In the mean time I can investigate on what other portions we might be missing. It is worth noting CFRegularExpression has never existed before swift-corelibs-foundation so there may very well be mistakes in my re-write.</font></div><div class=""><font color="#0000cc">>> Could you please describe the two paths in greater detail ?</font></div><div class=""><font color="#0000cc">>> Could you please elaborate more on what is expected from the unit test? Are you talking about the range difference created by the two paths?</font></div><div class="" style=""><font color="#000000"><br></font></div><div class="" style=""><font color="#000000">For the record, the tests are nearly a direct transliteration of some of the objective-c tests we have; so those probably are “more correct”.</font></div><div class="" style=""><font color="#000000"><br class=""></font></div><div class="" style=""><font color="#000000">A few clarifications inline:</font></div><br class=""><div><blockquote type="cite" class="" style=""><font color="#000000"><div class="">On Mar 3, 2016, at 12:23 PM, Pushkar N Kulkarni <<a href="mailto:pushkar.nk@in.ibm.com" class="">pushkar.nk@in.ibm.com</a>> wrote:</div><br class="Apple-interchange-newline"><div class=""><font face="Verdana,Arial,Helvetica,sans-serif" size="2" class=""><p style="font-size: 14px; line-height: normal; font-family: Arial;" class=""><font color="#454545" class="">Hi Philippe, other interested people:</font></p><p style="font-size: 14px; line-height: normal; font-family: Arial;" class=""><font color="#454545" class=""><br class=""></font></p><p style="font-size: 14px; line-height: normal; font-family: Arial;" class=""><font color="#454545" class="">Here is the detailed description of an apparent bug in NSRegularExpression along with a possible solution. Request you to comment on the solution, in the context of correctness and performance. </font></p><div style="font-size: 14px; line-height: normal; font-family: Arial;" class=""><font color="#454545" class=""> </font><br class="webkit-block-placeholder"></div><p style="font-size: 14px; line-height: normal; font-family: Arial;" class=""><font color="#454545" class="">This is long - please bear with me! Alternatively, we could discuss this over a pull request if you concur.</font></p><p style="font-size: 14px; line-height: normal; font-family: Arial;" class=""><font color="#454545" class=""><br class=""></font></p><p style="font-size: 14px; line-height: normal; font-family: Arial;" class=""><font color="#454545" class="">---> </font></p><p style="font-size: 14px; line-height: normal; font-family: Arial;" class=""><font color="#454545" class=""><br class=""></font></p><p style="font-size: 14px; line-height: normal; font-family: Arial;" class=""><font color="#454545" class="">The following test case has been borrowed from some of those tests in TestNSRegularExpression that aren’t exercised (test_complexRegularExpression) as of today:</font></p><p style="font-size: 14px; line-height: normal; font-family: Arial; min-height: 16px;" class=""><font color="#454545" class=""><br class=""></font></p><p style="font-size: 14px; line-height: normal; font-family: Courier;" class=""><font color="#454545" class="">import Foundation</font></p><p style="font-size: 14px; line-height: normal; font-family: Courier; min-height: 17px;" class=""><font color="#454545" class=""><br class=""></font></p><p style="font-size: 14px; line-height: normal; font-family: Courier;" class=""><font color="#454545" class="">let searchStr = "This this is the theway."</font></p><p style="font-size: 14px; line-height: normal; font-family: Courier;" class=""><font color="#454545" class="">let testRegex = try NSRegularExpression(pattern: "\\b(th[a-z]+) \\1\\b", options: [])</font></p><p style="font-size: 14px; line-height: normal; font-family: Courier;" class=""><font color="#454545" class="">let fm = testRegex.firstMatchInString(searchStr, options: .WithTransparentBounds, range: NSMakeRange(0,20))</font></p><p style="font-size: 14px; line-height: normal; font-family: Courier;" class=""><font color="#454545" class="">let str = NSString.init(string: searchStr)</font></p><p style="font-size: 14px; line-height: normal; font-family: Courier;" class=""><font color="#454545" class="">if let match = fm {</font></p><p style="font-size: 14px; line-height: normal; font-family: Courier;" class=""><font color="#454545" class=""> print("Test failed")</font></p><p style="font-size: 14px; line-height: normal; font-family: Courier;" class=""><font color="#454545" class=""> print(str.substringWithRange(match.range))</font></p><p style="font-size: 14px; line-height: normal; font-family: Courier;" class=""><font color="#454545" class="">} else {</font></p><p style="font-size: 14px; line-height: normal; font-family: Courier;" class=""><font color="#454545" class=""> print("Test passed")</font></p><p style="font-size: 14px; line-height: normal; font-family: Courier;" class=""><font color="#454545" class="">}</font></p><p style="font-size: 14px; line-height: normal; font-family: Arial; min-height: 16px;" class=""><font color="#454545" class=""><br class=""></font></p><p style="font-size: 14px; line-height: normal; font-family: Arial;" class=""><font color="#454545" class="">The test fails on Linux - a false match is reported. The substring “the the” matches pattern “\b(th[a-z]+) \1\b” which is wrong because the second “the” does not occur on a word boundary. Note that we are using the option: WithTransparentBounds. This means the matcher can look beyond the search range, for word boundaries. </font></p><p style="font-size: 14px; line-height: normal; font-family: Arial;" class=""><font color="#454545" class=""><br class=""></font></p><p style="font-size: 14px; line-height: normal; font-family: Arial;" class=""><font color="#454545" class=""><br class=""></font></p><p style="font-size: 14px; line-height: normal; font-family: Arial;" class=""><font color="#454545" class="">The question is why the word boundary metacharacter at the end of the patter isn’t being honoured. I studied the functions _CFRegularExpressionEnumerateMatchesInString() and prepareRegularExpression() from CFRegularExpression.c and these are my findings that provide an explanation for a possible reason:</font></p><p style="font-size: 14px; line-height: normal; font-family: Arial;" class=""><font color="#454545" class=""><br class=""></font></p><p style="font-size: 14px; line-height: normal; font-family: Arial;" class=""><font color="#454545" class="">1. We first try to get the search text - a UniChar* - using CFStringGetCharactersPtr(). I guess this is done to try improve performance. We copy the entire search text and set regionStart and regionLimit to the search range.</font></p><div class=""><br class=""></div></font></div></font></blockquote><div style=""><font color="#000000"><br class=""></font></div><div style=""><font color="#000000">It is always best practice to expect CFStringGetCharactersPtr to potentially fail. This is where either a) the encoding cannot support returning a direct pointer (e.g. tagged pointer strings) or if somehow the encoding cannot be represented as that storage type.</font></div><br class=""><blockquote type="cite" class="" style=""><font color="#000000"><div class=""><font face="Verdana,Arial,Helvetica,sans-serif" size="2" class=""><p style="font-size: 14px; line-height: normal; font-family: Arial;" class=""><font color="#454545" class=""><br class=""></font></p><p style="font-size: 14px; line-height: normal; font-family: Arial;" class=""><font color="#454545" class="">2. Alternatively, if CFStringGetCharactersPtr() fails, we try to fill the UniChar buffer using CFStringGetCharacters(). Here we try to reduce the size of the search text so that it matches the search range. We use an “enclosingRange” for this. For “transparentBounds” we use the entire text. For nonAnchoringBounds we just take the searchRange plus one character to the left (to match ^) plus one to the right (to match $) : <span style="font-family: Courier;" class="">enclosingRange = range; </span><span style="font-family: Courier;" class=""> </span></font></p><p style="font-size: 14px; line-height: normal; font-family: Courier;" class=""><font color="#454545" class="">if (transparentBounds) {</font></p><p style="font-size: 14px; line-height: normal; font-family: Courier;" class=""><font color="#454545" class=""> enclosingRange = CFRangeMake(0, length);</font></p><p style="font-size: 14px; line-height: normal; font-family: Courier;" class=""><font color="#454545" class="">} else if (nonAnchoringBounds) {</font></p><p style="font-size: 14px; line-height: normal; font-family: Courier;" class=""><font color="#454545" class=""> if (enclosingRange.location > 0) {</font></p><p style="font-size: 14px; line-height: normal; font-family: Courier;" class=""><font color="#454545" class=""> enclosingRange.location--;</font></p><p style="font-size: 14px; line-height: normal; font-family: Courier;" class=""><font color="#454545" class=""> enclosingRange.length++;</font></p><p style="font-size: 14px; line-height: normal; font-family: Courier;" class=""><font color="#454545" class=""> }</font></p><p style="font-size: 14px; line-height: normal; font-family: Courier;" class=""><font color="#454545" class=""> if (enclosingRange.location + enclosingRange.length < length) enclosingRange.length++;</font></p><p style="font-size: 14px; line-height: normal; font-family: Courier;" class=""><font color="#454545" class="">}</font></p><p style="font-size: 14px; line-height: normal; font-family: Arial; min-height: 16px;" class=""><font color="#454545" class=""><br class=""></font></p><p style="font-size: 14px; line-height: normal; font-family: Arial;" class=""><font color="#454545" class="">We then set regionStart and regionLimit to the search range.</font></p><p style="font-size: 14px; line-height: normal; font-family: Arial; min-height: 16px;" class=""><font color="#454545" class=""><br class=""></font></p><p style="font-size: 14px; line-height: normal; font-family: Arial;" class=""><font color="#454545" class="">3. Further we set the search text using the ICU function uregex_setText() like this: </font></p><p style="font-size: 14px; line-height: normal; font-family: Courier;" class=""><font color="#454545" class=""> uregex_setText(regex, (const UChar *)stringBuffer, (int32_t)regionLimit, &errorCode);</font></p><div style="font-size: 14px; line-height: normal; font-family: Arial; min-height: 16px;" class=""><font color="#454545" class=""> </font><br class="webkit-block-placeholder"></div><p style="font-size: 14px; line-height: normal; font-family: Arial;" class=""><font color="#454545" class="">Note that we use “regionLimit” for the “textLength”, which seems questionable. This truncates the search text down to a substring matching the specified search range. So, in the above case where the search text is : "<b class="">This this is the theway.</b>" , using a search range of {0,20} the search text that we actually pass to ICU is “<b class="">This this is the the</b>” which matches “\b(th[a-z]+) \1\b”. Though this may not be a problem in most searches, the <b class="">WithTransparentBounds</b> option fails to take effect. </font></p><p style="font-size: 14px; line-height: normal; font-family: Arial; min-height: 16px;" class=""><font color="#454545" class=""><br class=""></font></p></font></div></font></blockquote><div style=""><font color="#000000"><br class=""></font></div><div style=""><font color="#000000">I have a feeling that this is potentially an artifact of not using uregex_setUText</font></div><br class=""><blockquote type="cite" class="" style=""><font color="#000000"><div class=""><font face="Verdana,Arial,Helvetica,sans-serif" size="2" class=""><p style="font-size: 14px; line-height: normal; font-family: Arial; min-height: 16px;" class=""><font color="#454545" class=""><br class=""></font></p><p style="font-size: 14px; line-height: normal; font-family: Arial;" class=""><font color="#454545" class="">Proposed solution</font></p><p style="font-size: 14px; line-height: normal; font-family: Arial;" class=""><font color="#454545" class="">————————</font></p><p style="font-size: 14px; line-height: normal; font-family: Arial;" class=""><font color="#454545" class="">The straightforward solution is to simply set the length of the search text to the actual length in <font class="">prepareRegularExpression()</font> of the search string as follows: </font></p><div style="font-size: 14px; line-height: normal; font-family: Arial; min-height: 16px;" class=""><font color="#454545" class=""> </font><br class="webkit-block-placeholder"></div><p style="font-size: 14px; line-height: normal; font-family: Arial;" class=""><font color="#454545" class=""> </font><b class="">int32_t textLength = length; //proposed fix</b></p><p style="font-size: 14px; line-height: normal; font-family: Arial; min-height: 16px;" class=""><br class=""></p><p style="font-size: 14px; line-height: normal; font-family: Arial;" class=""> …</p><p style="font-size: 14px; line-height: normal; font-family: Arial; min-height: 16px;" class=""><br class=""></p><p style="font-size: 14px; line-height: normal; font-family: Arial;" class=""> if (range.location + range.length <= INT_MAX) stringBuffer = (UniChar *)CFStringGetCharactersPtr(string);</p><p style="font-size: 14px; line-height: normal; font-family: Arial;" class=""> if (stringBuffer) {</p><p style="font-size: 14px; line-height: normal; font-family: Arial;" class=""> regionStart = (int64_t)range.location;</p><p style="font-size: 14px; line-height: normal; font-family: Arial;" class=""> regionLimit = (int64_t)(range.location + range.length);</p><p style="font-size: 14px; line-height: normal; font-family: Arial;" class=""> *offset = 0;</p><p style="font-size: 14px; line-height: normal; font-family: Arial;" class=""> } else {</p><p style="font-size: 14px; line-height: normal; font-family: Arial;" class=""> enclosingRange = range;</p><p style="font-size: 14px; line-height: normal; font-family: Arial;" class=""> if (transparentBounds) {</p><p style="font-size: 14px; line-height: normal; font-family: Arial;" class=""> enclosingRange = CFRangeMake(0, length);</p><p style="font-size: 14px; line-height: normal; font-family: Arial;" class=""> } else if (nonAnchoringBounds) {</p><p style="font-size: 14px; line-height: normal; font-family: Arial;" class=""> if (enclosingRange.location > 0) {</p><p style="font-size: 14px; line-height: normal; font-family: Arial;" class=""> enclosingRange.location--;</p><p style="font-size: 14px; line-height: normal; font-family: Arial;" class=""> enclosingRange.length++;</p><p style="font-size: 14px; line-height: normal; font-family: Arial;" class=""> }</p><p style="font-size: 14px; line-height: normal; font-family: Arial;" class=""> if (enclosingRange.location + enclosingRange.length < length) enclosingRange.length++;</p><p style="font-size: 14px; line-height: normal; font-family: Arial;" class=""> }</p><p style="font-size: 14px; line-height: normal; font-family: Arial; min-height: 16px;" class=""><br class=""></p><p style="font-size: 14px; line-height: normal; font-family: Arial;" class=""> …</p><p style="font-size: 14px; line-height: normal; font-family: Arial; min-height: 16px;" class=""><br class=""></p><p style="font-size: 14px; line-height: normal; font-family: Arial;" class=""> regionStart = (int64_t)(range.location - enclosingRange.location);</p><p style="font-size: 14px; line-height: normal; font-family: Arial;" class=""> regionLimit = (int64_t)((range.location + range.length) - enclosingRange.location);</p><p style="font-size: 14px; line-height: normal; font-family: Arial;" class=""> *offset = enclosingRange.location;</p><p style="font-size: 14px; line-height: normal; font-family: Arial;" class=""> if (enclosingRange.length <= STACK_BUFFER_SIZE) {</p><p style="font-size: 14px; line-height: normal; font-family: Arial;" class=""> stringBuffer = stackBuffer;</p><p style="font-size: 14px; line-height: normal; font-family: Arial;" class=""> if (enclosingRange.length > 0) {</p><p style="font-size: 14px; line-height: normal; font-family: Arial;" class=""> CFStringGetCharacters(string, enclosingRange, stringBuffer);</p><p style="font-size: 14px; line-height: normal; font-family: Arial;" class=""> }</p><p style="font-size: 14px; line-height: normal; font-family: Arial;" class=""> } else {</p><p style="font-size: 14px; line-height: normal; font-family: Arial;" class=""> stringBuffer = (UniChar *)malloc(sizeof(UniChar) * enclosingRange.length);</p><p style="font-size: 14px; line-height: normal; font-family: Arial;" class=""> if (stringBuffer) {</p><p style="font-size: 14px; line-height: normal; font-family: Arial;" class=""> CFStringGetCharacters(string, enclosingRange, stringBuffer);</p><p style="font-size: 14px; line-height: normal; font-family: Arial;" class=""> *bufferToFree = stringBuffer;</p><p style="font-size: 14px; line-height: normal; font-family: Arial;" class=""> }</p><p style="font-size: 14px; line-height: normal; font-family: Arial;" class=""> }</p><p style="font-size: 14px; line-height: normal; font-family: Arial;" class=""> <b class="">textLength = enclosingRange.length; //proposed fix</b></p><p style="font-size: 14px; line-height: normal; font-family: Arial;" class=""> }</p><p style="font-size: 14px; line-height: normal; font-family: Arial; min-height: 16px;" class=""><br class=""></p><p style="font-size: 14px; line-height: normal; font-family: Arial;" class=""> if (stringBuffer) {</p><p style="font-size: 14px; line-height: normal; font-family: Arial;" class=""> regex = checkOutRegularExpression(internal, checkout, checkedOutRegex);</p><p style="font-size: 14px; line-height: normal; font-family: Arial;" class=""> uregex_setText(regex, (const UChar *)stringBuffer, <b class="">textLength</b>, &errorCode); //proposed fix</p><div class=""><br class=""></div></font></div></font></blockquote><div style=""><font color="#000000"><br class=""></font></div><div style=""><font color="#000000">I might be worried that this has some other side effect that does not behave as the darwin version. Is there a way we can prove this by unit test that this hits all of the expected range extrema?</font></div><div><font color="#0000cc">>> I have opened up 35 out of 37 tests in the test_complexRegularExpression() function in TestNSRegularExpression. These tests include most if the cases I could think of. Do you think these offer enough coverage? Or would you advise me to work on more?</font></div><br class=""><blockquote type="cite" class="" style=""><font color="#000000"><div class=""><font face="Verdana,Arial,Helvetica,sans-serif" size="2" class=""><p style="font-size: 14px; line-height: normal; font-family: Arial;" class=""> }</p><div class=""><br class=""></div><font face="Default Sans Serif,Verdana,Arial,Helvetica,sans-serif" size="2" class=""><div class="socmaildefaultfont" dir="ltr" style="font-family:Arial;font-size:10.5pt"><div class="socmaildefaultfont" dir="ltr" style="font-family:Arial;font-size:10.5pt"><div class="socmaildefaultfont" dir="ltr" style="font-family:Arial;font-size:10.5pt"><div dir="ltr" class=""><br class=""></div><div dir="ltr" class=""><br class=""></div><div dir="ltr" class=""><br class=""></div><div dir="ltr" class="">Thanks!</div><div dir="ltr" class=""><br class=""></div><div dir="ltr" class=""><br class=""><span style="font-family:georgia,serif;" class=""><span style="font-size: 1.143em;" class="">Pushkar N Kulkarni,</span></span></div><div dir="ltr" class=""><span style="font-family:georgia,serif;" class=""><span style="font-size: 1.143em;" class="">IBM Runtimes</span></span></div><div dir="ltr" class=""> </div><div dir="ltr" class=""><em class=""><span style="font-family:georgia,serif;" class=""><span style="font-size: 0.857em;" class="">"Any sufficiently advanced technology is indistinguishable from magic." - Arthur Clarke</span></span></em></div></div></div></div></font></font><br class="">
</div></font></blockquote></div><br class=""></phausler@apple.com></div></div></font></font><BR>