Feature #2834
openMake regex more strict
Added by Qiuhan Ding over 9 years ago. Updated almost 7 years ago.
0%
Description
This change is mainly for key generation. Since we cannot infer pattern and derive name from ambiguous pattern. And also because we want to simplify the regular expression. Some features are not essential.
Features that have been changed:
do not allow repetition for sub groups. This is to avoid uncertainty in pattern.
For example: We do not allow
<ndn>(<>)*
or<ndn>(<>*)*
remove '
^
' and '$' in pattern expression. Every pattern needs to be complete.For example: previous pattern
^<ndn><edu>
needs to change to<ndn><edu><>*
if it has more components after<edu>
do not allow sub groups inside the component matcher. This is to simplify the regular expression.
For example: previous pattern
<(.*).(.*)>
is not allowed.
Updated by Junxiao Shi over 9 years ago
- does not allow repetition for sub groups. This is to avoid uncertainty in pattern.
- remove '
^
' and '$' in pattern expression. Every pattern needs to be complete.- does not allow subgroups in the wild card. This is to simplify the regular expression.
Please give examples for points 1 and 3. I don't fully understand them.
Updated by Qiuhan Ding over 9 years ago
- Description updated (diff)
- Status changed from New to Code review
Updated by Junxiao Shi over 9 years ago
- Description updated (diff)
do not allow subgroups in the wildcard. This is to simplify the regular expression.
For example: previous pattern<(.*).(.*)>
is not allowed.
I still don't under what is "subgroups in the wildcard". The example given is "subgroups in a name component matcher".
From Regex:
A special case is that
<>
is a wildcard matcher that can match ANY component.
The only possible "wildcard" is <>
. By definition, no subgroups can occur in the wildcard.
Updated by Junxiao Shi over 9 years ago
Please update Regex wiki page to reflect these changes.
Updated by Tai-Lin Chu over 9 years ago
I saw this and other related proposals for regexp. A less-ambiguous design might be applying one regexp on one component instead of the whole name. As a result, there will be a list of regexp patterns to match a name.
Also redefining/changing regexp syntax deviates ndn regexp from the regexp standard, so it is not necessarily a good thing. (Removing support for complex syntax is ok though.)
Updated by Junxiao Shi over 9 years ago
do not allow boost regular expression inside the component matcher. This is to simplify the regular expression.
I disagree with this change.
This makes a component matcher either an exact match or a wildcard.
It's sometimes useful to match a pattern such as <ksk-.*>
.
Updated by Yingdi Yu over 9 years ago
Junxiao Shi wrote:
do not allow boost regular expression inside the component matcher. This is to simplify the regular expression.
I disagree with this change.
This makes a component matcher either an exact match or a wildcard.
That is exactly what we what.
It's sometimes useful to match a pattern such as
<ksk-.*>
.
This name component per se is not correct. It should be separated into two name component, one for ksk and the other one for timestamp.
Updated by Junxiao Shi over 9 years ago
Followup on note-9:
Reducing name component matcher to either exact matching or wildcard significantly weakens the power of ndn-regex.
There are legit cases where pattern matching is necessary.
<ksk-.*>
is a correct example, because this is how a real world protocol is defined. We should weaken ndn-regex and force the protocol to be changed.
Other examples are: <%FD.*>
to match a version component, and <%00.*>
to match a segment number. They are needed at least until #2012 completes and relevant matchers are added.
Updated by Yingdi Yu over 9 years ago
Junxiao Shi wrote:
Followup on note-9:
Reducing name component matcher to either exact matching or wildcard significantly weakens the power of ndn-regex.
There are legit cases where pattern matching is necessary.
There are three component matcher, exact matcher, wildcard, and wildcard specializer. Please check the TR http://named-data.net/publications/techreports/ndn-0030-2-trust-schema/
<ksk-.*>
is a correct example, because this is how a real world protocol is defined. We should weaken ndn-regex and force the protocol to be changed.
Other examples are:<%FD.*>
to match a version component, and<%00.*>
to match a segment number. They are needed at least until #2012 completes and relevant matchers are added.
There is no such "real world protocol" about ksk-..., it is just some early thought Alex and I came up. And we do not think it is correct. Ideally, ksk and dsk imply the privilege of a key, and should be explicitly expressed as a single name component rather than a part of key id.
For the version number and segment number you mentioned above, why trust schema should care about the value of these components?
Updated by Junxiao Shi over 9 years ago
For the version number and segment number you mentioned above, why trust schema should care about the value of these components?
ndn-regex is a utility to help application developers.
It is NOT solely used by trust schema.
An application may want to use ndn-regex to classify incoming Interests, and take actions according to its structure.
ndn::InterestFilter
type encourages such usage.
It's reasonable for an application to process an Interest that requests a versioned item differently from an Interest without a version component.
You could make a separate ndn-trust-schema-regex that imposes all these limitations, and I won't object.
However, leave the original ndn-regex alone!
Updated by Yingdi Yu over 9 years ago
No, the original ndn-regex is wrong! And I did not see anything the new ndn-regex cannot do, for the example you just mentioned above, please use the wildcard specializer instead. The reason to make regex stricter is to avoid unnecessary flexibility. BTW. the normal regex can only match string, what do you expect people to do with the binary name component? To me the wildcard specializer is obviously a better solution than normal regex.
Updated by Junxiao Shi over 9 years ago
I did not see anything the new ndn-regex cannot do
This statement is false.
Wildcard specializer as defined in commit:f6c11c06139518fe6378c5c4cbd6508c085bcc9d is incapable of creating a regex equivalent to original <ab*c>
.
Updated by Yingdi Yu over 9 years ago
Junxiao Shi wrote:
Wildcard specializer as defined in commit:f6c11c06139518fe6378c5c4cbd6508c085bcc9d is incapable of creating a regex equivalent to original
<ab*c>
.
In which case we need such a matcher?
Updated by Junxiao Shi over 9 years ago
Answer to note-15:
There are all kinds of applications, and you don't know all of them.
Since InterestFilter
supports ndn-regex, applications start to rely on this feature.
There's no reason to take part of this feature away from application developers.
Updated by Yingdi Yu over 9 years ago
Even if we already know its a bad feature that can be easily abused, while there is a better solution? I just do not get why people want to rely on internal pattern of component value. If there has to be some pattern, why not explicitly express the pattern at the name level?
If you are talking about backward compatibility, then we should find a way to help people to migrate to the new syntax.
Updated by Junxiao Shi over 9 years ago
The ability to match inside a NameComponent is not a bad feature.
It is a legit feature that needs to be maintained.
For example, ndnpingserver
may use an ndn-regex to ensure the sequence number component is a decimal integer.
Note: it's intentional to use decimal rather than nonNegativeInteger, because it's easily readable in logs.
Version and segment markers are another example where matching inside a component is necessary.
An application may want to classify an incoming Interest to see whether it ends with version and segment, and this classifier could be written with an ndn-regex.
Jeff's marker proposal was to use a separate component like _v/1/_s/2
but it was rejected.
I can agree with dropping regex matching in component matcher under the condition that an equivalent feature is provided as a function in wildcard specializer, ie [regex:ab*c]
.
Updated by Junxiao Shi over 9 years ago
commit:4d827b7db1c7f436b3fc838a1266d85d86c93be5 says:
A special case is that ``<>`` is a wildcard matcher that can match **ANY** component.
Why is this special case necessary?
The regular name component matcher <.*>
is already able to match any component.
The concern of having this special case definition is: how to write a regular name component matcher that matches an empty component?
I know one way to match an empty component is [^<.+>]
but this is really unobvious.
Updated by Yingdi Yu over 9 years ago
given wildcard is the most frequently used matcher, <> is a short cut, it is convenient to use. BTW. in which case people want to put an empty component in a name?
Updated by Junxiao Shi over 9 years ago
I can accept having <>
as a wildcard.
But it's also necessary to show an example of how to match an empty component, so that application designers can use the same syntax, and therefore improve the readability of code.
I suggest taking <.{0}>
as the recommended syntax for empty component matching.
See #1932 note-6 on why empty component is necessary.
If you disagree with these arguments, post your comments on #1932.
Updated by Junxiao Shi over 9 years ago
From commit:70cb7c5fe11011f297529890ce56884e60bfbd21 :
Question:
Give an example for a name component matcher that can match a character not representable in UTF-8.
For example, how to write an exact matcher that matching the binary NameComponent FD00305E ?
Qiuhan's answer was:
We do not support matching binary name component
This answer conflicts with NDN packet format.
TLV-VALUE := BYTE+
GenericNameComponent
is a generic name component, without any restrictions on the content of the value.
The quoted lines from NDN packet format define GenericNameComponent as an array of octets, not restricted to UTF-8 text.
Furthermore, Naming Conventions require the use of octets outside of UTF-8 range in order to represent version number, segment number, etc.
Therefore, ndn-regex needs the ability to match any octet, and not to be restricted in UTF-8 range.
Updated by Junxiao Shi over 8 years ago
20160607 conference call, Yingdi reveals Regex can actually match binary (non-printable) NameComponent, but only with the wildcard matcher.
Capturing with wildcard matcher is supported.
I think this should be sufficient.
Updated by Junxiao Shi over 8 years ago
- Status changed from Code review to Feedback
20160901 conference call decides to neither merge nor reject these patches:
https://gerrit.named-data.net/2062
https://gerrit.named-data.net/2057