View Issue Details

IDProjectCategoryView StatusLast Update
0000603fileGeneralpublic2024-12-31 22:54
ReporterAnton Monroe Assigned Tochristos  
PrioritynormalSeverityminorReproducibilityalways
Status assignedResolutionopen 
Product Version5.46 
Summary0000603: regex: $ does not match CRLF line-ending
DescriptionIn a regular expression, $ represents the end of a line. "regex <string>$" works for files with LF line-endings but not for CRLF line-endings.
Steps To ReproduceUsing the attached files, type
        file -m regex.magic lf.c crlf.c
TagsNo tags attached.

Activities

Anton Monroe

2024-12-29 05:16

reporter  

regex.magic (183 bytes)   
0   regex   \^#[[:space:]]*ifdef        found #ifdef
>0  regex   \^#[[:space:]]*endif$       \b, found #endif$
# >0  default x                           \b, did not find #endif$

regex.magic (183 bytes)   
lf.c (65 bytes)   
/*  this file has LF line endings    */
#ifdef something
#endif

lf.c (65 bytes)   
crlf.c (69 bytes)   
/*  this file has CRLF line endings  */
#ifdef something
#endif

crlf.c (69 bytes)   

christos

2024-12-29 19:54

manager   ~0004147

Perhaps use [\r\n] instead of $ if you want that? file(1) just uses regex(3)..

Anton Monroe

2024-12-31 22:54

reporter   ~0004148

Okay, it sounds like my complaint is with regex(3) then. I was misled by the fact that grep on OS/2 behaves sanely.

Searching for \r\n would only work for a CRLF line ending. The portable way to do it seems to be to search for [[:space:]]*$. Most of the tests in Magdir/c-lang already use that; the test for "endif$" may have been an oversight.

So in Magdir/c-lang you might want to change
    0 search/8192 endif
    >0 regex \^#[[:space:]]*(if\|ifn)def
    >>&0 regex \^#[[:space:]]*endif$ C source text
    !:mime text/x-c
to
    0 search/8192 endif
    >0 regex \^#[[:space:]]*(if\|ifn)def
    >>&0 regex \^#[[:space:]]*endif[[:space:]]*$ C source text
    !:mime text/x-c

There are some other files that look like they might benefit from the same fix, but I'm not qualified to offer suggestions about them.

Start of rant:
The documentation for 'grep' says "The caret '^' and the dollar sign '$' are special characters that respectively match the empty string at the beginning and end of a line." The "end of a line" is a concept that is common to all operating systems. A CRLF marks the end of a line just as much as an LF, and should be treated the same. I discovered today that GNU grep (version 3.6) on Linux
does not do that. The GNU grep (version 3.8) that I use on OS/2 does-- a '$' matches the end of a line, whether the line is terminated by a LF, CRLF, or the end of the file. I don't know how it does it, but it is more logical and more useful this way. 'file' is a good example of why treating them alike is a good idea, because it must deal with files from multiple operating systems.

If '$' only matches the LF character then why have the '^' and '$' meta-characters at all? And why document that '$' represents "end of a line" when it only represents the end of some lines?

End of rant

Issue History

Date Modified Username Field Change
2024-12-29 05:16 Anton Monroe New Issue
2024-12-29 05:16 Anton Monroe File Added: regex.magic
2024-12-29 05:16 Anton Monroe File Added: lf.c
2024-12-29 05:16 Anton Monroe File Added: crlf.c
2024-12-29 19:53 christos Assigned To => christos
2024-12-29 19:53 christos Status new => assigned
2024-12-29 19:54 christos Status assigned => feedback
2024-12-29 19:54 christos Note Added: 0004147
2024-12-31 22:54 Anton Monroe Note Added: 0004148
2024-12-31 22:54 Anton Monroe Status feedback => assigned