View Issue Details
ID | Project | Category | View Status | Date Submitted | Last Update |
---|---|---|---|---|---|
0000061 | file | General | public | 2019-02-15 04:34 | 2019-02-19 20:35 |
Reporter | tmc | Assigned To | christos | ||
Priority | normal | Severity | minor | Reproducibility | always |
Status | feedback | Resolution | open | ||
OS | GNU/Linux | ||||
Product Version | 5.35 | ||||
Summary | 0000061: Line endings misdetected for UTF16 files | ||||
Description | file misidentifies the line endings of little-endian UTF16 files (running on x86_64), while big-endian seems to work fine. Also it doesn't seem to even try to to identify the line ending of an UTF32 file. Also, I notice that file can't recognise UTF16/32 files without a BOM, which is a pity. | ||||
Steps To Reproduce | file misdetects utf16le with CRLF line endings sometimes as CR line endings: > printf "\xff\xfe\r\0\n\0" |file - /dev/stdin: Little-endian UTF-16 Unicode text, with CR line terminators ...and sometimes as mixed CR, CRLF line endings: > printf "\xff\xfe\r\0\n\0\r\0\n\0" |file - /dev/stdin: Little-endian UTF-16 Unicode text, with CRLF, CR line terminators Maybe it's skipping over the final \n? This example reinforces that idea: > printf "\xff\xfe\n\0\r\0" | file - /dev/stdin: Little-endian UTF-16 Unicode text utf16be doesn't seem to show this problem: > printf "\xfe\xff\0\r\0\n" | file - /dev/stdin: Big-endian UTF-16 Unicode text, with CRLF line terminators UTF32 line endings not identified: > printf "\xff\xfe\0\0\r\0\0\0B\0\0\0" | file - /dev/stdin: Unicode text, UTF-32, little-endian | ||||
Tags | No tags attached. | ||||
|
Yes, the utf-16 detection had length issues, which have been fixed on HEAD. Now the utf-32 detection is not built-in and it is done in regular magic, this is why it does not get the CR/LF info right. I guess we should move the utf-32 detection to be built-in for consistency. Here's the latest output: [11:59am] 2601>cat uni #!/bin/sh doit() { printf "$1" | ./file -m ../magic//magic.mgc - } doit "\xff\xfe\r\0\n\0" doit "\xff\xfe\r\0\n\0\r\0\n\0" doit "\xff\xfe\n\0\r\0" doit "\xfe\xff\0\r\0\n" doit "\xff\xfe\0\0\r\0\0\0B\0\0\0" [11:59am] 2602>./uni /dev/stdin: Little-endian UTF-16 Unicode text, with CRLF line terminators /dev/stdin: Little-endian UTF-16 Unicode text, with CRLF line terminators /dev/stdin: Little-endian UTF-16 Unicode text, with CR, LF line terminators /dev/stdin: Big-endian UTF-16 Unicode text, with CRLF line terminators /dev/stdin: Unicode text, UTF-32, little-endian |
|
Should be fixed now: #!/bin/sh doit() { printf "$1" | ./file -m /dev/null - } doit "\xff\xfe\r\0\n\0" doit "\xff\xfe\r\0\n\0\r\0\n\0" doit "\xff\xfe\n\0\r\0" doit "\xfe\xff\0\r\0\n" doit "\xff\xfe\0\0\r\0\0\0B\0\0\0" doit "\xff\xfe\0\0\r\0\0\0\n\0\0\0" doit "\0\0\xfe\xff\0\0\0\r\0\0\0\n" |
Date Modified | Username | Field | Change |
---|---|---|---|
2019-02-15 04:34 | tmc | New Issue | |
2019-02-18 17:02 | christos | Assigned To | => christos |
2019-02-18 17:02 | christos | Status | new => assigned |
2019-02-18 17:04 | christos | Status | assigned => acknowledged |
2019-02-18 17:04 | christos | Note Added: 0003210 | |
2019-02-19 20:35 | christos | Status | acknowledged => feedback |
2019-02-19 20:35 | christos | Note Added: 0003220 |