View Issue Details

IDProjectCategoryView StatusLast Update
0000061fileGeneralpublic2019-02-19 20:35
Reportertmc Assigned Tochristos  
PrioritynormalSeverityminorReproducibilityalways
Status feedbackResolutionopen 
OSGNU/Linux 
Product Version5.35 
Summary0000061: Line endings misdetected for UTF16 files
Descriptionfile misidentifies the line endings of little-endian UTF16 files (running on x86_64), while big-endian seems to work fine.
Also it doesn't seem to even try to to identify the line ending of an UTF32 file.

Also, I notice that file can't recognise UTF16/32 files without a BOM, which is a pity.
Steps To Reproducefile misdetects utf16le with CRLF line endings sometimes as CR line endings:

> printf "\xff\xfe\r\0\n\0" |file -
/dev/stdin: Little-endian UTF-16 Unicode text, with CR line terminators

...and sometimes as mixed CR, CRLF line endings:

> printf "\xff\xfe\r\0\n\0\r\0\n\0" |file -
/dev/stdin: Little-endian UTF-16 Unicode text, with CRLF, CR line terminators

Maybe it's skipping over the final \n? This example reinforces that idea:

> printf "\xff\xfe\n\0\r\0" | file -
/dev/stdin: Little-endian UTF-16 Unicode text

utf16be doesn't seem to show this problem:

> printf "\xfe\xff\0\r\0\n" | file -
/dev/stdin: Big-endian UTF-16 Unicode text, with CRLF line terminators

UTF32 line endings not identified:

> printf "\xff\xfe\0\0\r\0\0\0B\0\0\0" | file -
/dev/stdin: Unicode text, UTF-32, little-endian

TagsNo tags attached.

Activities

christos

2019-02-18 17:04

manager   ~0003210

Yes, the utf-16 detection had length issues, which have been fixed on HEAD. Now the utf-32 detection is not built-in and it is done in regular magic, this is why it does not get the CR/LF info right. I guess we should move the utf-32 detection to be built-in for consistency.
Here's the latest output:

[11:59am] 2601>cat uni
#!/bin/sh
doit() {
        printf "$1" | ./file -m ../magic//magic.mgc -
}

doit "\xff\xfe\r\0\n\0"
doit "\xff\xfe\r\0\n\0\r\0\n\0"
doit "\xff\xfe\n\0\r\0"
doit "\xfe\xff\0\r\0\n"
doit "\xff\xfe\0\0\r\0\0\0B\0\0\0"
[11:59am] 2602>./uni
/dev/stdin: Little-endian UTF-16 Unicode text, with CRLF line terminators
/dev/stdin: Little-endian UTF-16 Unicode text, with CRLF line terminators
/dev/stdin: Little-endian UTF-16 Unicode text, with CR, LF line terminators
/dev/stdin: Big-endian UTF-16 Unicode text, with CRLF line terminators
/dev/stdin: Unicode text, UTF-32, little-endian

christos

2019-02-19 20:35

manager   ~0003220

Should be fixed now:
#!/bin/sh
doit() {
        printf "$1" | ./file -m /dev/null -
}

doit "\xff\xfe\r\0\n\0"
doit "\xff\xfe\r\0\n\0\r\0\n\0"
doit "\xff\xfe\n\0\r\0"
doit "\xfe\xff\0\r\0\n"
doit "\xff\xfe\0\0\r\0\0\0B\0\0\0"
doit "\xff\xfe\0\0\r\0\0\0\n\0\0\0"
doit "\0\0\xfe\xff\0\0\0\r\0\0\0\n"

Issue History

Date Modified Username Field Change
2019-02-15 04:34 tmc New Issue
2019-02-18 17:02 christos Assigned To => christos
2019-02-18 17:02 christos Status new => assigned
2019-02-18 17:04 christos Status assigned => acknowledged
2019-02-18 17:04 christos Note Added: 0003210
2019-02-19 20:35 christos Status acknowledged => feedback
2019-02-19 20:35 christos Note Added: 0003220