View Issue Details

IDProjectCategoryView StatusLast Update
0000337fileGeneralpublic2022-05-22 20:01
Reporterjmp3r Assigned Tochristos  
PrioritynormalSeveritymajorReproducibilityalways
Status feedbackResolutionopen 
Product Version5.41 
Summary0000337: Data files identified as text, text as data
Descriptionnew bug, based on previous tickets
0000319, 0000334

I used the latest sources (HEAD) from github

I understand that it could be complicated to realize correct identifying text files but hope that its possible.

Attached more files for test.

Folders in archive:
data - binaries (files encrypted with ransomware) detected as `Unicode text, UTF-16, big-endian text, with no line terminators`
text - correct text files detected as data
CORRECT_DETECTED_AS_DATA - one file similar to the others (in data folder) but identified correctly.

If you need I can provide more files.
Steps To ReproduceScan files from attach with `file` version latest sources (HEAD)
Tagsbug

Activities

jmp3r

2022-04-05 19:06

reporter  

bug_new.zip (166,197 bytes)

christos

2022-05-22 20:01

manager   ~0003748

It all has to do with the ucs16 detection in looks_ucs16 in encoding.c. if you change https://github.com/file/file/blob/master/src/encoding.c#L480 to 'return 0;', i.e. ignoring all ucs16 files that don't have BOM, then the text files mis-detected will get fixed (but then we'll miss usc16 files without BOM).
The Estonian and Hebrew files have invalid low surrogate pair characters (dc00 and de05 respectively). If you comment out https://github.com/file/file/blob/master/src/encoding.c#L516, they succeed.

The English.tr file has 2 0x13 (^S) characters, that is why it fails. The rest have some 0x7f DEL characters that is why they fail. If you comment out https://github.com/file/file/blob/master/src/encoding.c#L511, they all succeed.

Issue History

Date Modified Username Field Change
2022-04-05 19:06 jmp3r New Issue
2022-04-05 19:06 jmp3r File Added: bug_new.zip
2022-04-05 19:06 jmp3r Tag Attached: bug
2022-05-22 19:50 christos Assigned To => christos
2022-05-22 19:50 christos Status new => assigned
2022-05-22 20:01 christos Status assigned => feedback
2022-05-22 20:01 christos Note Added: 0003748