View Issue Details

IDProjectCategoryView StatusLast Update
0000337fileGeneralpublic2022-05-22 20:01
Reporterjmp3r Assigned Tochristos  
Status feedbackResolutionopen 
Product Version5.41 
Summary0000337: Data files identified as text, text as data
Descriptionnew bug, based on previous tickets
0000319, 0000334

I used the latest sources (HEAD) from github

I understand that it could be complicated to realize correct identifying text files but hope that its possible.

Attached more files for test.

Folders in archive:
data - binaries (files encrypted with ransomware) detected as `Unicode text, UTF-16, big-endian text, with no line terminators`
text - correct text files detected as data
CORRECT_DETECTED_AS_DATA - one file similar to the others (in data folder) but identified correctly.

If you need I can provide more files.
Steps To ReproduceScan files from attach with `file` version latest sources (HEAD)



2022-04-05 19:06

reporter (166,197 bytes)


2022-05-22 20:01

manager   ~0003748

It all has to do with the ucs16 detection in looks_ucs16 in encoding.c. if you change to 'return 0;', i.e. ignoring all ucs16 files that don't have BOM, then the text files mis-detected will get fixed (but then we'll miss usc16 files without BOM).
The Estonian and Hebrew files have invalid low surrogate pair characters (dc00 and de05 respectively). If you comment out, they succeed.

The file has 2 0x13 (^S) characters, that is why it fails. The rest have some 0x7f DEL characters that is why they fail. If you comment out, they all succeed.

Issue History

Date Modified Username Field Change
2022-04-05 19:06 jmp3r New Issue
2022-04-05 19:06 jmp3r File Added:
2022-04-05 19:06 jmp3r Tag Attached: bug
2022-05-22 19:50 christos Assigned To => christos
2022-05-22 19:50 christos Status new => assigned
2022-05-22 20:01 christos Status assigned => feedback
2022-05-22 20:01 christos Note Added: 0003748