View Issue Details

IDProjectCategoryView StatusLast Update
0000337fileGeneralpublic2022-10-09 18:55
Reporterjmp3r Assigned Tochristos  
PrioritynormalSeveritymajorReproducibilityalways
Status resolvedResolutionfixed 
Product Version5.41 
Summary0000337: Data files identified as text, text as data
Descriptionnew bug, based on previous tickets
0000319, 0000334

I used the latest sources (HEAD) from github

I understand that it could be complicated to realize correct identifying text files but hope that its possible.

Attached more files for test.

Folders in archive:
data - binaries (files encrypted with ransomware) detected as `Unicode text, UTF-16, big-endian text, with no line terminators`
text - correct text files detected as data
CORRECT_DETECTED_AS_DATA - one file similar to the others (in data folder) but identified correctly.

If you need I can provide more files.
Steps To ReproduceScan files from attach with `file` version latest sources (HEAD)
Tagsbug

Activities

jmp3r

2022-04-05 19:06

reporter  

bug_new.zip (166,197 bytes)

christos

2022-05-22 20:01

manager   ~0003748

It all has to do with the ucs16 detection in looks_ucs16 in encoding.c. if you change https://github.com/file/file/blob/master/src/encoding.c#L480 to 'return 0;', i.e. ignoring all ucs16 files that don't have BOM, then the text files mis-detected will get fixed (but then we'll miss usc16 files without BOM).
The Estonian and Hebrew files have invalid low surrogate pair characters (dc00 and de05 respectively). If you comment out https://github.com/file/file/blob/master/src/encoding.c#L516, they succeed.

The English.tr file has 2 0x13 (^S) characters, that is why it fails. The rest have some 0x7f DEL characters that is why they fail. If you comment out https://github.com/file/file/blob/master/src/encoding.c#L511, they all succeed.

jmp3r

2022-05-27 22:56

reporter   ~0003750

Thank you for explanation. Now I'm trying to apply 3rd party (chardet / charset-normalizer) to detect text files.

But what about files in `data` folder ? Files are encrypted but still detected as Unicode text ?

christos

2022-05-28 00:27

manager   ~0003751

This is the first sentence (about UCS16 files without BOM). If you comment out 480, they will all return data.

jmp3r

2022-05-28 01:02

reporter   ~0003752

Oh, sry, didnt check correctly. Now I have `data` as I want and can handle them with postprocessing using `charset-normalizer` (pypi.org/project/charset-normalizer).

Can I ask for future option (maybe post-processing for possible text files or replace current detection text data) that will provide results as good as charset-normalizer:

for all encrypted (`data` folder) files we have correct `undefined` output:

normalizer -m *
Unable to identify originating encoding for "encry-https___download.eclipse.org_eclipse_updates_4.22_R-4.22-202111241800_p2.index". Maybe try increasing maximum amount of chaos.
Unable to identify originating encoding for "encry-https___download.eclipse.org_technology_epp_packages_2021-12_202112021200_p2.index". Maybe try increasing maximum amount of chaos.
Unable to identify originating encoding for "encry-https___download.eclipse.org_tools_cdt_releases_10.5_p2.index". Maybe try increasing maximum amount of chaos.
Unable to identify originating encoding for "encry-led-20-headers.h". Maybe try increasing maximum amount of chaos.
Unable to identify originating encoding for "encry-pom.properties". Maybe try increasing maximum amount of chaos.
undefined
undefined
undefined
undefined
undefined

and also for files in `text` folder:
normalizer -m *
utf_16
utf_16_le
utf_16
utf_16_le
utf_16
utf_16
utf_16

As you can see, all files detected 'closest to perfect' output

christos

2022-10-09 18:55

manager   ~0003832

Seems to be fine now.

Issue History

Date Modified Username Field Change
2022-04-05 19:06 jmp3r New Issue
2022-04-05 19:06 jmp3r File Added: bug_new.zip
2022-04-05 19:06 jmp3r Tag Attached: bug
2022-05-22 19:50 christos Assigned To => christos
2022-05-22 19:50 christos Status new => assigned
2022-05-22 20:01 christos Status assigned => feedback
2022-05-22 20:01 christos Note Added: 0003748
2022-05-27 22:56 jmp3r Note Added: 0003750
2022-05-27 22:56 jmp3r Status feedback => assigned
2022-05-28 00:27 christos Note Added: 0003751
2022-05-28 01:02 jmp3r Note Added: 0003752
2022-10-09 18:55 christos Status assigned => resolved
2022-10-09 18:55 christos Resolution open => fixed
2022-10-09 18:55 christos Note Added: 0003832