View Issue Details
|ID||Project||Category||View Status||Date Submitted||Last Update|
|0000337||file||General||public||2022-04-05 19:06||2022-10-09 18:55|
|Summary||0000337: Data files identified as text, text as data|
|Description||new bug, based on previous tickets|
I used the latest sources (HEAD) from github
I understand that it could be complicated to realize correct identifying text files but hope that its possible.
Attached more files for test.
Folders in archive:
data - binaries (files encrypted with ransomware) detected as `Unicode text, UTF-16, big-endian text, with no line terminators`
text - correct text files detected as data
CORRECT_DETECTED_AS_DATA - one file similar to the others (in data folder) but identified correctly.
If you need I can provide more files.
|Steps To Reproduce||Scan files from attach with `file` version latest sources (HEAD)|
bug_new.zip (166,197 bytes)
It all has to do with the ucs16 detection in looks_ucs16 in encoding.c. if you change https://github.com/file/file/blob/master/src/encoding.c#L480 to 'return 0;', i.e. ignoring all ucs16 files that don't have BOM, then the text files mis-detected will get fixed (but then we'll miss usc16 files without BOM).
The Estonian and Hebrew files have invalid low surrogate pair characters (dc00 and de05 respectively). If you comment out https://github.com/file/file/blob/master/src/encoding.c#L516, they succeed.
The English.tr file has 2 0x13 (^S) characters, that is why it fails. The rest have some 0x7f DEL characters that is why they fail. If you comment out https://github.com/file/file/blob/master/src/encoding.c#L511, they all succeed.
Thank you for explanation. Now I'm trying to apply 3rd party (chardet / charset-normalizer) to detect text files.
But what about files in `data` folder ? Files are encrypted but still detected as Unicode text ?
||This is the first sentence (about UCS16 files without BOM). If you comment out 480, they will all return data.|
Oh, sry, didnt check correctly. Now I have `data` as I want and can handle them with postprocessing using `charset-normalizer` (pypi.org/project/charset-normalizer).
Can I ask for future option (maybe post-processing for possible text files or replace current detection text data) that will provide results as good as charset-normalizer:
for all encrypted (`data` folder) files we have correct `undefined` output:
normalizer -m *
Unable to identify originating encoding for "encry-https___download.eclipse.org_eclipse_updates_4.22_R-4.22-202111241800_p2.index". Maybe try increasing maximum amount of chaos.
Unable to identify originating encoding for "encry-https___download.eclipse.org_technology_epp_packages_2021-12_202112021200_p2.index". Maybe try increasing maximum amount of chaos.
Unable to identify originating encoding for "encry-https___download.eclipse.org_tools_cdt_releases_10.5_p2.index". Maybe try increasing maximum amount of chaos.
Unable to identify originating encoding for "encry-led-20-headers.h". Maybe try increasing maximum amount of chaos.
Unable to identify originating encoding for "encry-pom.properties". Maybe try increasing maximum amount of chaos.
and also for files in `text` folder:
normalizer -m *
As you can see, all files detected 'closest to perfect' output
||Seems to be fine now.|
|2022-04-05 19:06||jmp3r||New Issue|
|2022-04-05 19:06||jmp3r||File Added: bug_new.zip|
|2022-04-05 19:06||jmp3r||Tag Attached: bug|
|2022-05-22 19:50||christos||Assigned To||=> christos|
|2022-05-22 19:50||christos||Status||new => assigned|
|2022-05-22 20:01||christos||Status||assigned => feedback|
|2022-05-22 20:01||christos||Note Added: 0003748|
|2022-05-27 22:56||jmp3r||Note Added: 0003750|
|2022-05-27 22:56||jmp3r||Status||feedback => assigned|
|2022-05-28 00:27||christos||Note Added: 0003751|
|2022-05-28 01:02||jmp3r||Note Added: 0003752|
|2022-10-09 18:55||christos||Status||assigned => resolved|
|2022-10-09 18:55||christos||Resolution||open => fixed|
|2022-10-09 18:55||christos||Note Added: 0003832|