View Issue Details
ID | Project | Category | View Status | Date Submitted | Last Update |
---|---|---|---|---|---|
0000186 | file | General | public | 2020-08-24 02:14 | 2022-04-13 07:16 |
Reporter | joveler | Assigned To | christos | ||
Priority | normal | Severity | minor | Reproducibility | always |
Status | confirmed | Resolution | fixed | ||
Product Version | 5.39 | ||||
Fixed in Version | 5.40 | ||||
Summary | 0000186: Korean text file misidentified as 'COM executable for DOS' | ||||
Description | [Summary] Some Korean text file encoded as EUC-KR (aka CP949 on Windows) is misidentified as 'COM executable for DOS'. Part of the COM signatures should be disabled to fix it. [Technical Detail] EUC-KR encodes 4% of Korean characters as 'B8xx' ('륫/B8A0' ~ '뫼/B8FE'). In libmagic, the simplest COM signature only checks for 0xB8 at offset 0. As a result, libmagic causes false positive on EUC-KR text which starts with some Korean characters. Windows notepad (prior to Windows 10 v19H1) used ANSI encoding as default. It means almost every text file produced in Korean Windows is encoded as EUC-KR. Therefore it is a critical issue on Korean text files, as much Korean text files are misidentified as executable. [Fix] To reduce the negative impact, I propose to disable the simplest COM file signature. I have attached the diff file. | ||||
Steps To Reproduce | Run file command with attached euckr_falsepositive.txt. $ file euckr_falsepositive.txt euckr_falsepositive.txt: COM executable for DOS $ file euckr_falsepositive.txt --mime-type euckr_falsepositive.txt: application/x-dosexec | ||||
Tags | No tags attached. | ||||
|
euckr_falsepositive.txt (293 bytes)
������ This file is encoded as EUC-KR. In EUC-KR, 4% of Korean character encoded as 'B8xx' ('��/B8A0' ~ '��/B8FE'). COM file signature of libmagic searchs for '0xB8' as offset 0, causing false positive on this file. ```shell $ file custom.txt custom.txt: COM executable for DOS ``` 0001-Disable-simplest-COM-signature-to-avoid-FP.patch (1,869 bytes)
From 31245d71d9d279b649f5a13c2aee60525266d8f6 Mon Sep 17 00:00:00 2001 From: Hajin Jang <hajin_jang@worksmobile.com> Date: Mon, 24 Aug 2020 11:02:34 +0900 Subject: [PATCH] Disable simplest COM signature to avoid FP The simplest COM signature causes false-positive on EUC-KR text files. Disable it to avoid misidentification. --- magic/Magdir/msdos | 24 +++++++++++++----------- 1 file changed, 13 insertions(+), 11 deletions(-) diff --git a/magic/Magdir/msdos b/magic/Magdir/msdos index 8bf85892..0b7993ff 100644 --- a/magic/Magdir/msdos +++ b/magic/Magdir/msdos @@ -565,17 +565,19 @@ # syslinux version (4.x) # "COM executable (COM32R)" or "Syslinux COM32 module" by TrID >>>1 lelong 0x21CD4CFe \b, relocatable) -# remaining are DOS COM executables starting with assembler instruction MOV -# like FreeDOS BANNER*.COM FINDDISK.COM GIF2RAW.COM WINCHK.COM -# MS-DOS SYS.COM RESTART.COM -# SYSLINUX.COM (version 1.40 - 2.13) -# GFXBOOT.COM (version 3.75) -# COPYBS.COM POWEROFF.COM INT18.COM ->>1 default x COM executable for DOS -!:mime application/x-dosexec -#!:mime application/x-ms-dos-executable -#!:mime application/x-msdos-program -!:ext com +# Hajin Jang <hajin_jang@worksmobile.com>: +# Disable simplest COM signature to prevent false positive on some EUC-KR text files. +## remaining are DOS COM executables starting with assembler instruction MOV +## like FreeDOS BANNER*.COM FINDDISK.COM GIF2RAW.COM WINCHK.COM +## MS-DOS SYS.COM RESTART.COM +## SYSLINUX.COM (version 1.40 - 2.13) +## GFXBOOT.COM (version 3.75) +## COPYBS.COM POWEROFF.COM INT18.COM +#>>1 default x COM executable for DOS +#!:mime application/x-dosexec +##!:mime application/x-ms-dos-executable +##!:mime application/x-msdos-program +#!:ext com # URL: https://en.wikipedia.org/wiki/UPX # Reference: https://github.com/upx/upx/archive/v3.96.zip/upx-3.96/ -- 2.28.0.windows.1 |
|
Patched, thanks! |
|
Will revert for now and revisit. Breaks too many com executables. Perhaps we can limit it on what follows b8? |
|
> Perhaps we can limit it on what follows b8? I have tried, but it is impossible. In 8086 opcode, 0xB8 is 'MOV AX, [IMM]' command. Since the IMM is any arbitrary two bytes, we cannot limit the followings. - B8 0A 16 -> MOV AX, 0x16A0 - B8 40 00 -> MOV AX, 0x0040 |
|
Every Extended Unix Code charset, such as EUC-JP, shares the same address space as EUC-KR. (Bytes of 0xA0-0xFF range, except 0x80) Keeping 0xB8 COM signature may also cause problems in every EUC charset. One idea is the use text/binary detection on buffers since the EUC charset tries to avoid ASCII control characters. I do not know how libmagic's text detection works yet, isn't it involve code patching? |
Date Modified | Username | Field | Change |
---|---|---|---|
2020-08-24 02:14 | joveler | New Issue | |
2020-08-24 02:14 | joveler | File Added: euckr_falsepositive.txt | |
2020-08-24 02:14 | joveler | File Added: 0001-Disable-simplest-COM-signature-to-avoid-FP.patch | |
2020-09-06 15:14 | christos | Assigned To | => christos |
2020-09-06 15:14 | christos | Status | new => assigned |
2020-09-06 15:14 | christos | Status | assigned => resolved |
2020-09-06 15:14 | christos | Resolution | open => fixed |
2020-09-06 15:14 | christos | Fixed in Version | => 5.40 |
2020-09-06 15:14 | christos | Note Added: 0003482 | |
2021-10-12 18:24 | christos | Status | resolved => confirmed |
2021-10-12 18:24 | christos | Note Added: 0003648 | |
2022-04-13 06:54 | joveler | Note Added: 0003734 | |
2022-04-13 07:16 | joveler | Note Added: 0003735 |