View Issue Details

IDProjectCategoryView StatusLast Update
0000186fileGeneralpublic2022-04-13 07:16
Reporterjoveler Assigned Tochristos  
PrioritynormalSeverityminorReproducibilityalways
Status confirmedResolutionfixed 
Product Version5.39 
Fixed in Version5.40 
Summary0000186: Korean text file misidentified as 'COM executable for DOS'
Description[Summary]
Some Korean text file encoded as EUC-KR (aka CP949 on Windows) is misidentified as 'COM executable for DOS'.
Part of the COM signatures should be disabled to fix it.

[Technical Detail]
EUC-KR encodes 4% of Korean characters as 'B8xx' ('륫/B8A0' ~ '뫼/B8FE').
In libmagic, the simplest COM signature only checks for 0xB8 at offset 0.
As a result, libmagic causes false positive on EUC-KR text which starts with some Korean characters.

Windows notepad (prior to Windows 10 v19H1) used ANSI encoding as default.
It means almost every text file produced in Korean Windows is encoded as EUC-KR.
Therefore it is a critical issue on Korean text files, as much Korean text files are misidentified as executable.

[Fix]
To reduce the negative impact, I propose to disable the simplest COM file signature.
I have attached the diff file.

Steps To ReproduceRun file command with attached euckr_falsepositive.txt.

$ file euckr_falsepositive.txt
euckr_falsepositive.txt: COM executable for DOS

$ file euckr_falsepositive.txt --mime-type
euckr_falsepositive.txt: application/x-dosexec
TagsNo tags attached.

Activities

joveler

2020-08-24 02:14

reporter  

euckr_falsepositive.txt (293 bytes)   
������

This file is encoded as EUC-KR. In EUC-KR, 4% of Korean character encoded as 'B8xx' ('��/B8A0' ~ '��/B8FE').
COM file signature of libmagic searchs for '0xB8' as offset 0, causing false positive on this file.

```shell
$ file custom.txt
custom.txt: COM executable for DOS
```
euckr_falsepositive.txt (293 bytes)   
0001-Disable-simplest-COM-signature-to-avoid-FP.patch (1,869 bytes)   
From 31245d71d9d279b649f5a13c2aee60525266d8f6 Mon Sep 17 00:00:00 2001
From: Hajin Jang <hajin_jang@worksmobile.com>
Date: Mon, 24 Aug 2020 11:02:34 +0900
Subject: [PATCH] Disable simplest COM signature to avoid FP

The simplest COM signature causes false-positive on EUC-KR text files.
Disable it to avoid misidentification.
---
 magic/Magdir/msdos | 24 +++++++++++++-----------
 1 file changed, 13 insertions(+), 11 deletions(-)

diff --git a/magic/Magdir/msdos b/magic/Magdir/msdos
index 8bf85892..0b7993ff 100644
--- a/magic/Magdir/msdos
+++ b/magic/Magdir/msdos
@@ -565,17 +565,19 @@
 # syslinux version (4.x)
 # "COM executable (COM32R)" or "Syslinux COM32 module" by TrID
 >>>1	lelong		0x21CD4CFe	\b, relocatable)
-# remaining are DOS COM executables starting with assembler instruction MOV
-# like FreeDOS BANNER*.COM FINDDISK.COM GIF2RAW.COM WINCHK.COM
-# MS-DOS SYS.COM RESTART.COM
-# SYSLINUX.COM (version 1.40 - 2.13)
-# GFXBOOT.COM (version 3.75)
-# COPYBS.COM POWEROFF.COM INT18.COM
->>1	default	x			COM executable for DOS
-!:mime	application/x-dosexec
-#!:mime	application/x-ms-dos-executable
-#!:mime	application/x-msdos-program
-!:ext com
+# Hajin Jang <hajin_jang@worksmobile.com>:
+# Disable simplest COM signature to prevent false positive on some EUC-KR text files.
+## remaining are DOS COM executables starting with assembler instruction MOV
+## like FreeDOS BANNER*.COM FINDDISK.COM GIF2RAW.COM WINCHK.COM
+## MS-DOS SYS.COM RESTART.COM
+## SYSLINUX.COM (version 1.40 - 2.13)
+## GFXBOOT.COM (version 3.75)
+## COPYBS.COM POWEROFF.COM INT18.COM
+#>>1	default	x			COM executable for DOS
+#!:mime	application/x-dosexec
+##!:mime	application/x-ms-dos-executable
+##!:mime	application/x-msdos-program
+#!:ext com
 
 # URL:		https://en.wikipedia.org/wiki/UPX
 # Reference:	https://github.com/upx/upx/archive/v3.96.zip/upx-3.96/
-- 
2.28.0.windows.1

christos

2020-09-06 15:14

manager   ~0003482

Patched, thanks!

christos

2021-10-12 18:24

manager   ~0003648

Will revert for now and revisit. Breaks too many com executables. Perhaps we can limit it on what follows b8?

joveler

2022-04-13 06:54

reporter   ~0003734

> Perhaps we can limit it on what follows b8?

I have tried, but it is impossible.

In 8086 opcode, 0xB8 is 'MOV AX, [IMM]' command.
Since the IMM is any arbitrary two bytes, we cannot limit the followings.
- B8 0A 16 -> MOV AX, 0x16A0
- B8 40 00 -> MOV AX, 0x0040

joveler

2022-04-13 07:16

reporter   ~0003735

Every Extended Unix Code charset, such as EUC-JP, shares the same address space as EUC-KR. (Bytes of 0xA0-0xFF range, except 0x80)
Keeping 0xB8 COM signature may also cause problems in every EUC charset.

One idea is the use text/binary detection on buffers since the EUC charset tries to avoid ASCII control characters.
I do not know how libmagic's text detection works yet, isn't it involve code patching?

Issue History

Date Modified Username Field Change
2020-08-24 02:14 joveler New Issue
2020-08-24 02:14 joveler File Added: euckr_falsepositive.txt
2020-08-24 02:14 joveler File Added: 0001-Disable-simplest-COM-signature-to-avoid-FP.patch
2020-09-06 15:14 christos Assigned To => christos
2020-09-06 15:14 christos Status new => assigned
2020-09-06 15:14 christos Status assigned => resolved
2020-09-06 15:14 christos Resolution open => fixed
2020-09-06 15:14 christos Fixed in Version => 5.40
2020-09-06 15:14 christos Note Added: 0003482
2021-10-12 18:24 christos Status resolved => confirmed
2021-10-12 18:24 christos Note Added: 0003648
2022-04-13 06:54 joveler Note Added: 0003734
2022-04-13 07:16 joveler Note Added: 0003735