View Issue Details

IDProjectCategoryView StatusLast Update
0000427fileGeneralpublic2023-05-09 18:08
Reportera1rind Assigned Tochristos  
PriorityhighSeveritymajorReproducibilityalways
Status resolvedResolutionreopened 
Product Version5.44 
Fixed in Version5.45 
Summary0000427: docx file is determined as zip
DescriptionHi!

There is an OOXML format docx file that is being determined as application/zip. Unfortunately I can not share the document yet but I have some debug info that hopefully can help.

The `zipinfo` list following directories/files:

```
Zip file size: 36239 bytes, number of entries: 36
-rw---- 4.5 fat 399 b- stor 80-Jan-01 00:00 [trash]/0000.dat
-rw---- 4.5 fat 739 b- defN 80-Jan-01 00:00 _rels/.rels
-rw---- 4.5 fat 41347 b- defN 80-Jan-01 00:00 word/document.xml
-rw---- 4.5 fat 1116 b- defN 80-Jan-01 00:00 docProps/app.xml
-rw---- 4.5 fat 381 b- stor 80-Jan-01 00:00 [trash]/0002.dat
-rw---- 4.5 fat 269 b- stor 80-Jan-01 00:00 [trash]/0003.dat
-rw---- 4.5 fat 450 b- stor 80-Jan-01 00:00 [trash]/0001.dat
-rw---- 4.5 fat 288 b- defN 80-Jan-01 00:00 word/_rels/header1.xml.rels
-rw---- 4.5 fat 288 b- defN 80-Jan-01 00:00 word/_rels/header3.xml.rels
-rw---- 4.5 fat 3225 b- defN 80-Jan-01 00:00 word/fontTable.xml
-rw---- 4.5 fat 2864 b- defN 80-Jan-01 00:00 word/footer1.xml
-rw---- 4.5 fat 3380 b- defN 80-Jan-01 00:00 word/header1.xml
-rw---- 4.5 fat 9807 b- defN 80-Jan-01 00:00 word/header2.xml
-rw---- 4.5 fat 3380 b- defN 80-Jan-01 00:00 word/header3.xml
-rw---- 4.5 fat 680 b- defN 80-Jan-01 00:00 word/media/image1.wmf
-rw---- 4.5 fat 38367 b- defN 80-Jan-01 00:00 word/numbering.xml
-rw---- 4.5 fat 9410 b- defN 80-Jan-01 00:00 word/settings.xml
-rw---- 4.5 fat 31843 b- defN 80-Jan-01 00:00 word/styles.xml
-rw---- 4.5 fat 6992 b- defN 80-Jan-01 00:00 word/theme/theme1.xml
-rw---- 4.5 fat 483 b- defN 80-Jan-01 00:00 word/webSettings.xml
-rw---- 4.5 fat 1768 b- stor 80-Jan-01 00:00 [trash]/0005.dat
-rw---- 4.5 fat 296 b- defS 80-Jan-01 00:00 customXml/_rels/item1.xml.rels
-rw---- 4.5 fat 201 b- defS 80-Jan-01 00:00 customXml/itemProps2.xml
-rw---- 4.5 fat 219 b- defS 80-Jan-01 00:00 customXml/item2.xml
-rw---- 4.5 fat 201 b- defS 80-Jan-01 00:00 customXml/itemProps1.xml
-rw---- 4.5 fat 296 b- defS 80-Jan-01 00:00 customXml/_rels/item2.xml.rels
-rw---- 4.5 fat 443 b- stor 80-Jan-01 00:00 [trash]/0004.dat
-rw---- 4.5 fat 2383 b- defN 80-Jan-01 00:00 word/_rels/document.xml.rels
-rw---- 4.5 fat 236 b- stor 80-Jan-01 00:00 [trash]/0006.dat
-rw---- 4.5 fat 201 b- defS 80-Jan-01 00:00 customXml/itemProps3.xml
-rw---- 4.5 fat 296 b- defS 80-Jan-01 00:00 customXml/_rels/item3.xml.rels
-rw---- 4.5 fat 775 b- defN 80-Jan-01 00:00 docProps/core.xml
-rw---- 4.5 fat 563 b- defN 80-Jan-01 00:00 docProps/custom.xml
-rw---- 4.5 fat 2530 b- defN 80-Jan-01 00:00 [Content_Types].xml
-rw---- 4.5 fat 11932 b- defS 80-Jan-01 00:00 customXml/item1.xml
-rw---- 4.5 fat 587 b- defS 80-Jan-01 00:00 customXml/item3.xml
36 files, 178635 bytes uncompressed, 31036 bytes compressed: 82.6%
```

The first file listed is coming from a [trash] directory e.g. [trash]/0000.dat and the regex at line 36 here (https://github.com/file/file/blob/master/magic/Magdir/msooxml#L36) isn't expecting such file.

Furthermore according to OOXML specification there can exists a trash directory:

> Trash items represent parts that have been discarded or are no longer in use. Trash items shall not conform to
OPC part naming guidelines as defined in ECMA-376-2 and shall not be associated with a content type. All trash
items shall follow the naming scheme: [trash]/HHHH.dat where H represents a hexadecimal digit.

As I see and understood the msooxml magic rules expects a certain order for files in order to identify correct content type based on magic bytes at certain memory locations. The presence of trash items is causing it to fail.

Any tips and tricks to skip over trash items?

Thanks!
Tagsbug, magic

Activities

a1rind

2023-02-28 16:46

reporter   ~0003898

Hi! Any thoughts on this?

christos

2023-03-05 19:52

manager   ~0003903

Does this diff fix it?

--- msooxml 16 Aug 2022 11:16:39 -0000 1.18
+++ msooxml 5 Mar 2023 19:51:25 -0000
@@ -33,7 +33,7 @@
 # make sure the first file is correct
 >0x1E use msooxml
 >0x1E default x
->>0x1E regex \\[Content_Types\\]\\.xml|_rels/\\.rels|docProps|customXml
+>>0x1E regex \\[trash\\]|\\[Content_Types\\]\\.xml|_rels/\\.rels|docProps|customXml
 # skip to the second local file header
 # since some documents include a 520-byte extra field following the file
 # header, we need to scan for the next header

a1rind

2023-03-09 10:43

reporter   ~0003909

Hi!

Thanks for the response. The suggested change doesn't fix the problem. I think we need to skip trash files and have the logic after the regex works by reading bytes from the expected file header. As you notice those trash files are not in ordered, they could be anywhere not just at start or at bottom.

Unfortunately I can not share the document yet but soon I will for the ease of debugging.

Kind Regards!

a1rind

2023-03-13 11:33

reporter   ~0003915

Hi!

I've attached the problematic document. Had to remove some confidential information and manually zip it according to the order of the same files as before.

Thanks!
unsupported-prepared.docx (182,798 bytes)

christos

2023-03-14 19:46

manager   ~0003916

Fixed, thanks!

a1rind

2023-03-15 12:49

reporter   ~0003918

Hi!

Thanks a lot for looking into this. However the latest changes doesn't fix the issue I think. When I try the latest magic rules it still recognizes it as application/zip:

```
file -m msooxml unsupported-prepared.docx
```

Produces:
```
Zip archive data, at least v2.0 to extract, compression method=store
```

Also when I try to compile the rules with the latest changes I get the following error:
```
/usr/share/file/magic/mail.news, 84: Warning: Unparsable number `xu \b, dcrypt version %d'
```

a1rind

2023-03-21 12:16

reporter   ~0003919

Hi!

Any thoughts on the issue? or am I doing something wrong?

Kind Regards!

christos

2023-03-21 14:03

manager   ~0003920

why is it picking up files from /usr/share/file/magic? Is there some environment setting? Also line 84 in the most recent version of file, does not match that string...

a1rind

2023-04-04 10:22

reporter   ~0003921

Sorry for getting back late on this. Turned out the newer changes works only with the lates version. Tested with file-5.44 and works fine. But can not work with file-5.41, unable to test file-5.42 and file-5.43.

christos

2023-05-09 18:08

manager   ~0003924

Submitter verified it is fixed on the latest version.

Issue History

Date Modified Username Field Change
2023-02-18 00:37 a1rind New Issue
2023-02-18 00:37 a1rind Tag Attached: bug
2023-02-18 00:37 a1rind Tag Attached: magic
2023-02-28 16:46 a1rind Note Added: 0003898
2023-03-05 19:51 christos Assigned To => christos
2023-03-05 19:51 christos Status new => assigned
2023-03-05 19:52 christos Status assigned => feedback
2023-03-05 19:52 christos Note Added: 0003903
2023-03-09 10:43 a1rind Note Added: 0003909
2023-03-09 10:43 a1rind Status feedback => assigned
2023-03-13 11:33 a1rind Note Added: 0003915
2023-03-13 11:33 a1rind File Added: unsupported-prepared.docx
2023-03-14 19:46 christos Status assigned => resolved
2023-03-14 19:46 christos Resolution open => fixed
2023-03-14 19:46 christos Fixed in Version => 5.45
2023-03-14 19:46 christos Note Added: 0003916
2023-03-15 12:49 a1rind Status resolved => feedback
2023-03-15 12:49 a1rind Resolution fixed => reopened
2023-03-15 12:49 a1rind Note Added: 0003918
2023-03-21 12:16 a1rind Note Added: 0003919
2023-03-21 12:16 a1rind Status feedback => assigned
2023-03-21 14:03 christos Note Added: 0003920
2023-04-04 10:22 a1rind Note Added: 0003921
2023-05-09 18:08 christos Status assigned => resolved
2023-05-09 18:08 christos Note Added: 0003924