View Issue Details

IDProjectCategoryView StatusLast Update
0000590fileGeneralpublic2025-01-30 20:43
Reporterhiran Assigned Tochristos  
PrioritynormalSeverityminorReproducibilityalways
Status feedbackResolutionopen 
Product Version5.45 
Summary0000590: Mime-type for .docx files not always detected correctly
DescriptionI have a couple of word documents (.docx). For some of them file detects this:

hiran@silver:~/test$ file -i somefile.docx
somefile.docx: application/octet-stream; charset=binary
hiran@silver:~/test$ file somefile.docx
somefile.docx: Microsoft OOXML
hiran@silver:~/test$

for some others it detects this:

hiran@silver:~/test$ file somefile2.docx
somefile2.docx: Microsoft Word 2007+
hiran@silver:~/test$ file -i somefile2.docx
somefile2.docx: application/vnd.openxmlformats-officedocument.wordprocessingml.document; charset=binary
hiran@silver:~/test$

I highly suspect it is about the age of the documents. Word 2007 are detected ok, but any application relying on the mime type will be confused for the OOXML typed documents. The mime type emitted is application/extet-stream. On the other hand the documents are not new for file, as it reliably identifies them as OOXML.

It looks like a mistake. Can it be corrected or is this as per specification?
Steps To ReproduceUse word documents created by different versions of MS Office, all stored in .docx format.
Run the command as mentioned in the description and check both the emitted file type and the mime type.
TagsNo tags attached.

Activities

hiran

2024-12-17 20:24

reporter   ~0004134

This issue was detected and recorded here: https://github.com/paperless-ngx/paperless-ngx/discussions/8489

christos

2025-01-30 20:43

manager   ~0004176

Detecting word as opposed to a general office doc depends on the structure of the document. The magic spec tries to find a word/ folder in the first few entries of the zip directory. If you have examples of files that don't work, I can take a look at them.

Issue History

Date Modified Username Field Change
2024-12-17 20:15 hiran New Issue
2024-12-17 20:24 hiran Note Added: 0004134
2025-01-30 20:42 christos Assigned To => christos
2025-01-30 20:42 christos Status new => assigned
2025-01-30 20:43 christos Status assigned => feedback
2025-01-30 20:43 christos Note Added: 0004176