View Issue Details

IDProjectCategoryView StatusLast Update
0000104fileGeneralpublic2020-08-23 19:26
ReporterIlrandar Assigned Tochristos  
PrioritynormalSeverityminorReproducibilityhave not tried
Status resolvedResolutionfixed 
Product Version5.37 
Fixed in Version5.40 
Summary0000104: pdf file incorrectly reported as `data`
DescriptionSome pdf files downloaded from the internet are incorrectly reported as `data` by file. Their associated mime-type is `application/octet-stream` and not `application/pdf`. I join such a pdf to this report.
TagsNo tags attached.



2019-09-10 21:04



2019-09-11 14:42

manager   ~0003288

These are the first few lines of the file:

HTTP/1.1 200 OK
Date: Tue, 10 Sep 2019 08:38:20 GMT
Server: Apache/2.4.38 (Debian)
Content-Disposition: attachment; filename="21808995-2019-certificat-scolarite.pdf"
Cache-Control: no-cache, private
X-RateLimit-Limit: 60
X-RateLimit-Remaining: 58
Content-Length: 1184531
Content-Type: application/pdf

Here's where the pdf file starts:


The tool you used to download it or the original file has junk in front. Of course some browsers ignore the junk and process it as a pdf file (because users want things to just work), but this is just crappy behavior. Most application will not open it properly, and it is also a security issue since you can masquerade files this way. It is also fragile. How many lines does it try to parse? 10? 1K of data? Who knows. Depends on the implementation. Of course file can also be modified to mimick this behavior at the loss of efficiency and encouraging people to produce junk...


2019-09-11 17:07

reporter   ~0003295

Oh, I didn’t know I could open pdf files with a text editor.
I don’t think you should ignore junk in front of file. I just needed some way to get this file (and a few other) to be recognized as pdf files, but if I can just open them and get rid of the leading incorrect lines, I will just do it.
Thank you for your answer.
As far as I’m concerned, you can consider this issue closed.

2020-06-29 16:21

reporter   ~0003432

I've noticed a similar issue with other technically-malformed PDF files, but it appears almost every PDF reader and tool not using libmagic supports these.

As an aside, it does appear a rule to implement this is actually in place [1]. However it appears that due to a longstanding issue, searching for text strings in files detected as binary doesn't work[2], so this rule is never successfully hit (this can be verified by the fact that a single line text file with the string '%PDF-1.7' at the end will be detected as 'PDF document, version 1.7, ASCII text'. Presumably this rule should either be removed or a workaround or fix for the search issue is required.



2020-06-29 16:50

manager   ~0003433

Actually there is a simple fix to make search work for binary data (add /b). I will do that and it will probably fix your problem.

2020-06-29 16:56

reporter   ~0003434

My apologies - I was apparently looking at an old manpage that didn't have documentation on this. Having tested locally with a custom magic file this does appear to work correctly.


2020-08-23 19:26

manager   ~0003455

Confirmed fixed

Issue History

Date Modified Username Field Change
2019-09-10 21:04 Ilrandar New Issue
2019-09-10 21:04 Ilrandar File Added: certificat_scolarité_l2_eco.pdf
2019-09-11 14:39 christos Assigned To => christos
2019-09-11 14:39 christos Status new => assigned
2019-09-11 14:42 christos Status assigned => feedback
2019-09-11 14:42 christos Note Added: 0003288
2019-09-11 17:07 Ilrandar Note Added: 0003295
2019-09-11 17:07 Ilrandar Status feedback => assigned
2020-06-29 16:21 Note Added: 0003432
2020-06-29 16:50 christos Note Added: 0003433
2020-06-29 16:56 Note Added: 0003434
2020-08-23 19:26 christos Status assigned => resolved
2020-08-23 19:26 christos Resolution open => fixed
2020-08-23 19:26 christos Fixed in Version => 5.40
2020-08-23 19:26 christos Note Added: 0003455