View Issue Details
|ID||Project||Category||View Status||Date Submitted||Last Update|
|0000104||file||General||public||2019-09-10 21:04||2020-08-23 19:26|
|Priority||normal||Severity||minor||Reproducibility||have not tried|
|Fixed in Version||5.40|
|Summary||0000104: pdf file incorrectly reported as `data`|
|Description||Some pdf files downloaded from the internet are incorrectly reported as `data` by file. Their associated mime-type is `application/octet-stream` and not `application/pdf`. I join such a pdf to this report.|
|Tags||No tags attached.|
certificat_scolarité_l2_eco.pdf (1,184,843 bytes)
These are the first few lines of the file:
HTTP/1.1 200 OK
Date: Tue, 10 Sep 2019 08:38:20 GMT
Server: Apache/2.4.38 (Debian)
Content-Disposition: attachment; filename="21808995-2019-certificat-scolarite.pdf"
Cache-Control: no-cache, private
Here's where the pdf file starts:
The tool you used to download it or the original file has junk in front. Of course some browsers ignore the junk and process it as a pdf file (because users want things to just work), but this is just crappy behavior. Most application will not open it properly, and it is also a security issue since you can masquerade files this way. It is also fragile. How many lines does it try to parse? 10? 1K of data? Who knows. Depends on the implementation. Of course file can also be modified to mimick this behavior at the loss of efficiency and encouraging people to produce junk...
Oh, I didn’t know I could open pdf files with a text editor.
I don’t think you should ignore junk in front of file. I just needed some way to get this file (and a few other) to be recognized as pdf files, but if I can just open them and get rid of the leading incorrect lines, I will just do it.
Thank you for your answer.
As far as I’m concerned, you can consider this issue closed.
I've noticed a similar issue with other technically-malformed PDF files, but it appears almost every PDF reader and tool not using libmagic supports these.
As an aside, it does appear a rule to implement this is actually in place . However it appears that due to a longstanding issue, searching for text strings in files detected as binary doesn't work, so this rule is never successfully hit (this can be verified by the fact that a single line text file with the string '%PDF-1.7' at the end will be detected as 'PDF document, version 1.7, ASCII text'. Presumably this rule should either be removed or a workaround or fix for the search issue is required.
||Actually there is a simple fix to make search work for binary data (add /b). I will do that and it will probably fix your problem.|
||My apologies - I was apparently looking at an old manpage that didn't have documentation on this. Having tested locally with a custom magic file this does appear to work correctly.|
|2019-09-10 21:04||Ilrandar||New Issue|
|2019-09-10 21:04||Ilrandar||File Added: certificat_scolarité_l2_eco.pdf|
|2019-09-11 14:39||christos||Assigned To||=> christos|
|2019-09-11 14:39||christos||Status||new => assigned|
|2019-09-11 14:42||christos||Status||assigned => feedback|
|2019-09-11 14:42||christos||Note Added: 0003288|
|2019-09-11 17:07||Ilrandar||Note Added: 0003295|
|2019-09-11 17:07||Ilrandar||Status||feedback => assigned|
|2020-06-29 16:21||holmesmr.pf||Note Added: 0003432|
|2020-06-29 16:50||christos||Note Added: 0003433|
|2020-06-29 16:56||holmesmr.pf||Note Added: 0003434|
|2020-08-23 19:26||christos||Status||assigned => resolved|
|2020-08-23 19:26||christos||Resolution||open => fixed|
|2020-08-23 19:26||christos||Fixed in Version||=> 5.40|
|2020-08-23 19:26||christos||Note Added: 0003455|