View Issue Details

IDProjectCategoryView StatusLast Update
0000526fileGeneralpublic2024-06-16 15:00
Reportermaiphuc Assigned Tochristos  
PrioritynormalSeverityminorReproducibilityalways
Status assignedResolutionopen 
PlatformlinuxOSubuntuOS Version20.04
Summary0000526: html file gets misdetected as csv
DescriptionWhen I request https://www.doordash.com/ and save the response body to a file. This file should be HTML but the file command checks it as csv.
Steps To Reproducefile test.txt
TagsNo tags attached.

Activities

maiphuc

2024-05-15 02:58

reporter  

test.txt (1,413,228 bytes)

christos

2024-05-18 15:15

manager   ~0004049

That's unfortunate because it happens that that html file has 3 comma-separated-fields for the first 19 lines... File requires that only 10 lines have the same number of fields... So in this special case it misdetects:

[11:13am] 337>cc -DCSV_LINES=40 -DDEBUG -DTEST is_csv.c
[11:13am] 338>./a.out o
0 3 0
1 3 3
2 3 3
3 3 3
4 3 3
5 3 3
6 3 3
7 3 3
8 3 3
9 3 3
10 3 3
11 3 3
12 3 3
13 3 3
14 3 3
15 3 3
16 3 3
17 3 3
18 3 3
19 3 3
20 12 3
is csv 0

maiphuc

2024-05-21 09:18

reporter   ~0004051

Thank you for your attention to this issue. I have a question. If a file can be identified as both an HTML file or a CSV file, So why do we choose CSV?

christos

2024-06-16 15:00

manager   ~0004054

the answer is a little complicated... Using -k (keep going) should print both. Now file(1) has both built-in recognition for formats that magic files can't easily handle (tar, csv, json, der, ctf, etc.) and the regular magic definitions (softmagic). The softmagic entries are sorted with respect to "strength" a heuristic for how specific a magic entry is, but the built-ins are not sorted and are applied in sequence before softmagic. Typically this is not a problem because the built-in ones usually don't have spurious matches.

Issue History

Date Modified Username Field Change
2024-05-15 02:58 maiphuc New Issue
2024-05-15 02:58 maiphuc File Added: test.txt
2024-05-18 15:13 christos Assigned To => christos
2024-05-18 15:13 christos Status new => assigned
2024-05-18 15:15 christos Status assigned => feedback
2024-05-18 15:15 christos Note Added: 0004049
2024-05-21 09:18 maiphuc Note Added: 0004051
2024-05-21 09:18 maiphuc Status feedback => assigned
2024-06-16 15:00 christos Note Added: 0004054