我需要Python处理一个类似下面的 CSV文件:

于是编写代码先尝试读取之:
import csv
with open('Datax.CSV', encoding='utf-8') as csvfile:
reader = csv.DictReader(csvfile)
for row in reader:
print(row['Process Name'])
奇怪的事情发生了:我可以读取除了第一个”Time of Day”之外的所有 row 的值。当我写出来 print(row[‘Time of Day’]) 的时候会出现下面的错误:
C:\Users\Administrator\AppData\Local\Programs\Python\Python38>go2.py
Traceback (most recent call last):
File "C:\Users\Administrator\AppData\Local\Programs\Python\Python38\go2.py", l
ine 5, in <module>
print(row['Time of Day'])
KeyError: 'Time of Day'
测试数据文件在这里,代码在这里。有兴趣的朋友在看解释之前可以尝试自己解决一下。

接下来我各种猜测,默认是因为空格等等,都不是Root Cause。后来用代码将取得的 Key 打印一下:

惊奇的发现第一个是 ‘\ufeff”Time of Day”‘。用十六进制软件打开数据文件查看:

在文件头上有 EF BB BF ,这是CSV 文件的 Unicode 头。就是它引起了我们奇怪的问题。
知道了原因之后,可以通过尝试构造出一个相同的Key 来解决:
import csv
with open('Datax.CSV', encoding='utf-8') as csvfile:
reader = csv.DictReader(csvfile)
for row in reader:
print(row["\ufeff\"Time of Day\""])
运行正常:

结论:这是因为数据文件的 Unicode 头导致的……. 话说 Python 处理数据还真是方便,虽然调试和编写耗时不短,但是代码长度大大出乎我的意料。
import csv, os
try:
import chardet
except ModuleNotFoundError:
raise Error(“Please install chardet by \”pip install chardet\” first.”)
def checkEncoding(file):
bytes = min(32, os.path.getsize(file))
raw = open(file, ‘rb’).read(bytes)
result = chardet.detect(raw)
encoding = result[‘encoding’]
return encoding
with open(‘Datax.CSV’, encoding=checkEncoding(‘Datax.CSV’)) as csvfile:
reader = csv.DictReader(csvfile)
for row in reader:
print(row[‘Time of Day’])
//分享另一种解法,使用flex(或者antlr),以下是mylex.flex的内容:
%OPTION noyywrap
%{
#define ECHO
%}
%%
[0-9]{2}”:”[0-9]{2}”:”[0-9]{2}”.”[0-9]+” “([A]|[P])[M] {printf(“time is: %s\n”,yytext);}
%%
int main()
{
while(yylex())
{
//nothing;
}
return 0;
}
//以下是测试输出:
E:\temp\test>flex –wincompat mylex.flex
E:\temp\test>cl lex.yy.c
Microsoft (R) C/C++ Optimizing Compiler Version 19.15.26730 for x64
Copyright (C) Microsoft Corporation. All rights reserved.
lex.yy.c
Microsoft (R) Incremental Linker Version 14.15.26730.0
Copyright (C) Microsoft Corporation. All rights reserved.
/out:lex.yy.exe
lex.yy.obj
E:\temp\test>type DataX.CSV | lex.yy.exe
time is: 10:21:22.8436230 PM
time is: 10:21:22.8436625 PM
time is: 10:21:22.8436974 PM
time is: 10:21:22.8440542 PM
time is: 10:21:22.8440997 PM
time is: 10:21:22.8441291 PM
time is: 10:21:22.8442544 PM
//flex可以在这里下载:https://sourceforge.net/projects/winflexbison/
专业啊