Как заставить csvkit распознавать длинные строки ASCII?

737
isomorphismes

Я использую Ubuntu, и я скачал этот файл CSV, который fileговорит мне, закодирован как:

ASCII text, with very long lines, with CRLF line terminators 

Однако, когда я прохожу csvcut -e ASCII datafile, я получаю:

Your file is not "utf-8" encoded. Please specify the correct encoding with the -e flag. Use the -v flag to see the complete error. 

и когда я прохожу csvcut -e ASCII datafile, я получаю:

Your file is not "ASCII" encoded. Please specify the correct encoding with the -e flag. 

(Ни заглавные буквы, ни копирование-вставка точного fileвывода не улучшают это.)


Полная ошибка ( -v) выглядит следующим образом:

Traceback (most recent call last): File "/usr/local/bin/csvcut", line 9, in <module> load_entry_point('csvkit==0.9.2', 'console_scripts', 'csvcut')() File "/usr/local/lib/python2.7/dist-packages/csvkit-0.9.2-py2.7.egg/csvkit/utilities/csvcut.py", line 64, in launch_new_instance utility.main() File "/usr/local/lib/python2.7/dist-packages/csvkit-0.9.2-py2.7.egg/csvkit/utilities/csvcut.py", line 53, in main for row in rows: File "/usr/local/lib/python2.7/dist-packages/csvkit-0.9.2-py2.7.egg/csvkit/unicsv.py", line 51, in next row = next(self.reader) File "/usr/local/lib/python2.7/dist-packages/six.py", line 535, in next return type(self).__next__(self) File "/usr/local/lib/python2.7/dist-packages/csvkit-0.9.2-py2.7.egg/csvkit/unicsv.py", line 35, in __next__ return next(self.reader).encode('utf-8') File "/usr/lib/python2.7/codecs.py", line 615, in next line = self.readline() File "/usr/lib/python2.7/codecs.py", line 530, in readline data = self.read(readsize, firstline=True) File "/usr/lib/python2.7/codecs.py", line 477, in read newchars, decodedbytes = self.decode(data, self.errors) UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 0: ordinal not in range(128) 
0

1 ответ на вопрос

0
4ae1e1

Your payload is neither ASCII nor UTF-8 encoded. You can quickly find the non-ASCII bits:

awk '/[^\x00-\x7F]/{ print NR ":", $0 }' data.csv | less 

You'll see things like Briarcliffe College�??Patchogue in a UTF-8 encoded terminal emulator, suggesting that this is not a UTF-8 encoded file. And the first guess of encoding? ISO 8859-1, Western European. Let's test:

# piping to /dev/null to suppress printing and speed up processing (printing to tty is slow) csvcut -e iso-8859-1 data.csv >/dev/null 

No error this time, voila!