Tuesday, 2 September 2014

Convert Little-endian UTF-16 to ascii

I took a csv export from mssql. I had to read this file in python. However there was a problem. Python was reading it in binary format.

file.readlines read it like this:

'\xff\xfe2\x000\x001\x003\x00-\x001\x000\x00-\x001\x000\x00 \x000\x000\x00:\x000\x002\x00:\x000\x000\x00,\x00i\x00n\x00s\x00t\x00a\x00l\x00l\x00 \x002\x000\x001\x003\x00,\x00t\x00o\x00o\x00k\x00 \x00r\x00a\x00\r\x00\n'

When i used file command on that file I got:

Little-endian UTF-16 Unicode text, with very long lines, with CRLF, CR line terminators

When I tried to use dos2unix command I got this error:

dos2unix: Binary symbol 0x000B found at line 9419305
dos2unix: Skipping binary file calls.csv

So I tried iconv command to convert the file to urf-8

iconv -f utf-16 -t utf-8 input_file > output_file

And it worked! Now python reads it properly.