Playing with data: CSV, quirks and pitfalls

Comma Separated Values, CSV, is the most basic way to export data out of just about any program that handles data in table format, from databases to spread sheets. The problem is that CSV isn’t standardized, so different programs can interpret what “CSV” is supposed to mean in slightly different ways, which winds up doing bad things to your data.

For very simple things, CSV is rather straightforward. Every line is a row, all values are separated by commas, and that’s that. You could slurp all your data in in a few lines of code.

<python>
#naively slurp files into a data list…
ifile = open(“data.csv”,’r')
data = [ x.split(',') for x in ifile ]
</python>

This is usually all you need to read simple outputs like the kind you can get off a finance web site of historic stock values.

The problem is that for more complicated sets of data, you cells that include commas, or newline characters that belong in a single entry. Most often you’d expect this sort of thing for user input data like survey comments, and all it takes is one occurrence to throw your software off.

Excel for example uses double quotes to enclose entries that have \n or commas in them, and then double double quotes to put in a double quote character (… yeah…)

In Python, there’s the <a href=”http://docs.python.org/library/csv.html”>csv module </a> in the standard library that takes care of most of these quirks. It also provides ways to define a new dialect to take into account different delimiters, escape characters, quote characters, and so on. It also provides readers/writers that will handle the dialects for you transparently so you don’t have to think about what you’re doing as much.

1
2
3
4
5
6
7
8
9
10
import csv
ifile = open("data.csv",'r')
r = csv.reader(ifile, dialect='excel') #build a reader object using the excel dialect
data = []
for row in r: #and just iterate over the reader
  data.append(r)
#...processing happens to data...
ofile = open('output.csv','w')
w = csv.writer(ofile,dialect='excel')
w.writerows(data) #write all the rows all at once.

Still, for data dumps with human free responses mixed in, it might make more sense to use TSV, Tab Separated Values, simply because users aren’t able to insert tabs into forms and such as easily as they can put newlines and commas.

This entry was posted in Uncategorized. Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>