Excel’s data format is everywhere in the business world, and more often than not, you’ll wind up getting chunks of data from people dumped into an Excel file. Sure, CSV tends to be the “lowest common denominator” exchange format for data between software, but that’s quirky for its own reasons.
One way to pull data off Excel using Python is to use the win32 extensions and go through COM. However, this requires you to have a copy of Excel installed, and isn’t all that fast if you’re doing large batches of data grabbing.
The other way is to use the wonderful xlrd package. xlrd is great because it’s pure python, can run anywhere Python 2.1 and higher runs. It handles many of the quirks of the data format (such as the date epochs, encodings). Finally, perhaps most important for people in the business world, it’s BSD licensed, so we can include it in any products we make without running into any GPL conflicts.Of course, xlrd isn’t all powerful. It won’t handle password protected files, show charts/pictures/etc, display formulas (it will display the saved result), do any VBA-related things, display comments and hyperlinks. Also, the new Excel 2007 format isn’t supported as of this writing, but it’ll handle just about all the formats before that point.
The API for the module is pretty large, with lots of things you didn’t think you’d need, so take a look at the docs when you have time.
1 2 3 4 5 6 7 8 9 10 11 12 13 | import xlrd book = xlrd.open_workbook("data.xls") #open our xls file, there's lots of extra default options in this call, for logging etc. take a look at the docs sheet = book.sheets()[0] #book.sheets() returns a list of sheet objects... alternatively... sheet = book.sheet_by_name("qqqq") #we can pull by name sheet = book.sheet_by_index(0) #or by the index it has in excel's sheet collection r = sheet.row(0) #returns all the CELLS of row 0, c = sheet.col_values(0) #returns all the VALUES of row 0, data = [] #make a data store for i in xrange(sheet.nrows): data.append(sheet.row_values(i)) #drop all the values in the rows into data |
Note, there’s a difference between CELLS and VALUES. xlrd has a Cell class that encodes information about the cell in addition to its value, such as the data type (number, date, text, etc). Be careful about this when doing operations, for example:
1 2 3 | #if sheet.col(0)[1] returns a cell with 'number:2.0' sheet.col(0)[1] + 2 # will throw a TypeError. sheet.col(0)[1].value + 2 #will yield the expected 4.0 |
There’s many more goodies in the xlrd API, but even with just this little, you can pull out the tables you need with a little bit of work.
Thanks for the example. This has gotten me started. Your example uses a .xls doc. Can we somehow open and manipulate a .xlsx doc?