Pulling data from Excel using Python, COM

Previously, I had a quick overview of extracting data from Excel files using the python xlrd module. While that’s the recommended way of extracting data that you need from Excel data files you receive, it’s not the way to make interactive programs. For that, you’re likely to want to pull data out in real time. This brings us to win32 extensions and COM.

Of course you’ll need the python win32 extensions installed for this, check the resources for the link. After that, you’ll need to open up a COM client to the Excel app.

import win32com.client
xl = win32com.client.Dispatch("Excel.Application")

The first way to think of to pull data out of Excel is to just read it from the cells:

data = xl.Range("A1").value  #active book, active sheet
data2 = xl.Sheets("Sheet2").Range("A1").value #active book, specific sheet
data3 = xl.WorkBooks(2).Sheets("Sheet3").Cells(1,1).value #specific book, specific sheet

All the ways you can think about referencing a given cell in Excel in VBA you can use here cleanly.

Iteration
If you can pull a single cell, you could easily do some kind of iteration, for example:

1
2
3
4
5
6
data = []
data2=[]
for i in xrange(200):
  data.append(xl.Cells(1,i+1).value)#grab all along row 1, for 200 cells
  #i+1 because win32com counts from 1, not 0.
  data2.append( xl.Range("A1").Offset(1,i+1) #same idea, but using offset instead of range

In the end, you’d have a list of the values going across 200 cells in the first row. However! This is really slow! It’s not too noticeable if you’re doing it for 200 cells. However, if you were to read a 5k cell table like this (like a 25-question survey with 200 respondents) then you’ll definitely feel it.

Alternative
A much faster way to grab data out of Excel is whole blocks of data at a time.

1
2
3
4
5
6
7
#in interpreter
>>> data = xl.Range("A1:D4")
>>> data.Value
((1.0, None, 3.0, 4.0),
(2.0,None,None,None),
(u'asdf',u'test',u'data',4.0),
(None,None,None,u' '))

Calling the “Value” property returns you a tuple of tuples, representing your data in a block. Now you can iterate over the tuple as much as you want in Python. Also, since you’ve only made 1 (well, here 2, but you can obviously make it 1) COM call, it’s much faster. This same method works for things like the Formula property, etc.

Block iteration

One thing about pulling by blocks is that you need to remember that xl.Range() usually takes addresses in A1 notation, and won’t take R1C1. This makes iterating automatically a bit difficult. One way to get around that is to use the End(xlToRight) method of motion, which works unless you have blank cells in the way, we’ll cover that another time.

The alternative is to feed xl.Range() xl.Cells(row,col) arguments. xl.Range() will happily accept two Cells arguments and return the box with the two cells as the top-left/top-right corners.

1
2
3
4
5
6
data = xl.Range(xl.Cells(1,1), xl.Cells(4,4)).Values #the same as A1:D4
data2=[]
for i in xrange(10)
  if i%2 == 0:
    data2.append(xl.Range(xl.Cells(i,1),xl.Cells(i,200)).Value)
  #grab 200 cells for the first 5 even rows

Memory warning

You might be tempted to say “why not grab ALL the cells on the sheet at once and just worry about it in pure Python later?” Sadly, this doesn’t work. If you try calling data = xl.Cells().Value it will throw a com_error, “Not enough storage is available to complete this operation.”

I’m not exactly sure where the lack of memory is, whether in Python, Excel itself, or the COM part in between, but you can’t just pull the 65,536*256 = 16,777,216 cells out that way. It’s probably COM or Excel’s fault, because Python can certainly hold the cell data in memory if forced to (as you’ll see below).

With a bit of work, you can create a block iterator that grabs a few thousand cells at a time quickly. For example, here’s one that will grab every column, going across the sheet. You could do one for rows too, but there’s only 256 columns, and there’s over 65k rows.

data = [] #big list of columns
for i in xrange(256):
  data.append(xl.Range(xl.Cells(1,i+1), xl.Cells(65536,i+1)).Value)

Mind you, on testing, this is still a pretty slow operation, taking something like 0.5~1 second a column on my test machine. Watching the resource meter, it seems to take up a bit under 600MB of memory, even though it’s mostly full of “None”. Not pretty at all, but doable.

This entry was posted in Uncategorized. Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>