Previously, I had a quick overview of extracting data from Excel files using the python xlrd module. While that’s the recommended way of extracting data that you need from Excel data files you receive, it’s not the way to make interactive programs. For that, you’re likely to want to pull data out in real time. This brings us to win32 extensions and COM.
Named ranges are useful things, especially for writing formulas that need to refer to the same block of cells constantly. You can easily set them by hand just by selecting the range you’re interested in, and then clicking the box to the left to the formula entry bar, and type in the name of your range.
On the programming side, named ranges are a bit more complicated to work with. You can easily refer to a range, just set the formula attribute to “=name” and that’s it. myRange.Formula=”=myName” for VBA, range.formula = ‘=myName’ for python via win32 COM once you have ‘range’ storing a range object.
Closing Excel from Python using COM seems pretty simple, but there’s a few little caveats to remember. Continue reading
Everyone knows about the “A1 style” notation for referring to cells in Excel. It’s what the labels on the edges of the worksheet are after all, and it’s the most natural thing to work with if you’re using the standard interface.
However, there’s a second style of notation, the R1C1 style. It’s not commonly known because it’s not even turned on normally, and it’s much more difficult to use intuitively. Continue reading
Comma Separated Values, CSV, is the most basic way to export data out of just about any program that handles data in table format, from databases to spread sheets. The problem is that CSV isn’t standardized, so different programs can interpret what “CSV” is supposed to mean in slightly different ways, which winds up doing bad things to your data.
Excel’s data format is everywhere in the business world, and more often than not, you’ll wind up getting chunks of data from people dumped into an Excel file. Sure, CSV tends to be the “lowest common denominator” exchange format for data between software, but that’s quirky for its own reasons.
One way to pull data off Excel using Python is to use the win32 extensions and go through COM. However, this requires you to have a copy of Excel installed, and isn’t all that fast if you’re doing large batches of data grabbing.
The other way is to use the wonderful xlrd package. xlrd is great because it’s pure python, can run anywhere Python 2.1 and higher runs. It handles many of the quirks of the data format (such as the date epochs, encodings). Finally, perhaps most important for people in the business world, it’s BSD licensed, so we can include it in any products we make without running into any GPL conflicts. Continue reading
One of the things that can confuse a lot of people who are new to data analysis is the concept of “data is data.” That is to say, so long as I have the data, the topic that I am reporting on does not matter from the standpoint of reporting. It doesn’t matter if I am reporting on income or velocity, revenue or shuttle launches.
At a previous job I would be called on to demo the <acronym title=”Online Analaytical Processing”>OLAP</acronym> software that our company configured. In these demos I might show a report of customers by revenue vs. income and get asked “Can you this with total orders” or “can you do that by business region instead of income?”
Sure. Data is data. If you have the data that supports that kind of breakdown, then most software can do the analysis. It doesn’t know that this is “income”: it only knows that it is a number associated to a field in a database.
Context is mostly necessary for interpretation. Knowing whether a number is in miles-per-hour, degrees Kelvin, or revenue per customer gives us an idea of how to interpret the data and indicates what kind of reports might be appropriate, but the mechanics in the background of reporting are not going to be all that different.
The main question then becomes “what kind of reports does this data support.” This is a harder question and one I will go more in depth with at a later time.
Working with Excel is all about manipulating the data and relationships between cells, and there are lots of ways to do such a thing. Here’s a tour of a few of those variations.
One of the problems we frequently encounter in business is that our data is not as clean as we might want it to be. We are plagued by duplicates, fields that are formatted incorrectly or in the wrong place, needed values that are missing, outdated information, or even data that is flat-out wrong.
This becomes an especially prevalent problem for companies that have to deal with other companies data sets either through consulting or through mergers, but it can happen anywhere due to user error, software glitches, imported lists for marketing, or miscommunication.
In one case, a coworker ended up with his weight in the database coming out as “blue”–his eye color. In another case working with the D&B (Dunn and Bradstreet) business intelligence services, information would be drawn for a company at one point in time and then fall out of date as D&B updated their records.
In another case, the company was using region codes from several different sources and didn’t have a master key to sort them. To this day I still don’t know where “OH” is in Japan.
There are also fake records that come in from various sources. At one company I saw a record for “Doc Hudson from Radiator Springs</a>” and another for “Utopia Planitia, Mars.” These kinds of problems are particularly prevalent for sites that require registration before use.
Cleaning Data is going to be an ongoing series about taking care of these and other related problems.
Manipulating Excel with Python feels very much like manipulating Excel from within the VBA language. In fact, it uses the same objects, and the only major differences are that you don’t have access to autocompletion, and the syntax for function calls look a bit more ‘pythonic’.