One of my hobbies is working as an amateur translator for scripts (of the textual kind) and assorted short stories. Since I move around to different machines often, it just makes sense to put everything on subversion. Of course, when you have a subversion repository lying around, you start wondering what sorts of data is lurking in it right?
Since I’m working primarily in plain text files, I figured it would be interesting to look at the number of lines added/deleted/changed for my files. The statistics make a certain amount of sense since the amount of work completed is ~= to the number of lines touched. This isn’t so true for source code, where I could spend 5 days on 20 lines of code, and 15 minutes on another 50 lines.
Method
The basic idea of this project is simple, 1) grab all the incremental diffs for a file, 2) parse the diffs, counting for the features that I want, 3) generate some statistics.
The core logic of the script was borrowed in part from Tigris’s contributed tools page, where svn_all_diffs.pl can be found. It’s a little Perl script that gets all the diffs of a given file. I didn’t really want to use Perl, and it was more convenient to have one script than two, so I extracted out the logic and threw them together into a python script.
(1) Figure out what revisions are needed for the file in question. This involves just reading the svn log using os.popen() and reading the input.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 | #--------------------------Sample revision------------------------------ #r140 | Randy | 2007-09-18 01:21:40 -0400 (Tue, 18 Sep 2007) | 1 line # #Neeed Sleeeeeeep..... #------------------------------------------------------------------------ def get_svn_log(filename, revision_time, author): """Executes svn log for each file involved, building a list of revisions to work with. returns a list of revisions [r#, user, timestamp, linelength] """ rev_list = [] reader = os.popen('svn log '+ filename) for line in reader: if len(line.split('|')) == 4: if author.lower() == line.split('|')[1].strip().lower(): #check author strings in lower case rev_list.append(line.split('|')) revision_time[line.split('|')[0].strip()] = line.split('|')[2] #glue it onto the list of revisions reader.close() return rev_list |
(2) Next, taking that list of revisions, we start fetching the diffs from the server (again with os.popen()) then parse through the output as needed.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 | def process_revisions(revisions,filename,author,revision_time): """a single filename, and starts parsing the diffs generated by svn diff returns a list of tuples (days from start of file, line count counter, change history counter)""" revisions.reverse()#list is from newest -> oldest revision, we want to reverse that older_rev = None newer_rev = None output = []#list of tuples that we'll be using to plot change_history = [] #stores (date,number of english lines modified (counts +^)) line_count = [] #stores (date, number of lines added to a file (+1 for + -1 for -) date_start = 0 for revnum in revisions: #revisions comes sorted: Older items first older_rev = newer_rev newer_rev = revnum[0].strip() #shifting buffer - it would've been more elegant to do this with slices, but this pattern happened to jump to mind quicker at the time if older_rev != None: print 'handling: ', older_rev,':', newer_rev reader = os.popen('svn diff -r' + older_rev[1:] +':'+newer_rev[1:]+' '+filename) #need to drop the 'r' in the r### string. line_count_counter = 0 change_history_counter = 0 for line in reader: #the specific logic for the line parsers are here, and aren't particularly important line_count_counter += parse_line_count(line) change_history_counter += parse_change_hist(line) #counting done #To generate the 'days from start', we'll have to compare the dates in the two revisions with a rather ugly conversion function time_diff = convert_timestamp(revision_time[newer_rev]) - convert_timestamp(revision_time[older_rev]) date_start += time_diff.days output.append((date_start,line_count_counter, change_history_counter)) #all revisions done return output #(days from start, line count counter, change history counter) |
(3) Handle time differences using a small function to get the difference between two revisions in days.
This is an ugly hack, a very ugly one. At the very least, it should use a proper time.strptimp(time,format) implementation instead of string splitting… Oh well, as it stands it works.
1 2 3 4 5 6 7 8 | def convert_timestamp(stamp): 'returns a datetime.date object off a timestamp' #Really, this should be implemented using time.strptime() date_shards = stamp.split() date_shards = date_shards[0].split('-') date_shards = [x.lstrip('0') for x in date_shards] processed_date = datetime.date(int(date_shards[0]), int(date_shards[1]), int(date_shards[2])) return processed_date |
(4) Analyze the returned string of data, and plop down some charts, etc. I’ll save this for another post since gets involved with matplotlib and the like.