Python, Subversion, and revision diffs

One of my hobbies is working as an amateur translator for scripts (of the textual kind) and assorted short stories. Since I move around to different machines often, it just makes sense to put everything on subversion. Of course, when you have a subversion repository lying around, you start wondering what sorts of data is lurking in it right?

Since I’m working primarily in plain text files, I figured it would be interesting to look at the number of lines added/deleted/changed for my files. The statistics make a certain amount of sense since the amount of work completed is ~= to the number of lines touched. This isn’t so true for source code, where I could spend 5 days on 20 lines of code, and 15 minutes on another 50 lines.

Method

The basic idea of this project is simple, 1) grab all the incremental diffs for a file, 2) parse the diffs, counting for the features that I want, 3) generate some statistics.

The core logic of the script was borrowed in part from Tigris’s contributed tools page, where svn_all_diffs.pl can be found. It’s a little Perl script that gets all the diffs of a given file. I didn’t really want to use Perl, and it was more convenient to have one script than two, so I extracted out the logic and threw them together into a python script.
(1) Figure out what revisions are needed for the file in question. This involves just reading the svn log using os.popen() and reading the input.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
#--------------------------Sample revision------------------------------
#r140 | Randy | 2007-09-18 01:21:40 -0400 (Tue, 18 Sep 2007) | 1 line
#
#Neeed Sleeeeeeep.....
#------------------------------------------------------------------------
 
def get_svn_log(filename, revision_time, author):
    """Executes svn log for each file involved, building a list of revisions to work with. 
    returns a list of revisions [r#, user, timestamp, linelength] """
    rev_list = []
    reader = os.popen('svn log '+ filename)
    for line in reader:
        if len(line.split('|')) == 4:
            if author.lower() == line.split('|')[1].strip().lower():
                #check author strings in lower case
                rev_list.append(line.split('|'))
                revision_time[line.split('|')[0].strip()] = line.split('|')[2]
            #glue it onto the list of revisions            
    reader.close()
    return rev_list

(2) Next, taking that list of revisions, we start fetching the diffs from the server (again with os.popen()) then parse through the output as needed.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
def process_revisions(revisions,filename,author,revision_time):
    """a single filename, and starts parsing the diffs generated by svn diff
    returns a list of tuples (days from start of file, line count counter, change history counter)"""
    revisions.reverse()#list is from newest -> oldest revision, we want to reverse that
    older_rev = None
    newer_rev = None
    output = []#list of tuples that we'll be using to plot
    change_history = [] #stores (date,number of english lines modified (counts +^))
    line_count = [] #stores (date, number of lines added to a file (+1 for + -1 for -)
    date_start = 0
    for revnum in revisions:
        #revisions comes sorted: Older items first
        older_rev = newer_rev
        newer_rev = revnum[0].strip()
        #shifting buffer - it would've been more elegant to do this with slices, but this pattern happened to jump to mind quicker at the time
 
        if older_rev != None:
            print 'handling: ', older_rev,':', newer_rev
            reader = os.popen('svn diff -r' + older_rev[1:] +':'+newer_rev[1:]+' '+filename) #need to drop the 'r' in the r### string.
 
            line_count_counter = 0
            change_history_counter = 0
            for line in reader:
                #the specific logic for the line parsers are here, and aren't particularly important
                line_count_counter += parse_line_count(line)
                change_history_counter += parse_change_hist(line)
            #counting done
 
            #To generate the 'days from start', we'll have to compare the dates in the two revisions with a rather ugly conversion function
            time_diff =  convert_timestamp(revision_time[newer_rev]) - convert_timestamp(revision_time[older_rev])
            date_start += time_diff.days
            output.append((date_start,line_count_counter, change_history_counter))
    #all revisions done
    return output #(days from start, line count counter, change history counter)

(3) Handle time differences using a small function to get the difference between two revisions in days.

This is an ugly hack, a very ugly one. At the very least, it should use a proper time.strptimp(time,format) implementation instead of string splitting… Oh well, as it stands it works.

1
2
3
4
5
6
7
8
def convert_timestamp(stamp):
    'returns a datetime.date object off a timestamp'
     #Really, this should be implemented using time.strptime()
    date_shards =  stamp.split()
    date_shards = date_shards[0].split('-')
    date_shards = [x.lstrip('0') for x in date_shards]
    processed_date = datetime.date(int(date_shards[0]), int(date_shards[1]), int(date_shards[2]))
    return processed_date

(4) Analyze the returned string of data, and plop down some charts, etc. I’ll save this for another post since gets involved with matplotlib and the like.

This entry was posted in Uncategorized. Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>