Your Code as Crime Scene

Author : Adam Tohnhill

I pick up this book thinking that I will learn some techniques on how to debug the code better. But it turns out to be a completely different book, and I am glad that I read it.

The book is about how to find defects, complexities, and decays in a codebase using forensic techniques to find criminals.

You can get it from amazon.

Throughout the book, the author uses a tool created by him called Codemat.

He uses the Codemat tool to find defects in his own Codemat codebase.

The idea in the book is to master forensic techniques rather than the tools. Don’t fall for the tools.

I recommend reading this book because it’s very short and provides step-by-step guidance. However, if you want just an overview, then watch the below video.

Let’s talk about some forensic techniques used to find defects in the codebase, but first, we need some tools.

Tools

VCS (Version Control System) like git is one of the most important tools required for codebase investigation.

Before reading this book, I thought VCS is just a tool for maintaining history, team collaboration, and a tool to easily revert changes if something goes wrong.

Using code history to find detects and decays using forensic crime investigation techniques was beyond my imagination.

We don’t need to worry about how to collect this data. The author has already created scripts to format the output in CSV and some scripts to merge different outputs for different analyses.

Finding the hotspot

There will be multiple hotspot locations around a place where crime is committed. These hotspots are derived from crime locations using some other data like connecting roads and shops. (In most crimnal investigations, the criminal location was found nearby the crime location. Example Jack The Ripper).

We can use the same techniques with the code history.

The first thing we need is a map of an area. i.e the time range for the code history. The time frame we chose is really important, because, over time, our development focus shifts, and the hotspots will also shift. Similarly, as design issues get resolved, hotspots cool down.

So we need to choose the range wisely.

The author suggests starting with a smaller range. For example past 6 months.

We can convert the git history to a CSV file showing a number of revisions to a file.

entity,                 n-revs
commom/InfoUtils.java,  60
BarChart.java,          30
route/Page.java,        27
...

We can also merge this CSV with the number of lines of code for each file.

Why do we choose revisions and lines of code? Because Measuring change frequencies is based on the idea that code that has changed in the past is likely to change again.

Keep in mind that this data won’t be true in all cases. For example, there might be a Config.java file with a high number of revisions and lines of code because config files tend to change a lot and indicate that it’s not a defect. Similarly, this can happen with code generations.

So use on your domain knowledge of the codebase while analyzing the data.

We can also use some visual tools like d3 charts to get a good visual representation.

code_age_sample.png

Dissect Your Architecture

You can also use code history to identify potential design or architectural issues.

For example, looking into a file that changes with other files in almost all the commits can indicate tight couplings. Why do these files always change together? Are they related to each other?

In some cases, files that change together can be a good sign. For example, a test file that changes with its prod file is a good indicator that the test and prod code are in sync and up to date.

But if the files are unrelated, then it’s an issue.

One way to figure out this is to check the file naming or its content based on your domain knowledge. If the name sounds completely different, then that’s a coupling issue.

Identifying those files will help you build a safety net around those files by writing or improving tests and then refactoring them and removing the coupling.

Social Aspects of code

After finding the hotspot or architecture problem, it’s time to figure out who is the right person for the job.

One way to identify is an ownership pattern, which is to check the person’s username against the file, which indicates that this person has the most knowledge to refactor this file.

entity,                author, author-revs, total-revs
analysis/authors.clj,     apt,           5,         10
analysis/authors.clj,     qew,           3,         10
analysis/authors.clj,      jt,           1,         10
analysis/authors.clj,     apt,           1,         10
...

This might also help in the case if the person is no longer working with the team and to find the next person after him/her.

Some people might use this data to blame a person. But keep in mind that we are using these techniques to solve a problem rather than questioning the person.

Just remember that no matter how many innovative data analyses we have, there’s no replacement for actually talking to the rest of the team and taking an active role in the daily work. These methods just help you ask the right questions.

This data also helps us to build the knowledge map of the team and use this to rotate people around features or improvements.

Conclusion

The book is short and easy to read. For me, it was a completely new way to see software development and find potential design flaws. Thanks to Raghunath Jawahar for the recommendation.

Now I see git history differently. I tried to manage my commits in such a way that I could use those commits data in the future to find defects and design issues.

I use these techniques in my current project and I was amazed at how useful and accurate it was.

Apart from programming, this has also helped me train my brain muscles for analytical thinking.

This is also an opportunity to learn burndown charts, statistics, and stuff. After doing this you will appreciate your manager’s work more 😃

Highlighted Quotes

  1. Software development is a learning activity, and maintenance reflects what we’ve learned about the project thus far.
  2. Measuring change frequencies is based on the idea that code that has changed in the past is likely to change again.
  3. Pure text is the universal interface.
  4. Remember that hotspots reflect the probability of there being a problem, so false positives are possible.
  5. Heuristics are mental shortcuts. When we rely on them, we trade precision for simplicity. There’s always a risk that we may draw incorrect conclusions.
  6. False memories happen when we remember a situation or an event differently from how it actually looked or occurred.
  7. Patterns are more of a communication tool than a technical solution.
  8. Just remember that when you choose a technology, you also choose a culture.
  9. The relative success of any large-scale programming effort depends more on the people on the project than it does on any single technology.
  10. We just need to remind ourselves that the power of the situation is strong and often a better predictor of behavior than a person’s personality.

Site Footer