The Bus Problem

Disclaimer

I cannot confirm nor deny that any of the events described in this post ever occurred. What I can say is that everything I wrote is true, even the parts I made up myself.

As Pale As A Sheet

»As pale as a sheet, « I thought while watching my customers, the CEO and the project leader, who shifted nervously on their chairs. They were sitting to my left and right. Opposite me sat the leading developer of their only product. Totally unexpected, he had suddenly found the love of his life and decided to quit the company and relocate to a faraway place.

I was invited to interview the developer to determine what information was still missing during the handover to new employees who had recently joined the company and were not very pleased about facing an intensive phase of drinking from the firehose.

The meeting was scheduled for about an hour. It was clear that this was an emergency, so I decided to play it nice, not ask about software development processes, and did not put a CMMI questionnaire on the table, but instead asked simple questions like, e.g.

Where do you store passwords? (“Nowhere, they are all in my head.”)
How do you shut down and restart the production system (“That’s a 30-minute procedure where you manually generate and upload new container images to Azure, but we also run some VMs and some of them on a different hoster, for some, I am not sure whether we really need them.”)
Is there an architecture diagram? (“No”)
Is there a comprehensive list of all software components that are active in production and their corresponding repositories? (“No”)
For the REST APIs in use, is there an OpenAPI specification? (“We deactivated autogeneration based on OpenApi files and did it manually.”)
Is there any documentation about the REST APIs in use? (“There used to be a Word document, but it’s outdated. Read the code. It’s all there.”)

This went on for about three hours. »As pale as a sheet, « I thought while watching my customers, and rightly so. Luckily, I had taken notes on my laptop, approximately 15 pages of issues to be clarified.

It was apparent we had a huge bus problem. The bus factor is a measurement of the risk resulting from information and capabilities not being shared among team members, derived from the phrase “in case they get hit by a bus”.

Guessing from the color of my customers’ faces, I derived that I did not have to share the story of a good friend who was hit by a stroke that led to a total aphasia. He eventually recovered and regained his speech, reaching a level where his struggle to find the words was only noticed by people who knew him before. The only data that never came back were all his passwords. My customer no longer needed to hear this story.

On that day, the only thing I asked for was for the developer to make me an additional administrator of all systems (of all systems). Everyone was deeply exhausted, but I felt energetic to an extent that seemed inappropriate. So I kept that inside.

A Deep Crisis Ahead

Having access to all repositories, my software archaeology project started on the same evening.

To add to the problem we were already facing, the code quality reflected the absence of any kind of product strategy, development processes, or application of best practices, let alone common sense.

The client informed me they were facing significant operational challenges. Over the years, their software stack had grown through an unstructured, ad-hoc approach, which was further exacerbated by high staff turnover. The result was a system plagued by errors and instability, leading to massive customer dissatisfaction and developer frustration.

From looking at the code, I could confirm they were in big trouble.

Nothing to Lose

Having nothing more to lose gives you some freedom to take unorthodox measures. You can only win, because if you fail, no one can blame you, given the starting point where you joined.

I had been there before (create a working software solution in ten weeks for a piece of software that a team of five had failed to deliver within 3 years), but this time I perceived the task to be different. This was not a pure rescue action; it was more. I had the responsibility to create a genuinely sustainable solution, and for that, I had to develop a vision for how to enable the team. It turned out that the person who left was challenging to communicate with (surprise, surprise). Some team members had lost all hope, but to me, they appeared to be “good citizens” who only needed someone to give them a little direction and hope.

Since other developers had very recently joined, we had a considerable opportunity to form a new organization and set some things straight from the beginning.

The next morning, I called the CEO and asked whether the current system was not in too much danger of failing immediately. He confirmed that the house of cards the company was built on showed some singularity in time and space and could run if (and only if) someone was restarting the whole system from time to time - as we had already learned, a tedious and manual process.

That was the point in our discussion where I requested a complete halt to all development activities. I needed the whole team to pursue my idea. The plan that I was going to propose was as follows:

Collect all accounts and passwords and put them into a password database.
Document how the productive system could be shut down and restarted. The definition of done was something along the lines that a completion was reached when a person who had never heard about software development and had zero technical background could perform the task simply by reading the manual. We already had such a person on board: the CEO.
Completely document the source code repository structure, i.e., add a README to every repository that explains what the code was meant for, its maturity level, and whether it was actually part of the production system.
Completely document the current system and all software components.
Completely document the REST APIs in a machine-readable format, preferably using OpenApi v3 documents under version control.
Create some kind of architectural documents
Perform a deep source code analysis and make an improvement plan.

Luckily, I got a green light on the very same day.

Wiki Forever

To enable the team, I provided clear direction on how to work collaboratively on the documentation.

After more than three decades in software development, I have reached the conclusion that the tools a company adopts not only reflect its culture but also actively shape it over time. I considered the choice of tools to be critical and was eager to find tooling that did not require a significant amount of education and was easy to use.

The first deed was to create a new git repository dedicated to holding all our documentation. I believed the accessibility and ease-of-use of the chosen documentation was crucial for our success, so I decided to go with a GitHub Wiki inside a GitHub repository.

The plan was to put all files, such as API definition files, under version control and subject them to a review process via pull requests, and to utilize wiki pages for all regular documentation. The intuitive graphical user interface of the GitHub Wiki and the minimal set of markdown features available for generating documentation helped to have everyone clearly instructed within a couple of hours.

Because wikis are Git repositories, every change you make is a commit that you can (re-)view. Ultimately, the GitHub wiki is essentially a Git repository containing Markdown files. So, late at night, when no one was editing, I was able to pull the whole system onto my disk and perform some editorial work, like, e.g., cleanup, fix formatting errors, create consistency, reorganize documents, and push all changes back, so that everyone would find a clean environment the next morning.

Markdown has the advantage that it is easy to learn, allowing people to focus on content instead of wasting time on formatting issues. Based on the immediate feedback the team received from each other and from me, best practices for editing emerged within days.

Success is a Team Effort

Over the following two weeks, we scheduled video conferences where the “bus problem” developer presented the entire system landscape and performed a live session for the shutdown and restart of the production system. The team then took the video recording, worked through it, and then tried hard while he was watching until they were able to do the same things. Then they put that information into the wiki.

In the meantime, I had my own interview sessions with the individual and provided him with a list of information I would like to obtain. So he dug out some office documents from somewhere on his laptop (no backup) and shared them on a Google Drive. I collected the data and converted everything to new wiki pages using the Swiss Army Knife of document format conversions, pandoc.

Initially, some team members were hesitant to use a wiki instead of a traditional text processor, but as the documentation evolved and they discovered how easy it was to correct errors, refine the text, and link to other sections, they became more comfortable with it.

Initially, it was I who, during meetings where people verbally shared new findings about the system, asked, “Have you put that piece of information in the wiki?” and persuaded them to share all their knowledge there, regardless of its importance. A couple of weeks later, the project manager called me and told me that I was no longer required to attend all knowledge exchange meetings: “You know what?” he said, “They start to ask each other exactly your question and they really mean it and they are really motivated and behind it all day long.”

I watched the wiki grow in size and quality during the following weeks.

After several internal tests with the project manager, they finally held a session where the CEO shared his screen, going through the Wiki document that explained how to shut down and restart the production system. It was a great success.

My contribution to all this was a set of 70 action items for the next two years to further clean up the codebase and introduce proper Testing, Continuous Integration, and Continuous Deployment, as well as complete automation and code generation from interface specifications.

Together with a vast number of documentation tasks, we created tickets for each of them and assigned them to the appropriate team members. All was clearly visible in the cross-repository Project view, featuring a Kanban board which could be filtered for certain aspects and per assignee.

That was helpful for our daily standup. Yes, it came naturally to introduce agile best practices. I did not tell them it was Agile, though. Leave out the buzzwords, find out what works.

The next day, I got a call from my customer. They said

We always had a bad feeling about the whole system, but we couldn’t clearly identify what it was. We were also quite unhappy with the way our work was organized, as no one had an overview of what was happening within the team.

You opened our eyes and gave us a clear vision of how to go on. And this lean project management tool is quite cool.

Reflection

The person who left the company was actually quite intelligent. Yet, the work products he left behind were a real mess. I have questions about the overall system’s architecture. The code quality was not at the level it could have been.

So, what had happened here? What led to the mess? Why was there no structure? Why did they fail to organize themselves? And, since I am as German as German can be, the most important question is: Who is to blame for this? 😉

Neil Kandalgaonkar once put it this way¹:

Some people blame incompetent developers for cursed code. But the most cursed code is caused by highly competent developers. You just have to give them a task in a team that has the right kind of organizational dysfunctions - poor communication, power imbalances, diva designers or managers, unclear decisions. Where the incompetent developer would just fail, the highly competent developer will deliver an eldritch monstrosity - that works. Just don’t ask how it works.

This statement is quite on the spot. In the case discussed here, I think that several factors contributed to the situation:

A lack of true leadership:

After all, it is always about the management. Management is to blame for everything. Really, I mean it. If managers fail to communicate purpose, prioritize, provide direction, and have an overview of what is happening within their own company, then they are doomed. Highly skilled developers can only partly compensate for a lack of true leadership.

It is worth noting that the CEO was a person who was very reluctant to make decisions while everyone was still waiting for his advice and input.
A lack of experience:

If you do not know how to set up a development process, if you have never learned about best practices for project management or for how to keep track of all tasks, then how are you going to survive facing a constant stream of requirements changes, incidents in the field, and trouble in the production system?

It also becomes problematic when people who have never developed software and never engineered a system are the ones who decide how the work is organized. I disagree with those who think that good managers are not required to know the problem domain the engineering team is addressing.

For software, many good ideas on how to set up a development process emerged over the past decades. Except Scrum. Simply do not use Scrum. Using it only reveals your lack of experience². This is not the future you are looking for³.
A lack of appropriate tooling:

As outlined in another post, tools that generate more work by requiring significant administrative effort than they help reduce through their feature set can ruin your day. It is incredible to watch people come to the aha effect while using reductionist tools like the GitHub ticket system and its lean project views instead of a full-blown JIRA/Confluence instance or office suite. It always makes my day when people who used to drown in a never-ending stream of e-mail threads and a mass of documents on a shared drive finally learn that information can be shared in small text snippets under version control. I celebrate when they finally grasp it and fully understand the benefits of centralized knowledge management.

This is Not Real

Are you sure?
Do you make backups?
Can you recover data from the backup you made two months ago?
When was the last time you did this test?
What happens if the data center gets flooded?
Do you host a centralized password database with dedicated user access control and clear instructions about every password being stored exactly there?
Do you have a backup of your centralized password database?
What happens if the data center gets flooded? Do you still have a backup of your centralized password database?
Did you make sure the password database backup is encrypted? Who holds the key to decrypting it? What happens when she gets hit by a bus?
How many different documentation systems do you run in parallel?
Is your wiki up-to-date?
Is every artifact the team has ever created documented?
Do you have a centralized overview of who is working on what in your company and why?
How large is the portion of maintenance effort that goes into the tools you use?
Do you have a development process?
Do you think your development process is efficient?

Meta Questions

Do you put individuals and interactions over processes and tools?
Do you put working software over comprehensive documentation

Final Question

Would you like me to come to your company and ask some more questions about how you organize your work until you are as pale as a sheet?

Posted here on Nov 28, 2023, 02:03 AM. ↩
Take it with a grain of salt. ↩
TODO: add an animated GIF of a Jedi master waiving his hand. ↩