Managers, Architects and Product Development: A Case Study
Recently we were asked to help a UK retail chain that wanted to replace their order management system. This application was the very heart of their business but it was old, expensive to run and was holding them back. They had a number of internal teams who were looking at the problem and asked us to develop and update a number of products that would be impacted by the change.
This would normally be an ideal project for us. We would work with the team and the users to identify what changes were needed to make them work with the new system but also explore opportunities to improve them or drop functionality that is no longer needed. Unfortunately it didn't work out that way since the project became an example of what can happen when too much time is spent planning and not enough time actually testing.
The Challenge
This retailer had over 100 stores in the UK and several in Europe. They had warehouses, distribution centres, websites, mobile applications, a call center and a wholesale offering. Their business was complex and everything came together in their central order management system. It held information about their customers, products, orders and suppliers so they could not afford for it to fail.
The central system was named System One, it was a mainframe application that was maintained by a third party supplier. It had been in place for the past 20 years and in that time it had been modified by so many company projects that it was difficult to know everything it did. A lot of systems interacted with it using bespoke, point to point integrations that had been built over the years by lots of different teams. There were batch file uploads, message queues that allowed some systems to call functions and a number of SOAP based APIs. Each project had developed their own interaction without any central designs.
Existing design
This environment was complex and a migration would be difficult but we were up for the challenge. This situation was not uncommon, a lot of companies find themselves with a central system that is only becoming older and more expensive so we set to work and came up with a plan.
The client had identified the new core system they wanted to use (we will call it System Two), fortunately it was built by the same vendor that maintained System One. It was a more modern, cloud based system that would do most of what the client needed out of the box. There was some development that the vendor needed to do and a long business approval process so System Two would take some time to be ready.
Our job was to support and improve all the products that are currently interacting with System One. They wanted to be able to run both systems for a while since there was a lot of concern about moving to a new system. The idea was that a few stores would be migrated at a time, then some more and eventually everything would be moved. This presented a couple of challenges since data would be in 2 places and it would be impossible to update all the clients to be aware of individual store migrations. Despite the challenges we still felt it was the right approach because it avoided a big bang release and meant it was possible to roll back stores if there was a problem.
Team Structure
We were approached about this project by the client's Head of Programme Delivery, he was someone who knew of us from a previous client but we had not had a chance to work with him before. When we discussed the project we given the scope of the project was to work with the vendor to deliver System Two and to make sure all the connected products continue to work.
There was no design or ideas about how to do this but that’s the case for most of our projects. We knew we would need to design, prototype and test a number of approaches to find one that worked. Integrating with System One would be a challenge since we knew making changes there was going to be difficult so we needed to test what was there to find reliable ways of working with it.
We were told there was an internal architecture team who would sign-off on the approach and had some useful information about System One. They did not manage the application since the vendor did that, but they did had diagrams and background information on how the integrations worked.
Architecture Teams
Faced with a complex challenge the Perrio team got to work on the approach we would take. We had dealt with similar problems before, that was why we were asked to help. The team came up with a design that would allow the migration in the lowest risk way.
Facade based design
We proposed a design using API Facades. This is a well established integration pattern that would require a set of APIs to be built that could interact with either System One or System Two. We would create a single set of modern Rest API’s for all the external systems to use. Our API layer could then use the existing SOAP API and function calls to manage data in System One. Eventually we would begin to use System Two as well, we would build login into the API’s so they knew where to read and write data to. The external clients never needed to know and we could update both systems if needed.
The major risk was that the underlying system differences would make it impossible to expose these common APIs. To try and manage this we identified a number of data points that we would prototype. Our concern was that System One data would be inconsistent and requests would be slow, we were less worried about System Two since it was already being used by other companies at scale. We target a few different integration methods and data structures that we would test so we could validate the approach before committing to it.
We would have to work with the client product teams and update some products ourselves so they used our APIs rather than the current methods. Key products were identified for us to test with so that we could validate the design end to end. It would result in a clear, consistent integration approach and we could even manage issues like auditing and fault tolerance in our new API’s.
We were happy with the approach and it was clear that even if System Two was delayed, the project could deliver value as soon as the products started to use the new API’s. Our team presented the idea to the client's architecture team and we asked for input so we could refine it, identifying the system details that are not always obvious at the beginning.
The response was a shock. We were told not to interfere with the architecture team and that system designs would be provided to us in due course. We were amazed, our team had been asked to help with this design and build but the central architecture were asking everyone to wait until they decide an approach by edict.
This was a major red flag for us that we needed to deal with. After speaking to some of the senior stakeholders we got more of an insight into the politics of the organisation. We found that several of the senior project sponsors wanted to follow an upfront design process and the central architecture team and all had been in the organisation for such a long time it was hard to get past this process. There had been some previous projects that failed to replace System One and so the management team wanted to focus more on design before trying to code anything. Most people who immediately see the error there since small, rapid iteration would be a better way to reduce risk but it was clear they had a bad experience in the past and were not going to listen now.
It was clear that the Head of Programme Delivery who hired us in the first place was not as senior as he had presented himself to be and the project was starting to look like an old-fashioned waterfall. As we looked around the programme more we saw an every increasing number of managers and planners. There were almost no engineers or designers but there were managers for both. Those managers were in charge of plans and oversight but their teams had not started yet. Its was clear that the organisation had no real ability to do anything internally, they were totally dependant on outside help and covered it having a large management tier.
Despite the concerns of the team we agreed to wait and focus on the minor upgrades that various products needed. This was a new client and we would have to build the relationships before we could help them with a more modern approach to development, in retrospect that was an error.
The design
The actual high-level design was eventually delivered several months after it was originally due and it contained a lot more assumptions and risks than we would normally have expected.
Syncronizer design
The key feature of the design was a 2 way synchronization component that would be built by the vendors between the old and the new system. With this in place all of the client products could migrate to the APIs offered by System Two and the synchronization would allow either system to be read or written to.
Now, the prospect of a synchronization process will be terrifying to most software developers because it introduces a number of risks. Here are a few of the questions that we raised:
What happens if the synchronization fails and the two systems have different records?
What happens if there is latency in the synchronization and both systems have been updated, resulting in conflict.
What if there is an error in the data that prevents it from being written to the other system?
How can this be tested at scale? System One only had one production scale instance, that was used for production so there didn’t seem to be a way to test the synchronization until enough stores migrated and then it would be too late to change
What was the rollback strategy? Once data was being written to System Two it would be impossible to undo the synchronization without the updated clients being rolled back.
All the integration was still point to point so the organisation was locked into System Two, what will they do if you want to change in the future?
There were a lot of other concerns but the core of the problem was that everything depended on this component working quickly and correctly 100% of the time. And, of course, this component did not exist yet. The vendors believed it would be possible but had never actually done it, remember System One had 20 years worth of modifications behind it so getting an exact idea of the shape of all the data and what it meant was hard.
In response to the concerns we were told that the architecture teams from the vendors had agreed to some maximum response times for this component and that the vendors would be contractually obligated to make sure it worked. There was no need for a rollback strategy since the vendors architects had agreed it could be done. A large contract was awarded to them to line up their team.
The vendor team started listing dependencies and said everything should work correctly if all their requirements were met. Which meant that no-one had a technical plan for managing this system and everyone was getting positioned to point fingers.
A lot of time and money was committed to this project now. Timescales were agreed and published so that other teams could plan their migration, all before a single line of code was written or a single test had taken place.
So what happened next?
Our team stayed on with the client for some months, we made some much needed improvements and optimisations to their products but confidence was low. Progress on the commercial agreement with the vendor was slow and the synchronization component was delayed. The Perrio team did want they could but it was clear we were only adding marginal value while everyone waited for the synchronization that no one was confident in anymore.
In the end our team all agreed it was pointless to carry on. We looked around the programme and saw lots of consultants who all knew this was not going to work but would collect their day rate all the same. We saw promises being made by project managers and architects that we knew the development teams could not keep. At some point the product development teams would be asked to work crazy hours to catch up, resulting in products that don’t do what the users need.
The project had become a prime example of what we wanted to do differently and our team was not comfortable with it. We decided to stand down for now and potentially come back later if needed. From a business point of view it was illogical since we could have carried on and charged the client but maintaining morale in a project that no one thinks will work is impossible, life is too short.
So we spoke to our client and explained our concerns about the programme. They were sad to lose our team but clearly understood the reasons. The original request had been to help them deliver System Two and make the entire ecosystem work, instead we made small changes while the main System Two delivery was stuck with an approach that we knew would not work. We handed over our work to an internal team and stepped away,
What about after that?
Several months later we were contacted by the client. They needed help. Things had not really progressed, the prototype synchronizer had major flaws and it was unlikely to be fixable without major changes to both System One and System Two. Several people in the project management and architecture teams had lost their jobs and costs were already spiralling. All of the planned dates were slipping and everyone was feeling the pressure but could not do much to help. The company board was extremely unhappy and everyone was worried about the future.
Based on our previous experience, we knew what to look out for so we took it slow. Firstly we spent some with the internal teams to understand exactly what was happening. We met with the vendor and discovered the developers there had never believed the synchronization would work but were forced into it. We met with well meaning managers in the clients team who felt out of their depth because they didn’t have a team to provide technical input into what was going on. New people had been hired and other roles had disappeared but the same problems existed, there were a lot of planners but no doers.
We formed a plan to get the project back on track by using a modified version of our original architecture and streamlining the project planning process to only cover timescales that we could be sure about.
Then we insisted on presenting this to the company board. It was clear that the senior leadership needed to be more active in the project to have any hope of cutting through the politics. We had some tough sessions with the senior leaders but we managed to agree on a plan that worked for everyone and that we were confident could be delivered. We agreed on a partnership agreement because we wanted us to both have a stake in the success of the programme and we needed a way to ensure there was genuine change. The programme was streamlined with less management and closer working relationships with actual developers of System One and System Two.
Now, the first products are live with data from System Two, the integration layer works great and priorities are moving. We have shown that the two systems can work simultaneously without a lot of pain so the company wants to leave some of the low priority data in System One for now and focus on user products. Leaving the old system running is going to have to be technical dealt with eventually have to be dealt with but now the business has the option to make short-term user gains that would have been impossible before.
Conclusion
There is a long running joke in software development that the biggest problem in any project is not the technology, it's the people and process. That was true here, the technical problem was complex but manageable since it was all things that we had done before. The challenge was that different internal teams were pulling in different directions and there was no higher leadership to help to steer the project. The various internal teams wanted to protect their own areas of responsibility which resulted in no-one being willing to accept input from elsewhere and everyone preparing to cover themselves when things went wrong.
All of this was made worse by the fact that there was no-one actually writing code or designing the product. Everyone was a manager of some description, normally overseeing a supplier. The result was that the managers had little ability to change anything and would succeed or fail entirely on the basis of the supplier. If there were product developers it would have been possible to cut through some of the politics but in this case the management team were all concerned about their own position.
This culture of blame and fear made it impossible for teams to work openly and honestly together. The organisation structure put people into silos of their responsibilities but did manage the interdependence properly.
Since we have returned to the programme there have been several key changes. We insisted on a direct line of communication to the company senior management so we could cut through the silos. We also worked with the senior management team to make sure the vision and goals of the project were properly understood so that everyone could feel confident in how to help others. At the same time we insisted on working directly with the developers in other teams, we needed to get past the managers and work closely with those who are closest to the code. We promoted a flatter structure within the product teams so that teams could deliver independently.
Politics impacts every organisation but in this case it was so bad that it made delivery impossible. We were reminded that building a great product is often enough, delivering in a siloed and highly political environment requires senior support and it is important to get that view early.
We were also reminded that technology can be complex and intimidating. For non-technical leaders it can be hard to sit in a slick presentation by an architecture team and probe the detail of what is being said. Everyone is too far removed from users and code to see problems but vendor contracts are easy to understand. In this case the approach was sold to the senior management before it was presented to a wider technical team so everyone was committed. For some of our longest clients we do little more than provide a sounding board a lot of the time, a few good questions early on can stop a project going down the wrong road. Today, all companies are software companies so senior leadership needs support with what to ask and when.