papajuans weblog

Writings and notes


Additional Thoughts on "How Platform Teams Get Things Done"

Pete Hodgson’s “How platform teams get stuff done” article is a brief synopsis of archetypes on how feature and platform teams can collaborate. It’s a good summary of things to consider when you have this sort of organizational setup, but it doesn’t touch upon two other key questions before identifying a collaboration model:

  1. Should this thing even be in The Platform?
  2. You made the change, now what? How do you further adoption of this? What if there are competing priorities?

I ran the platform team (in the same scope/definition as Hodgson describes eg for other teams) at my old company for the better part of 6 years. We were a team of about 50 engineers across 3-4 different domains in the DDD sense. What follows are my own experiences and some guidelines that might be helpful if you’re in the unique position of running and managing internal platforms.

Quick Recap

Pete Hodgson’s article on martinfowler.com about “How Platform teams get stuff done” succinctly summarizes a few different ways that platform and feature teams can work together. He views this through 3 different “phases” of a platform. Note that “phases” here doesn’t refer to maturity, but more of who is driving the change. I think a better term would’ve been “proposer” or “requesting party”.

Hodgson’s archetypes are essentially variations of who-does-the-work: the requesting team (“internal open source”), the platform team (file a ticket), platform team embedded in the requesting team (embedded expert), the requesting team embedded in the platform team (tour of duty), etc.

Before choosing a model

But before we even get to which model is most applicable, there’s an important conversation to be had whether the thing/request/proposal should be in the platform. And in my experience, this can be a fundamental and challenging question for many platform teams - deciding what is user-land has long-term implications. Moreover, these questions can lead to principled discussions (of varying degrees of annoyance) and bikeshedding that can wear on the teams involved.

Answering that question requires 3 very important aspects, essential prerequisites for high performing platform groups:

  1. Strong conviction and vision of what the service/component is meant to do with regards to the overall ecosystem and some idea of what it should look like over time - this is essentially treating the component as a product itself
  2. Excellent collaborators (largely on communication) on both sides that are aligned high-level architectures (services, data modeling, etc.)
  3. An organizational support structure and system that allows for “no” or “not now”

The owning team of the service/component in the platform needs to have a clear vision and understanding of how that thing fits into the overall ecosystem. Nobody likes awkward APIs so discussing the proposed change is essential to ensure cohesiveness of that service/component over time. For example, suppose you have a team responsible for the Bookmarks API (save/retrieval of a user’s bookmarks) and there’s a proposal for adding a “recommended bookmarks” feature. There’s already a Recommendations API; should “recommended bookmarks” feature be a new endpoint to the Bookmarks API? Or should it be a modification to the Recommendations API? Should it be a new endpoint in the Bookmarks API that calls Recommendations internally? There are a myriad of approaches, and choosing the right one is contextual to the teams. Moreover, owners will have an understanding of upcoming changes so if there’s some lucky alignment, now can be a good opportunity to shuffle things around.

This discussion is really only productive when you have effective collaborators, aka communicators. Each party honestly and effectively conveying their respective needs is a helpful start (ie non-violent communication) and if there are established workflows for proposals like an RFC or some sort of meeting then taking the time to follow it will obviously be beneficial for both sides.

Through the discussion, it’s completely reasonable (although possibly unwanted) where the answer for the proposal and/or change is “no”. This could be for a variety of reasons: doesn’t fit in the long term vision; awkward API, not ready for something etc. In these scenarios it’s imperative that the requesting engineering team has an escape plan (build it separately, move logic somewhere else, etc.) and moreover that the organization is OK with this sort of escape plan. Tempers will undoubtedly flare if schedules must be shuffled to account for things. Unfortunately, there’s no right answer here, and it’s part of the challenges of common platforms. On a positive note - these conversations can help shape/mold the vision of the platform moving forward (aka “maybe we should allow that behavior”).

Ok, the change is done - now what?

Once you have determined that the change is needed and the thing is deployed, there’s still a laundry list of tasks that are necessary for the benefit of the change and platform. Hodgson’s article doesn’t mention “who should do these tasks” in his collaboration models, but I’m an advocate for the platform group taking on these responsibilities. It’s more natural for the owner of the component to be broadcasting these changes rather than an “outside” contributor - other dependent teams will expect that messaging from the same voice as normal changes/bug-fixes.

The biggest task will be evangelization of the change. This can take a lot of forms, and unsurprisingly will be proportional to how large it is. If it’s work that aligns with long-term strategic functions (new data model, new Kafka topic, whatever), then this will induce more work (potentially a migration of clients, more planning etc.). Don’t forget that this needs to be accounted for in the first question of “does this thing belong”.

If the work/change is smaller, run-of-the-mill stuff (add a new filter parameter, a new optional argument, etc.) then additional work will not necessarily be generated, but the communication of these changes are still required.

Disseminating this change is not a one-time thing! Platforms that have a lot of client teams will require multiple, repeated ways of telling people - and absolutely expect to have to say the same thing multiple times. Relying on an automated, standard way of communicating these new changes will be paramount and directly proportional to how effective the platform is. It’s helpful to skew towards aggressive over-communication (Slack, e-mail, all-hands, meetings for feature walkthroughs, etc) and treat it like a formal product release.

The second-largest task will be the activities around the initial deployment of the changes. I’ve seen (and been a part of) groups with a you-build-it-you-own-it approach but this model breaks down very quickly for slices of features within a single deployed component. If it’s safe to assume that the new thing will be used by few clients, then having a temporary shared on-call setup will be most effective for the first iteration(s). This could take the form of new specific alerts/monitors created for that feature, or modifications of existing alerts to include participating teams. As the feature/change is hardened in production, you can disband the shared-on-call setup in favor of normal operations (aka platform team owns it). Regardless, having an answer for when-things-blow-up in production is necessary and should be accounted for.

If all of these feels too it-depends, it should serve as a reminder of how difficult it can be to run effective platform teams. And note - as with everything else in software development there will always be it-depends scenario; your judgment here as a manager is crucial to relay which portion of this process should be emphasized.

Thanks to John, Wayne, and Kat for their invaluable input and review.

See Also