Back to index

Software Engineering at Google

Tags: #technology #software engineering #programming #teams #culture #scaling #google

Authors: Titus Winters, Tom Manshreck, Hyrum Wright

Overview

In “Software Engineering at Google,” we delve into the principles and practices that have enabled us to build and maintain a massive and ever-evolving codebase. We argue that software engineering is more than just programming; it’s about understanding the impact of time and scale on software development and making informed decisions about the trade-offs inherent in building sustainable systems. We cover topics like managing dependencies, designing for scalability, handling large-scale changes, and the crucial role of tooling and automation in enabling these practices. This book, drawn from our experiences at Google, provides insights and practical strategies that are applicable to software projects of any size. Our aim is to share what we’ve learned in building and maintaining software that lasts, adapts, and scales to meet the challenges of a constantly changing technological landscape. We delve into specific examples, tools, and processes we’ve developed at Google, highlighting how these can be adapted to different contexts and problem domains. We also emphasize the importance of a data-driven culture and a willingness to re-evaluate decisions as new information emerges. Our goal is to foster a broader conversation about software engineering principles and practices, promoting more sustainable and scalable software development across the industry. Whether you’re a seasoned engineer or just starting your journey, this book will equip you with the knowledge and insights to build better software that stands the test of time.

Book Outline

1. Chapter 1: What is Software Engineering?

The essence of software engineering lies in recognizing that software development extends beyond just writing code. It involves managing that code over time, considering its evolution, scalability, and the trade-offs inherent in decision-making throughout its lifecycle. The crucial element of time sets it apart from programming, introducing dimensions of maintenance, modification, and long-term sustainability.

Key concept: “Software engineering is programming integrated over time.” Programming is certainly a significant part of software engineering: after all, programming is how you generate new software in the first place. If you accept this distinction, it also becomes clear that we may need to delineate between programming tasks (development) and software engineering tasks (development, modification, maintenance).

1. Chapter 1: What is Software Engineering?

Software must be designed not just to work, but to be maintainable over time. The concept of software sustainability, the capacity to adapt to change throughout a project’s lifespan, is crucial. Hyrum’s Law highlights the inevitability of dependencies arising on any observable behavior in a system. This means even seemingly insignificant changes can have unforeseen consequences, requiring us to anticipate and plan for these dependencies to make our software truly sustainable.

Key concept: With a sufficient number of users of an API, it does not matter what you promise in the contract: all observable behaviors of your system will be depended on by somebody. - Hyrum’s Law

1. Chapter 1: What is Software Engineering?

Scalability considerations are another key differentiator between programming and software engineering. A program may function perfectly in isolation but fail to scale as the number of users, code size, and development team grows. Software engineering emphasizes creating systems that can handle this growth, ensuring that processes, policies, and the codebase itself remain efficient and manageable as they scale.

Key concept: “Your organization’s codebase is sustainable when you are able to change all of the things that you ought to change, safely, and can do so for the life of your codebase.”

1. Chapter 1: What is Software Engineering?

In software development, understanding that the cost of fixing issues increases exponentially the later they are found is crucial. The “Shifting Left” principle encourages us to focus on identifying and addressing problems as early as possible in the development workflow, utilizing tools and practices like static analysis and code review to catch defects before they escalate into more significant challenges.

Key concept: “Shifting Left”: Shifting problem detection to the “left,” earlier on the timeline of development, makes it cheaper to fix than waiting longer.

1. Chapter 1: What is Software Engineering?

Software engineering requires careful consideration of tradeoffs and costs. Each project comes with its own context and constraints, making it crucial to evaluate the trade-offs between various paths forward. This involves analyzing not only financial costs but also resource utilization, personnel effort, opportunity cost, and even societal impacts when making informed decisions.

Key concept: Rarely is there a “one size fits all” solution in software engineering… Google’s experience will probably not match yours… Most of the practices that we find are necessary at that scale will also work well for smaller endeavors: consider this a report on one engineering ecosystem that we think could be good as you scale up.

2. Chapter 2: Build Systems and Build Philosophy

Modern build systems are an essential component of a scalable and efficient software development process. While seemingly simple, they automate the complex process of transforming source code into executable binaries. Prioritizing build speed enables rapid iteration, and ensuring consistent and reproducible builds across different machines and developers is crucial for collaborative development.

Key concept: Fundamentally, all build systems have a straightforward purpose: they transform the source code written by engineers into executable binaries that can be read by machines. A good build system will generally try to optimize for two important properties: fast and correct

2. Chapter 2: Build Systems and Build Philosophy

A significant challenge for build systems is managing dependencies. Whether internal dependencies within the project or external ones on third-party libraries, it is essential to ensure that these dependencies are correctly resolved, versioned, and isolated to maintain build integrity and avoid conflicts.

Key concept: In looking through the above problems, one theme repeats over and over: managing your own code is fairly straightforward, but managing its dependencies is much harder.

2. Chapter 2: Build Systems and Build Philosophy

Artifact-based build systems, like Google’s Bazel, represent a shift from the traditional task-based approach. By focusing on defining the desired artifacts (like binaries or libraries) and their dependencies, they grant the build system more control over how to perform the build. This leads to significant improvements in efficiency, parallelism, and build reproducibility, allowing for much larger and more complex projects to be built reliably.

Key concept: Reframing the build process in terms of artifacts rather than tasks is subtle but powerful. By reducing the flexibility exposed to the programmer, the build system can know more about what is being done at every step of the build.

2. Chapter 2: Build Systems and Build Philosophy

Choosing the right module granularity is a key consideration for build system performance. Fine-grained modules, where a module maps to a single directory or even a source file, offer advantages in caching, parallelism, and change impact analysis. While requiring more effort to maintain dependencies, tools like Bazel can automate this, making it a viable approach, especially for large codebases.

Key concept: The 1:1:1 Rule: For languages like Java that have a strong built-in notion of packaging, each directory usually contains a single package, target, and BUILD file.

2. Chapter 2: Build Systems and Build Philosophy

Managing external dependencies and their versions is crucial for build stability. Google’s experience has shown that a strict one-version rule, allowing only one version of a specific third-party dependency within the codebase, greatly reduces conflicts and ensures consistency. While requiring more upfront work, the payoff in long-term stability is significant.

Key concept: Google has found this to cause a lot of problems in practice, and so we enforce a strict one-version rule for all third-party dependencies in our internal codebase.

3. Chapter 3: Large-Scale Changes

Large-scale changes (LSCs) are essential for maintaining and evolving large codebases. As codebases grow, making sweeping changes in a single commit becomes infeasible due to technical limitations, merge conflicts, and testing challenges. Adopting practices for handling large-scale changes allows for ongoing improvements and modernization without disrupting ongoing development.

Key concept: In our experience, a large-scale change is any set of changes which are logically related, but cannot practically be submitted as a single atomic unit.

3. Chapter 3: Large-Scale Changes

Effectively managing LSCs often involves a dedicated team responsible for the change process. Centralizing this responsibility allows for the development of expertise, efficient tooling, and standardized processes, ultimately reducing the overall cost of LSCs and ensuring their successful completion.

Key concept: Centralization also allows for faster recovery when faced with errors, as errors generally fall into a small set of categories, and the team running the migration can have a playbook–formal or informal–for addressing them.

3. Chapter 3: Large-Scale Changes

Embracing LSCs and investing in the necessary infrastructure allows for greater flexibility in decision-making. Teams can confidently implement changes, knowing that even widespread modifications can be executed safely and efficiently, encouraging ongoing improvement and adaptation of the codebase over time.

Key concept: As the ability to make changes across our entire codebase has improved, the diversity of changes has also expanded, and we can make some engineering decisions knowing that they aren’t immutable in the future.

3. Chapter 3: Large-Scale Changes

A key aspect of executing LSCs is breaking them down into smaller, independent units called shards. Each shard should be testable, reviewable, and committable independently, minimizing disruption to ongoing development and allowing for easier validation and rollback if necessary.

Key concept: For any large-scale change process, individual shards should be committable independently.

3. Chapter 3: Large-Scale Changes

Tools like Google’s Rosie facilitate the LSC process by automating the sharding, testing, mailing, reviewing, and submitting of changes. These tools integrate with existing developer infrastructure, ensuring that LSCs can be executed efficiently and without overloading shared resources like testing and continuous integration systems.

Key concept: Rosie can be a heavy user of other pieces of Google’s developer infrastructure, so it caps the number of outstanding shards for any given LSC, runs at lower priority, and communicates with the rest of the infrastructure about how much load it is acceptable to generate on our shared testing infrastructure.

Essential Questions

1. How does the book define software engineering and differentiate it from programming?

Software engineering encompasses not just writing code, but also maintaining and evolving it over time, often at a large scale. Key differences from programming include considerations of time (lifespan of code), scale (number of users, engineers, and code size), and the complexity of trade-offs involved in making software sustainable and scalable.

2. What does the book mean by “sustainable software” and how is it achieved?

The book argues that software projects should be sustainable, meaning they can adapt to necessary changes throughout their lifecycle. This is achieved by understanding Hyrum’s Law (dependencies arise on observable behaviors), designing for scalability, and embracing change as a constant.

3. Why does the book advocate for artifact-based build systems over task-based systems?

Artifact-based build systems, like Google’s Bazel, offer advantages in parallelism, reproducibility, and incremental builds. They shift focus from defining tasks to defining desired artifacts and their dependencies, giving the system more control over the build process.

Google emphasizes fine-grained modules, even down to the 1:1:1 rule (one package, target, and BUILD file per directory), to maximize caching, parallelization, and isolation of changes. While requiring more effort to manage dependencies, automation tools can mitigate this downside.

5. What are the challenges of managing external dependencies, and how does Google approach them?

Managing external dependencies, especially their versions, is crucial for build stability. Google enforces a strict one-version rule for third-party dependencies to minimize conflicts, advocating for manual version management and potential vendoring for increased control and security.

Key Takeaways

1. Design for Sustainability

The book emphasizes that change is inevitable in software. Designing for sustainability means anticipating and planning for future changes in dependencies, technology, and product requirements. This reduces long-term costs and ensures the system can evolve without major rewrites.

Practical Application:

In developing an AI product, anticipating future needs and potential changes is crucial. Embracing modular design, using well-defined APIs, and adhering to semantic versioning can make the system more adaptable to future algorithm updates, data format changes, or integration with new platforms.

2. Leverage Robust Build Systems

As projects grow, managing dependencies becomes increasingly complex. Artifact-based build systems offer advantages in parallelism, reproducibility, and incremental builds, making them essential for scaling software development efforts.

Practical Application:

AI systems often involve complex dependencies on libraries, frameworks, and data sources. Using a robust build system like Bazel can help manage these dependencies effectively, ensuring reliable builds and consistent environments for development and deployment.

3. Consider Language Choices

The choice of programming language can significantly impact the long-term maintainability of a system. Statically typed languages offer advantages in terms of automated refactoring, large-scale changes, and build system efficiency, making them more suitable for large and long-lived codebases.

Practical Application:

When developing AI models, consider using strongly typed languages and tools. This allows for static analysis, automated refactoring, and early detection of errors, contributing to more maintainable code.

4. Embrace Large-Scale Changes

Traditional refactoring methods often break down at scale. Adopting a dedicated process for making large-scale changes, including automated tooling for change creation, management, review, and testing, is crucial for maintaining and evolving large codebases.

Practical Application:

When updating an AI product’s core algorithms or data pipelines, leverage an LSC process. Break down changes into smaller, independent units, test them thoroughly, and use automation for management and review. This enables controlled and safe evolution of the AI system.

5. Foster a Culture of Collaboration

Centralizing responsibility for large-scale changes within a dedicated team allows for focused expertise and more efficient execution. However, fostering a collaborative culture where engineers understand the importance of these changes and are empowered to contribute is crucial for long-term success.

Practical Application:

In an AI product team, foster a culture of shared knowledge and ownership, where engineers are encouraged to contribute to code improvements and understand the rationale behind changes. Clear communication and documentation of LSCs will help the entire team adapt and maintain consistency.

Memorable Quotes

Chapter 1: What is Software Engineering?. 20

“Software engineering is programming integrated over time.”

Chapter 1: What is Software Engineering?. 27

With a sufficient number of users of an API, it does not matter what you promise in the contract: all observable behaviors of your system will be depended on by somebody. - Hyrum’s Law

Chapter 1: What is Software Engineering?. 31

“It’s programming if clever is a compliment, but it’s software engineering if clever is an accusation.”

Chapter 2: Build Systems and Build Philosophy. 81

Reframing the build process in terms of artifacts rather than tasks is subtle but powerful.

Chapter 3: Large-Scale Changes. 142

A large-scale change process makes it possible to rethink the immutability of certain technical decisions.

Comparative Analysis

While many software engineering books focus on specific methodologies like Agile or design patterns, “Software Engineering at Google” distinguishes itself by taking a broader, principles-based approach, emphasizing the impact of time and scale. It aligns with the core principles of “The Mythical Man-Month” by Frederick Brooks, acknowledging the challenges of scaling software projects and the importance of communication. It also complements “Site Reliability Engineering” by delving deeper into the development side of building and maintaining reliable systems. However, this book focuses less on prescriptive processes or specific technologies and more on Google’s philosophy and approach, making it a valuable resource for understanding the challenges and considerations of engineering at scale, but potentially less directly applicable to specific situations than more focused texts.

Reflection

“Software Engineering at Google” offers valuable insights into engineering at scale. The emphasis on sustainability, scalability, and data-driven decision making provides a solid foundation for building and maintaining large, complex systems. However, the book is heavily skewed towards Google’s specific context and experiences. While the principles are generally applicable, the tools and processes discussed are often tailored to Google’s infrastructure and culture, making direct implementation in other environments challenging. A skeptical reader might question the feasibility of replicating Google’s approach, especially the reliance on a monolithic repository, in organizations with different structures and legacy systems. However, the book’s strength lies in its thought-provoking discussions and the overarching themes it presents, encouraging readers to critically evaluate their own practices and strive for more sustainable and scalable software development, regardless of their specific context. It serves as a valuable guide for navigating the complexities of engineering for the long term, even if the exact solutions may need to be adapted.

Flashcards

What is Hyrum’s Law?

Dependencies arise on all observable behaviors of a system, regardless of what is promised in the contract.

What is Google’s definition of software engineering?

Programming integrated over time.

What is software sustainability?

The capacity of software to adapt to necessary changes throughout its lifespan.

What is the “Shifting Left” principle?

Shifting problem detection to the “left,” earlier in the development workflow.

What are artifact-based build systems?

Build systems that define desired artifacts and their dependencies, leaving the “how” of building to the system.

What is the 1:1:1 rule in Bazel?

One package, target, and BUILD file per directory.

What is the one-version rule?

Allowing only one version of a third-party dependency in the entire codebase.

What are large-scale changes (LSCs)?

Changes that are logically related but cannot be submitted as a single atomic commit.

What is Rosie?

Google’s tool for sharding, testing, and submitting large-scale changes.