Guest Blog: A Heretic Speaks (Why Hardware Doesn’t Fit the Agile Model)

Fair to say that what we’ve posted on AgileSoC.com to date is decidedly pro-agile. Bryan, myself and the guest bloggers we’ve had thus far believe in agile hardware development so we haven’t spend much time talking about why agile hardware wouldn’t work. No surprise there. But when you’re getting a steady diet of opinions from one side of an argument, it can be easy to forget that there can be some very practical arguments on the flip side to the coin. Today – after a little cajoling from Bryan over the past year – Mike Thompson from Huawei in Ottawa brings a little balance to AgileSoC.com by examining the flip side of the coin.

In A Heretic Speaks, Mike talks about reasons why agile and hardware just don’t go together. Being that Mike brings years of knowledge to the discussion, it’s hard to call this anything but a fair assessment. But I’d love to know what you think. Do you agree with Mike’s assessment? Extra points to those involved with physical design that jump into the discussion!

Take it away, Mike!


A Heretic Speaks

For a couple of years now, I’ve been reading the AgileSoC blogs.  They are great reading, and each update provides an opportunity to see SoC development from a different perspective.  Every now and then a tidbit from one of the blog entries you’ll find at this site will make its way into my own work – which is probably the whole point.  Having said that, I simply do not agree with the idea that the Agile software development model can be applied to SoC development.   There.  I said it.   I can feel the virtual slings and arrows of an outraged Agile community allied against me.  🙂

There is a contraction in what I just said.  If I do not believe that Agile can be applied to SoC development, then why do I continue to read this blog and even adopt some of the concepts discussed here?  The answer is simple: I am Verification geek and more-and-more SoC verification has become a software activity.  So Agile can teach verifiers a thing or two about their craft.  However, apart from Verification, SoC development is not software development.   SoC development is hardware development.  That distinction is real and it matters.

Hardware is not Software

As Verifiers, our view of the overall SoC[1] development cycle can be somewhat stunted.  Our view of the design is mostly formed by our interactions with the RTL and the designers who write the RTL.  Of course, RTL is just a specific type of software right?  It can be compiled into an executable for simulation, or it can be synthesized into a gate-level model that can be mapped onto a netlist of logic cells from a library.  Sounds like software to me.  The trouble is, RTL is just the tip of the iceberg.  It’s the part of the design that is the easiest for us to see, but it is not all there is to the design – and it’s not the biggest part – not by a long shot.  Let’s have a look at a few reasons why.

Hardware is not Virtual

One of the coolest features of software is that it can use virtual resources.  Need more memory – just call malloc()!  Need another object – just call new()!  We do this all the time in Verification.  Heck, if you are a SystemVerilog user, you can even have virtual interfaces.  You can’t do this in hardware.  All resources – the amount and type of memory needed, the number and type of interfaces, the amount of logic to perform tasks – all of this is fixed and is not easily changed.  Why?   Because you are building hardware and all of these resources are physical.  Worse, all of them cost money.  Before your organization lets you build a chip, the-powers-that-be will want to know what it costs.  As in how-many-dollars-per-chip?  In order to answer this question, you’ll need to answer to a bunch of important questions such as what process node you will use (e.g 32nm) and how big the die will be (e.g. 10mm x 10mm).  In order to answer those questions, you’ll need to know things like how many and how big the I/O interface macros are (e.g. high-speed SerDes and DDR PHYs), and the types and geometries of memories you can use.   In short, you will need to know a lot about your design.  And you’ll need to know it long before you start writing any RTL.   This is one of the reasons that Test Driven Development (TDD) doesn’t fix RTL coding very well… but I’m getting ahead of myself…

RTL is not Abstract

Describing digital logic at the Register Transfer Level is probably the single biggest advance in hardware development ever.  RTL reduces effort and improves quality by allowing digital logic to be described at a functional, as opposed to structural, level of abstraction.  But it’s not that abstract – not really.  An RTL designer still needs to understand and deal with clocks, resets, complex interfaces to memories, clock gating and clock-domain-crossings.   She needs to know how many levels of logic she can get away with between flops; which are a function of clock period and the cell library.  Also, RTL doesn’t allow designers to explicitly control things like area, timing and power.  A lot of very fancy tools have been deployed to improve this situation, but for the most part, getting all this right still involves a lot of blood, sweat and tears.   The implication here is that designers often spend more time on structural issues, such as clock-domain crossings and clock-gating, than they do on functional issues.  Strike 2 for TDD…

Physical Design Takes a Loooong Time

One of the biggest differences between software and hardware is how they are mapped into a product that actually does something useful.  Software is compiled.  For a large software product (tens of millions of lines of code), a compile/build cycle can be many hours or days.  Long enough, but not too bad and most of it is fully-automated.  The implication is that software can cut a release at just about any time.  The equivalent task in SoC development is called Physical Design, and it’s a big task.  For a large SoC (tens of millions of gates), the physical design – getting from that netlist to working hardware – can be several weeks or even months.   To get it right, several iterations may be required.  Much of the work cannot be automated.  The implication here is that SoC development time is driven primarily by Physical Design, not Functional Verification.  The PD cycle is set very early on in the project and it is extremely difficult, if not impossible, to change it.  One of the things a good Program Manager will try to do is fit the Verification cycle entirely within the PD cycle, so that Verification doesn’t drive the schedule’s critical path[2].

Why Hardware Doesn’t Fit the Agile Model

OK, hopefully by now you’re convinced that hardware development is a distinct task from software development, and SoC development is mostly a hardware development activity.   That still doesn’t explain why Agile can’t be applied to SoC development.  I’ll try to do that by making reference to a couple of AgileSoC posts from last year: “When Done Actually Means DONE” and “TDD And A New Paradigm For Hardware Verification”.

You’re not Done until PD Say you’re DONE.

In “When Done Actually Means DONE”, a contrast is made between the Waterfall development model and the Agile model:

Figure 1: Waterfall vs. Agile Development Models

I doubt that anyone is working on a team that is using the waterfall model anymore.  All development teams that were doing this have long ago gone out of business because their products were either very late to market, full of defects, or both.   The Agile model is much better – except that it doesn’t work for hardware.  There are a couple of big reasons why this is so.  First, since hardware is not virtual, some aspects of the physical design must start first.   Yes, even before the Specification and certainly before the RTL coding.  Second, the pace of the development is set by PD.  This effectively negates one of the biggest advantages of Agile development: the ability to quit development and generate a release of your product whenever it suits the project.

So most large SoC projects look more like a hybrid of the Waterfall and Agile methods as shown in Figure 2.   Let’s call this model “Concurrent Engineering”.   Now you could say that this is a visual description of how SoC development can take advantage of some aspects of Agile development.  Getting rid of that waterfall and replacing it with concurrent tasks really does improve quality and reduce schedules.   But we cannot build the SoC at the 50% mark.  Software can do that.  Not hardware.

Figure 2: Concurrent Engineering of SoC Tasks

TDD is not a good fit for RTL coding

In Figure 2, the RTL coding task has been explicitly broken out.  This was done to illustrate an important point that was alluded to earlier: designers actually do not spend that much time coding RTL.  There are two big reasons for this:

  1. Most of the real design work such as spec’ing out memories, clocking, resets, resource allocation/management, etc. must be done before RTL coding starts.
  2. In order to begin PD trials, the bulk of the RTL must be available early – long before it is fully verified.  This is an important point: at this stage, the RTL need not be functionally correct, but it must be ‘complete’ in that all functions are implemented[3].

So, the actual RTL coding doesn’t take that much time (3 to 6 weeks is typical).  Given that, a Test Driven Development (TDD) process for RTL is a bad idea since it slows done the delivery of the code that is needed for PD.  Note that this is not true for the verification code.  TDD for testbenches, testcases, coverage models, etc. is very beneficial and should be part of all verification flows[4].

Conclusion

Well, if you’ve gotten this far then you are pretty open minded.   After all, if you are even a semi-regular reader of this blog, then you probably think Agile is a pretty darn good idea.  I do too – but just not for the hardware side of SoC development.   For Verification and Emulation, it’s a different matter.  In our Verification shop we cherry pick many Agile concepts (such as unit-testing of verification code and continuous integration of the verification environment and RTL) and for Emulation we pretty much go all out with Agile.  So I hope that this blog continues to be an open forum for applying Agile to these activities.

Thanks to Neil for setting up this blog and continuing to drive it, and special thanks for permission to re-use his slide from “When Done Actually Means DONE”.  Not many people have the self-confidence to provide ammo for the opposition in a public forum.  J  Also, if you haven’t already – get a copy of SVUNIT and use it!

Michael Thompson
   ASIC Verification Manager
   Huawei Technologies
   Ottawa, Ontario, Canada
   michael.thompson@huawei.com


[1] Actually, I’d prefer to use the term “integrated circuit” or “IC” in this discussion.  By “IC”, I mean any large scale integrated circuit implemented as an ASIC or ASSP.  However, SoC is used throughout this site, so I’ll continue to use the term, or sometimes I’ll just say “chip”.

[2] I know, I know.  It is rarely possible to keep Verification off the critical path for at least part of the project.

[3] It shouldn’t be brain-dead either.  In our shop, PD trials start on a netlist that has passed at least some basic functional testing.

[4] If you do not already have a methodology in-house of TDD of your verification code, I strongly recommend that you take a good look at SVUNIT.

6 thoughts on “Guest Blog: A Heretic Speaks (Why Hardware Doesn’t Fit the Agile Model)

  1. Mike,

    Thanks for taking the time to put this together! I don’t think this site is complete without somebody railing against the “propoganda” so I appreciate you playing devil’s advocate. I think you’re right that pd is the big elephant when it comes to design cycle. As people can probably guess by the content on agilesoc.com, it’s not my or bryan’s area of expertise so it’s hard for me to argue with your position. I’d hope, though, that others with more expertise may start chatting and devising ways agile might help transform the way to do/interface with pd.

    a few questions for you…
    * how do resource estimates vary over the course of a project? Does “needing” to know resource requirements early in the design cycle translate to accurate estimates that change very little over time? Is it common to have die size estimates change over time even though we’d prefer they don’t? (I ask that because “we need to know X” is still a common argument against agile in the sw world from what I understand, especially from architecture and managment teams).
    * in your experience, how often are iterations through the pd toolchain initiated due to functional issues? I ask because I often wonder how functional correctness earlier in the dev process may end up cutting down on the number of pd runs/eco’s/etc saving a team a few weeks at release time.
    * could there be anything gained by faster incomplete cycles in early pd vs. waiting for the rtl sane/done milestone? I’ve always wondered why we can’t approximate functionally dismal code early on for the pd team as opposed to waiting for the actual (yet still functionally dismal) code from designers.
    * do you think unit tests would affect defect rates as they’re changing code to address pd issues like timing, size, etc? or are bugs injected at that stage relatively rare/insignificant?
    * is there feedback you see missing from pd back to rtl design? how about the verification team? What could design/verification engineers be doing to make life easier for the pd team that we don’t do now?

    btw… despite my lack of experience with it, I found your comment about pd where “To get it right, several iterations may be required” very consoling. That’s not the first time I’ve heard the word ‘iterating’ used in an argument against agile. Those are the arguments against that I like. They’re my glimmer of hope :).

    Thanks again Mike for stepping up!

    -neil

  2. Great questions Neil. I’ll do my best to answer them. Keep in mind that like you, I am not a PD engineer. So for that reason, I am not the best person to be answering these questions. Of course, that never stopped me from having an opinion… 🙂

    [Neil] How do resource estimates vary over the course of a project? Does “needing” to know resource requirements early in the design cycle translate to accurate estimates that change very little over time? Is it common to have die size estimates change over time even though we’d prefer they don’t?

    [Mike] It is _not_ common for the die-size to change once it has been set. There are a bunch of reasons for this. Most of these reasons boil down to time-and-money. Changing the die size will almost certainly have a schedule impact, but the impact that matters most is cost. Die size is a first-order driver of the cost on an IC. The bigger the die, the more money the IC will cost. The people who are paying for this chip expect to make money from it, so they will want to know how much it will cost before they commit the resources to get it developed and built. If you change this half way through the project, you change the business case for the chip. You do not want to be the one standing in front of the boss with _that_ message…

    [Neil] In your experience, how often are iterations through the pd toolchain initiated due to functional issues? I ask because I often wonder how functional correctness earlier in the dev process may end up cutting down on the number of pd runs/eco’s/etc saving a team a few weeks at release time.

    [Mike] In large ASIC/ASSP chips, functional changes should not drive the PD schedule. The PD guys really do not care if the device functions as expected, they only care that it can be built. So, the number of iterations through the PD toolchain is set by PD concerns, not functional concerns. Where some projects run into trouble is when the front-end (e.g. design, verification, STA) is not done before the last iteration through PD. In this situation, verification concerns can drive the schedule. A good Program Manager will arrange the project to ensure that the front end is complete (DONE!) when the final PD cycle is scheduled to start.

    [Neil] Could there be anything gained by faster incomplete cycles in early pd vs. waiting for the rtl sane/done milestone? I’ve always wondered why we can’t approximate functionally dismal code early on for the pd team as opposed to waiting for the actual (yet still functionally dismal) code from designers.

    [Mike] In fact, this is exactly how its done. The first PD cycle starts with “functionally dismal code”.

    [Neil] Do you think unit tests would affect defect rates as they’re changing code to address pd issues like timing, size, etc? or are bugs injected at that stage relatively rare/insignificant?

    [Mike] By the time a Designer is making changes to improve timing, increase clock gating or reduce area, there is usually a stable verification environment in place. So any functional defects injected at this stage are usually caught in the next regression.

    [Neil] Is there feedback you see missing from pd back to rtl design? how about the verification team? What could design/verification engineers be doing to make life easier for the pd team that we don’t do now?

    [Mike] The most important thing is that verification should be completed (DONE!) just as the PD entries its final iteration. That will mean no unexpected surprises that trigger an ECO. This is usually a _very_ difficult thing to do since schedules are tight and Program Managers will typically trade risk for schedule (maybe we can handle a few ECOs without a significant impact to the schedule). So to answer your question, verification can help PD by prioritizing those features in the chip that will be hard for PD to ECO. For example, in the last PD iteration it will be almost impossible to consider changing the depth of an internal SRAM, but it is relatively painless to add/delete a flop.

  3. Neil and Mike, glad to read both your posts. 🙂
    I agree with Mike that hardware are more fixed, and I always had thought Agile methodology more applicable to the verification and emulation or even SW integration side of ASIC/FPGA cycles. However, the initial decision of die size, memory sizes are also a best effort of guesstimate. There are times where functional verification can spot an error and shake the ground, (e.g. a larger memory may needed for trade-off for a higher speed in some bottle-neck blocks) , unless a big-enough buffer room has been pre-placed on a first-cut decision.

  4. Hi all,
    sorry if I’m wrong but AgileSOC for me means “Bringing Agile Methods to Soc Development” or at least this is what I can read at the top of the blog. So I don’t know if I agree that SoC Development is just PD. OK PD is really long (and I agree with Mike “A good Program Manager” should “arrange the project to ensure that the front end is complete (DONE!) when the final PD cycle is scheduled to start”) but there are so many other tasks to be done (Architecture/Front End Design/Verification) on which Agile methods can be (/are more and more) applied that makes me conclude that Agile methods are good for SoC development just not in the whole flow.
    Regards,
    Andrea

  5. I treat the whole soc task as the following pieces concurrently.
    – requirement spec closure or sign off
    – functional closure or sign off
    – timing closure or sign off
    – floor plan closure or sign off
    – power closure or sign off
    – DfT or DfM closure or sign off
    – Physical verification closure or sign off
    Pd includes the last 5 items. If we analyze from root cause of chip respin view, you can find that spec or functional is the key driving force. Pd can estimate the schedule at good precision almost in term of tool run time, iteration time per stages in pd. But functional issues or spec issues are hard to schedule or predict well in SoC esp for these no Production record ips. spec issues drives by competitors or customers. Pd doesn’t care spec or functional just if net list or floor plan are final or need to do it again. Thus, in pd, the schedule file just likes a for loop. Before these items of pd closure, all are trial run instead of final run since there will have new check in. Thus, we should enlarge the view to when we can mass production reaching a large amount instead of tape out or design done since time to profit is the most important for a company and even a good flow or working methodology. Thus, in agile, there can have chances to shorten the time to profit or reduces the numbers of iterations whatever in pd or others.

  6. AFAIK the physical design is done by a separate team (if you mean the P&R phase, sign-off to foundry, wafer tests, and so on). It is done after other phases like synthesis trials, RTL validation, formal verification, bits of post-synthesis validation, … It’s not in the development loop, it’s pipelined to it, and usually near the end.

    So Agile methodology seems perfectly possible, and coupled with early synthesis of individual modules it even allows to adjust the project depending on the resource estimation as it is narrowed down. Provided the other teams are Agile too, or flexible enough to take advantage of it.

    In a waterfall methodology, the synthesis, validation, P&R phases start once all the code is written (sometimes a mock-up is sent earlier to prepare the environment, scripts and so on). So unlike Figure 2.

    With Agile methodology, there is an opportunity to overlap, so it can’t take more time, it’s actually the opposite.

    By the way, RTL coding takes much more than 3-6 weeks on an average project, but maybe it depends what we are talking about. SoC are large, and it’s usually much more than 3 to 6 months just for the RTL. If the RTL took as little as 3-6 weeks it would be a small project and I wouldn’t expect the methodology to have any impact – for ex. if you are only integrating known IPs that you trust are working correctly.

    TDD is perfectly possible in RTL, and I don’t see why it should slow down the process. It’s just a methodology to write code, in which you first write a test, then the RTL that makes it pass, refactor, and go to the next step, until all unit tests and acceptance tests are done. Which means the requirements have a better chance of being clarified earlier than another approach, so there are benefits. Otherwise the same code is written in about the same time. The only case of shorter development (without using TDD) I can think of, is when the RTL validations have been reduced because there wasn’t enough time left – which is frequent in that type of project, there is so much to do in the later stages. But that’s not a risk I’d be willing to take here.

    This is an old article, so I suppose techniques have evolved since then, and maybe we see more clearly how to apply Agile and TDD in general. But SoC projects are still pretty much the same, and I believe the initial hypotheses on which this article is based are incorrect.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.