An open SOC with an open GPU could alleviate a number of problems. For one, your OpenCV+ML cat-feeding Raspberry-PI(or insert some generic SBC) would finally have access to previously locked up computational acceleration available on the GPU. There may be a whole class of IOT and edge devices that could benefit.
That being said, we’re building an open source SOC, Libre-SOC, with a GPU, and we’re using an open-source toolchain the entire way, from our source code and routing, right up to the silicon cells that will be photolithographically etched (currently) on TSMC’s 180nm process.
We’ve made quite a bit of progress over the past several months, so we wanted to take this moment to describe our toolchain and how we work. We’re proud of our progress so far, and we’d like to share it with the broader community so that everyone can draw inspiration.
We still have a long way to go until we’re able to achieve the sort of performance you’ll find in the smartphone in your pocket or on a Raspberry Pi–but we’re excited to be here and to be moving open-source hardware forward.
Step 1. Choose the ISA
One of the most important decisions to make when designing a CPU is the ISA or Instruction Set Architecture. Libre-SOC was originally targeting RISCV, but RaptorCS expressed interest in a POWER SOC, so we pivoted to IBM’s Open POWER ISA.
A RISCV chip may be in the pipeline later.
Step 2. Choose the Hardware Description Language
Instead of designing at the gate level, nearly all modern chips are designed using a Hardware Description Language or HDL. Although Verilog and VHDL are industry standards, they have some glaring flaws and are difficult to integrate with other programming languages for use with testing.
Because of this, Libre-SOC chose nMigen, which is an open source Python DSL; it addresses some of the issues with VHDL and Verilog, and allows us to do all our testing in Python.
Step 3. Break the overall design up into pieces
To simplify testing and design, we needed to break the overall design up into smaller pieces that could be designed and tested independently. Our CPU is broken down as follows:
- Memory subsystem
- External memory interface
- Instruction Fetch
- Instruction decode
- Instruction Issue
- Register File
- Execution Resources
- Integer ALU - executes all additions, shifts, rotates, etc
- Load/store unit
- LPC interface
After breaking the chip up into parts, we also needed to define the interfaces and basic operations of each block above, so we’d be able to connect them together when the time came.
Step 4. RTL Design
Each of the blocks above is then written in nMigen. One of the interesting things about nMigen is that, although it is python, it doesn’t take python code and convert it to logic gates. Rather, the python code generates the circuit. Perhaps an example would be good:
a = Signal() b = Signal() c = Signal() o = Signal() with m.If(a == 1): comb += o.eq(b) with m.Else(): comb += o.eq(c)
Notice how the code above does not use a standard python
if, but a
with statement? That’s because the python code is building an
Abstract Syntax Tree of the module so that it can generate logic gates
Step 5. Testing
Arguably the most important aspect of hardware design is testing. Unlike with software, bugs are nearly impossible to patch, aside from fabricating a new chip which is horrendously expensive. That’s why it’s important to extensively test each and every piece of the design before it’s ever put on the chip.
One of the nice things about using nMigen is that all the tests can be written in python. This makes it really easy to check the RTL against other libraries that were never intended to be used with hardware. For instance, to test the decoder, we have the PowerPC assembler assemble random instructions, and check that the decoder’s output matches the input to the assembler.
Step 6. Formal Verification (Optional)
While regular testing is very helpful, for many modules it is impossible to test every possible permutation of inputs. This means unknown edge cases may exist that never show up in testing, but are encountered in hardware.
Fortunately, there’s an alternative called Formal Verification. Unlike testing, where you generate some inputs and check that the output matches some expected value, in formal verification the inputs are left unspecified and some assertions are written about the output. For instance, when formally verifying an ALU you might assert that the ALU’s output is (A & B) when the ALU’s opcode input selects an AND operation.
Formal verification is especially useful when you’re trying to do something clever to save a few gates or a few nanoseconds. For instance, we were able to verify that a complex tree based popcount (count of the number of 1 bits in a number) implementation behaved exactly the same as a very dumb chain of adders. By doing this, we were able to catch a couple of edge cases that would have never been discovered through random testing.
For more information on formal verification, see Wolfe’s talks on the matter or one of ZipCPU’s many articles on it.
Step 7. Synthesis/Layout
Rendering of POWER BPermd instruction (page 100) logic from nMigen source to RTLIL as described below.
So now we have a bunch of python modules that describe hardware, and a bunch of tests that confirm that they work. How do we get all this on a piece of silicon?
First, we need to have nMigen spit out a high level description of the
hardware, in a format called ilang. By doing this, nMigen generates a
simplified version of the hardware we described by converting things
with m.If statements above to multiplexors and expanding
for loops. However, this simplified description is still not
suitable to be placed on a chip; we need to convert it to a network of
For that, we use the wonderful Yosys Open SYnthesis Suite to convert nMigen’s ilang file to a network of logic gates. However, that’s not the only thing yosys can do; in addition to the formal verification mentioned above, it can also convert ilang or verilog designs to the lookup tables and other cells used in several FPGA families, such as Lattice ICE40, Lattice ECP5, and Xilinx 7-series.
Finally, we need to arrange the logic gates on the chip, and route wires between them. For this, Libre-SOC is using a FOSS CAD tool called Coriolis which allows us to control the placement of the higher level logic blocks, and let the tool automatically place and route the gates inside those blocks
Step 8. Manufacturing
Finally, after all that, we have a layout of a chip with our CPU on it. To get it manufactured, we then take the output of Coriolis2 and give it to a semiconductor manufacturer, such as TSMC or Samsung. Provided that our layout passes their error checking, they’ll use it to create a set of masks. Those masks will then be used to make many copies of our design on a silicon die. The individual chips on the die will get tested, cut apart, packaged, (possibly) tested again, before being shipped out to us.
What’s Our Status?
We have a tapeout scheduled for TSMC in late fall. Our chip currently supports about 70 POWER instructions and the out-of-order scheduler(we use a variation on the scoreboard architecture) is almost done.
We aren’t sure if our fall tapeout will support Virtual Memory. In addition, the fall tapeout definitely won’t have GPU support.
I’d estimate a fully ready 4 core CPU with a GPU will be market ready by fall 2022.
Where to Learn More?
Learn more here.
Also, if you want to learn more about how to design a CPU, I wrote a course for Georgia Tech here.