Universal Chiplet Interconnect Express Notes

This article is based on the references and idea from Dr. Zhen Zhou and Prof. Yang Yi, thank you for your guidance and help.

Wave physics as an analog recurrent neural network¹
A Radio Frequency Analog Computer for Computational Electromagnetics²
Universal Chiplet Interconnect Express (UCIe): An Open Industry Standard for Innovations With Chiplets at Package Level³
Challenges and recent prospectives of 3D heterogeneous integration⁴
Chiplet Technology and Heterogenous Integration⁵
Chiplet Technology & Heterogeneous Integration⁶

Overview

UCIe is short for Universial Chiplet Interconnect Express⁷. UCIe Cononsortium provides whitepapers for V1.0⁸ as well as for V1.1⁹. Intel’s Dr. Debendra Das Sharma¹⁰ writes the paper that defines UCIe. In addition, it provides webinars¹¹ by embedding YouTube videos in the webpage. UCIe Consortium also has a YouTube channel¹² that contains lots of useful tutorials. Spepcifications¹³ could not be accessed to the general public as only employees of partners are able to see.

The following table provides a high level overview of UCIe architecture from both hardware perspective on the left side to the software protocol stack on the right side of the table.

	PHY	RDI	D2D Adapter	FDI	Protocol
name	Physical Layer	Raw Die to Die Interface	Die to Die Adapter	Flit-aware Die to Die Interface	Protocol Layer
Concepts⁸	1️⃣Link Traning 2️⃣Lane Repair 3️⃣Lane Reversal 4️⃣Scrambling 5️⃣Sidband 6️⃣Analog Front End 7️⃣Clock Forwarding		1️⃣ARB/MUX 2️⃣CRC/Retry 3️⃣Link State Mgmt 4️⃣Parameter Negotiation		1️⃣PCIe¹⁴ 2️⃣CXL¹⁵¹⁶ 🔹CXL.io 🔹CXL.cache 🔹CXL.mem 3️⃣Streaming¹⁷ 4️⃣Raw
Extension	Packaging			Flit

UCIe layering approach, Page 13 of slides¹⁸ has a flowchart as a goal to achieve.

	PHY	D2D	Protocol
CXL		ARB/MUX
CRC		lightweight 16 bit triple bit detection, converge at high frequency, pipeline.
RETRY		simplify from ~~PCIe by selective Nak~~¹⁹
LSM²⁰		Link state management: Reset/Active/PowerMgmt/Error flows
PN²¹		PN with remote link partner²²
PCI
CXL
Steaming		support Raw formats

UCIe Die-to-Die Adapter Concepts

Hardware

This section discusses the hardware aspect of UCIe™ as well as advanced packaging for chiplets.

Packaging

Packaging for integrated circuit contains knowledge and technology from various aspects, the following table is aiming to summarize the different categories of technology on chiplets packaging.

Categories	Details
Die/Package Stacking	2.5D, 3D Die Thinning/Stacking Package on Package Package in Package
Wafer Level Package	Wafer Bumping Cu Pillar Fan-out WLP Wafer Level IPD Wafer Level MEMS
Interconnection	Cap Wire Bond Wire Bond Flip Chip
Embedded Substrate	SESUB a-EASI
Double Side Molding	Double Side Molding Selective Molding Flexible Encapsulation Molded Underfill Irregular Packaging
Antenna	Antenna on Package Antenna in Package
Board Assembly	High Density SMT ACF Bonding Wire Bond on Flex Laser Welding Flex Bending
Shielding	Conformal Shielding Compartment Shielding Selective Shielding Magnetic shielding

System in Package Technology²³

UCIe V1.0⁸ supports 2 types of packaging, namely standard package(UCIe-S) and advanced package(UCIe-A).

	Standard Package	Advanced Package
Characteristics
Data Rate per Lane(GT/s)²⁴	4, 8, 12, 16, 24, 32	4, 8, 12, 16, 24, 32
Width/Configuration(each cluster)	16 Full Duplex	64 Full Duplex
Bump pitch(\(\mu m\))	100-130	25-55
Channel Reach(\(mm\))	\(10\leqslant L \leqslant 25\)	\(\leqslant 2\)
Raw Bit Error Rate(BER)	◾️\(10^{27}\) 16GT/s short reach ◾️\(10^{15}\) 16GT/s long reach	◾️\(10^{27}\) 12GT/s ◾️\(10^{15}\) 16GT/s
Targets for Key Metrics
B/W Shoreline(\(GB/s/mm\))	28-224	165-1317
B/W Density(\(GB/s/mm^{2}\))	22-125	188-1350
Bump
bump maps!!	◾️\(\times16\) ◾️\(\times32\)	◾️8-columns, ◾️10-columns ◾️16-columns ❗️UCIe-A area/column type efficiency plots²⁵

UCIe™ Packaging Specs Comparison²⁶, Table I¹⁰, Table I²⁷

UCIe™ at the moment(V1.1) only supports 2D and 2.5D packaging, 2 figures are placed in this article for illustration purpose. The following figure Figure 4: UCIe : Layering Approach and different packaging choices is extracted from V1.0 of whitepaper⁸.

The standard/laminate packaging (2D) is used for cost-effective performance. The advanced packaging(2.5D) is used for power-efficient performance. There are multiple commercially available options, some of which are shown in the diagram. UCIe specification embraces all types of packaging choices in these categories.
Universal Chiplet Interconnect Express (UCIe)®: Building an open chiplet ecosystem⁸

The following figure is from Figure 4 of V1.1 whitepaper⁹.

Heterogeneous Integrations

There are several books that give a comprehensive description on heterogeneous integration which enable readers to quickly learn about the basics of problem-solving methods and understand the trade-offs inherent in making system-level decisions.

“Heterogeneous Integrations”²⁸

“Chiplet Design and Heterogeneous Integration Packaging”²⁹

_https://semiengineering.com/is-ucie-really-universal/

https://www.computer.org/csdl/magazine/mi/2023/02/10013705/1JP1z96RPP2

_https://docs.pingcode.com/info/34526.html

UCIe-S Bump map

The video³⁰ explains the routing of UCIe-S bump maps with details.

checkboard arrangement is useful for interrupt concerns, mechanical compatibility.

UCIe-A area/column type efficiency plots³¹

UCIe-A interoperability across bump pitch range³²: fixed beachfront, signal ordering rules.

UCIe considers not only the protocol software side, the hardware side regarding electrical and packaging are highly emphasized for chiplets design.

Flit mean flow control unit.

CXL: compute express link³³

Wafer Level Packaging

Fan-In

Fan-Out: any package with connections fanned-out of the chip surface, enabling more external I/Os.³⁴ A comparision of Fan-In and Fan-Out illustration for a RF module.³⁵

FOWLP: Fan-Out Wafer Level Packaging, this image³⁶ shows how the FOWLP based chip is made.

Interposer³⁷

Interposer contains the following characteristics:

Purpose: electronic substrate to interconnect between the fine-pitch I/Os at the die level on the top side to the coarser dimensional features on the package on the bottom side of the interposer.³⁷
TSV: Through-silicon Via, illustration of TSV on High Bandwidth Mememory³⁸, in reference³⁹ page 10 figure 17, illustrations of how RDL is manufactured is presented.
TPV: Through-package Via
RDL: redistribution layers on both sides of the interposer, in reference³⁹ page 10 figure 16, illustrations of how RDL is manufactured is presented.

Through-silicon Via⁴⁰

Via First
Via Middle
Via Last

Advanced Packaging Technology

There is an incredible number of advanced packaging types and brand names from Intel (EMIB, Foveros, Foveros Omni, Foveros Direct), TSMC (InFO-OS, InFO-LSI, InFO-SOW, InFO-SoIS, CoWoS-S, CoWoS-R, CoWoS-L, SoIC), Samsung (FOSiP, X-Cube, I-Cube, HBM, DDR/LPDDR DRAM, CIS), ASE (FoCoS, FOEB), Sony (CIS), Micron (HBM), SKHynix (HBM), and YMTC (XStacking).
Advanced Packaging Part 2 – Review Of Options/Use From Intel, TSMC, Samsung, AMD, ASE, Sony, Micron, SKHynix, YMTC, Tesla, and Nvidia⁴¹

Technology	Corporation	Category
EMIB⁴²⁴³	Intel	2.5D	🔵thin pieces of silicon with multi-layer BEOL⁴⁴ interconnects, embedded in organic substrates. Fig 4(A)⁴³ ❗️limitations of interposer SI and enable similar inter-die bandwidth at lower cost⁴⁵
Foveros⁴⁶⁴⁷	Intel	2.5D/3D⁴⁸	🔬⁴⁹, with EMIB named Co-EMIB⁵⁰ 🔵F2F \(\mu\)bumps for Foveros 🔵Cu Column for Foveros Omni
Foveros Direct	Intel	Hybrid Bonding⁴⁸	🔵Cu to Cu Bonding for Foveros Direct⁵¹
ODI⁵²⁴⁵	Intel	3D	🔵Type I: sharing 🔵Type II: single die 🔵Cupper Pillar or Package
CoWoS⁵³	TSMC	2.5D Package Chip Last	🔬⁵⁴, CoWoS®-S structure⁵⁵, NEC SX-Aurora⁵⁶, high-K MiM⁵⁷, iCAP⁵⁸ AMD EPYC MI300⁵⁹🔬⁶⁰
CoWoS-R⁶¹	TSMC	2.5D Package Chip Last	🔬⁶², 🔴InFO technology to utilize RDL interposer which comprised of polymer and copper traces. 🔴Build bigger chips: adding high-bandwidth memory (HBM) stacks to one or more processors.⁶³
CoWoS-L⁶⁴	TSMC	2.5D Package Chip Last	🔬⁶⁵, 🔴interposer with LSI⁶⁶ chip 🔴RDL layers for power and signal delivery
InFO⁶⁷	TSMC	Package Chip First	🔴InFO_oS⁶⁸: LSI⁶⁶+RDL 🔴InFO_PoP🔬⁶⁹: FOWLP⁷⁰+PoP⁷¹, DRAM TIV⁷² connect to Fan-Out processor. InFO_B⁷³ ❗️InFO_PoP vs CoWoS vs SoIC🔬⁷⁴ 🔴InFO_LSI: InFO_oS+LSI on Substrate⁶⁶ 🔴InFO_SoIS: System-on-Integrate Substrate⁵⁵ 🔴InFO_SoW: System-on-Wafer
SoIC⁷⁵	TSMC	3D Chip Stack⁷⁶	🔴CoW⁷⁷: 🔴WoW⁷⁸: bond face to face with same size⁷⁹. ❗️CoW and WoW comparison: 🔬⁸⁰ ❗️SoIC: Fronend, chiplet; InFO&CoWoS: Backend, integration. AMD EPYC MI300⁵⁹🔬⁶⁰
I-Cube⁸¹	Samsung⁸²	2.5D
H-Cube⁸³	Samsung	2.5D
X-Cube⁸⁴	Samsung	3D	⚪️compete with SoIC
	ASE
	Sony
	Micron
	SKHynix⁸⁵
	YMTC⁸⁶

Packaging technologies³⁹

Concepts for packaging

2.5D vs 3D: The major difference between 2.5D and 3D are 2.5D uses passive silicon where 3D uses active silicon which not only has electrical connections but has integrated circuits as well.⁸⁷ In 2.5D structure, two or more active semiconductor chips are placed side-by-side on a silicon interposer for achieving extremely high die-to-die interconnect density. In 3D structure, active chips are integrated by die stacking for shortest interconnect and smallest package footprint.⁸⁸ 2.5D strcture needs interposer while 3D strcture enables direct vertical connection between dies at page 11 of ³⁹.

chip first vs chip last: InFO is a chip first process where the chip is placed first, then build the RDLs are built around it. With CoWoS, the RDLs are built up then the chip is placed.

Hybrid Packaging

Optics and UCIe: the article⁸⁹ figure 6 contains the evolution of integration between optical data connection and IC to have larger bandwitdh for Co-packaged datacenter optics.

	Raw	Standard 256 bit flit mode

Stream	✅ only⁹⁰

UCIe 1.0

PHY Layer

Module

A pair of physical interface between dies is defined as module, with main band for data transfer and side band for connection overhead.

The following figure⁹¹ shows the connection of modules for UCIe-S as well as UCIe-A.

UCIe-S: 16TX, 16 RX, Side Band DATA, CLK

Die 2 Die Adapter

Flit formats

flit header carries: protocol identifier, stack identifier⁹², sequence number, Ack/Nak completion, pause of data stream indication.

pause of data stream indication(PDS): as for 68 byte size is note a multiple of lane numbers, an pause indication is needed, which is a dedicated header followed by all 0 for receiver to react.

Types

Raw format: bypass D2D which can be used directly on PHY layer.

68 Flit format: this stems form CXL flit format, 64 bytes from FDI, 4 bytes inserted in D2D, 2 bytes of flit header, 2 bytes of CRC, data shifting to for RDI as it allows 64 bytes flit size.

Standard 256B End Header Flit Format: this design is related to how data buffer works thus header is placed at the end.

Standard 256B Start Header Flit Format: CXL 256 Flit Model protocol

Latency Optimized 256B without Optional bytes:

Latency Optimized 256B with Optional bytes:

Format	Name	PCIe Non-Flit Mode	PCIe Flit Mode	CXL 68B Flit Mode	CXL 256B Flit Mode	Streaming¹⁷
1	Raw	Optional	Optional	Optional	Optional	Mandatory
2	68B	Mandatory	🚫	Mandatory	🚫	🚫⁸, ✅⁹
3	Std 256B End Header	🚫	Mandatory	🚫	🚫	🚫⁸, ✅⁹
4	Std 256B Start Header	🚫	Optional	🚫	Mandatory	🚫⁸, ✅⁹
5	Latency Optmized 256B without optional bytes	🚫	🚫	🚫	Optional	🚫⁸, ✅⁹
6	Latency Optimized 256B with optional bytes	🚫	Strongly Recommended	🚫	Strongly Recommended	🚫⁸, ✅⁹

Protocol and Flit Format Matrix⁹³

Unraveling PCIe 6.0 FLIT Mode Challenges

Link Initialization Flow

Link Training State Machine

Expand from UCIe

This section contains some papers that apply concepts form chiplet with machine learning, aiming to achieve better performance, scalability, energy efficiency.

Computing In-Memory

“SIAM: Chiplet-based Scalable In-Memory Acceleration with Mesh for Deep Neural Networks”⁹⁴

📌benchmarking simulator named SIAM to evaluate the performance of chiplet-based IMC architectures.
📎SIAM integrates device, circuit, architecture, network-on-chip (NoC), network-on-package (NoP), and DRAM access models to realize an end-to-end system.
✅The chiplet-based IMC architecture obtained through SIAM shows \(130\times\) and \(72\times\) improvement in energy-efficiency for ResNet-50 on the ImageNet dataset compared to Nvidia V100⁹⁵ and T4⁹⁶ GPUs.

“Neuromorphic Computing Based on CMOS-Integrated Memristive Arrays: Current State and Perspectives”⁹⁷

“SWAP: A Server-Scale Communication-Aware Chiplet-Based Manycore PIM Accelerator”⁹⁸

📌a novel server scale 2.5-D manycore architecture called SWAP that accounts for the traffic characteristics of DL applications.
✅SWAP achieves significant performance and energy consumption improvements with much lower fabrication cost than state-of-the-art network-on-package NoP topologies.

“In-Memory Computing based Acceleration: Large-Scale to Edge Computing”⁹⁹

💡heterogeneous big-little chiplet-based IMC architecture that utilizes big and little IMC-based chiplets coupled with an optimal network-on-package or NoP configuration.
💡on-Chip Training: a ReRAM-based in-memory computing accelerator for on-chip training and inference of millimeter Wave (mmWave) CNN and inference of RGB CNN models for personalized home-based rehabilitation systems.
✅on-chip training and inference using IMC architectures can enable energy-efficient edge computing.

Neuromorphic Computing

“Chiplet-based Architecture Design for Multi-Core Neuromorphic Processor”¹⁰⁰

📌chiplet-based architecture for a multi-core neuromorphic processor.
💡multiple neural processing chiplets are connected together through the interposer.
📎neuromorphic chip¹⁰¹ contains programmable neuron and routers for spike handling and topology mapping.

Spiking Neural Network

“Abisko: Deep codesign of an architecture for spiking neural networks using novel neuromorphic materials”¹⁰²

📌designed as a chiplet that can be deployed in contemporary computer architectures and investigating novel neuromorphic materials to improve its design.
📌developing a productive software stack for the neuromorphic accelerator that will also be portable.

Understand UCIe from other stacks

If we consider UCIe to be the data transmission on the chip package scale, PCIe is on the data transmission on the device scale, then Ethernet/Internet is the data transmission on the network scale. Through they are targeting on different problems with different addresses, there are some similiarities between those protocol stacks.

OSI(Ethernet Protocol)

Perhaps OSI 7 layer netwrok protocol model is one of the most widely recognized protocols in the world, it is designed to separate different functionalities based on a hierarchy structure, which is definitely similar to UCIe.

Print 🖨 eBook 📱

Posted

November 15, 2023

Uncategorized

Xiaomeng Wang

Tags:

Universal Chiplet Interconnect Express Notes

Overview