4 posts tagged with "Go"

Golang exploration projects

Digging into sync.Once: How Go Ensures One-Time Execution

April 19, 2025 · 7 min read

Software Engineer II @ Blinkit

How it started?

While writing some concurrent code for Blinkit, I found myself reaching for sync.Once—a common utility in Go to ensure an action is performed just once, no matter how many goroutines attempt it. Out of curiosity, I decided to dig into how sync.Once works internally and how its implementation has evolved over time. While investigating the internals I came across something interesting and ended up contributing myself — a small step, but super rewarding!

In this blog, I’ll walk through the internals of sync.Once, how it leverages atomics for performance, and trace its evolution through Go versions. This blog is meant to motivate you to explore and solve your own doubts by diving into the source code of Go itself. You’ll be amazed at how much you can learn just by following the code and seeing how things work behind the scenes!

Prerequisites: What's sync.Once?

sync.Once ensures a function is only executed once, no matter how many times it's called, even across goroutines.
It's most commonly used to initialize shared resources like config, DB connections, or singletons.

var (
	readConfigOnce  sync.Once
	config          *Config
)

func GetConfig() (c *Config) {
	readConfigOnce.Do(func() {
		// Read yaml and make config object
	})
	return config
}

func main() {
	cfg := LoadConfig()
	fmt.Println(cfg)
}

Link to sync.Once documentation

Digging into sync.Once internals?

To follow along this read feel free to clone the Golang repository.

Open the repository and run the bash script ./make.bash, to build and install the latest compiler of Go.
Point your system or editor (like VSCode) to use the newly built Go version:

Bootstrapped Compilers

When I first came across the concept of a bootstrapped compiler, it honestly felt like a total brain teaser. The idea that a compiler could be written in the same language it’s supposed to compile? Wild.

Here’s the bombshell: the Go compiler is written in Go itself. Sounds paradoxical, right?

Like a classic chicken-and-egg dilemma - “How can a compiler compile itself if it doesn’t exist yet?”

In programming, bootstrapping refers to: The process of building a system using a simpler or initial version of itself.

make.bash is a shell script located at src/make.bash inside the Go source tree.
It's used to bootstrap the Go toolchain — it builds the Go compiler (cmd/compile), linker (cmd/link), and other core tools from scratch using the Go bootstrap toolchain.
It uses the already installed Go compiler
Use the clones Golang source code to build the new version of Go.

Internals (Go 1.18)

Let's start from the basics, struct of sync.Once looks like

src/sync/once.go | GOVERSION=1.18
type Once struct {
	done uint32
	m    Mutex
}

Pretty simple, right?

One uint32 flag done
- done == 1 means that the function has ran once
- done == 0 means that the function hasn't ran yet
One mutex m
- A mutex to avoid race condition while updating done

Ok, somehow these two are used together to:

Ensure that a particular action executes only once, regardless of how many times it is called concurrently.
Achieve this guarantee efficiently, minimizing lock contention for better performance.

Ok so far so good right?

Let's move to the implementation of once.Do(f): It ensures that the function f() is only executed once, no matter how many times it's called—even if from multiple goroutines.

src/sync/once.go | GOVERSION=1.18
func (o *Once) Do(f func()) {
	if atomic.LoadUint32(&o.done) == 0 {
		o.doSlow(f)
	}
}

Goal: Avoid acquiring a mutex unless absolutely necessary (i.e., the function f() hasn’t run yet).

if atomic.LoadUint32(&o.done) == 0
- We check if done == 0, done == 0 means that the function hasn't ran yet
- It checks done == 0 atomically in one operation. More about atomic package

func (o *Once) doSlow(f func()) {
	o.m.Lock()
	defer o.m.Unlock()
	if o.done == 0 {
		defer atomic.StoreUint32(&o.done, 1)
		f()
	}
}

if o.done == 0
- A second check inside the locked section.
- Why? Because multiple goroutines might pass the atomic check in Do(), but only one should actually run the function. So we check again inside the lock to be 100% sure.
- This is a double-checked locking pattern.

defer atomic.StoreUint32(&o.done, 1)
- Marks the function as executed after f() is done.
- It’s deferred so even if f() panics, we still consider it “done” and don't call it again (intentional in Go’s design).

Internals (Go 1.24)

In newer versions of Go they revised the implementation of how sync.Once.

src/sync/once.go | GOVERSION=1.24
type Once struct {
	_ noCopy
	done atomic.Uint32
	m    Mutex
}

Ok! What's changed now?

noCopy is embedded in Once struct
done is atomic.Uint32 not uint32

"noCopy" What's that?

It'a a zero-size struct that is adds no memory overhead.
Go has a statical analysis tool go vet that checks your Go source code for common mistakes and suspicious constructs that the compiler won’t catch.
Some types must never be copied once they’ve been initialized—most notably synchronization primitives like sync.Mutex, sync.Once, etc. Accidental copies can lead to deadlocks or data races.
Having noCopy embedded in your struct will produce a warning if your type is ever copied by go vet.

Regular uint32 vs. atomic.Uint32

When you don't know something in Go let's follow the approach like we have done, and let's look at the source code:

src/atomic/type.go | GOVERSION=1.24
type Uint32 struct {
	_ noCopy
	v uint32
}

Ok, wow! As you can see atomic.Uint32 is just a wrapper type around a uint32 with noCopy but why???

Let's look further functions binded to this struct:

// Load atomically loads and returns the value stored in x.
func (x *Uint32) Load() uint32 { return LoadUint32(&x.v) }

// Store atomically stores val into x.
func (x *Uint32) Store(val uint32) { StoreUint32(&x.v, val) }

// Swap atomically stores new into x and returns the previous value.
func (x *Uint32) Swap(new uint32) (old uint32) { return SwapUint32(&x.v, new) }

// CompareAndSwap executes the compare-and-swap operation for x.
func (x *Uint32) CompareAndSwap(old, new uint32) (swapped bool) { return CompareAndSwapUint32(&x.v, old, new) }

Ok, seems like it's just a wrapper type provides methods for atomic operations.

And that is exactly what atomic.Uint32 is:

A Go 1.19+ wrapper type around a uint32 that provides methods for atomic operations

Bonus: Go. 1.25 (Hopefully)

While exploring the internals of sync.Once, I noticed that the done field — which indicates whether the function has already been executed — was originally an atomic.Uint32.

However, since it’s only ever used as a boolean flag (0 or 1), I realized it could be more semantically clear to use atomic.Bool instead. Even though atomic.Bool is just a thin wrapper around a uint32 under the hood, switching to it makes the code more self-explanatory and aligns better with the intent of the field. So I decided to raise a PR and it got merged :)

Now the struct looks like this:

src/sync/once.go | GOVERSION=1.25
type Once struct {
	_ noCopy
	done atomic.Bool
	m    Mutex
}

Conclusion

Exploring sync.Once from Go 1.18 to Go 1.25 shows how a small, fundamental primitive can evolve for clarity, safety, and maintainability:

Go 1.18
- Used a plain uint32 flag plus a Mutex and double‑checked locking
- Minimized lock contention by atomically checking the flag on the fast path

Go 1.24
- Embeds noCopy to catch accidental copies via go vet
- Switches to atomic.Uint32, providing a clean, method‑based API
Go 1.25
- Switches atomic.Uint32 to atomic.Bool

Along the way we’ve seen:

Bootstrapping – how Go builds itself from source via make.bash
Atomic vs. mutex – why lock‑free fast paths matter in high‑concurrency code
Static analysis – how noCopy and go vet help prevent subtle bugs

The beauty of Go’s standard library is that it balances performance, safety, and readability. Whenever you have a question about how Go works under the hood, the answer is just a GitHub clone and a make.bash away. Dive into the source, follow the code, and you’ll not only solve your doubts—you’ll discover deeper principles that make Go such a pleasure to work with.

And it’s downright fun to see how these technologies evolve over time.

What does memory mean actually?

March 12, 2025 · 12 min read

Prabhav Dogra

Software Engineer II @ Blinkit

Introduction

While exploring how Go manages memory, I stumbled upon an intricate hierarchy that determines how fast data moves between the CPU and RAM. Go’s runtime optimizations, like garbage collection and stack allocation, made me curious about what happens under the hood. This led me to registers, caches (L1, L2, L3), and RAM—each playing a crucial role in balancing speed and storage.

Prerequisites: What's a CPU Cycle?

CPU Cycle:

It's the smallest unit of processing that a CPU can do.
Each cycle allows the CPU to execute instructions like fetching data, performing arithmetic, or storing results.
For example: the method atomic.CompareAndSwap in go is executed as follows:
- It reads a value from memory.
- It compares it with an expected value.
- If they match, it writes a new value.
- This requires at least three steps (read, compare, write), which take multiple cycles.
A single 1GHz CPU can complete one CPU cycle in 1 nanosecond.
Similarly, a 3GHz CPU can complete one CPU cycle in 0.33 nanoseconds.
Not every operation takes 1 cycle
- Simple instructions like integer addition (a + b) may take 1 cycle.
- More complex operations (e.g., division, memory access) can take multiple cycles.

Clock edge

NOTE: Here Period = Clock Cycle

Clock edge A clock edge refers to the transition point of a clock signal where changes in a digital circuit occur. The clock signal is a periodic waveform (square wave), and it has two main edges:

Rising Edge (Positive Edge)
- The transition from low (0) to high (1).
- Many digital circuits, including registers and flip-flops, are designed to capture input and update their state on this edge.
Falling Edge (Negative Edge)
- The transition from high (1) to low (0).
- Some circuits use this edge for synchronization, though it is less common than the rising edge.

Why is the Clock Edge Important?

It synchronizes operations in digital circuits.
Registers and flip-flops capture and store data only on a specific edge, ensuring controlled data flow.
In CPU pipelines, clock edges trigger different stages like instruction fetch, decode, execute, etc.

Understanding Computer Memory Hierarchy

Registers (Fastest, Smallest)
- Located inside the CPU, closest to the execution units.
- Stores data for immediate operations (e.g., arithmetic calculations).
- Extremely small (few bytes) but operates at CPU clock speed.
- Access time: 1 CPU cycle (fastest).
L1 Cache (Level 1)
- Smallest and fastest cache (typically 32KB to 128KB per core).
- Directly integrated into the CPU core.
- Stores frequently used instructions (will discuss this later) and data for ultrafast access.
- Access time: 2-4 CPU cycles.
L2 Cache (Level 2)
- Larger than L1 (256KB to a few MB per core).
- Slightly slower than L1 but still much faster than RAM.
- Used to store recently accessed data that might be needed again soon.
- Access time: 10-20 CPU cycles.
L3 Cache (Level 3)
- Shared among multiple CPU cores, ranging from a few MB to tens of MB.
- Acts as a buffer between L2 and RAM, reducing latency for core-to-core communication.
- Access time: 30-60 CPU cycles.
RAM (Random Access Memory)
- Main working memory for the system (GBs in size).
- Much slower than CPU caches but holds more data.
- Stores active processes and data that aren’t frequently used by the CPU.
- Access time: 100+ CPU cycles.

Registers: The Fastest Storage

What Are Registers?

Registers are ultra-fast, small storage units embedded directly inside a computer’s CPU (Central Processing Unit). They are temporary holding areas for data, instructions, or memory addresses that the CPU needs to access immediately during computations. Registers are the fastest type of memory in a computer, designed to minimize delays in processing.

How Registers Work?

Fetching Data: When the CPU needs to perform an operation (e.g., 5 + 3), it first copies the values 5 and 3 from RAM into two registers.
Processing: The CPU’s arithmetic logic unit (ALU) performs the addition directly using the values stored in the registers.
Storing Results: The result (8) is placed into another register, which can either be used for further operations or written back to RAM.

Basic Structure: Flip-Flops
- Core Component: Registers are built using D-type flip-flops, each flip-flop just stores one bit. A 32-bit register, for example, contains 32 flip-flops.
- Function: Each flip-flop has:
  - Data Input (D): Receives the bit to store.
  - Write Enable (WE): Controls whether the flip-flop should capture and store the input value on the next active clock edge.
  - Clock Input (CLK): Provides the timing signal that synchronizes when data is captured by the flip-flop. The flip-flop updates its value on the rising edge of the clock.
  - Output (Q): Provides the stored bit.
Data Storage and Clock Synchronization

Writing Data:
- When the CPU writes to a register:
  - The write enable signal for the register is activated.
  - Data is placed on the input bus. The input bus is a set of electrical connections that carry data to the register.
  - In the next rising edge of the clock, the flip-flops capture the input values.
Reading Data:
1. Stored Data in Flip-Flops
  - A register consists of multiple flip-flops, each storing a single bit. Once a value is stored in a flip-flop, it remains available at its output until changed by a new write operation. However, this stored value is not automatically placed on the CPU’s internal bus—something must send the data when the data is read.
  - Each flip-flop's output is connected to a tri-state buffer, which controls whether the stored bit is driven onto the bus connecting CPU and register (for reading the bit).
2. Role of the Tri-State Buffer
  - Tri-State Buffer ensures conflict-free (kindof like mutex) data access, enabling the CPU to perform billions of operations per second reliably.
  - A tri-state buffer is a special circuit that can either:
    - Tri-State Buffer Enabled: Passes the stored data from the register to the bus connecting CPU and register.
    - Tri-State Buffer Disabled: Disconnects the register from bus connecting CPU and register.
  - This is necessary because multiple registers share the same internal bus, and only one should be active at a time to avoid conflicting reads and writes (more of a hardware constraint).
  - The enable signal is synchronized with the CPU clock to ensure stable data transfer.

Types of Registers (A little extra context)

General-Purpose Registers:
- Used for temporary data storage and most calculations.
Special-Purpose Registers:
- Program Counter (PC): Holds the memory address of the next instruction to execute.
- Instruction Register (IR): Stores the current instruction being decoded/executed.
- Stack Pointer (SP): Tracks the top of the stack in memory.
- Status/Flag Register: Stores metadata about operations (e.g., whether a result was zero or caused an overflow).

Why Registers Are Essential

Eliminate Bottlenecks: Without registers, the CPU would need to read/write data directly from RAM for every operation, which is too slow.
Enable Pipelining: Registers allow the CPU to work on multiple instructions simultaneously by holding intermediate states.
Direct Hardware Access: Registers interface directly with the CPU’s ALU and control unit, enabling rapid execution of machine-level instructions.

CPU Cache: L1, L2, L3 CPU caches

CPU Caches are small, ultra-fast memory layers between the CPU and main memory (RAM). They store frequently accessed data and instructions to reduce latency and improve performance. Modern CPUs use three levels of cache:

There are two types of cached instructions:

Instruction Cache:
- What it stores:
  - Instructions are the actual binary code (machine code) of the program being executed by the CPU.
  - Examples: ADD, MOV, JUMP, LOAD, or any operation the CPU performs.
- Purpose:
  - Allows the CPU to quickly fetch the next operation to execute.
  - For example, when running a loop, the instruction cache holds the repeated code (for, while loops) so the CPU doesn’t have to fetch it repeatedly from slower memory.

Data Cache:
- What it stores:
  - Data refers to the values the CPU is actively working with.
  - Examples: Variables (e.g., int x = 5), memory addresses, temporary results, or input/output values.
- Purpose:
  - Provides fast access to the operands (numbers, addresses) needed by instructions.
  - For example, when calculating x + y, the data cache holds the values of x and y for the ADD instruction to use.

Why Split Them?

Parallel Access:
- The CPU can fetch the next instruction (from the instruction cache) while simultaneously reading/writing data (from the data cache). This avoids bottlenecks.
- Example: While executing an ADD instruction, the CPU can already fetch the next instruction (MOV or JUMP) from the instruction cache.
Specialization:
- Instruction caches are optimized for sequential access (program code is usually read in order).
- Data caches are optimized for random access (variables can be accessed in any order).

L1 Cache (Level 1 Cache)

Role:
- The fastest and smallest cache, directly integrated into the CPU core.
- Split into L1 Instruction Cache (stores executable code) and L1 Data Cache (stores data).
Characteristics:
- Size: Typically 32–64 KB per core (e.g., 64 KB total: 32 KB data + 32 KB instructions).
- Speed: 1–4 clock cycles access time (fastest).
- Location: Embedded within each CPU core.

L2 Cache (Level 2 Cache)

Role:
- Acts as a middle layer between L1 and L3.
- Stores data/instructions not held in L1 but likely to be reused.
Characteristics:
- Size: 256 KB–2 MB per core (varies by CPU design).
- Speed: 10–20 clock cycles access time.
- Location: May be shared between cores or dedicated per core (e.g., AMD Zen vs. Intel Core).

L3 Cache (Level 3 Cache)

Role:
- The largest and slowest CPU cache, shared across all cores.
- Reduces traffic to RAM by storing data shared between multiple cores.
Characteristics:
- Size: 4–64 MB
- Speed: 20–50 clock cycles access time.
- Location: On the CPU die but outside individual cores.

Why Three Levels?

Latency vs. Size Trade-off: L1 prioritizes speed for critical data, L2 balances speed and size, and L3 minimizes RAM access.
Efficiency: Reduces "cache misses" by filtering requests through layers (90% of data is often found in L1/L2).
Multicore Optimization: L3 enables shared data (e.g., game textures, OS tasks) to stay accessible to all cores.

Practical Example:

When running a game:
- L1: Stores code for rendering a character (e.g., position calculations).
- L2: Caches textures used in the current scene.
- L3: Holds shared assets like audio files or global physics data.

Why not replace L2 and L3 with L1?

Physical Limits:
- L1 is fast but bulky/power-hungry. Scaling it to L2/L3 sizes would make CPUs impractical (cost, heat, latency).
Hierarchy Efficiency:
- L1: Speed-optimized for critical data.
- L2: Balances size/speed for common data.
- L3: Shared, large storage to minimize RAM trips.
Cache Miss Mitigation:
- Without L2/L3, frequent RAM access (~100x slower) would cripple performance.
Power/Heat:
- Larger L1 would drain power and overheat CPUs.
Multicore Sharing:
- L3 allows cores to access shared data without duplicating it in L1/L2.

RAM: The Parts of the Memory Cell

Imagine a single memory cell in your computer’s RAM (the temporary memory your computer uses to do stuff). Think of it like a tiny light switch and a tiny battery working together to store a 0 or a 1 (the basic "yes/no" language computers use). Here’s how it works:

Capacitor: A tiny “battery” that can hold an electric charge.
- Charged (has electricity) = 1
- Not charged (empty) = 0
Transistor: A tiny “light switch” that controls access to the capacitor.
- ON (switch closed) = Lets electricity flow.
- OFF (switch open) = Blocks electricity.
Address Line: The wire that tells the transistor to turn ON/OFF.
Data Line: The wire that reads or writes the charge (0 or 1) to the capacitor.

How It Works

Writing Data (Saving a 0 or 1)
- Step 1: The CPU (computer’s brain) says, “Hey, I need to save a 1 at this specific memory cell!”
- Step 2: The Address Line sends electricity (like flipping the switch ON).
- Step 3: The Data Line sends electricity to charge the capacitor (filling the tiny battery). Result: Capacitor is charged = 1 is stored. If the CPU wants to save a 0, the Data Line drains the capacitor instead.
Reading Data (Checking if it’s 0 or 1)
- Step 1: The CPU says, “What’s stored at this memory cell?”
- Step 2: The Address Line sends electricity (switch ON).
- Step 3: If the capacitor is charged (storing 1), electricity flows out through the Data Line.
  - Result: The CPU detects this flow = 1.
- Step 4: If the capacitor is empty (storing 0), no electricity flows.
  - Result: The CPU detects no flow = 0.

Refresh Cycle

Each DRAM cell consists of a capacitor (storing a 1 or 0 as charge) and an access transistor.
When a cell is “charged” (1) or “discharged” (0), that state is maintained only temporarily because the charge leaks away.
The DRAM controller (or memory controller) periodically reads each memory cell and then rewrites (recharges) it to restore the original value. This refresh cycle typically occurs every 64–128 milliseconds for all cells.
Without refreshing, the leakage would eventually cause the stored bits to flip, leading to data corruption. The periodic refresh ensures data integrity over time.

Writing your own Goroutines

March 1, 2025 · One min read

Prabhav Dogra

Software Engineer II @ Blinkit

This all started when someone asked me how goroutines work internally and all I could respond with was:

"Goroutines are lightweight threads managed by the Go runtime instead of the operating system. Go runtime automatically multiplexes—mapping multiple goroutines onto a smaller number of OS threads. And that somehow makes them fast?? 👉👈"

If anyone asked me any in-depth questions about how this multiplexing worked I was blank. So I decided to gain a deeper understanding by implementing goroutines myself. Cloned the Go Github repo

To be continued...

References

Goroutines and their scheduling basics:
- https://www.youtube.com/watch?v=S-MaTH8WpOM [Goroutines: Vicki Niu]
- https://youtu.be/MYtUOOizITs?si=FVGFtez2z3fNCjx7 [Goroutines: jesus espino]
- https://youtu.be/wQpC99Xu1U4?si=uOu0RiLyMpNXKYa0 [Go Scheduler: Madhav Jivrajani]
Go scheduler basics
- Channel primitives: https://youtu.be/KBZlN0izeiY?si=8HAeSVJxE3Vc3GC0 [Channels - Kavya Joshi]
How memory allocation works in go
- https://goog-perftools.sourceforge.net/doc/tcmalloc.html [tcmalloc]
- https://andrestc.com/post/go-memory-allocation-pt1/ [tcmalloc inspired allocator]
Go garbage collection internals:
- https://youtu.be/gPxFOMuhnUU?si=O9pn99sLiqptgyw3 [Garbage collector: Maya Rosecrance]
- https://youtu.be/We-8RSk4eZA?si=QNXxqq2xVEoh9At9 [GC Pacer: Madhav Jivrajani]
Netpoll: https://youtu.be/xwlo3xigknI?si=dmTrK_CH_fa0Bs51 [netpoll - Cindy Sridharan]

How Go atomic operations avoid race conditions?

February 28, 2025 · 5 min read

Prabhav Dogra

Software Engineer II @ Blinkit

Introduction

This question popped up in my head, "How Go atomic operations avoid race conditions?"

I finally gathered the courage to open the cloned Go Github repo and scan through it.

Go Code Structure

Go code structure

Source: ChatGPT

I went inside the implementation of CompareAndSwapInt32 and found this:

src/sync/atomic/doc.go
// CompareAndSwapInt32 executes the compare-and-swap operation for an int32 value.
// Consider using the more ergonomic and less error-prone [Int32.CompareAndSwap] instead.
//
//go:noescape
func CompareAndSwapInt32(addr *int32, old, new int32) (swapped bool)

Finding the implementation of this was not straightforward, because this method is implemented in Go Assembly:

src/sync/atomic/asm.s
TEXT ·CompareAndSwapInt32(SB),NOSPLIT,$0
	JMP	internal∕runtime∕atomic·Cas(SB)

What's Go Assembly?

Simply put, Go Assembly is the low-level language used to write performance-critical functions in Go. Go Assembler (Code directory path: cmd/asm) is the tool that compiles Go assembly (.s) files into machine code. The Go assembler was heavily inspired by the Plan 9 C compilers.

Plan 9 C compilers

Plan 9 C compilers (6c, 8c, 5c, etc.) were architecture-specific compilers designed to generate optimized code for different CPU architectures. Unlike GCC or LLVM, which support multiple architectures within a single compiler framework, Plan 9 used separate compilers for different instruction sets. These compilers were originally developed for the Plan 9 operating system, an experimental OS designed as a potential successor to Unix-based systems.

You can read more about it here: https://9p.io/sys/doc/compiler.html

Go drew inspiration from 9 C Compiler:

Just like Plan 9 had separate compilers for different architectures (e.g., 6c for x86-64, 8c for ARM, etc.).
Go’s assembler follows a similar architecture-based approach, instead of a universal assembler Go has different assemblers for x86, ARM, RISC-V, etc.

You can watch this, an interesting talk about Go Assembler presented by Rob Pike himself.

Go Assembler Documentation

Go Assembler streamlined a lot of things:

Portability: It abstracts CPU architecture details better.
Simpler syntax: No need for % prefixes, brackets, or complex addressing.
Unified across architectures: ARM, AMD64, RISC-V, etc., use the same structure.
Designed for the Go runtime: Helps implement Go features like garbage collection, goroutines, and stack growth efficiently.

Go Assembler has 4 architecture-specific implementations of atomic.CompareAndSwapInt32():

amd64.s: For AMD64 (x86-64) architecture (Intel, AMD CPUs).
arm64.s: For ARM64 (AArch64) processors (used in Apple M1/M2, mobile devices, servers).
ppc64le.s: For PowerPC 64-bit, Little Endian (used in IBM systems).
s390x.s: For IBM Z-series mainframes (used in enterprise computing).

Go runs on multiple architectures, and low-level atomic operations must be natively implemented for each to ensure compatibility.

Added the implementations for one architecture (other 3 are similar) in Go Assembly:

src/internal/runtime/atomic/atomic_amd64.s
// bool Cas(int32 *val, int32 old, int32 new)
// Atomically:
//	if(*val == old){
//		*val = new;
//		return 1;
//	} else
//		return 0;
//  }
TEXT ·Cas(SB),NOSPLIT,$0-17
	MOVQ	ptr+0(FP), BX
	MOVL	old+8(FP), AX
	MOVL	new+12(FP), CX
	LOCK
	CMPXCHGL	CX, 0(BX)
	SETEQ	ret+16(FP)
	RET

Explaining this line by line how this maintains atomicity.

TEXT ·Cas(SB),NOSPLIT,$0-17

TEXT ·Cas(SB): Declares the function Cas(CompareAndSwap) in Go assembly.
NOSPLIT: Instructs the runtime not to perform stack splitting, ensuring that the function runs without interruption. It tells the Go runtime not to perform stack splitting for that function.
$0-17: Specifies the stack frame size for the function (0 bytes for local variables and 17 bytes for arguments/return values).

MOVQ ptr+0(FP), BX:

Moves the pointer ptr (the address of val) from the function's frame pointer (FP) into the BX register.

MOVL old+8(FP), AX:

Moves the old value from the frame pointer into the AX register.

MOVL new+12(FP), CX:

Moves the new value from the frame pointer into the CX register.

LOCK:

This is a crucial instruction. It prefixes the next instruction (CMPXCHGL) with a lock, ensuring that the memory operation is atomic. This lock ensures that no other process or thread can modify the memory location while the compare and exchange instruction is running.

CMPXCHGL CX, 0(BX):

This is the Compare and Exchange instruction. It performs the following:
- Compares the value in AX (the old value) with the value at the memory location pointed to by BX (the val value).
- If the values are equal, it replaces the value at 0(BX) with the value in CX (the new value).
- The original value at 0(BX) is loaded into the AX register.

SETEQ ret+16(FP):

SETEQ sets the byte at the destination to 1 if the zero flag is set, and to 0 otherwise. In this case, it sets the return value to 1 if the comparison was equal (meaning the swap was successful), and to 0 otherwise.

RET:

Returns from the function

Conclusion

At the register level, atomicity is achieved because:

The LOCK prefix serializes access across CPU cores.
CMPXCHGL ensures all three steps (compare, swap, write-back) happen as one unit.
The CPU guarantees atomicity, eliminating race conditions without software locks.

Feel free to be curious and figure out the answers to your questions on your own.

How it started?​

Prerequisites: What's sync.Once?​

Digging into sync.Once internals?​

Internals (Go 1.18)​

Internals (Go 1.24)​

"noCopy" What's that?​

Regular uint32 vs. atomic.Uint32​

Bonus: Go. 1.25 (Hopefully)​

Conclusion​

Introduction​

Prerequisites: What's a CPU Cycle?​

Understanding Computer Memory Hierarchy​

Registers: The Fastest Storage​

What Are Registers?​

How Registers Work?​

Types of Registers (A little extra context)​

Why Registers Are Essential​

CPU Cache: L1, L2, L3 CPU caches​

L1 Cache (Level 1 Cache)​

L2 Cache (Level 2 Cache)​

L3 Cache (Level 3 Cache)​

Why Three Levels?​

Why not replace L2 and L3 with L1?​

RAM: The Parts of the Memory Cell​

How It Works​

Refresh Cycle​

To be continued...​

References​

Introduction​

Go Code Structure​

What's Go Assembly?​

Conclusion​

How it started?

Prerequisites: What's sync.Once?

Digging into sync.Once internals?

Internals (Go 1.18)

Internals (Go 1.24)

"noCopy" What's that?

Regular uint32 vs. atomic.Uint32

Bonus: Go. 1.25 (Hopefully)

Conclusion

Introduction

Prerequisites: What's a CPU Cycle?

Understanding Computer Memory Hierarchy

Registers: The Fastest Storage

What Are Registers?

How Registers Work?

Types of Registers (A little extra context)

Why Registers Are Essential

CPU Cache: L1, L2, L3 CPU caches

L1 Cache (Level 1 Cache)

L2 Cache (Level 2 Cache)

L3 Cache (Level 3 Cache)

Why Three Levels?

Why not replace L2 and L3 with L1?

RAM: The Parts of the Memory Cell

How It Works

Refresh Cycle

To be continued...

References

Introduction

Go Code Structure

What's Go Assembly?

Conclusion