Discovering goroutine leaks with Semgrep

Originally published May 10, 2021

While learning how to write multithreaded code in Java or C++ can make computer science students reconsider their career choices, calling a function asynchronously in Go is just a matter of prefixing a function call with the go keyword. However, writing concurrent Go code can also be risky, as vicious concurrency bugs can slowly sneak into your application. Before you know it, there could be thousands of hanging goroutines slowing down your application, ultimately causing it to crash. This blog post provides a Semgrep rule that can be used in a bug-hunting quest and includes a link to a repository of specialized Semgrep rules that we use in our audits. It also explains how to use one of those rules to find a particularly pesky type of bug in Go: goroutine leaks.

The technique described in this post is inspired by GCatch, a tool that uses interprocedural analysis and the Z3 solver to detect misuse-of-channel bugs that may lead to hanging goroutines. The technique and development of the tool are particularly exciting because of the lack of research on concurrency bugs caused by the incorrect use of Go-specific structures such as channels.

Although the process of setting up this sort of tool, running it, and using it in a practical context is inherently complex, it is worthwhile. When we closely analyzed confirmed bugs reported by GCatch, we noticed patterns in their origins. We were then able to use those patterns to discover alternative ways of identifying instances of these bugs. Semgrep, as we will see, is a good tool for this job, given its speed and the ability to easily tweak Semgrep rules.

Goroutine leaks explained

Perhaps the best-known concurrency bugs in Go are race conditions, which often result from improper memory aliasing when working with goroutines inside of loops. Goroutine leaks, on the other hand, are also common concurrency bugs but are seldom discussed. This is partially because the consequences of a goroutine leak only become apparent after several of them occur; the leaks begin to affect performance and reliability in a noticeable way.

Goroutine leaks typically result from the incorrect use of channels to synchronize a message passed between goroutines. This problem often occurs when unbuffered channels are used for logic in cases when buffered channels should be used. This type of bug may cause goroutines to hang in memory and eventually exhaust a system’s resources, resulting in a system crash or a denial-of-service condition.

Let’s look at a practical example:

import (
  "fmt"
  "runtime"
  "time"
)

func main() {
  requestData(1)
  time.Sleep(time.Second * 1)
  fmt.Printf("Number of hanging goroutines: %d", runtime.NumGoroutine() - 1)
}

func requestData(timeout time.Duration) string {
 dataChan := make(chan string)

go func() {
     newData := requestFromSlowServer()
     dataChan <- newData // block
 }()
 select {
 case result := <- dataChan:
     fmt.Printf("[+] request returned: %s", result)
     return result
 case <- time.After(timeout):
     fmt.Println("[!] request timeout!")
         return ""
 }
}

func requestFromSlowServer() string {
 time.Sleep(time.Second * 1)
 return "very important data"
}

In the above code, a channel write operation on line 21 blocks the anonymous goroutine that encloses it. The goroutine declared on line 19 will be blocked until a read operation occurs on dataChan. This is because read and write operations block goroutines when unbuffered channels are used, and every write operation must have a corresponding read operation.

There are two scenarios that cause anonymous goroutine leaks:

  • If the second case, case <- time.After(timeout), occurs before the read operation on line 24, the requestData function will exit, and the anonymous goroutine inside of it will be leaked.
  • If both cases are triggered at the same time, the scheduler will randomly select one of the two cases. If the second case is selected, the anonymous goroutine will be leaked.

When running the code, you’ll get the following output:

[!] request timeout!
Number of hanging goroutines: 1
Program exited.

The hanging goroutine is the anonymous goroutine on line 19.

Using buffered channels would fix the above issue. While reading or writing to an unbuffered channel results in a goroutine block, executing a send (a write) to a buffered channel results in a block only when the channel buffer is full. Similarly, a receive operation will cause a block only when the channel buffer is empty.

To prevent a goroutine leak, all we need to do is add a length to the channel on line 17, which gives us the following:

func requestData(timeout time.Duration) string {
 dataChan := make(chan string, 1)

go func() {
     newData := requestFromSlowServer()
     dataChan <- newData // block
 }()

After running the updated program, we can confirm that there are no more hanging goroutines.

[!] request timeout!
Number of hanging goroutines: 0
Program exited.

This bug may seem minor, but in certain situations, it could lead to a goroutine leak. For an example of a goroutine leak, see this PR in the Kubernetes repository. While running 1,496 goroutines, the author of the patch experienced an API server crash resulting from a goroutine leak.

Finding the bug

The process of debugging concurrency issues is so complex that a tool like Semgrep may seem ill-equipped for it. However, when we closely examined common Go concurrency bugs found in the wild, we identified patterns that we could easily leverage to create Semgrep rules. Those rules enabled us to find even complex bugs of this kind, largely because Go concurrency bugs can often be described by a few sets of simple patterns.

Before using Semgrep, it is important to recognize the limitations on the types of issues that it can solve. When searching for concurrency bugs, the most significant limitation is Semgrep’s inability to conduct interprocedural analysis. This means that we’ll need to target bugs that are contained within individual functions. This is a manageable problem when working in Go and won’t prevent us from using Semgrep, since Go programmers often rely on anonymous goroutines defined within individual functions.

Now we can begin to construct our Semgrep rule, basing it on the following typical manifestation of a goroutine leak:

  1. An unbuffered channel, C, of type T is declared.
  2. A write/send operation to channel C is executed in an anonymous goroutine, G.
  3. C is read/received in a select block (or another location outside of G).
  4. The program follows an execution path in which the read operation of C does not occur before the enclosing function is terminated.

It is the last step that generally causes a goroutine leak.

Bugs that result from the above conditions tend to cause patterns in the code, which we can detect using Semgrep. Regardless of the forms that these patterns take, there will be an unbuffered channel declared in the program, which we’ll want to analyze:

- pattern-inside: |
       $CHANNEL := make(...)
       ...

We’ll also need to exclude instances in which the channel is declared as a buffered channel:

- pattern-not-inside: |
       $CHANNEL := make(..., $T)
       ...

To detect the goroutine leak from our example, we can use the following pattern: