30 new Semgrep rules: Ansible, Java, Kotlin, shell scripts, and more

By Matt Schwager and Sam Alws

We are publishing a set of 30 custom Semgrep rules for Ansible playbooks, Java/Kotlin code, shell scripts, and Docker Compose configuration files. These rules were created and used to audit for common security vulnerabilities in the listed technologies. This new release of our Semgrep rules joins our public CodeQL queries and Testing Handbook in an effort to share our technical expertise with the security community. This blog post will briefly cover the new Semgrep rules, then go in depth on two lesser-known Semgrep features that were used to create these rules: generic mode and YAML support.

For this release of our internal Semgrep rules, we focused on issues like unencrypted network transport (HTTP, FTP, etc.), disabled SSL certificate verification, insecure flags specified for common command-line tools, unrestricted IP address binding, miscellaneous Java/Kotlin concerns, and more. Here are our new rules:

Mode Rule ID Rule description
Generic container-privileged Found container command with extended privileges
Generic container-user-root Found container command running as root
Generic curl-insecure Found curl command disabling SSL verification
Generic curl-unencrypted-url Found curl command with unencrypted URL (e.g., HTTP, FTP, etc.)
Generic gpg-insecure-flags Found gpg command using insecure flags
Generic installer-allow-untrusted Found installer command allowing untrusted installations
Generic openssl-insecure-flags Found openssl command using insecure flags
Generic ssh-disable-host-key-checking Found ssh command disabling host key checking
Generic tar-insecure-flags Found tar command using insecure flags
Generic wget-no-check-certificate Found wget command disabling SSL verification
Generic wget-unencrypted-url Found wget command  with unencrypted URL (e.g. HTTP, FTP, etc.)
Java, Kotlin gc-call Calling gc suggests to the JVM that the garbage collector should be run, and memory should be reclaimed. This is only a suggestion, and there is no guarantee that anything will happen. Relying on this behavior for correctness or memory management is an anti-pattern.
Java, Kotlin mongo-hostname-verification-disabled Found MongoDB client with SSL hostname verification disabled
YAML (Ansible) apt-key-unencrypted-url Found apt key download with unencrypted URL (e.g., HTTP, FTP, etc.)
YAML (Ansible) apt-key-validate-certs-disabled Found apt key with SSL verification disabled
YAML (Ansible) apt-unencrypted-url Found apt deb with unencrypted URL (e.g., HTTP, FTP, etc.)
YAML (Ansible) dnf-unencrypted-url Found dnf download with unencrypted URL (e.g., HTTP, FTP, etc.)
YAML (Ansible) dnf-validate-certs-disabled Found dnf with SSL verification disabled
YAML (Ansible) get-url-unencrypted-url Found file download with unencrypted URL (e.g., HTTP, FTP, etc.)
YAML (Ansible) get-url-validate-certs-disabled Found file download with SSL verification disabled
YAML (Ansible) rpm-key-unencrypted-url Found RPM key download with unencrypted URL (e.g., HTTP, FTP, etc.)
YAML (Ansible) rpm-key-validate-certs-disabled Found RPM key with SSL verification disabled
YAML (Ansible) unarchive-unencrypted-url Found unarchive download with unencrypted URL (e.g., HTTP, FTP, etc.)
YAML (Ansible) unarchive-validate-certs-disabled Found unarchive download with SSL verification disabled
YAML (Ansible) wrm-cert-validation-ignore Found Windows Remote Management connection with certificate validation disabled
YAML (Ansible) yum-unencrypted-url Found yum download with unencrypted URL (e.g., HTTP, FTP, etc.)
YAML (Ansible) yum-validate-certs-disabled Found yum with SSL verification disabled
YAML (Ansible) zypper-repository-unencrypted-url Found Zypper repository with unencrypted URL (e.g., HTTP, FTP, etc.)
YAML (Ansible) zypper-unencrypted-url Found Zypper package with unencrypted URL (e.g., HTTP, FTP, etc.)
YAML (Docker Compose) port-all-interfaces Service port is exposed on all interfaces

Semgrep 201: intermediate features

Semgrep is a static analysis tool for finding code patterns. This includes security vulnerabilities, bug variants, secrets detection, performance and correctness concerns, and much more. While Semgrep includes a proprietary cloud offering and more advanced rules, Semgrep CLI is free to install and run locally. You can run Trail of Bits’ rules, including the rules mentioned above, with the following command:

semgrep scan --config p/trailofbits /path/to/code

This post will not go into all the details of each rule presented above. The basics of Semgrep have already been discussed extensively by both Trail of Bits and the broader security community, so this post will discuss two lesser-known Semgrep features in more depth: generic mode and YAML support.

generic mode

Semgrep’s generic mode provides an easy method for searching for arbitrary text. Unlike Semgrep’s syntactic support for programming languages like Java and Python, generic mode is glorified text search. Naturally, this provides both advantages and disadvantages: generic mode has a tendency to produce more false positives but also fewer false negatives. In other words, it produces more findings, but you may have to sift through them. Limiting rule paths is one way to avoid false positives. However, the primary reason for using generic mode is the breadth of data it can search.

generic mode can roughly be thought of as an ergonomic alternative to regular expressions. They both perform arbitrary text search, but generic mode offers improved handling of newlines and other white space. It also offers Semgrep’s familiar ellipsis operator, metavariables, and a tight integration with the rest of the Semgrep ecosystem for managing findings. Any text file or text-based data can be analyzed in generic mode, so it’s a great option when you want to analyze less commonly used formats such as Jinja templates, NGINX configuration files, HAML templates, TOML files, HTML content, or any other text-based format.

The primary disadvantage of generic mode is that it has no semantic understanding of the text it parses. This means, for example, that patterns may be incorrectly detected in commented code or other unintended places—in other words, false positives. For example, if we search for os.system(...) in both generic mode and python mode in the following code, we will get different results:

import os

# Uncomment when debugging
# os.system("debugger")

os.system("run_production")

Figure 1: Python code with a line commented out

$ semgrep scan --lang python --pattern "os.system(...)" test.py 
...                       
    test.py 
            6┆ os.system("run_production")
...
Ran 1 rule on 1 file: 1 finding.

Figure 2: python mode semantically understands the comment.

$ semgrep scan --lang generic --pattern "os.system(...)" test.py 
...                       
    test.py 
            4┆ # os.system("debugger")
            ⋮┆----------------------------------------
            6┆ os.system("run_production")
...
Ran 1 rule on 1 file: 2 findings.

Figure 3: generic mode does not semantically understand the comment.

Another disadvantage of generic mode is that it misses the extensive list of Semgrep equivalences. Despite this, we still felt it was the right tool for the job when searching for these specific patterns. Sifting through a few false positives is okay if it means we don’t miss a critical security bug.

Given generic mode’s disadvantages, why use it for many of the rules released in this post? After all, Semgrep has official language support for both Bash and Dockerfiles. But consider the ssh-disable-host-key-checking rule. Using generic mode will find SSH commands disabling StrictHostKeyChecking in Bash scripts, Dockerfiles, CI configuration, documentation files, system calls in various programming languages, or other places we may not even be considering. Using the official Bash or Dockerfile support will cover only a single use case. In other words, using generic mode gives us the broadest possible coverage for a relatively simple heuristic that is applicable in many different scenarios.

For more information, see Semgrep’s official documentation on generic pattern matching.

YAML support

In addition to generic mode, YAML support helps make Semgrep a one-stop shop for searching for code, or text, in basically any text-based file in your filesystem. And YAML is eating the world: Kubernetes configuration, AWS CloudFormation, Docker Compose, GitHub Actions, GitLab CI, Argo CD, Ansible, OpenAPI specifications, and yes, Semgrep rules themselves are even written in YAML. In fact, Semgrep has best practice rules written for Semgrep rules in Semgrep rules. Sem-ception.

Of course, you could write a basic utility in your programming language of choice that uses a mainstream YAML library to parse YAML and search for basic heuristics, but then you would be missing out on the rest of the Semgrep ecosystem. The fact that you can manage all these different types of files and file formats in one place is Semgrep’s killer feature. YAML rules sit next to Python rules, which sit next to Java rules, which sit next to generic rules. They all run in CI together, and findings can be managed in the same place. Ten tools for 10 types of files are no longer necessary.

We were recently engaged in an audit that included a large Ansible implementation. With this in mind, we set out to cover many of the basic security concerns one may expect in the Ansible.Builtin namespace. Searching for YAML patterns using Semgrep’s YAML rule format has a tendency to make your head spin, but once you get used to it, it becomes relatively formulaic. The highly structured nature of formats like JSON and YAML makes searching for patterns straightforward. The Ansible rules presented at the top of this post are relatively clear-cut, so instead let’s consider the port-all-interfaces rule patterns, which highlights the YAML functionality more distinctly:

patterns:
  - pattern-inside: |
      services:
        ...
  - pattern: |
      ports:
        - ...
        - "$PORT"
        - ...
  - focus-metavariable: $PORT
  - metavariable-regex:
      metavariable: $PORT
      regex: '^(?!127.\d{1,3}.\d{1,3}.\d{1,3}:).+'

Figure 4: patterns searching for ports listening on all interfaces

The | YAML block style indicator used in the pattern-inside and pattern operators states that the text below is a plaintext string, not additional Semgrep rule syntax. Semgrep then interprets this plaintext string as YAML. Again, the fact that this is YAML within YAML takes some squinting at first, but the rest of the rule is relatively straightforward Semgrep syntax.

The rule itself is looking for services binding to all interfaces. The Docker Compose documentation states that, by default, services will listen on 0.0.0.0 when specifying ports. This rule finds ports that don’t start with loopback addresses, like 127.0.0.1, which indicates they listen on all interfaces. This is not always a problem, but it can lead to issues like firewall bypass in certain circumstances.

Extend your reach with Semgrep

Semgrep is a great tool for finding bugs across many disparate technologies. This post introduced 30 new Semgrep rules and discussed two lesser-known features: generic mode and YAML support. Adding YAML and generic searching to Semgrep’s extensive list of supported programming languages makes it an even more universal tool. Heuristics for problematic code or infrastructure and their corresponding findings can be managed in a single location.

If you’d like to read more about our work on Semgrep, we have used its capabilities in several ways, such as securing machine learning pipelines, discovering goroutine leaks, and securing Apollo GraphQL servers.

Contact us if you’re interested in custom Semgrep rules for your project.

Leave a Reply