Table of Contents
This blog post examines a tricky bug in the incredibly useful libarchive-ruby-swig ruby gem. This gem wraps the libarchive C library which can be used to read and write archives of many different formats.
The bug in the C++ code of the RubyGem itself causes Ruby’s GC to mistakenly free an in-use object which later leads to a segfault.
A fix for this issue was sent as a pull request and subsequently merged.
A portion of the backend software for packagecloud uses Ruby to determine the type of package the user is uploading. libarchive-ruby-swig is pointed at uploaded files as part of the file type detection process. During the development of support for DSCs, a recurrent segmentation fault was encountered that only seemed to be related to processing DSC packages.
Getting a reproducible test case
Getting a reproducible test case for a garbage collection related bug can be quite tricky as changes to the environment, directory structure, time of day, etc. can all affect when and how a garbage collection run is executed.
I realized that re-running the test suite only triggered this segfault for DSC files, which are plain text. So, I created a simple test program which used libarchive-ruby-swig and pointed it at a plain text file and forced a garbage collector run:
And we’ve got a winner:
So, what is going on here?
Investigating with GDB
The first step in any fun debugging session is to fire up GDB, get a backtrace, and see what’s what.
It’s a bit tricky if you need Bundler to run your test program, but not too bad:
(Some alerts from GDB about threads and libthread_db were removed for brevity, but the important pieces here are getting the program running and seeing the SIGSEGV come thorugh)
And now, for a backtrace courtesy of
From the backtrace, we can see that:
read_open_filenameclass method is called (stack frames 8 and 9)
- An exception is raised at stack frame 7
- Internal MRI functions from frames 6 to 0 attempt to create an exception
- A lookup on a hash table via
st_lookupcauses a segfault
It’s important to note that
libarchive-ruby-swig uses SWIG to autogenerate some wrapper code for interacting with the libarchive C library. This means that we’ll need to dive into some interesting generated C++ code to fully debug this issue.
So, we begin by first examining the source code described in stack frame 8, libarchive_wrap.cxx, line 2486:
At first glance, this code looks reasonable. An exception occurs, it is caught and then raised in Ruby-land so that Ruby programs using this RubyGem can deal with error that was raised appropriately.
Why would this cause a segfault?
Read the assembly
Most of the time, it is far more useful and instructive to the read the actual assembly code which is being executed, especially when debugging. In this case, once the assembly is examined, it’ll be a bit more clear why the segfault happens.
So, ask GDB to show some of the assembly instructions for the function in question:
If you’ve never disassembled C++ code before, the above output will surely look a bit overwhelming, but the key thing to notice about the above output is:
This function is provided by the compiler (in this case
g++) and it is used above to implement the
static storage class qualifier we saw in the exception handling code in the generated SWIG wrapper code earlier:
This usage was intended to initialize the variables
o_except just one time so that any future exceptions raise to Ruby would not need to reinitialize
o_except. Their values are stored the first time and re-used.
The assembly code is a bit convoluted, but let’s walk through how
static is implemented for
The C++ code:
The assembly code starts by calling the guard function to determine if
o_except has been initialized. If not, control is transferred via a jump instruction (
jne) to another piece of code:
The code that is jumped to initializes
o_except. You’ll see a call to
rb_exc_new2 is actually just a macro in the Ruby VM source and is replaced with
After the function is called, its return value is written to
o_except and control is transferred back to the
mov instruction above which appears after the
And, in this way,
static is implemented for variables defined within functions in C++.
But, what does this have to do with the segfault?
Ruby’s GC implementation
In order to understand this bug, you must understand how Ruby’s garbage collector works. Ruby’s garbage collector is a conservative mark-and-sweep garbage collector.
It works by:
- Crawling in use objects, starting at a set of root objects, and marking them
- Checking the program stack and heap for any value that looks like it could be a Ruby object, and marking those.
- Checking the register set of the CPU for any value that looks like it could be a Ruby object, and marking those.
Due to implementation details, it is impossible for Ruby to know if a particular value found on the stack or heap is actually a Ruby object or not. Ruby acts “conservatively” in that when it finds a value in a CPU register or on the program stack that looks like it refers to a Ruby object, the Ruby object that could be referenced is marked as in-use just incase.
After all objects (and things which look like objects) are marked, a sweep phase begins freeing objects which are not marked.
This process is demonstrated in the animation below (taken from here):
The bug occurs because:
- The Ruby VM’s object allocator doesn’t know which objects are in use or not, all it can do is guess
- The variables marked as static are initialized once and afterward, never again
- The compiler has optimized the generated assembly to use the fewest number of registers possible. As such, references to Ruby objects aren’t always guaranteed to exist in registers, or on the program stack, if the compiler thinks it can complete a function call or other operation without it. Remember, your compiler just needs to satisfy the ABI of the target system - it doesn’t “know” anything about Ruby or Ruby objects and performs valid optimizations for the target system.
- The Ruby objects that are marked static are not stored on the heap or the program stack; they are stored in a different program memory segment entirely
When Ruby’s garbage collector runs, it does not see the static objects because:
- References to the objects aren’t found in the program stack where the Ruby GC will scan
- References to the objects don’t exist in registers due to optimizations and the static initialization code path running once
And so, Ruby’s GC mistakenly frees this object even though it will be used when an exception is generated.
Writing a fix
The quickest and simplest fix for this is to remove the
static storage class qualifier:
There are two effects of removing
- The execption object is recreated in Ruby-land everytime an exception occurrs in C++
- A reference to the Ruby object will actually exist on the program stack so Ruby’s GC will see it when it scans
A fix was sent via pull request on GitHub and merged to the project.
Deploying the fix
packagecloud uses packagecloud for maintaining internal dependencies. A fix for this issue was deployed immediately so that development of the feature could continue in development environments without waiting for a fix to be accepted and merged to the general project.
Our exact step-by-step for this:
- The original repository was forked
- The fix was committed
- A version bump was committed
- The RubyGem was rebuilt and pushed to packagecloud
Rebuilding the RubyGem and pushing to packagecloud was quick and easy:
Ruby’s garbage collector can prove to be a tricky adversary when writing or using C or C++ based RubyGems. You need to carefully consider how the garbage collector will interact with Ruby objects created and allocated in C/C++ and what the implications are when using storage class qualifiers.
Any complex system will eventually require the developers to manage their own set of dependencies in order to get bugfixes, performance improvements, or new features that can be used in the application.
Having a place for these objects to live and be tracked is crucial for ensuring that production, development, and test environments are using the same versions of every piece of software in the stack.