Open source is a great thing. This becomes especially obvious if one is confronted with a program that refuses to work, and furthermore refuses to yield any kind of helpful error message. Reading the source may be the only way to determine what is actually going on.
Sadly I've been doing rather a lot of that lately, This post shall serve as an example how to navigate the Open Solaris source code in search of an answer.
h3. The problem
This specific problem arose during my experiments to create a small Solaris installation for use in an embedded system (small in this context means around 60MB used disk space). More details on this later.
The system has a cfgadm(1M)
binary, but it does not work:
# cfgadm
cfgadm: Library error: Device library initialize failed: Facility is not active
As error messages go this only marginally better that "Failed", but not by much. Telling the user which exact facility is not active would have been helpful.
But at least there are some search friendly strings in there that may help to determine the source code responsible for this message.
h3. The source
One thing the classical UNIX source approach of "all the source in one tree" has going for it is that it makes searching in the source relatively easy. The Open Solaris web site has build a search engine above the source tree which automatically cross-references symbols in the code and has some other nice features. "The entry page to the search engine is here.":http://src.opensolaris.org/source
Searching for "Facility is not active" (note the quotes) yields just a handful of hits. One of those (in /onnv/onnv-gate/usr/src/uts/common/sys/errno.h
) hints that there is a system error (and corresponding symbol) called ENOTACTIVE
which belongs to this error message.
Running cfgadm
under truss(1)
confirms this:
# truss cfgadm
execve("/usr/sbin/cfgadm", 0x08047E24, 0x08047E2C) argc = 1
[...]
sysconfig(_CONFIG_PAGESIZE) = 4096
open("/devices/pseudo/devinfo@0:devinfo", O_RDONLY) = 3
ioctl(3, DINFOIDENT, 0x00000000) = 57311
ioctl(3, 0x10DF00, 0x08047460) Err#73 ENOTACTIVE
close(3) = 0
[...]
Things go kind of downhill from there. So some code opens the devinfo device, runs two IOCTLs on in and the second one fails. Furthermore, truss
only knows the first IOCTL by name, not the actually failing one.
Searching for the first name turns up /onnv/onnv-gate/usr/src/uts/common/sys/devinfo_impl.h
:
#define DINFOIDENT (DIIOC | 0x82) /* identify the driver */
Looking around in this file some more yields two other definitions:
#define DIIOC (0xdf<<8)
[...]
#define DINFOCACHE (DIIOC | 0x100000) /* use cached data */
So the second IOCTL is actually called DINFOCACHE
. Tracing IOCTLs through the code is, unfortunately, a bit tricky, because the routine that handles the IOCTL depends on the passed file descriptor (the first parameter to the IOCTL call). The file associated with the IOCTL in this case belongs to the file /devices/pseudo/devinfo@0:devinfo
(see the open
call directly above the two IOCTLs).
But since the IOCTL handling code most likely contains the symbol DINFOCACHE
as well (that's what constants are for, after all) searching for the name will turn up the correct file, possibly buried among others.
Armed this knowledge the search results for DINFOCACHE
can be narrowed down to one likely candidate: /onnv/onnv-gate/usr/src/uts/common/io/devinfo.c
. This file belongs to the kernel code (it lives in usr/src/uts
), and the name fits the name of the device opened above.
DINFOCACHE
appears twice in a function called di_ioctl
, which sounds good. Following the code flow through this function (DINFOCACHE
is passed in the cmd
parameter), the first relevant code part reads as follows:
if ((st->command & DINFOCACHE) && !cache_args_valid(st, &error)) {
di_freemem(st);
(void) di_setstate(st, IOC_IDLE);
return (error);
}
(By the time execution reaches this code the cmd
variable has been copied to st->command
, more or less). cachevalidargs
, among other things, does the following:
if (!modrootloaded || !i_ddi_io_initialized()) {
CACHE_DEBUG((DI_ERR,
"cache lookup failure: I/O subsystem not inited"));
*error = ENOTACTIVE;
return (0);
}
That looks pretty promising, as it sets the right error code if the condition holds. modrootloadied
is a kernel symbol, so mdb(1)
can be used to inspect this value in a running kernel.
# mdb -k
Loading modules: [ unix genunix specfs mac cpu.generic uppc pcplusmp scsi_vhci
ufs sockfs ip hook neti sctp arp usba uhci sd lofs logindmux ptm random crypto
zfs ipc ]
> modrootloaded/X
modrootloaded:
modrootloaded: 1
That's not the culprit. iddiioinitialized()
basically returns the value of syseventdaemon_init
, so what about that?
# mdb -k
Loading modules: [ unix genunix specfs mac cpu.generic uppc pcplusmp scsi_vhci
ufs sockfs ip hook neti sctp arp usba uhci sd lofs logindmux ptm random crypto
zfs ipc ]
> modrootloaded/X
modrootloaded:
modrootloaded: 1
> sysevent_daemon_init/X
sysevent_daemon_init:
sysevent_daemon_init: 0
Bingo. From the name of the variable the probable name of the not running facility (remember the original error message?) can be deduced: svc:/system/sysevent:default
, which, indeed, is not running on the minimal system. Starting it makes cfgadm
work.
That wasn't so hard, now was it?