Software

Calendar

Quicksearch

Syndicate This Blog

Blog Administration

Open login screen

Powered by

Serendipity PHP Weblog

Lizenz/License

Static Pages

Frontpage
Impressum

Sunday, February 22. 2009

Building an OpenSolaris storage - Software, Part 3

Installing Solaris can be a strange experience for someone who is only used to modern time Linux installers. Yes, there is a graphical installer, but it consists of little more than an X windows which basically asks the same questions as the text mode installer. Unless you already know how to install Solaris, and what the installer expects of you some of the questions and dialogs seem a little strange.

Due to Solaris' focus on binary compatibility some of the defaults don't make that much sense anymore, either, but changing them to more sensible defaults would cause confusion, or so it seems.

For the install on the storate system, though, most of the defaults are sensible, and since Solaris does not have to share any disks withother operating systems the partitioning process is not that painful, either.

The first question the installer asks (always in text mode) is about the general installaton mode the user wishes to perform (roughly graphical/text based or rescue shell). Interactive/text mode (option 4) is usually fine.

If the system has booted from the network the installer will not ask about IP configuration for the network cards but assume DHCP for IPv4.

The question about the name resolution service is one of the odd quirks in the installer. The naming service defaults to NIS, which is probably wrong for almost any new installation on this planet. Usually DNS is the right choice here. The installer will then ask for the DNS server IPs and default domains. If the installer can not resolve the current machine IP via these nameservers it will explicitly ask for confirmation that the data is really right.

The default answers for the next questions (Kerberos/NFS4) are sensible in the usual cases.

When asked for the file system to use for the root filesystem the default is UFS. Change it to ZFS. I prefer to use separate datasets for / and /var, but that is a matter of personal taste.

The (almost) final question is for the amount of packages to be installed. The installer offers five predefined groups, ranging from several hundred to almost three gigabytes of installed data. Selecting the smallest set will do fine here, the system will boot, have network and NFS client support, which is enough to get at the rest of the packages to install later.

That's it, basically. The installer will now copy the files to the boot disk, prepare the bootloader and restart the system.

Posted by Ralf Ertzinger in Computer, Software, Solaris at 15:35 | Comments (0) | Trackbacks (0)

Monday, February 16. 2009

Building an OpenSolaris storage - Software, Part 2

After solving the cache issue on the MSI board it was time to install the OS for real. I had planned to do the install via the network, for two reasons.

To avoid having to burn a DVD for this
To find out how to do it

Installing Solaris via the network has been supported for a very long time, and Solaris being what it is the process has not changed very much. That means that there are some quirks in it.

Because this was the first Solaris installation in my network some non-Solaris machine had to take over the job of providing the various services needed for an installation. The job was delegated to my notebook running Linux.

The following services are needed to install Solaris over the network: * A DHCP server * A TFTP server * An NFS server

In addition to that some software: * A Solaris medium (obviously) * A bootloader (I'll use SYSLINUX, at least version 3.73 required)

First and foremost, though, the network card in the system that is to be the target of the installation needs to support PXE. PXE is a method that defines a way for a network card BIOS to obtain an IP address via DHCP and load a piece of software from the network, which is then executed by the system. In addition PXE provides a handful of library functions that give the just loaded software a way to talk to the network itself (to load even more software, for example).

Most modern network cards and BIOSes support PXE and booting from the network. If this is not the case the nice people over at etherboot.org have a large library of network card specific code that can be booted via a floppy disk or a bootable CDROM which will provide the network card with the appropriate capabilities.

A dedicated network will be used for the installation, namely 10.200.200.0/24. The install server has the IP 10.200.200.1.

h3. Preparing the tftpboot directory

The TFTP server will serve it's files from the /tftpboot directory. The following structure is expected there:

/tftpboot/
|-- mboot.c32
|-- pxelinux.0
|-- pxelinux.cfg
|   `-- default
`-- solaris
    |-- platform
    |   `-- i86pc
    |       `-- kernel
    |           `-- unix
    `-- x86.miniroot

The mboot.c32 and pxelinux.0 files come from the SYSLINUX bootloader package. pxelinux.cfg/default is the configuration file for the PXE bootloader. It is loaded and read when pxelinux.0 runs. It has the following content:

DEFAULT jumpstart
LABEL jumpstart
  KERNEL mboot.c32
  APPEND -solaris solaris/platform/i86pc/kernel/unix -v -m verbose -B install_media=10.200.200.1:/jumpstart  --- solaris/x86.miniroot

The parameters describe the path of the kernel image under the TFTP server root (solaris/platform/i86pc/kernel/unix), request a verbose startup (-v -m verbose), select the install method, server and path (-B install_media=10.200.200.1:/jumpstart) and the describe the path to the miniroot which contains the installer (solaris/x86.miniroot).

The two files in the solaris directory are found on the Solaris install media under the /boot directory in a similar directory structure. It is important that this structure is retained (even if it seems a bit pointless).

h3. Preparing the DHCP server

There is not much to this, really, all that is required (besides the obvious IP address and netmask) is the IP address of the TFTP server and the filename of the SYSLINUX bootloader in the TFTP directory structure. The complete config file (for the ISC DHCP server) looks like this:

subnet 10.200.200.0 netmask 255.255.255.0 {
    range 10.200.200.128 10.200.200.200;
    option routers 10.200.200.1;
    option subnet-mask 255.255.255.0;
    next-server 10.200.200.1;
    filename "/pxelinux.0";
}

h3. Preparing the NFS server

There are several ways to present the contents of the install media via NFS, but for this install the method that worked best for me was to simply mount the ISO into a directory and share that via NFS.

# mkdir /jumpstart
# mount -o ro,loop /tmp/nv105.iso /jumpstart

The entry in /etc/exports looks like this:

/jumpstart      *(ro,no_subtree_check,sec=sys)

h3. Putting it all together

That should be it, basically. If the machine is turned on and set to boot from the network the following chain of events will take place, provided all goes well:

The PXE BIOS sends DHCP requests
The DHCP server answers the requests, assigning an IP address and the TFTP server information
The PXE BIOS loads /pxelinux.0 from the TFTP server and executes it
/pxelinux.0 loads /pxelinux.cfg/default from the TFTP server and interpets it as a config file
/pxelinux.0 loads the Solaris kernel and miniroot from the TFTP server and runs the kernel
The Solaris kernel boots, mounts the miniroot and starts the installer and re-requests the DHCP address
The installer mounts the NFS directory

Posted by Ralf Ertzinger in Computer, Software, Solaris at 14:05 | Comment (1) | Trackbacks (0)

Friday, February 6. 2009

Building an OpenSolaris storage - Software, Part 1

After the hardware was assembled I made a quick attempt to boot Nevara 105 from a DVD in order to see how things went (the installation proper will be made via the network).

The system booted, but was very slow. It took over 10 minutes to get to the first prompt (which asks about the kind of installation you want to perform, and which is usually reached in a few seconds). Older releases and OpenSolaris 08/11 behaved the same. A Linux system booted from an USB stick behaved normally, though.

I wrestled with this for two days, but then noticed something while running memtest86+ on the system. Since the system has 4GB of RAM, and quite a lot of the physical address space between 3 and 4 GB is used by PCI devices, quite a lot of physical memory is remapped to physical addresses above the 4GB mark. memtest shows the start and end address of the block it is currently testing (this is why start and end addresses above 4GB can show up there, even if the system has less than 4GB of RAM). While testing the relocated memory block memtest slowed to a crawl, while the memory below 4GB was tested at normal speed. It looks like memory accesses above 4GB are not covered by the processor cache. Linux seems to put it's kernel below this magic mark, and thus runs normally, while Solaris lands above it, and is less than usable.

h3. Memory layout

In order to explain what is going on here (and why it is bad) a small detour is in order.

Physical address space is a shared resource on most architectures, the Intel x86 platform (in 32 and 64 bit) included. It is shared between real, physical memory and IO memory. Physical memory is the kind that comes (usually and these days) in DIMM memory sticks that go into the appropriate slots on the main board. IO memory, on the other hand, is a way of talking to extension cards, for example network adapters and graphics cards.

These adapters register one or more areas of memory with the BIOS during system startup. Accessing these adresses results in reading or writing to these extension cards instead of real memory.

The use of this is that programs can treat extension cards just like normal memory.

The end result of this is that a given physical memory address can have one of three "backgrounds": * A physical memory cell * A device * Nothing

The memory ranges that devices register usually live between the 3GB and 4GB physical address space. This was all well and good as long as practically no system had that much real memory, so there was no contention for address space.

However, two things happened in the last years: memory got incredibly cheap, and devices got more hungry for memory space. Modern graphics cards, for example, map a large chunk of their on board graphics RAM into the memory space, sometimes all of it, easily taking up half a gigabyte or more of address space.

There are two ways to handle address conflicts in this situation.

The easy way is to simply ignore the physical memory in the address ranges claimed by devices. The physical memory cell becomes inaccessible, and the storage it provides is lost. This is obviously not a popular solution.

The other way is to relocate the physical memory from the contended address spaces into non-contended space. This usually means "above the 4GB border". So although you may only have 4GB of physical RAM in the system some of it must be accessed at addresses above 4GB.

The exact layout of the memory is passed from the BIOS to the system in the so-called E820 memory maps.

On the MSI IM-GM45 board with 4GB of memory installed it looks like this:

BIOS-e820: 0000000000000000 - 0000000000099000 (usable)
BIOS-e820: 0000000000099000 - 00000000000a0000 (reserved)
BIOS-e820: 00000000000e0000 - 0000000000100000 (reserved)
BIOS-e820: 0000000000100000 - 00000000bdc80000 (usable)
BIOS-e820: 00000000bdc80000 - 00000000bdc8e000 (ACPI data)
BIOS-e820: 00000000bdc8e000 - 00000000bdcd0000 (ACPI NVS)
BIOS-e820: 00000000bdcd0000 - 00000000bdce0000 (reserved)
BIOS-e820: 00000000bdcec000 - 00000000bde00000 (reserved)
BIOS-e820: 00000000fee00000 - 00000000fee01000 (reserved)
BIOS-e820: 00000000ffb00000 - 0000000100000000 (reserved)
BIOS-e820: 0000000100000000 - 000000013c000000 (usable)

The lines marked with (usable) represent physical memory, the others are memory ranges reserved for other purposes.

The last line shows the physical memory that has been relocated above the 4GB memory barrier. Almost one gigabyte of physical memory has been relocated.

h3. Processor caches

These days there are multiple layers of caches between the CPU core and the main memory. Data read from memory is held in the caches, for it might be needed again soon, and data written to memory is held for the same reasons.

Data written to memory addresses beloning to devices, however, may or may not be eligible to caching. While memory is expected to keep it's content stable (unless explicitly written to) devices may change the content of their memory maps as they see fit, and caching the values read or written would screw with the CPUs world view.

CPUs therefore contain a list of memory ranges and the cache policies associated with those ranges. These lists are called memory type range registers (MTRR) on Intel CPUs. Below is the list from a different system, also with 4GB of physical memory:

reg00: base=0x100000000 (4096MB), size= 512MB: write-back, count=1
reg01: base=0x120000000 (4608MB), size= 256MB: write-back, count=1
reg02: base=0x00000000 (   0MB), size=2048MB: write-back, count=1
reg03: base=0x80000000 (2048MB), size=1024MB: write-back, count=1
reg04: base=0xc0000000 (3072MB), size= 256MB: write-back, count=1
reg05: base=0xcff00000 (3327MB), size=   1MB: uncachable, count=1

Three quarters of the memory range between 3 and 4GB do not appear on this list, and are thus considered uncacheable by the CPU. The memory above 4GB, however, is marked as cacheable (reg00 and reg01).

Now the same list on the MSI system:

reg00: base=0x00000000 (   0MB), size=4096MB: write-back, count=1

This is wrong in several ways, but the most important one is that the physical memory above 4GB is no longer cached by the CPU. The result of that is access to that memory becomes painfully slow.

This also explains why the problem goes away when only 2GB of memory are present: no physisical memory is relocated beyond the 4GB barrier, because not enough address contentions exist, so all physical memory is being cached again.

I have filed a support request with MSI and hope someone there understands the problem.

h3. [Update 1]

After a short discussion about supported operating systems, MSI has sent me a newer BIOS file. I'll try that this evening.

h3. [Update 2]

The new BIOS does indeed fix this issue. The new MTRRs look like this:

reg00: base=0x13c000000 (5056MB), size=  64MB: uncachable, count=1
reg01: base=0x00000000 (   0MB), size=4096MB: write-back, count=1
reg02: base=0x100000000 (4096MB), size=1024MB: write-back, count=1
reg03: base=0xc0000000 (3072MB), size=1024MB: uncachable, count=1
reg04: base=0xbdd00000 (3037MB), size=   1MB: uncachable, count=1
reg05: base=0xbde00000 (3038MB), size=   2MB: uncachable, count=1
reg06: base=0xbe000000 (3040MB), size=  32MB: uncachable, count=1

There are still some weird edges about this, but by and large it does what it should: all memory (minus that for the onborad graphics) is cached now.

My thanks to the MSI support for the fast (and working) response.

Posted by Ralf Ertzinger in Computer, Software, Solaris at 22:33 | Comments (8) | Trackback (1)

(Page 1 of 1, totaling 3 entries)

Entries from February 2009