Discussion:
Help me select hardware and software options for very large server
(too old to reply)
Terry Kennedy
2009-01-24 02:30:50 UTC
Permalink
[I posted the following message to freebsd-questions, as I thought it
woule be the most appropriate list. As it has received no replies in two
weeks, I'm trying freebsd-current.]

--------

[I decided to ask this question here as it overlaps -hardware, -current,
and a couple other lists. I'd be glad to redirect the conversation to a
list that's a better fit, if anyone would care to suggest one.]

I'm in the process of planning the hardware and software for the second
generation of my RAIDzilla file servers (see http://www.tmk.com/raidzilla
for the current generation, in production for 4+ years).

I expect that what I'm planning is probably "off the scale" in terms of
processing and storage capacity, and I'd like to find out and address any
issues before spending lots of money. Here's what I'm thinking of:

o Chassis - CI Design SR316 (same model as current chassis, except i2c link
between RAID controller and front panel
o Motherboard - Intel S5000PSLSATAR
o CPU - 2x Intel Xeon E5450 BX80574E5450P
p Remote management - Intel Remote Management Module 2 - AXXRM2
o Memory - 16GB - 8x Kingston KVR667D2D4F5/2GI
o RAID controller - 3Ware 9650SE-16ML w/ BBU-MODULE-04
o Drives - 16x 2TB drives [not mentioning manufacturer yet]
o Cables - 4x multi-lane SATA cables
o DVD-ROM drive
o Auxiliary slot fan next to BBU card
o Adaptec AHA-39160 (for Quantum Superloader 3 tape drive)

So much for the hardware. On the software front:

o FreeBSD 8.x?
o amd64 architecture
o MBR+UFS2 for operating system partitions (hard partition in controller)
o GPT+ZFS for data partitions
o Multiple 8TB data partitions (separate 8TB controller partitions or one
big partition divided with GPT?)

I looked at "Large data storage in FreeBSD", but that seems to be a stale
page from 2005 or so: http://www.freebsd.org/projects/bigdisk/index.html

I'm pretty sure I need ZFS, since even with the 2TB partitions I have now,
taking snapshots for dump or doing a fsck take approximately forever 8-)
I'll be using the harware RAID 6 on the 3Ware controller, so I'd only be
using ZFS to get filesystems larger than 2TB.

I've been following the ZFS discussions on -current and -stable, and I
think that while it isn't quite ready yet, it probably will be ready in
a few months, being available around the same time I get this hardware
asssembled. I recall reading that there will be an import of newer ZFS
code in the near future.

Similarly, the ports collection seems to be moving along nicely with
amd64 support.

I think this system may have the most storage ever configured on a
FreeBSD system, and it is probably up near the top in terms of CPU and
memory. Once I have it assembled I'd be glad to let any FreeBSD devel-
opers test and stress it if that would help improve FreeBSD on that
type of configuration.

In the meantime, any suggestions regarding the hardware or software con-
figuration would be welcomed.

Terry Kennedy http://www.tmk.com
***@tmk.com New York, NY USA
Freddie Cash
2009-01-24 16:17:50 UTC
Permalink
Post by Terry Kennedy
I'm in the process of planning the hardware and software for the second
generation of my RAIDzilla file servers (see http://www.tmk.com/raidzilla
for the current generation, in production for 4+ years).
I expect that what I'm planning is probably "off the scale" in terms of
processing and storage capacity, and I'd like to find out and address any
o Chassis - CI Design SR316 (same model as current chassis, except i2c link
between RAID controller and front panel
o Motherboard - Intel S5000PSLSATAR
o CPU - 2x Intel Xeon E5450 BX80574E5450P
p Remote management - Intel Remote Management Module 2 - AXXRM2
o Memory - 16GB - 8x Kingston KVR667D2D4F5/2GI
o RAID controller - 3Ware 9650SE-16ML w/ BBU-MODULE-04
o Drives - 16x 2TB drives [not mentioning manufacturer yet]
o Cables - 4x multi-lane SATA cables
o DVD-ROM drive
o Auxiliary slot fan next to BBU card
o Adaptec AHA-39160 (for Quantum Superloader 3 tape drive)
o FreeBSD 8.x?
o amd64 architecture
o MBR+UFS2 for operating system partitions (hard partition in controller)
o GPT+ZFS for data partitions
o Multiple 8TB data partitions (separate 8TB controller partitions or one
big partition divided with GPT?)
We did something similar for our off-site, automated backups box.

Hardware (2 identical boxes):
- Chenbro 5U chassis with 24 hot-swap drive bays
- 1350 W 4-way redundant PSU
- Tyan h2000M motherboard
- 2x Opteron 2200-series CPUs (dual-core) @ 2.8 GHz
- 8 GB DDR2-800 SDRAM
- 3Ware 9650SE-12ML PCIe RAID controller
- 3Ware 9550SX-12ML PCI-X RAID controller
- 12x Seagate Barracude 7200.11 500 GB SATA drives
- 12x WD 500 GB SATA drives
- Intel Pro/1000MT 4-port gigabit NIC

One box has 2x 2 GB CompactFlash in IDE adapters, the other has 2x 2
GB USB flash drives.

Software:
- 64-bit FreeBSD 7.1-RELEASE (started with 7.1-STABLE from August 08)
- UFS for / partition, using gmirror across the two CF/USB drives
- ZFS for everything else
- uses rsync and ssh to backup 83 remote servers everynight,
creating ZFS snapshots every night
- uses rsync to transfer "snapshots" between the two servers

The drives on each of the RAID controllers are configured as "Single
Disk Array", so they appear as 24 separate drives to the OS, but still
benefit from the controller's disk cache, management interface, and so
on (as compared to JBOD where it acts like nothing more than a SATA
controller).

The drives on one box are configured as 1 large 24-drive raidz2 in ZFS
(this box also has 12x 400 GB drives).

The drives on the other box are configured as 2 separate 11-drive
raidz2 arrays, with 2 hot spares.

The usable space on box boxes is 9 TB.

I'm experimenting with different ways to allocated the drives in ZFS,
to get the best performance. The recommendations of the Solaris folks
seems to be to use raidz arrays of fewer than 10 disks. Depending on
the cost, for the next storage box, I may use 3x 3Ware 9550SX RAID
controllers and use 3x raidz2 arrays of 8 disks each, to see how that
compares.

Other than a bit of kernel tuning back in August/September, these
boxes have been running nice and smooth. Just waiting for either the
release of FreeBSD 8.0 or an MFC of ZFS v13 to 7-STABLE to get support
for auto-rebuild using hot spares.

Also waiting for a pair of CF-to-SATA controllers to come in so that I
can remove the USB flash drives from the one box. They're just too
unreliable in my testing. The CF-to-IDE adapters work really well,
but there's only a single IDE controller on the motherboard, so it's
doing gmirror across master and slave devices on the same controller,
which isn't the fastest way of doing things.
--
Freddie Cash
***@gmail.com
Bob Bishop
2009-01-24 17:18:35 UTC
Permalink
Hi,
Post by Terry Kennedy
[...]
o Motherboard - Intel S5000PSLSATAR
o CPU - 2x Intel Xeon E5450 BX80574E5450P
We're using plenty of the closely related S5000PAL m/b with various
Xeons and we're very happy with them.
Post by Terry Kennedy
p Remote management - Intel Remote Management Module 2 - AXXRM2
We tried this and decided it's not worth having (vs the piggybacked
IPMI on that m/b). We found that RMM's power control sometimes doesn't
work (usually when you need it!). Also it's very tricky indeed to use
down an ssh tunnel. OTOH if you need the remote device functionality
there's not really any alternative.

Can't comment to the rest of your list, except that ZFS is definitely
the way to go.

--
Bob Bishop
***@gid.co.uk
Terry Kennedy
2009-01-25 02:21:41 UTC
Permalink
Post by Freddie Cash
We did something similar for our off-site, automated backups box.
One box has 2x 2 GB CompactFlash in IDE adapters, the other has 2x 2
GB USB flash drives.
I assume this was for FreeBSD itself? I was concerned about write
cycles (and, to a lesser extent, storage capacity) on CF or USB
media. I haven't seen much degradation due to seeks when the FreeBSD
space is a separate logical drive on the AMCC controller. That also
gets me inherent RAID 6 protection (assuming I carve it from the main
batch of drives).
Post by Freddie Cash
- 64-bit FreeBSD 7.1-RELEASE (started with 7.1-STABLE from August 08)
- UFS for / partition, using gmirror across the two CF/USB drives
- ZFS for everything else
This seems to mirror what I'll be doing. It is good to hear that this
has been working well for you.
Post by Freddie Cash
- uses rsync and ssh to backup 83 remote servers everynight,
creating ZFS snapshots every night
- uses rsync to transfer "snapshots" between the two servers
In my case, the data will originate on one of the servers (instead of
being backups of other servers) and will be synchronized with an off-
site server (same hardware) via either 1Gbit or 10Gbit Ethernet (right
now I have regular Gigabit hardware), but all I'd need is new add-in
cards and an upgrade to the switches at each end of the fiber. The
synchonization is currently dones nightly via rdiff-backup. But that
package can occasionally loose its marbles, even with < 2TB of data,
so I may have to consider alternatives.
Post by Freddie Cash
The drives on each of the RAID controllers are configured as "Single
Disk Array", so they appear as 24 separate drives to the OS, but still
benefit from the controller's disk cache, management interface, and so
on (as compared to JBOD where it acts like nothing more than a SATA
controller).
Hmmm. I was planning to use the hardware RAID 6 on the AMCC, for a
number of reasons: 1) that gives me front-panel indications of broken
RAID sets, controller-hosted rebuild, and so forth. 2) I'd be using
fewer ZFS features (basically, just large partitions and snapshots)
so if anything went wrong, I'd have a larger pool of expertise to draw
on to fix things (all AMCC users, rather than all FreeBSD ZFS users).

Did you consider this option and reject it? If so, can you tell me
why?
Post by Freddie Cash
The drives on one box are configured as 1 large 24-drive raidz2 in ZFS
(this box also has 12x 400 GB drives).
The drives on the other box are configured as 2 separate 11-drive
raidz2 arrays, with 2 hot spares.
The usable space on box boxes is 9 TB.
So a single ZFS partition of 8TB would be manageable without long
delays for backup snapshots?
Post by Freddie Cash
Other than a bit of kernel tuning back in August/September, these
boxes have been running nice and smooth. Just waiting for either the
release of FreeBSD 8.0 or an MFC of ZFS v13 to 7-STABLE to get support
for auto-rebuild using hot spares.
That's good to hear. What sort of tuning was involved (if it is still
needed)?

Thanks,
Terry Kennedy http://www.tmk.com
***@tmk.com New York, NY USA
Freddie Cash
2009-01-25 04:16:12 UTC
Permalink
On Sat, Jan 24, 2009 at 6:21 PM, Terry Kennedy
Post by Terry Kennedy
Post by Freddie Cash
We did something similar for our off-site, automated backups box.
One box has 2x 2 GB CompactFlash in IDE adapters, the other has 2x 2
GB USB flash drives.
I assume this was for FreeBSD itself? I was concerned about write
cycles (and, to a lesser extent, storage capacity) on CF or USB
media. I haven't seen much degradation due to seeks when the FreeBSD
space is a separate logical drive on the AMCC controller. That also
gets me inherent RAID 6 protection (assuming I carve it from the main
batch of drives).
Correct. On the original box, just / is on the CF. /usr, /var, /tmp,
/home, and a bunch of sub-directories of /usr are ZFS filesystems. We
also have a /storage directory that we put all the backups in.

One the second box (the offsite replica), / and /usr are on the USB,
with /home, /var, /tmp, /usr/src, /usr/obj, /usr/ports, /usr/local are
all ZFS filesystems. I put /usr onto the USB as well, as I ran into
an issue with zpool corruption that couldn't be fixed as not enough
apps were available between / and /rescue. 2 GB is still plenty of
space for the OS, and it's not like it will be changing all that much.
Post by Terry Kennedy
Post by Freddie Cash
The drives on each of the RAID controllers are configured as "Single
Disk Array", so they appear as 24 separate drives to the OS, but still
benefit from the controller's disk cache, management interface, and so
on (as compared to JBOD where it acts like nothing more than a SATA
controller).
Hmmm. I was planning to use the hardware RAID 6 on the AMCC, for a
number of reasons: 1) that gives me front-panel indications of broken
RAID sets, controller-hosted rebuild, and so forth. 2) I'd be using
fewer ZFS features (basically, just large partitions and snapshots)
so if anything went wrong, I'd have a larger pool of expertise to draw
on to fix things (all AMCC users, rather than all FreeBSD ZFS users).
Did you consider this option and reject it? If so, can you tell me
why?
Originally, I was going to use hardware RAID6 as well, creating two
arrays, and just joing them together with ZFS. But then I figured, if
we're going to use ZFS, we may as well use it to the fullest, and use
the built-in raidz features. In theory, the performance should be
equal or better, due to the CoW feature that eliminates the
"write-hole" that plagues RAID5 and RAID6. However, we haven't done
any formal benchmarking to see which is actually better: multiple
hardware RAID arrays added to the pool, or multiple raidz datasets
added to the pool.
Post by Terry Kennedy
Post by Freddie Cash
The drives on one box are configured as 1 large 24-drive raidz2 in ZFS
(this box also has 12x 400 GB drives).
The drives on the other box are configured as 2 separate 11-drive
raidz2 arrays, with 2 hot spares.
The usable space on box boxes is 9 TB.
So a single ZFS partition of 8TB would be manageable without long
delays for backup snapshots?
Creating ZFS snapshots is virtually instantaneous. Destroying ZFS
snapshots takes a long time, depending on the age and size of the
snapshot. But creating and accessing snapshots is nice and quick.

The really nice thing about ZFS snapshots is that if you set the ZFS
property snapdir to "visible", then you can navigate to
/<zfs-filesystem>/.zfs/snapshot/ and have access to all the snapshots.
They'll be listed here, by snapshot name. Just navigate into them as
any other directory, and you have full read-only access to the
snapshot.
Post by Terry Kennedy
Post by Freddie Cash
Other than a bit of kernel tuning back in August/September, these
boxes have been running nice and smooth. Just waiting for either the
release of FreeBSD 8.0 or an MFC of ZFS v13 to 7-STABLE to get support
for auto-rebuild using hot spares.
That's good to hear. What sort of tuning was involved (if it is still
needed)?
Here are the loader.conf settings that we are currently using:
# Kernel tunables to set at boot (mostly for ZFS tuning)
# Disable DMA for the CF disks
# Set kmem to 1.5 GB (the current max on amd64)
# Set ZFS Adaptive Read Cache (arc) to about half of kmem (leaving
half for the OS)
hw.ata.ata_dma=0
kern.hz="100"
vfs.zfs.arc_min="512M"
vfs.zfs.arc_max="512M"
vfs.zfs.prefetch_disable="1"
vfs.zfs.zil_disable="0"
vm.kmem_size="1596M"
vm.kmem_size_max="1596M"

Finding the correct arc_min/arc_max and kmem_size_max settings is a
bit of a black art, and will depend on the workload for the server.
There's a max of 2 GB for kmem_size on FreeBSD 7.x, but the usable max
appears to be around 1596 MB, and will change depending on the server.
The second box has a max of 1500 MB, for example (won't boot with
anything higher).

Some people run with an arc_max of 64 MB, we ran with it set to 2 GB
for a bit (8 GB of RAM in the box). Basically, we just tune it down a
little bit everytime we hit a "kmem_map too small" kernel panic.

FreeBSD 8.0 won't have these limitations (kmem_max is 512 GB), and ZFS
v13 will auto-tune itself as much as possible.
--
Freddie Cash
***@gmail.com
Terry Kennedy
2009-01-25 02:38:28 UTC
Permalink
Post by Bob Bishop
We're using plenty of the closely related S5000PAL m/b with various
Xeons and we're very happy with them.
That's good to know.
Post by Bob Bishop
Post by Terry Kennedy
p Remote management - Intel Remote Management Module 2 - AXXRM2
We tried this and decided it's not worth having (vs the piggybacked
IPMI on that m/b). We found that RMM's power control sometimes doesn't
work (usually when you need it!). Also it's very tricky indeed to use
down an ssh tunnel. OTOH if you need the remote device functionality
there's not really any alternative.
I also have out-of-band power control, so that shouldn't be a problem.
I assume that anything that borked the system so badly the AXXRMM2
couldn't power it off isn't going to be made worse by pulling the plug.

I'll be building one of these systems as a test box before I go and
order the parts for the other 2 (primary, backup, hot spare) so if the
management card is a wast of money, I'll omit it from the other two
builds. At under $170 it's a very small part of the cost of the first
build. [I have 4 Tyan M8239's which were completely non-functional on
the Tyan S2721-533 motherboards in the current systems, so I definite-
ly learned my lesson there.]

Terry Kennedy http://www.tmk.com
***@tmk.com New York, NY USA
Aristedes Maniatis
2009-01-25 03:57:49 UTC
Permalink
Post by Terry Kennedy
I'm pretty sure I need ZFS, since even with the 2TB partitions I have now,
taking snapshots for dump or doing a fsck take approximately forever 8-)
I'll be using the harware RAID 6 on the 3Ware controller, so I'd only be
using ZFS to get filesystems larger than 2TB.
If you do RAID on the hardware card you get a few downsides:

* if your 3ware card dies you'll need a spare of the same model and
same BIOS to replace it with. Otherwise you risk your disks not being
properly detected.

* you don't get the advantage of ZFS check-summing and auto-repairing
the data. There are some very nice features in ZFS for doing this on
the fly and periodically. You'll want to compare that against what the
hardware card gives you.

* personally I've found the 3ware utilities much more cumbersome to
use than ZFS tools

* you can't use RAIDz which has some nice features

You'll want to do some tests and see how it performs in different
configurations using the type of data and access you plan on
implementing. Another person's benchmarks may not necessarily apply to
your specific setup, so try it and see. The 3ware cards are still very
good to use even if you don't use them in RAID mode.
Post by Terry Kennedy
I've been following the ZFS discussions on -current and -stable, and I
think that while it isn't quite ready yet, it probably will be ready in
a few months, being available around the same time I get this hardware
asssembled. I recall reading that there will be an import of newer ZFS
code in the near future.
That new code is already in Current. Whether it will be ported to 7 is
not yet known. On the other hand a great many people are running
7.0/7.1 ZFS in production very successfully with properly tuning.
Post by Terry Kennedy
I think this system may have the most storage ever configured on a
FreeBSD system, and it is probably up near the top in terms of CPU and
memory.
I doubt it. 8 core systems are very common these days. I've seen
benchmarking with anything up to 64 cores (which I believe is the
current FreeBSD limit). As for memory, there was a recent thread with
someone installing 64Gb also claiming theirs was the biggest, but
others insisting that it certainly wasn't. As for storage, I know Sun
sell systems into the petabyte range which I assume these days are
powered by ZFS.

The above is not to belittle your project; more to reassure you that
you aren't anywhere near the limits and you should have no troubles if
you tune things properly.

Ari Maniatis



-------------------------->
ish
http://www.ish.com.au
Level 1, 30 Wilson Street Newtown 2042 Australia
phone +61 2 9550 5001 fax +61 2 9550 4001
GPG fingerprint CBFB 84B4 738D 4E87 5E5C 5EFA EF6A 7D2E 3E49 102A
Terry Kennedy
2009-01-27 03:27:15 UTC
Permalink
Post by Aristedes Maniatis
Post by Terry Kennedy
I think this system may have the most storage ever configured on a
FreeBSD system, and it is probably up near the top in terms of CPU and
memory.
I doubt it. 8 core systems are very common these days. I've seen
benchmarking with anything up to 64 cores (which I believe is the
current FreeBSD limit). As for memory, there was a recent thread with
someone installing 64Gb also claiming theirs was the biggest, but
others insisting that it certainly wasn't. As for storage, I know Sun
sell systems into the petabyte range which I assume these days are
powered by ZFS.
The above is not to belittle your project; more to reassure you that
you aren't anywhere near the limits and you should have no troubles if
you tune things properly.
Regarding storage, while there are other ZFS deployments (particularly
on Sun equipment) which are a lot larger, I haven't seen any discussion
of pools of this size on FreeBSD. And I have been keeping an eye on the
FreeBSD ZFS discussions. That isn't to say that nobody has one, but if
they do, they're keeping pretty quiet about it...

Part of the reason for rolling my own is the fun I get from it - other-
wise I could just buy a pre-configured Sun / NetApp / whatever box.

It is good to know that my CPU / memory choices aren't pushing any
limits in FreeBSD. I still think they're probably in the 95th percentile,
size-wise.

I've always been a bit careful about sizing at the top end - back when
I was using BSD/OS, I'd run into problems with configurations with more
than a certain amount of memory failing in subtle (and not-so-subtle)
ways.

I'm sure that people have experimented enough to find the upper limits,
at least for quick testing. This is particularly common among hardware
vendors and integrators, since they have all those parts sitting around
waiting - years ago, I did such a test on a DEC system that shipped with
16MB memory standard and 64MB max. I loaded it up with 768MB to see what
would happen (it worked). I'm also the guy that once set up a demo Cisco
2511 with full BGP tables (now *that* was a hack 8-).

Terry Kennedy http://www.tmk.com
***@tmk.com New York, NY USA
Oliver Fromme
2009-01-27 17:18:39 UTC
Permalink
[Aristedes Maniatis wrote:]
[Terry Kennedy wrote:]
Post by Terry Kennedy
I think this system may have the most storage ever configured on a
FreeBSD system, and it is probably up near the top in terms of CPU and
memory.
I doubt it. 8 core systems are very common these days. I've seen
benchmarking with anything up to 64 cores (which I believe is the
current FreeBSD limit). As for memory, there was a recent thread with
someone installing 64Gb also claiming theirs was the biggest, but
others insisting that it certainly wasn't. As for storage, I know Sun
sell systems into the petabyte range which I assume these days are
powered by ZFS.
The above is not to belittle your project; more to reassure you that
you aren't anywhere near the limits and you should have no troubles if
you tune things properly.
Regarding storage, while there are other ZFS deployments (particularly
on Sun equipment) which are a lot larger, I haven't seen any discussion
of pools of this size on FreeBSD. And I have been keeping an eye on the
FreeBSD ZFS discussions. That isn't to say that nobody has one, but if
they do, they're keeping pretty quiet about it...
Well, I can tell you that your storage setup is _not_ the
largest ever configured on a FreeBSD system. Sometimes
there are reasons why you cannot disclose details about
your (or your customers') setups.

Best regards
Oliver
--
Oliver Fromme, secnetix GmbH & Co. KG, Marktplatz 29, 85567 Grafing b. M.
Handelsregister: Registergericht Muenchen, HRA 74606, Geschäftsfuehrung:
secnetix Verwaltungsgesellsch. mbH, Handelsregister: Registergericht Mün-
chen, HRB 125758, Geschäftsführer: Maik Bachmann, Olaf Erb, Ralf Gebhart

FreeBSD-Dienstleistungen, -Produkte und mehr: http://www.secnetix.de/bsd

"C++ is the only current language making COBOL look good."
-- Bertrand Meyer
Paul Tice
2009-01-27 18:41:20 UTC
Permalink
Excuse my rambling, perhaps something in this mess will be useful.

I'm currently using 8 cores (2x Xeon E5405), 16G FB-DIMM, and 8 x 750GB drives on a backup system (I plan to add the other in the chassis one by one, testing the speed along the way)
8-current AMD64, ZFS, Marvell 88sx6081 PCI-X card (8 port SATA) + LSI1068E (8 port SAS/SATA) for the main Array, and the Intel onboard SATA for boot drive(s).
Data is sucked down through 3 gigabit ports, with another available but not yet activated.
Array drives all live on the LSI right now. Drives are <ATA ST3750640AS K>.

ZFS is stable _IF_ you disable the prefetch and ZIL, otherwise the classic ZFS wedge rears it's ugly head. I haven't had a chance to test just one yet, but I'd guess it's the prefetch that's the quick killer. Even with prefetching and ZIL disabled, my current bottleneck is the GigE. I'm waiting to get new switches in that support jumbo frames, quick and dirty testing shows almost 2x increase in throughput, and ~40% drop in interrupt rates from the NICs compared to the current standard (1500 MTU) frames.

Pool was created with 'zpool create backup raidz da0 da1 da2 da3 da4 da5 da6 da7'

I've seen references to 8-Current having a kernel memory limit of 8G (compared to 2G for pre 8 from what I understand so far) and ZFS ARC (caching) is done in kernel memory space. (Please feel free to correct me if I'm wrong on any of this!)
Default ZFS (no disables) with a 1536M kern mem limit, and 512M ARC limit, I saw 2085 ARC memory throttles before the box wedged.

Using rsync over several machines with this setup, I'm getting a little over 1GB/min to the disks.
'zpool iostat 60' is a wonderful tool.
I would mention something I've noticed that doesn't seem to be documented:
The first reading from 'zpool iostat' (whether single run or with an interval) is a running average, although I haven't found the time period averaged yet. (from pool mount time maybe?)

The jumbo frame interrupt reduction may be important. I run 'netstat -i -w60' right beside 'zpool iostat 60', and the throughput is closely inversely related. I can predict a disk write (bursty writes in ZFS it seems) by throughput dropping to on the NIC side. The drop is up to 75% averaging around 50%. Using a 5 interval instead of 60, I see disk out throughput spikes up to 90MB/s, although 55, 0, 0, 0, 55 is more common.
Possibly, binding interrupts to particular cpu's might help a bit too. I haven't found, and don't feel competent to write, userspace tools to do this.

CPU usage during all this is suprisingly low. rsync is running with -z, the files themselves are compressed as they go onto the drives with pbzip2, and the whole thing runs on (ducking) BackupPC, which is all perl script.
With all that, 16 machines backing up, and 1+GB/Min going to the platters, CPU is still avg 40% idle using top. I'm considering remaking the array raidz2, I seem to have enough CPU to handle it.

Random ZFS thoughts:
You cannot shrink/grow a raidz or raidz2. You can grow a stripe array, I'm don't know if you can shrink it successfully.
You cannot promote a stripe array to raidz/z2, nor demote in the other direction.
You can have hot spares, haven't seen a provision for warm/cold spares.
/etc/default/rc.conf already has cron ZFS status/scrub checks, but not enabled.

Anyway, enough rambling, just thought I'd use something not too incredibly far from your suggested system to toss some data out.

Thanks
Paul





-----Original Message-----
From: owner-freebsd-***@freebsd.org on behalf of Terry Kennedy
Sent: Fri 1/23/2009 8:30 PM
To: freebsd-***@freebsd.org
Subject: Help me select hardware and software options for very large server

[I posted the following message to freebsd-questions, as I thought it
woule be the most appropriate list. As it has received no replies in two
weeks, I'm trying freebsd-current.]

--------

[I decided to ask this question here as it overlaps -hardware, -current,
and a couple other lists. I'd be glad to redirect the conversation to a
list that's a better fit, if anyone would care to suggest one.]

I'm in the process of planning the hardware and software for the second
generation of my RAIDzilla file servers (see http://www.tmk.com/raidzilla
for the current generation, in production for 4+ years).

I expect that what I'm planning is probably "off the scale" in terms of
processing and storage capacity, and I'd like to find out and address any
issues before spending lots of money. Here's what I'm thinking of:

o Chassis - CI Design SR316 (same model as current chassis, except i2c link
between RAID controller and front panel
o Motherboard - Intel S5000PSLSATAR
o CPU - 2x Intel Xeon E5450 BX80574E5450P
p Remote management - Intel Remote Management Module 2 - AXXRM2
o Memory - 16GB - 8x Kingston KVR667D2D4F5/2GI
o RAID controller - 3Ware 9650SE-16ML w/ BBU-MODULE-04
o Drives - 16x 2TB drives [not mentioning manufacturer yet]
o Cables - 4x multi-lane SATA cables
o DVD-ROM drive
o Auxiliary slot fan next to BBU card
o Adaptec AHA-39160 (for Quantum Superloader 3 tape drive)

So much for the hardware. On the software front:

o FreeBSD 8.x?
o amd64 architecture
o MBR+UFS2 for operating system partitions (hard partition in controller)
o GPT+ZFS for data partitions
o Multiple 8TB data partitions (separate 8TB controller partitions or one
big partition divided with GPT?)

I looked at "Large data storage in FreeBSD", but that seems to be a stale
page from 2005 or so: http://www.freebsd.org/projects/bigdisk/index.html

I'm pretty sure I need ZFS, since even with the 2TB partitions I have now,
taking snapshots for dump or doing a fsck take approximately forever 8-)
I'll be using the harware RAID 6 on the 3Ware controller, so I'd only be
using ZFS to get filesystems larger than 2TB.

I've been following the ZFS discussions on -current and -stable, and I
think that while it isn't quite ready yet, it probably will be ready in
a few months, being available around the same time I get this hardware
asssembled. I recall reading that there will be an import of newer ZFS
code in the near future.

Similarly, the ports collection seems to be moving along nicely with
amd64 support.

I think this system may have the most storage ever configured on a
FreeBSD system, and it is probably up near the top in terms of CPU and
memory. Once I have it assembled I'd be glad to let any FreeBSD devel-
opers test and stress it if that would help improve FreeBSD on that
type of configuration.

In the meantime, any suggestions regarding the hardware or software con-
figuration would be welcomed.

Terry Kennedy http://www.tmk.com
***@tmk.com New York, NY USA
_______________________________________________
freebsd-***@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-***@freebsd.org"
Freddie Cash
2009-01-27 19:42:41 UTC
Permalink
Post by Paul Tice
Excuse my rambling, perhaps something in this mess will be useful.
I'm currently using 8 cores (2x Xeon E5405), 16G FB-DIMM, and 8 x 750GB
drives on a backup system (I plan to add the other in the chassis one by
one, testing the speed along the way) 8-current AMD64, ZFS, Marvell
88sx6081 PCI-X card (8 port SATA) + LSI1068E (8 port SAS/SATA) for the
main Array, and the Intel onboard SATA for boot drive(s). Data is sucked
down through 3 gigabit ports, with another available but not yet
activated. Array drives all live on the LSI right now. Drives are <ATA
ST3750640AS K>.
ZFS is stable _IF_ you disable the prefetch and ZIL, otherwise the
classic ZFS wedge rears it's ugly head. I haven't had a chance to test
just one yet, but I'd guess it's the prefetch that's the quick killer.
You probably don't want to disable the ZIL. That's the journal, and an
important part of the data integrity setup for ZFS.

Prefetch has been shown to cause issues on a lot of systems, and can be a
bottleneck depending on the workload. But the ZIL should be enabled.
Post by Paul Tice
I've seen references to 8-Current having a kernel memory limit of 8G
(compared to 2G for pre 8 from what I understand so far) and ZFS ARC
FreeBSD 8.x kmem_max has been bumped to 512 GB.
Post by Paul Tice
Using rsync over several machines with this setup, I'm getting a little
over 1GB/min to the disks. 'zpool iostat 60' is a wonderful tool.
gstat is even nicer, as it shows you the throughput to the individual
drives, instead of the aggregate that zpool shows. This works at the GEOM
level. Quite nice to see how the I/O is balanced (or not) across the drives
in the raidz datasets, and the pool as a whole.
Post by Paul Tice
CPU usage during all this is suprisingly low. rsync is running with -z,
If you are doing rsync over SSH, don't use -z as part of the rsync command.
Instead, use -C with ssh. That way, rsync is done in one process, and the
compression is done by ssh in another process, and it will use two
CPUs/cores instead of just one. You'll get better throughput that way, as
the rsync process doesn't have to do the compression and reading/writing in
the same process. We got about a 25% boost in throughput by moving the
compress out of the rsync, and CPU usage balanced across CPUs instead of
just hogging one.
Post by Paul Tice
You cannot shrink/grow a raidz or raidz2.
You can't add devices to a raidz/raidz2 dataset. But you can replace the
drives with larger ones, do a resilver, and the extra space will become
available. Just pull the small drive, insert the large drive, and do a "zfs
replace <poolname> <device> <device>".

And you can add extra raidz/raidz2 datasets to a pool, and ZFS will stripe
the data across the raidz datasets. Basically, the pool becomes a RAID 5+0
or RAID 6+0, instead of just a RAID 5/RAID 6.

If you have lots of drives, the recommendations from the Solaris folks is to
use a bunch of raidz datasets comprised of <=9 disks each, instead of one
giant raidz dataset across all the drives. ie:

zfs create pool raidz2 da0 da1 da2 da3 da4 da5
zfs add pool raidz2 da6 da7 da8 da9 da10 da11
zfs add pool raidz2 da12 da13 da14 da15 da16 da17

Will give you a single pool comprised of three raidz2 datasets, with data
being striped across the three datasets.

And you can add raidz datasets to the pool as needed.
Post by Paul Tice
You can grow a stripe array,
I'm don't know if you can shrink it successfully. You cannot promote a
stripe array to raidz/z2, nor demote in the other direction. You can have
hot spares, haven't seen a provision for warm/cold spares.
ZFS in FreeBSD 7.x doesn't support hot spares, in that a faulted drive won't
start a rebuild using a spare drive. You have to manually "zfs replace" the
drive using the spare.

ZFS in FreeBSD 8.x does support auto-rebuild using spare drives (hot spare).
Post by Paul Tice
/etc/default/rc.conf already has cron ZFS status/scrub checks, but not enabled.
periodic(8) does ZFS checks as part of the daily run. See
/etc/defaults/periodic.conf.

However, you can whip up a very simple shell script that does the same, and
run it via cron at whatever interval you want. We use the following, that
runs every 15 mins:

#!/bin/sh

status=$( zpool status -x )

if [ "${status}" != "all pools are healthy" ]; then
echo "Problems with ZFS: ${status}" | mail -s "ZFS Issues on <server>" \
<mail>
fi

exit 0
--
Freddie
***@gmail.com
Paul Tice
2009-01-28 04:54:31 UTC
Permalink
I just bumped up the kmem, arc.max, enabled zil and reenabled mdcomp. Prefetch is disabled.
Less than 1 minute into a backup run of only 4 machines, I've got a fresh ZFS wedgie. Ouch.
As I understand it, the ZIL is not as much of an integrity boost as a speed boost, especially since we already have checksum-per-block. I did see spikes of up to 140MB/s on the ZFS pool in the 2 minutes before ZFS wedged.
I think of ZIL as the equivalent of journaling for more traditional filesystems. I can certainly see where it would help with integrity for certain cases, but a good UPS seems to offer many of the same benefits. ;>

(As always, I'm ready to be better informed)

I'm intrigued by the ssh/rsync double process throughput bump, but it does require ssh as well as rsync.
Alas, many of the boxes being backed up belong to the dark side, and many of them are only managed by us.
For some reason,many 'dark side' owners trust rsync more than ssh, to the point of disallowing ssh installation. Realistically, we could do a lot more with SSH/rsync, since we can start up a Volume Shadow copy, back it up, then remove the Shadow copy.

Good to know there is a pretty high limit to kmem now.

I'm not 100% sure about the gstat, I did no slicing/labeling on the disks, they are purely /dev/adX used by ZFS, would GEOM level even see this? zpool iostat -v will give the overall plus per device stats, I'm curious to see what difference there would be between gstat and zpool iostat -v, if any.

I suspected you could upsize raidz pools one disk at a time, I do wonder how the inner/outer track speed differences would affect throughput over a one-by-one disk full array replacement. And yes, I might wonder too much about corner cases. ;>

I assume that under 7.x, you could have your cron script take care of automating the zpool replace.
Built in hot sparing is much nicer, but warm spares would be the best (IMHO), a powered down drive isn't spinning towards failure. Of course, this is probably too much to ask of any multi-platform FS, since the method of spinning a drive up (and access required to do so) varies widely. Sounds like another cron script possibility. :)

If I can get the time on the box, I may try graphing 3 disk -> 17 disks with 2nd axis throughput and 3rd axis CPU usage.Drivewise, 8+1 is easily understandable as a theoretical best 8 bit+1 parity number of drives. Assuming you transfer multiple 8 bit bytes per cycle, and your controller/system/busses can keep up, 16+1 would in theory increase throughput, both from more bits read at a time, and from 1/2 the parity calculations per double-byte written to disk. Of course, this assumes the 'splitting' and parity generation code can optimize for mutiple 8 bit byte transfers.

Anyway, yet another overly long ramble must come to a close.

Thanks
Paul




________________________________

From: owner-freebsd-***@freebsd.org on behalf of Freddie Cash
Sent: Tue 1/27/2009 1:42 PM
To: freebsd-***@freebsd.org
Subject: Re: Help me select hardware....Some real world data that might help
Post by Paul Tice
Excuse my rambling, perhaps something in this mess will be useful.
I'm currently using 8 cores (2x Xeon E5405), 16G FB-DIMM, and 8 x 750GB
drives on a backup system (I plan to add the other in the chassis one by
one, testing the speed along the way) 8-current AMD64, ZFS, Marvell
88sx6081 PCI-X card (8 port SATA) + LSI1068E (8 port SAS/SATA) for the
main Array, and the Intel onboard SATA for boot drive(s). Data is sucked
down through 3 gigabit ports, with another available but not yet
activated. Array drives all live on the LSI right now. Drives are <ATA
ST3750640AS K>.
ZFS is stable _IF_ you disable the prefetch and ZIL, otherwise the
classic ZFS wedge rears it's ugly head. I haven't had a chance to test
just one yet, but I'd guess it's the prefetch that's the quick killer.
You probably don't want to disable the ZIL. That's the journal, and an
important part of the data integrity setup for ZFS.

Prefetch has been shown to cause issues on a lot of systems, and can be a
bottleneck depending on the workload. But the ZIL should be enabled.
Post by Paul Tice
I've seen references to 8-Current having a kernel memory limit of 8G
(compared to 2G for pre 8 from what I understand so far) and ZFS ARC
FreeBSD 8.x kmem_max has been bumped to 512 GB.
Post by Paul Tice
Using rsync over several machines with this setup, I'm getting a little
over 1GB/min to the disks. 'zpool iostat 60' is a wonderful tool.
gstat is even nicer, as it shows you the throughput to the individual
drives, instead of the aggregate that zpool shows. This works at the GEOM
level. Quite nice to see how the I/O is balanced (or not) across the drives
in the raidz datasets, and the pool as a whole.
Post by Paul Tice
CPU usage during all this is suprisingly low. rsync is running with -z,
If you are doing rsync over SSH, don't use -z as part of the rsync command.
Instead, use -C with ssh. That way, rsync is done in one process, and the
compression is done by ssh in another process, and it will use two
CPUs/cores instead of just one. You'll get better throughput that way, as
the rsync process doesn't have to do the compression and reading/writing in
the same process. We got about a 25% boost in throughput by moving the
compress out of the rsync, and CPU usage balanced across CPUs instead of
just hogging one.
Post by Paul Tice
You cannot shrink/grow a raidz or raidz2.
You can't add devices to a raidz/raidz2 dataset. But you can replace the
drives with larger ones, do a resilver, and the extra space will become
available. Just pull the small drive, insert the large drive, and do a "zfs
replace <poolname> <device> <device>".

And you can add extra raidz/raidz2 datasets to a pool, and ZFS will stripe
the data across the raidz datasets. Basically, the pool becomes a RAID 5+0
or RAID 6+0, instead of just a RAID 5/RAID 6.

If you have lots of drives, the recommendations from the Solaris folks is to
use a bunch of raidz datasets comprised of <=9 disks each, instead of one
giant raidz dataset across all the drives. ie:

zfs create pool raidz2 da0 da1 da2 da3 da4 da5
zfs add pool raidz2 da6 da7 da8 da9 da10 da11
zfs add pool raidz2 da12 da13 da14 da15 da16 da17

Will give you a single pool comprised of three raidz2 datasets, with data
being striped across the three datasets.

And you can add raidz datasets to the pool as needed.
Post by Paul Tice
You can grow a stripe array,
I'm don't know if you can shrink it successfully. You cannot promote a
stripe array to raidz/z2, nor demote in the other direction. You can have
hot spares, haven't seen a provision for warm/cold spares.
ZFS in FreeBSD 7.x doesn't support hot spares, in that a faulted drive won't
start a rebuild using a spare drive. You have to manually "zfs replace" the
drive using the spare.

ZFS in FreeBSD 8.x does support auto-rebuild using spare drives (hot spare).
Post by Paul Tice
/etc/default/rc.conf already has cron ZFS status/scrub checks, but not enabled.
periodic(8) does ZFS checks as part of the daily run. See
/etc/defaults/periodic.conf.

However, you can whip up a very simple shell script that does the same, and
run it via cron at whatever interval you want. We use the following, that
runs every 15 mins:

#!/bin/sh

status=$( zpool status -x )

if [ "${status}" != "all pools are healthy" ]; then
echo "Problems with ZFS: ${status}" | mail -s "ZFS Issues on <server>" \
<mail>
fi

exit 0

--
Freddie
***@gmail.com
_______________________________________________
freebsd-***@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-***@freebsd.org"
Freddie Cash
2009-01-28 06:17:15 UTC
Permalink
Post by Paul Tice
I just bumped up the kmem, arc.max, enabled zil and reenabled mdcomp. Prefetch is disabled.
Less than 1 minute into a backup run of only 4 machines, I've got a fresh ZFS wedgie. Ouch.
Others more knowledgeable than I will have to jump in here, but
getting the kmem_max and arc_max settings set right for each system is
a bit of an art unto itself. It took almost 6 weeks to get the
settings we are currently using, which keeps the system stable while
running backups for 83 servers via rsync+ssh. Others find smaller
arc_max settings work better, others find smaller kmem_max work
better. Each system is different.
Post by Paul Tice
I'm intrigued by the ssh/rsync double process throughput bump, but it does
require ssh as well as rsync.
Yes, if you aren't using ssh as the transport, then adding -z to rsync
can be benfecial. :)
Post by Paul Tice
I'm not 100% sure about the gstat, I did no slicing/labeling on the disks,
they are purely /dev/adX used by ZFS, would GEOM level even see this?
Yes. All disk access goes through GEOM. You can't hit the device
nodes without going through GEOM. It's quite interesting watching the
output of gstat.
Post by Paul Tice
I suspected you could upsize raidz pools one disk at a time, I do wonder how
the inner/outer track speed differences would affect throughput over a
one-by-one disk full array replacement. And yes, I might wonder too much
about corner cases. ;>
We haven't got that far yet. :) We've only filled 6 TB of the 9, so
won't be needing to move to larger drives for several more months.
--
Freddie Cash
***@gmail.com
Oliver Fromme
2009-01-29 14:48:48 UTC
Permalink
Post by Paul Tice
I just bumped up the kmem, arc.max, enabled zil and reenabled
mdcomp. Prefetch is disabled. Less than 1 minute into a backup
run of only 4 machines, I've got a fresh ZFS wedgie.
How exactly did you bump up kmem? As far as I know, it
is not necessary anymore on 8-current/amd64, because the
defaults are already much larger. In fact it might be
possible that your tuning made things worse.
Post by Paul Tice
As I understand it, the ZIL is not as much of an integrity boost as
a speed boost, especially since we already have checksum-per-block.
The checksum feature works purely on a block level, while
the intent log (ZIL) records certain changes to meta data
on the file system, similar to a journal, which can be
used for recovery after a crash (power outage, hardware
failure, kernel panic, human error). Both features are
completely orthogonal, one cannot replace the other.

Therefore I recommend to keep the ZIL enabled.

Best regards
Oliver
--
Oliver Fromme, secnetix GmbH & Co. KG, Marktplatz 29, 85567 Grafing b. M.
Handelsregister: Registergericht Muenchen, HRA 74606, Geschäftsfuehrung:
secnetix Verwaltungsgesellsch. mbH, Handelsregister: Registergericht Mün-
chen, HRB 125758, Geschäftsführer: Maik Bachmann, Olaf Erb, Ralf Gebhart

FreeBSD-Dienstleistungen, -Produkte und mehr: http://www.secnetix.de/bsd

"In My Egoistical Opinion, most people's C programs should be indented
six feet downward and covered with dirt."
-- Blair P. Houghton
Continue reading on narkive:
Loading...