ZFS committed to the FreeBSD base.

Post by Pawel Jakub Dawidek
Hi.
I'm happy to inform that the ZFS file system is now part of the FreeBSD
operating system. ZFS is available in the HEAD branch and will be
available in FreeBSD 7.0-RELEASE as an experimental feature.
Please welcome ZFS - The last word in file systems.
ZFS file system was ported from OpenSolaris operating system. The code
in under CDDL license.
I'd like to thank all SUN developers that created this great piece of
software.
Supported by: Wheel LTD (http://www.wheel.pl/)
Supported by: The FreeBSD Foundation (http://www.freebsdfoundation.org/)
Supported by: Sentex (http://www.sentex.net/)
Limitations.
Currently ZFS is only compiled as kernel module and is only available
for i386 architecture. Amd64 should be available very soon, the other
archs will come later, as we implement needed atomic operations.
Missing functionality.
- We don't have iSCSI target daemon in the tree, so sharing ZVOLs via
iSCSI is also not supported at this point. This should be fixed in
the future, we may also add support for sharing ZVOLs over ggate.
- There is no support for ACLs and extended attributes.
- There is no support for booting off of ZFS file system.
Other than that, ZFS should be fully-functional.
Enjoy!

Give yourself a pat on the back :)

Kris

Juha Saarinen

2007-04-06 03:42:51 UTC

Post by Kris Kennaway

Post by Pawel Jakub Dawidek
Please welcome ZFS - The last word in file systems.

Give yourself a pat on the back :)

Seconded.

--
Juha
http://www.geekzone.co.nz/juha

Sean Bryant

2007-04-06 03:21:08 UTC

Is it fully 128bit? From wikipedia, which is by no means an
authoritative source but I have no idea if this was ever an issue.

Pawel Jakub Dawidek

2007-04-06 10:40:04 UTC

Is it fully 128bit? From wikipedia, which is by no means an authoritative source but I have no idea if this was ever an issue.

It's 64-bit even in Solaris. The "128-bitness" is only in the storage format, not for file system ops visible to applications.
(AFAIK).

That's correct. We are limited by POSIX, but the on-disk format is
128bit.

--
Pawel Jakub Dawidek http://www.wheel.pl
***@FreeBSD.org http://www.FreeBSD.org
FreeBSD committer Am I Evil? Yes, I Am!

Sean Bryant

2007-04-06 16:14:07 UTC

Is it fully 128bit? From wikipedia, which is by no means an authoritative source but I have no idea if this was ever an issue.

It's 64-bit even in Solaris. The "128-bitness" is only in the storage format, not for file system ops visible to applications.
(AFAIK).

That's correct. We are limited by POSIX, but the on-disk format is
128bit.

Thanks for the update,
I'll probably update that Wikipedia entry to reflect recent changes and
more correctly state the limitations.

Richard Elling

2007-04-06 05:10:33 UTC

Well done, team! Everyone who cares about their data will be happy :-)
-- richard

Ricardo Correia

2007-04-06 04:54:37 UTC

Hi Pawel,

Post by Pawel Jakub Dawidek
Other than that, ZFS should be fully-functional.

Congratulations, nice work! :)

I'm interested in the cross-platform portability of ZFS pools, so I have
one question: did you implement the Solaris ZFS whole-disk support
(specifically, the creation and recognition of the EFI/GPT label)?

Unfortunately some tools in Linux (parted and cfdisk) have trouble
recognizing the EFI partition created by ZFS/Solaris..

Eric Anderson

2007-04-06 05:22:14 UTC

Pawel - you're a madman! :)

I'm afraid of what your next project will be.

Thanks for the solid work (again..),
Eric

Rich Teer

2007-04-06 04:58:47 UTC

Post by Pawel Jakub Dawidek
I'm happy to inform that the ZFS file system is now part of the FreeBSD
operating system. ZFS is available in the HEAD branch and will be
available in FreeBSD 7.0-RELEASE as an experimental feature.

This is fantastic news! At the risk of raking over ye olde arguments,
as the old saying goes: "Dual licensing? We don't need no stinkeen
dual licensing!". :-)

--
Rich Teer, SCSA, SCNA, SCSECA, OGB member

CEO,
My Online Home Inventory

Voice: +1 (250) 979-1638
URLs: http://www.rite-group.com/rich
http://www.myonlinehomeinventory.com

Alex Dupre

2007-04-06 07:26:34 UTC

Post by Pawel Jakub Dawidek
I'm happy to inform that the ZFS file system is now part of the FreeBSD
operating system.

Congratulations! You're great!

Post by Pawel Jakub Dawidek
- There is no support for booting off of ZFS file system.

Even booting kernel from a removable ufs media and then mounting a zfs
root via vfs.root.mountfrom?

--
Alex Dupre

Robert Watson

2007-04-06 09:28:34 UTC

Post by Alex Dupre

Post by Pawel Jakub Dawidek
I'm happy to inform that the ZFS file system is now part of the FreeBSD
operating system.

Congratulations! You're great!

Post by Pawel Jakub Dawidek
- There is no support for booting off of ZFS file system.

Even booting kernel from a removable ufs media and then mounting a zfs root
via vfs.root.mountfrom?

I believe the key issue here is that the boot loader doesn't yet support ZFS.
In 6.x and 7.x, the mechanism for mounting the root file system is identical
to all other file systems, so it should be possible to use any file system as
the root file system as long as you get can get the kernel up and running.
And, in the case of ZFS, the ZFS module loaded (since it currently must be a
module).

This is really exciting work and I'm very glad to see this in the tree!

Robert N M Watson
Computer Laboratory
University of Cambridge

Pawel Jakub Dawidek

2007-04-06 21:48:04 UTC

Post by Alex Dupre

Post by Pawel Jakub Dawidek
I'm happy to inform that the ZFS file system is now part of the FreeBSD
operating system.

Congratulations! You're great!

Post by Pawel Jakub Dawidek
- There is no support for booting off of ZFS file system.

Even booting kernel from a removable ufs media and then mounting a zfs
root via vfs.root.mountfrom?

I just verified that this will be possible:

# mount
tank on / (zfs, local)
devfs on /dev (devfs, local)

but I need some time to implement it right.

--
Pawel Jakub Dawidek http://www.wheel.pl
***@FreeBSD.org http://www.FreeBSD.org
FreeBSD committer Am I Evil? Yes, I Am!

Mike Wolman

2007-04-06 23:48:51 UTC

Hi,

Currently I am using gmirror and ggated to run a live network mirror.
Obviously this can cause problems if the server exporting the 'backup'
device is offline then the mirror is broken - when the machines reconnect
a full mirror sync takes place. This is fine over gbit crossover and if
the size of the mirror is only a few 100Gb.

Is it feasible that when the connection to one of the mirror devices
breaks gmirror starts to log the changes to the mirror (obviously you
would need to configure up this mirror device as a 'lazy' mirror member
with a spare local device to write the changes to) - when the machines
reconnect gmirror would only then have to sync the actual changes.

This is sort of achieves a similar result to Live Network Backup on NetBSD
(http://kerneltrap.org/node/5058).

It could be used for laptop users mirroring their whole drive, allowing a
fast sync when they are on their local lan and should the laptop get lost
it would be possible to restore the whole machine with a simple dd. If
they were using a usb key as the device to log the changes while they were
disconnected from the network and they remember to unplug/plug this each
time they use the laptop then it could even be possible to recover the
data to the point they actually lost the machine.

It could also be used for asynchronous mirrors over slow links, if the log
device was always written to first then the write latency for long distant
links could be removed. Im not sure if it would be possible to achieve
this using just a modified ggatec instead which has a local device used
as a write cache.

Mike.

Mike Wolman

2007-04-09 23:09:27 UTC

Post by Mike Wolman
Hi,
Currently I am using gmirror and ggated to run a live network mirror.
Obviously this can cause problems if the server exporting the 'backup'
device is offline then the mirror is broken - when the machines reconnect a
full mirror sync takes place. This is fine over gbit crossover and if the
size of the mirror is only a few 100Gb.

Personally, I think this would be a very useful feature. This is sort of
like snapshots for GEOM, which would be useful in other ways too. GEOM
journaling, GEOM snapshots, and this lazy mirroring are all somewhat similar
it seems.
Are you proposing to write such a feature? Even if you can't code it,
writing up the design/specs might help someone actually implement it.
Eric

My initial thoughts were for fast recovery from a non failed mirror device
but realised it could be used for many other things from mobile users to
remote asyncs. I had not thought of using it for snapshots but i can see
where your coming from - by simply keeping the changes you could rewind
to any point in time without having to actually create a snapshot(s)
beforehand.

I only wish my os/file system/geom knowledge was up to even specifying
this, let alone my complete lack of c programming ability.

However I can try to put together a specification from a system admins
perspective if that is of any help?

And of course i would be more than willing to test test test.

Mike.

Alex Dupre

2007-04-07 07:39:44 UTC

Post by Pawel Jakub Dawidek
# mount
tank on / (zfs, local)
devfs on /dev (devfs, local)
but I need some time to implement it right.

I waited months for current ZFS implementation, I can wait more for root
support, now that I know it'll be possible :-) Thanks again.

--
Alex Dupre

Pawel Jakub Dawidek

2007-04-06 12:34:47 UTC

Post by Ricardo Correia
I'm interested in the cross-platform portability of ZFS pools, so I have
one question: did you implement the Solaris ZFS whole-disk support
(specifically, the creation and recognition of the EFI/GPT label)?
Unfortunately some tools in Linux (parted and cfdisk) have trouble
recognizing the EFI partition created by ZFS/Solaris..

I'm not yet setup to move disks between FreeBSD and Solaris, but my
first goal was to integrate it with FreeBSD's GEOM framework.
We support cache flushing operations on any GEOM provider (disk,
partition, slice, anything disk-like), so bascially currently I treat
everything as a whole disk (because I simply can), but don't do any
EFI/GPT labeling. I'll try to move data from Solaris' disk to FreeBSD
and see what happen.

First try:

GEOM: ad6: corrupt or invalid GPT detected.
GEOM: ad6: GPT rejected -- may not be recoverable.

:)

--
Pawel Jakub Dawidek http://www.wheel.pl
***@FreeBSD.org http://www.FreeBSD.org
FreeBSD committer Am I Evil? Yes, I Am!

Romain LE DISEZ

2007-04-06 12:44:54 UTC

Hello,

first of all, thank you for your great work.

When i tried to read my ZFS volume created under Solaris, I had the same
error. I simply used gpte (available in ports) to erase the partition
table and create a new one with the correct value. I hadn't lost any data
and now this error message has disappear. I can continue to read these
volume under Solaris and Linux, of course.

I can't try to read my ZFS volumes under FreeBSD because I get an error
when loading the module. "kldload zfs" return me an error about missing
files. (I will get back with the exact error later)

One time again : great job !

--
Romain LE DISEZ
06.78.77.99.18
http://www.ledisez.net/

Post by Ricardo Correia
I'm interested in the cross-platform portability of ZFS pools, so I

have

Post by Ricardo Correia
one question: did you implement the Solaris ZFS whole-disk support
(specifically, the creation and recognition of the EFI/GPT label)?
Unfortunately some tools in Linux (parted and cfdisk) have trouble
recognizing the EFI partition created by ZFS/Solaris..

GEOM: ad6: corrupt or invalid GPT detected.
GEOM: ad6: GPT rejected -- may not be recoverable.
:)
--
Pawel Jakub Dawidek http://www.wheel.pl
FreeBSD committer Am I Evil? Yes, I Am!

Romain LE DISEZ

2007-04-10 19:54:22 UTC

Hello,

I'm experiencing some problems due to this operation. When the partitions
table is fixed, the system run into the bug described here :
http://www.freebsd.org/cgi/query-pr.cgi?pr=72896&cat=

The only way to do a fresh install (or run sysinstall) after the fix is to
unplugged the fixed disks.

Post by Romain LE DISEZ
Hello,
first of all, thank you for your great work.
When i tried to read my ZFS volume created under Solaris, I had the same
error. I simply used gpte (available in ports) to erase the partition
table and create a new one with the correct value. I hadn't lost any data
and now this error message has disappear. I can continue to read these
volume under Solaris and Linux, of course.
I can't try to read my ZFS volumes under FreeBSD because I get an error
when loading the module. "kldload zfs" return me an error about missing
files. (I will get back with the exact error later)
One time again : great job !
--
Romain LE DISEZ
06.78.77.99.18
http://www.ledisez.net/

Post by Ricardo Correia
I'm interested in the cross-platform portability of ZFS pools, so I

have

GEOM: ad6: corrupt or invalid GPT detected.
GEOM: ad6: GPT rejected -- may not be recoverable.
:)
--
Pawel Jakub Dawidek http://www.wheel.pl
FreeBSD committer Am I Evil? Yes, I Am!

_______________________________________________
http://lists.freebsd.org/mailman/listinfo/freebsd-fs

Bruce M. Simpson

2007-04-06 21:09:56 UTC

This is most excellent work which is going to help everyone in a very
big way. Many thanks for working on this.

Ceri Davies

2007-04-06 21:07:06 UTC

This is fantastic news! At the risk of raking over ye olde arguments,
as the old saying goes: "Dual licensing? We don't need no stinkeen
dual licensing!". :-)

Actually, you might want to run that statement by a certain John Birrell
(***@FreeBSD.org) regarding the DTrace port and see what answer you get.

Ceri

--
That must be wonderful! I don't understand it at all.
-- Moliere

Gabor Kovesdan

2007-04-06 21:52:07 UTC

Post by Ceri Davies

This is fantastic news! At the risk of raking over ye olde arguments,
as the old saying goes: "Dual licensing? We don't need no stinkeen
dual licensing!". :-)

Actually, you might want to run that statement by a certain John Birrell

jhb@ is John Baldwin, John Birrel is jb@! :)

Regards,
Gabor

Bruno Damour

2007-04-06 22:39:14 UTC

Thanks, fantasticly interesting !

Post by Pawel Jakub Dawidek
Currently ZFS is only compiled as kernel module and is only available
for i386 architecture. Amd64 should be available very soon, the other
archs will come later, as we implement needed atomic operations.

I'm waiting eagerly to amd64 version....

Post by Pawel Jakub Dawidek
Missing functionality.
- There is no support for ACLs and extended attributes.

Is this planned ? Does that means I cannot use it as a basis for a
full-featured samba share ?

Thanks for your great work !!

Bruno DAMOUR

Pawel Jakub Dawidek

2007-04-07 00:57:05 UTC

Post by Bruno Damour
Thanks, fantasticly interesting !

I'm waiting eagerly to amd64 version....

Post by Pawel Jakub Dawidek
Missing functionality.
- There is no support for ACLs and extended attributes.

Is this planned ? Does that means I cannot use it as a basis for a full-featured samba share ?

It is planned, but it's not trivial. Does samba support NFSv4-style
ACLs?

--
Pawel Jakub Dawidek http://www.wheel.pl
***@FreeBSD.org http://www.FreeBSD.org
FreeBSD committer Am I Evil? Yes, I Am!

Rick Macklem

2007-04-08 21:41:04 UTC

Post by Bruno Damour

Post by Pawel Jakub Dawidek
- There is no support for ACLs and extended attributes.

Is this planned ? Does that means I cannot use it as a basis for a full-featured samba share ?

It is planned, but it's not trivial. Does samba support NFSv4-style
ACLs?

I don't know about samba, but my NFSv4 server can certainly use them.

I'll add my congratulations and thanks for the good work, to the list.
(Currently I know diddly about ZFS, but I'll try it someday.)

Good luck with it, rick

Bruno Damour

2007-04-06 23:06:26 UTC

Thanks, fantasticly interesting !

I'm waiting eagerly to amd64 version....

Post by Pawel Jakub Dawidek
Missing functionality.
- There is no support for ACLs and extended attributes.

Is this planned ? Does that means I cannot use it as a basis for a
full-featured samba share ?

Thanks for your great work !!

Bruno DAMOUR

Bernd Walter

2007-04-07 02:56:45 UTC

I got a kmem panic just by copying a recent ports.tgz (36M) onto a ZFS.
My sandbox just has 128MB RAM so kmem was set to ~40M.
After raising kmem to 80M it survived copying the file, but paniced
again while tar -xvzf the file into the same pool.
vfs.zfs.vdev.cache.size is unchanged at 10M.

--
B.Walter http://www.bwct.de http://www.fizon.de
***@bwct.de ***@bwct.de ***@fizon.de

Pawel Jakub Dawidek

2007-04-07 13:13:53 UTC

128MB RAM of suggested minimum in ZFS requirements, but it may be not
enough... Minimum of ARC is set to 1/8 of all memory or 64MB (whichever
is more). Could you locate these lines in
sys/contrib/opensolaris/uts/common/fs/zfs/arc.c file:

/* set min cache to 1/32 of all memory, or 64MB, whichever is more */
arc_c_min = MAX(arc_c / 4, 64<<20);

Change 64 to eg. 32, recompile and retest?

--
Pawel Jakub Dawidek http://www.wheel.pl
***@FreeBSD.org http://www.FreeBSD.org
FreeBSD committer Am I Evil? Yes, I Am!

Florian C. Smeets

2007-04-07 13:59:02 UTC

Post by Bernd Walter
My sandbox just has 128MB RAM so kmem was set to ~40M.
After raising kmem to 80M it survived copying the file, but paniced
again while tar -xvzf the file into the same pool.
vfs.zfs.vdev.cache.size is unchanged at 10M.

128MB RAM of suggested minimum in ZFS requirements, but it may be not
enough... Minimum of ARC is set to 1/8 of all memory or 64MB (whichever
is more). Could you locate these lines in
/* set min cache to 1/32 of all memory, or 64MB, whichever is more */
arc_c_min = MAX(arc_c / 4, 64<<20);
Change 64 to eg. 32, recompile and retest?

Hi Pawel,

i had the same problems like Bernd while trying to copy the src tree to
a ZFS volume. I have 384MB RAM but i got the same "kmem_map: too small"
panic. I compiled my kernel like you proposed and now i am able to copy
anything to the volume without panic :-)

Regards,
Florian

P.S. Thanks for all the work on ZFS!

Andrei Kolu

2007-04-07 14:39:34 UTC

Post by Ricardo Correia

Hi Pawel,
i had the same problems like Bernd while trying to copy the src tree to
a ZFS volume. I have 384MB RAM but i got the same "kmem_map: too small"
panic. I compiled my kernel like you proposed and now i am able to copy
anything to the volume without panic :-)

Why we can't use virtual memory?

Bernd Walter

2007-04-07 16:58:00 UTC

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hi Pawel,
i had the same problems like Bernd while trying to copy the src tree to
a ZFS volume. I have 384MB RAM but i got the same "kmem_map: too small"
panic. I compiled my kernel like you proposed and now i am able to copy
anything to the volume without panic :-)

I had increased RAM to 384 and still had a panic with default kmem
(IIRC around 100M) and even increasing kmem to 160M did help a long
time, but still produced the panic after a while.
I don't think 64M applies here as the real limit.

--
B.Walter http://www.bwct.de http://www.fizon.de
***@bwct.de ***@bwct.de ***@fizon.de

Bernd Walter

2007-04-07 18:03:19 UTC

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hi Pawel,
i had the same problems like Bernd while trying to copy the src tree to
a ZFS volume. I have 384MB RAM but i got the same "kmem_map: too small"
panic. I compiled my kernel like you proposed and now i am able to copy
anything to the volume without panic :-)

Now with 240M kmem it looks good, but I'm still unshure:
kstat.zfs.misc.arcstats.c_min: 67108864
kstat.zfs.misc.arcstats.c_max: 188743680
kstat.zfs.misc.arcstats.size: 87653376
c_max seemed to be increasing with kmem, but I did compare it with a
remebered value.
Should be good with:
vm.kmem_size: 251658240
But top shows wired memory which is roughly twice the size of
arcstats.size, so I'm still worried about kmem exhaustion if ARC runs
up to c_max.
Since the c_min/c_max values also influence the available RAM for other
purposes as well, can we have it at least a loader.conf tuneable?

Otherwise - the reboot after the panics where impressive.
No long fsck times or noticed data corruption - even with NFS clients.
All in all it is a great job.

--
B.Walter http://www.bwct.de http://www.fizon.de
***@bwct.de ***@bwct.de ***@fizon.de

Pawel Jakub Dawidek

2007-04-07 19:15:17 UTC

Post by Bernd Walter
kstat.zfs.misc.arcstats.c_min: 67108864
kstat.zfs.misc.arcstats.c_max: 188743680
kstat.zfs.misc.arcstats.size: 87653376
c_max seemed to be increasing with kmem, but I did compare it with a
remebered value.
vm.kmem_size: 251658240
But top shows wired memory which is roughly twice the size of
arcstats.size, so I'm still worried about kmem exhaustion if ARC runs
up to c_max.
Since the c_min/c_max values also influence the available RAM for other
purposes as well, can we have it at least a loader.conf tuneable?

Just committed a change. You can tune max and min ARC size via
vfs.zfs.arc_max and vfs.zfs.arc_min tunnables.

--
Pawel Jakub Dawidek http://www.wheel.pl
***@FreeBSD.org http://www.FreeBSD.org
FreeBSD committer Am I Evil? Yes, I Am!

Bernd Walter

2007-04-07 21:24:14 UTC

Just committed a change. You can tune max and min ARC size via
vfs.zfs.arc_max and vfs.zfs.arc_min tunnables.

Thanks - I'd set c_max to 80M now and will see what happens, since
I had such a panic again with 240M kmem.

I'm a bit confused about the calculation as such.
Lets asume a 4G i386 system.
arg_c = 512M
c_min = 512M
c_max = 3G
But isn't this KVA space, of which we usually can't have 3G on i386
without limiting userland to 1G?
Even 512M KVA sounds very much on a i386, since 4G systems usually
have more use for limited KVA.

--
B.Walter http://www.bwct.de http://www.fizon.de
***@bwct.de ***@bwct.de ***@fizon.de

Craig Boston

2007-04-10 00:38:37 UTC

512MB RAM. So I've been testing in a VMware instance with 512MB. My
vm.kmem_size is defaulting to 169758720.

Meh, wrong stat, I probably should have said that

vm.kmem_size_max: 335544320

Kris Kennaway

2007-04-10 01:11:25 UTC

512MB RAM. So I've been testing in a VMware instance with 512MB. My
vm.kmem_size is defaulting to 169758720.

Meh, wrong stat, I probably should have said that
vm.kmem_size_max: 335544320

Nah, you were right the first time :) Your system is defaulting to
160MB for the kmem_map, of which zfs will (by default) try to use up
to 3/4. Naturally this doesn't leave much for the rest of the kernel
(40MB), so you'll easily run the kernel out of memory.

For now, you probably want to increase vm.kmem_size a bit to allow
some more room for zfs, and set vfs.zfs.arc_max and arc_min to
something more reasonable like 64*1024*1024+1 (the +1 is needed
because zfs currently requires "greater than 64MB" for the arc).

Kris

Craig Boston

2007-04-10 01:30:35 UTC

Post by Kris Kennaway
Nah, you were right the first time :) Your system is defaulting to
160MB for the kmem_map, of which zfs will (by default) try to use up
to 3/4. Naturally this doesn't leave much for the rest of the kernel
(40MB), so you'll easily run the kernel out of memory.

Hmm, I had already reduced the maximum arc size to 64MB though, which I
figured (hoped?) would leave plenty of room.

So if kmem_size is the total size and it can't grow, what is
kmem_size_max? Is there a way to see a sum of total kmem allocation?
Even the vm.zone breakdown seems to be gone in current so apparently my
knowledge of such things is becoming obsolete :)

Post by Kris Kennaway
For now, you probably want to increase vm.kmem_size a bit to allow
some more room for zfs, and set vfs.zfs.arc_max and arc_min to
something more reasonable like 64*1024*1024+1 (the +1 is needed
because zfs currently requires "greater than 64MB" for the arc).

Yeah, I found that out the hard way after wondering why it was ignoring
the tunables :)

I ran out of kmem_map space once with it set to 64*1024*1024+1, then I
modified the source so that it would accept zfs_arc_max >= (64 << 20)
instead, just in case it was a power-of-2 thing.

Craig

Craig Boston

2007-04-10 01:42:33 UTC

Post by Craig Boston
Even the vm.zone breakdown seems to be gone in current so apparently my
knowledge of such things is becoming obsolete :)

But vmstat -m still works

...

solaris 145806 122884K - 15319671 16,32,64,128,256,512,1024,2048,4096
...

Whoa! That's a lot of kernel memory. Meanwhile...

kstat.zfs.misc.arcstats.size: 33554944
(which is just barely above vfs.zfs.arc_min)

So I don't think it's the arc cache (yeah I know that's redundant) that
is the problem. Seems like something elsewhere in zfs is allocating
large amounts of memory and not letting it go, and even the cache is
having to shrink to its minimum size due to the memory pressure.

It didn't panic this time, so when the tar finished I tried a "zfs
unmount /usr/ports". This caused the "solaris" entry to drop down to
about 64MB, so it's not a leak. It could just be that ZFS needs lots of
memory to operate if it keeps a lot of metadata for each file in memory.

The sheer # of allocations still seems excessive though. It was well
over 20 million by the time the tar process exited.

Craig

Kris Kennaway

2007-04-10 01:55:23 UTC

Post by Craig Boston
Even the vm.zone breakdown seems to be gone in current so apparently my
knowledge of such things is becoming obsolete :)

But vmstat -m still works
...
solaris 145806 122884K - 15319671 16,32,64,128,256,512,1024,2048,4096
...
Whoa! That's a lot of kernel memory. Meanwhile...
kstat.zfs.misc.arcstats.size: 33554944
(which is just barely above vfs.zfs.arc_min)
So I don't think it's the arc cache (yeah I know that's redundant) that
is the problem. Seems like something elsewhere in zfs is allocating
large amounts of memory and not letting it go, and even the cache is
having to shrink to its minimum size due to the memory pressure.
It didn't panic this time, so when the tar finished I tried a "zfs
unmount /usr/ports". This caused the "solaris" entry to drop down to
about 64MB, so it's not a leak. It could just be that ZFS needs lots of
memory to operate if it keeps a lot of metadata for each file in memory.
The sheer # of allocations still seems excessive though. It was well
over 20 million by the time the tar process exited.

That is a lifetime count of the # of operations, not the current
number allocated ("InUse").

It does look like there is something else using a significant amount
of memory apart from arc, but arc might at least be the major one due
to its extremely greedy default allocation policy.

Kris

Craig Boston

2007-04-10 02:04:55 UTC

Post by Kris Kennaway
That is a lifetime count of the # of operations, not the current
number allocated ("InUse").

Yes, perhaps I should have said "sheer number of allocations &
deallocations". I was just surprised that it seems to grab and release
memory much more often than anything else tracked by vmstat.

Post by Kris Kennaway
It does look like there is something else using a significant amount
of memory apart from arc, but arc might at least be the major one due
to its extremely greedy default allocation policy.

I wasn't going to post again until somebody suggested trying this, but I
think the name cache can be ruled out. I reduced vfs.zfs.dnlc.ncsize
from ~13000 to 4096 with no appreciable drop in total memory usage.

It seems to be stable with vm.kmem_size at 256MB, but the wired count
has come dangerously close a few times.

Craig

Pawel Jakub Dawidek

2007-04-10 02:38:57 UTC

Post by Craig Boston
Even the vm.zone breakdown seems to be gone in current so apparently my
knowledge of such things is becoming obsolete :)

ARC and ZIO are the biggest memory consumers and they are somehow
connected. I just committed changes that should stabilize ZFS in this
regard. Could you try them?

--
Pawel Jakub Dawidek http://www.wheel.pl
***@FreeBSD.org http://www.FreeBSD.org
FreeBSD committer Am I Evil? Yes, I Am!

Craig Boston

2007-04-10 04:04:31 UTC

Post by Pawel Jakub Dawidek
ARC and ZIO are the biggest memory consumers and they are somehow
connected. I just committed changes that should stabilize ZFS in this
regard. Could you try them?

Hrm, well I was attempting to but it panic'd in the middle of the kernel
build (/usr/src and obj are on the test zfs partition). Apparently
256MB isn't enough kmem either. I'll bump it up again and try
rebuilding, and lower it back to 256 for testing.

kmem_malloc(131072): kmem_map too small: 214921216 total allocated

Craig

Craig Boston

2007-04-10 04:36:10 UTC

Post by Pawel Jakub Dawidek
ARC and ZIO are the biggest memory consumers and they are somehow
connected. I just committed changes that should stabilize ZFS in this
regard. Could you try them?

Preliminary results with the latest -current kernel and
vm.kmem_size=268435456, disabling all my other loader.conf entries and
letting it autosize:

kstat.zfs.misc.arcstats.p: 15800320
kstat.zfs.misc.arcstats.c: 16777216
kstat.zfs.misc.arcstats.c_min: 16777216
kstat.zfs.misc.arcstats.c_max: 134217728
kstat.zfs.misc.arcstats.size: 18003456

solaris 43705 91788K - 4522887 16,32,64,128,256,512,1024,2048,4096

So it looks like it autosized the ARC to a 16M-128M range. I'm
currently doing a buildworld and am going to try untarring the ports
tree. The ARC size is tending to hover around 16-20M, probably due to
memory pressure. The "solaris" group appears to be taking up about 16M
less memory than it did before, which is consistent with the ARC being
16M smaller (I had changed the minimum to 32M before reverting to HEAD).

I may poke around in ZIO but judging from the complexity of the code I
don't have much of a chance of really understanding it anytime soon.

In my defense, the machine I was planning to use this on isn't _that_
old. It's a 2Ghz P4, which should be "okay" as far as checksum
calculations go. It just has a braindead motherboard that refuses to
accept more than 512MB of RAM.

Craig

Kris Kennaway

2007-04-10 01:48:23 UTC

Hmm, I had already reduced the maximum arc size to 64MB though, which I
figured (hoped?) would leave plenty of room.
So if kmem_size is the total size and it can't grow, what is
kmem_size_max? Is there a way to see a sum of total kmem allocation?
Even the vm.zone breakdown seems to be gone in current so apparently my
knowledge of such things is becoming obsolete :)

It's the cap used by the auto-sizing code, i.e. no matter how much RAM
the system has it will never use more than 320MB for kmem, by default.

Currently I think there is no exported way to view the amount of free
space in the map, but there should be.

Yeah, I found that out the hard way after wondering why it was ignoring
the tunables :)
I ran out of kmem_map space once with it set to 64*1024*1024+1, then I
modified the source so that it would accept zfs_arc_max >= (64 << 20)
instead, just in case it was a power-of-2 thing.

OK. Probably this is a sign that 160 - 64 = 96MB is not enough for
your kernel, i.e. you'd also get the panics if you turned down
vm.kmem_size to 96MB and didn't use zfs.

Kris

Craig Boston

2007-04-10 00:35:05 UTC

Post by Pawel Jakub Dawidek
Just committed a change. You can tune max and min ARC size via
vfs.zfs.arc_max and vfs.zfs.arc_min tunnables.

Thanks - I'd set c_max to 80M now and will see what happens, since
I had such a panic again with 240M kmem.

Hi, just wanted to chime in that I'm experiencing the same panic with
a fresh -CURRENT.

I'm seriously considering trying out ZFS on my home file server (this
should tell you how much I've come to trust pjd's work ;). Anyway,
since it's a repurposed desktop with a crappy board, it's limited to
512MB RAM. So I've been testing in a VMware instance with 512MB. My
vm.kmem_size is defaulting to 169758720.

Works fine up until the point I start copying lots of files onto the ZFS
partition. I tried the suggestion of reducing the tunables. After
modifying the source to accept these values, I have it set to:

kstat.zfs.misc.arcstats.p: 33554432
kstat.zfs.misc.arcstats.c: 67108864
kstat.zfs.misc.arcstats.c_min: 33554432
kstat.zfs.misc.arcstats.c_max: 67108864
kstat.zfs.misc.arcstats.size: 20606976

This is after a clean boot before trying anything. arcstats.size floats
right at the max for quite a while before the panic happens, so I
suspect something else is causing it to run out of kvm, perhaps the
normal buffer cache since I'm copying from a UFS filesystem.

panic: kmem_malloc(131072): kmem_map too small: 131440640 total
allocated

Though the backtrace (assuming I'm loading the module symbols correctly)
seems to implicate zfs.

#0 doadump () at pcpu.h:172
#1 0xc06bbaab in boot (howto=260) at /usr/src/sys/kern/kern_shutdown.c:409
#2 0xc06bbd38 in panic (
fmt=0xc094f28c "kmem_malloc(%ld): kmem_map too small: %ld total allocated")
at /usr/src/sys/kern/kern_shutdown.c:563
#3 0xc0821e70 in kmem_malloc (map=0xc145408c, size=131072, flags=2)
at /usr/src/sys/vm/vm_kern.c:305
#4 0xc0819d56 in page_alloc (zone=0x0, bytes=131072, pflag=0x0, wait=2)
at /usr/src/sys/vm/uma_core.c:955
#5 0xc081bfcf in uma_large_malloc (size=131072, wait=2)
at /usr/src/sys/vm/uma_core.c:2709
#6 0xc06b0eb1 in malloc (size=131072, mtp=0xc0bd0080, flags=2)
at /usr/src/sys/kern/kern_malloc.c:364
#7 0xc0b66f67 in zfs_kmem_alloc (size=131072, kmflags=2)
at /usr/src/sys/modules/zfs/../../compat/opensolaris/kern/opensolaris_kmem.c:67
#8 0xc0bb23ad in zio_buf_alloc (size=131072)
at /usr/src/sys/modules/zfs/../../contrib/opensolaris/uts/common/fs/zfs/zio.c:211
#9 0xc0ba4487 in vdev_queue_io_to_issue (vq=0xc3424ee4, pending_limit=Unhandled dwarf expression opcode 0x93
)
at /usr/src/sys/modules/zfs/../../contrib/opensolaris/uts/common/fs/zfs/vdev_queue.c:213
at /usr/src/sys/modules/zfs/../../contrib/opensolaris/uts/common/fs/zfs/vdev_queue.c:312
#11 0xc0bc69fd in vdev_geom_io_done (zio=0xc4435400)
at /usr/src/sys/modules/zfs/../../contrib/opensolaris/uts/common/fs/zfs/vdev_geom.c:412
#12 0xc0b6ad19 in taskq_thread (arg=0xc2dfa0cc)
at /usr/src/sys/modules/zfs/../../contrib/opensolaris/uts/common/os/taskq.c:833
#13 0xc06a54ba in fork_exit (callout=0xc0b6ac18 <taskq_thread>,
arg=0xc2dfa0cc, frame=0xd62cdd38) at /usr/src/sys/kern/kern_fork.c:814
#14 0xc08a8c10 in fork_trampoline () at /usr/src/sys/i386/i386/exception.s:205

I haven't tried increasing kmem yet -- I'm a bit leery of devoting so
much memory (presumably nonpageable, nonreclaimable) to the kernel.

Admittedly I'm somewhat confused as to why ZFS needs its own special
cache rather than sharing the system's, or at least only use free
physical pages allocated as VM objects rather than precious kmem. But
I'm no VM guru :)

Craig

R. B. Riddick

2007-04-07 06:32:23 UTC

Post by Mike Wolman
It could also be used for asynchronous mirrors over slow links, if the log
device was always written to first then the write latency for long distant
links could be removed. Im not sure if it would be possible to achieve
this using just a modified ggatec instead which has a local device used
as a write cache.

Sounds like rsync can already do that (I am not sure right now, if rsync can
find updated areas within a large file, or if it just copies the while updated
file even if it is a large one)...

Furthermore the remote consumer of that gmirror couldnt be mounted RW, if it
uses UFS, because UFS doesnt allow multiple RW mounts at the same time...

-Arne

____________________________________________________________________________________
Food fight? Enjoy some healthy debate
in the Yahoo! Answers Food & Drink Q&A.
http://answers.yahoo.com/dir/?link=list&sid=396545367

Matthew Seaman

2007-04-07 07:34:09 UTC

Post by R. B. Riddick
Sounds like rsync can already do that (I am not sure right now, if rsync can
find updated areas within a large file, or if it just copies the while updated
file even if it is a large one)...

rsync will find an updated area within a big file. The algorithm is to
divide any such file into 100kB[*] chunks, calculate checksums of each of
those chunks and only transfer the chunks where the checksum differs
between source and destination. More detail here:

http://samba.org/rsync/how-rsync-works.html
http://rsync.samba.org/tech_report/tech_report.html

Cheers,

Matthew

[*] For some value of 100kB.

--
Dr Matthew J Seaman MA, D.Phil. 7 Priory Courtyard
Flat 3
PGP: http://www.infracaninophile.co.uk/pgpkey Ramsgate
Kent, CT11 9PW

Mike Wolman

2007-04-07 09:35:56 UTC

Rsync is a great tool however if you try to rsync a filesystem
with hundreds of thousand files in it the file list can use quite
a large amount of bandwidth even if only a single file has
changed - if you were keeping track of the blocks which had changed
then you do not need to generate this list and simply send over the
changed blocks.

I was not thinking the remote side would mount the image unless
the primary site was offline/unavailable.

Mike.

Post by R. B. Riddick

Sounds like rsync can already do that (I am not sure right now, if rsync can
find updated areas within a large file, or if it just copies the while updated
file even if it is a large one)...
Furthermore the remote consumer of that gmirror couldnt be mounted RW, if it
uses UFS, because UFS doesnt allow multiple RW mounts at the same time...
-Arne
____________________________________________________________________________________
Food fight? Enjoy some healthy debate
in the Yahoo! Answers Food & Drink Q&A.
http://answers.yahoo.com/dir/?link=list&sid=396545367

Jim Rees

2007-04-07 11:58:55 UTC

Mike Wolman wrote:

if you were keeping track of the blocks which had changed
then you do not need to generate this list and simply send over the
changed blocks.

Unison keeps a list of files at each end and only exchanges block lists for
files that have changed. I use it to sync 40GB (10K files) over a 1Mbps
link and it's very fast. It also will do two-way sync.

Mike Wolman

2007-04-07 16:00:08 UTC

Post by Mike Wolman
if you were keeping track of the blocks which had changed
then you do not need to generate this list and simply send over the
changed blocks.
Unison keeps a list of files at each end and only exchanges block lists for
files that have changed. I use it to sync 40GB (10K files) over a 1Mbps
link and it's very fast. It also will do two-way sync.

Unison and rsync both work on the filesystem level and not with
the blocks directly so would not be able to achieve the same result
as the live network backup on netbsd - ie allowing a simple dd
restore of a machine.

As this would be filesystem independent and if you are running zfs or
other snapshot capable filesystem i think rsync or unison would have a
problem working with the snapshots. i do use rsync with close to about
1Tb of data and a lot hard links for - but if the remote file changes you
have to store the entire copy of new file and not just the actual blocks
which have changed.

Mike.

Randall Stewart

2007-04-07 10:39:22 UTC

Great work Pawel...

I see you posted a quick start ... I will have
to move my laptop to use this as its non-root fs's :-D

R

--
Randall Stewart
NSSTG - Cisco Systems Inc.
803-345-0369 <or> 803-317-4952 (cell)

Jorn Argelo

2007-04-07 10:54:57 UTC

This is fantastic news! At the risk of raking over ye olde arguments,
as the old saying goes: "Dual licensing? We don't need no stinkeen
dual licensing!". :-)

First of all, thanks a lot for all the hard work of both the FreeBSD
developers as the ZFS developers. I can't wait to give it a go.

That leads me to one question though: Why is *BSD able to bring it into
the OS as where Linux has licensing problems with the CDDL? AFAIK Linux
users can only run it in userland mode and not in kernel mode because of
the licenses.

I don't really know the differences between all the licenses, so feel
free to correct me if I'm saying something stupid.

Thanks,

Jorn

Wilko Bulte

2007-04-07 14:17:37 UTC

On Sat, Apr 07, 2007 at 12:54:57PM +0200, Jorn Argelo wrote..

Post by Jorn Argelo

This is fantastic news! At the risk of raking over ye olde arguments,
as the old saying goes: "Dual licensing? We don't need no stinkeen
dual licensing!". :-)

First of all, thanks a lot for all the hard work of both the FreeBSD
developers as the ZFS developers. I can't wait to give it a go.
That leads me to one question though: Why is *BSD able to bring it into
the OS as where Linux has licensing problems with the CDDL? AFAIK Linux
users can only run it in userland mode and not in kernel mode because of
the licenses.

My guess(!) is that they do not want non-GPL-ed code in the standard kernel.

--
Wilko Bulte ***@FreeBSD.org

Jim Rees

2007-04-07 16:10:45 UTC

Wilko Bulte wrote:

My guess(!) is that they do not want non-GPL-ed code in the standard kernel.

Actually there is non-GPL code in linux, NFSv4 for example. Some licenses
are considered "incompatible." OpenAFS falls into this category, apparently
because of some problem with patents. I don't know what the situation is
with CDDL.

Ivan Voras

2007-04-10 09:11:18 UTC

Post by Wilko Bulte
On Sat, Apr 07, 2007 at 12:54:57PM +0200, Jorn Argelo wrote..

Post by Jorn Argelo

Post by Rich Teer
This is fantastic news! At the risk of raking over ye olde arguments,
as the old saying goes: "Dual licensing? We don't need no stinkeen
dual licensing!". :-)

First of all, thanks a lot for all the hard work of both the FreeBSD
developers as the ZFS developers. I can't wait to give it a go.
That leads me to one question though: Why is *BSD able to bring it into
the OS as where Linux has licensing problems with the CDDL? AFAIK Linux
users can only run it in userland mode and not in kernel mode because of
the licenses.

My guess(!) is that they do not want non-GPL-ed code in the standard kernel.

Sorry if I'm reiterating what someone maybe already explained, but I
don't see it on the lists I read:

FreeBSD can include GPL'ed code due to a "technicality" (literally): As
long as the code is in a separate kernel module and not in the default
shipped GENERIC kernel, it's considered "bundled" and not a part of the
kernel. As soon as the user loads a GPLed kernel module, presto-changeo!
his kernel "automagically" becomes GPLed. I believe the same holds for
CDDL. (I have no idea how to resolve the licensing issues of a kernel
with both GPL and CDDL parts :) ). This is less inconvenient than it
seems since kernel modules can be (pre)loaded at the same time the
kernel loads, and so we can have a ZFS root partition, etc.

The problem with DTrace in FreeBSD is twofold:

1. It's much more intertwined with the kernel.
2. Much of its usability comes from it being available in the default
shipped kernel - so that users can use it to troubleshoot problems "on
the fly" without having to recompile and install a new kernel (involves
rebooting).

AFAIK (not involved with its development), most of dtrace can reside in
a kernel module but some parts need to be in the kernel proper to
support this mode of operation, and *this* is where the licensing comes
in. Just a few files (AFAIK: mostly header files!) need to be
dual-licensed so they can be included in the default kernel build, and
the rest can be in the CDDL licensed kernel module.

Hartmut Brandt

2007-04-10 09:25:47 UTC

Post by Ivan Voras

Post by Wilko Bulte
On Sat, Apr 07, 2007 at 12:54:57PM +0200, Jorn Argelo wrote..

Post by Jorn Argelo

Post by Rich Teer
This is fantastic news! At the risk of raking over ye olde arguments,
as the old saying goes: "Dual licensing? We don't need no stinkeen
dual licensing!". :-)

First of all, thanks a lot for all the hard work of both the FreeBSD
developers as the ZFS developers. I can't wait to give it a go.
That leads me to one question though: Why is *BSD able to bring it
into the OS as where Linux has licensing problems with the CDDL?
AFAIK Linux users can only run it in userland mode and not in kernel
mode because of the licenses.

My guess(!) is that they do not want non-GPL-ed code in the standard kernel.

Sorry if I'm reiterating what someone maybe already explained, but I
As long as the code is in a separate kernel module and not in the
default shipped GENERIC kernel, it's considered "bundled" and not a
part of the kernel. As soon as the user loads a GPLed kernel module,
presto-changeo! his kernel "automagically" becomes GPLed. I believe
the same holds for CDDL. (I have no idea how to resolve the licensing
issues of a kernel with both GPL and CDDL parts :) ). This is less
inconvenient than it seems since kernel modules can be (pre)loaded at
the same

I had some discussion with folks at Sun (indirectly via another guy)
while they were in the process of making the CDDL: They said:
Modifications to CDDL code must be under CDDL. This means if you change
a CDDLed file, your changes are CDDL. If you add a line to the CDDL code
that calls a function in another, new file, you're free to put that
other file under any license as long as there is a compatibility the
other way 'round - you probably cannot put that file under GPL, but you
can put it under BSD. The new file is not a modification of the CDDLed code.

harti

Florent Thoumie

2007-04-07 12:15:00 UTC

Thanks for working on it Pawel!

We're now all waiting for 7.0-RELEASE :-)

--
Florent Thoumie
***@FreeBSD.org
FreeBSD Committer

Dag-Erling Smørgrav

2007-04-07 19:43:59 UTC

ZFS is now also available on pc98 and amd64.

DES

--
Dag-Erling Smørgrav - ***@des.no

Bernd Walter

2007-04-07 20:34:12 UTC

ZFS is now also available on pc98 and amd64.

Great to read - is it just atomic.S missing for the remaining
architectures?

--
B.Walter http://www.bwct.de http://www.fizon.de
***@bwct.de ***@bwct.de ***@fizon.de

Dag-Erling Smørgrav

2007-04-07 21:16:12 UTC

Post by Dag-Erling SmÃ¸rgrav
ZFS is now also available on pc98 and amd64.

Great to read - is it just atomic.S missing for the remaining
architectures?

Yes. Ideally, ZFS would use FreeBSD's atomic operations instead of
its own. I believe that the reason it doesn't is (at least in part)
that we don't have 64-bit atomic operations for i386. I have
unfinished patches for cleaning up the atomic operations on all
platforms; I'll dust them off and see what I can do.

DES

--
Dag-Erling Smørgrav - ***@des.no

David Schultz

2007-04-11 21:49:11 UTC

Post by Dag-Erling SmÃ¸rgrav
ZFS is now also available on pc98 and amd64.

Great to read - is it just atomic.S missing for the remaining
architectures?

As I recall, Solaris 10 targets PPro and later processors, whereas
FreeBSD supports everything back to a 486DX. Hence we can't
assume that cmpxchg8b is available. The last time I remember this
coming up, people argued that we had to do things slow way in the
default kernel for compatibility.

Any ideas how ZFS and GEOM are going to work out, given that ZFS
is designed to be the filesystem + volume manager in one?

Anyway, this looks like awesome stuff! Unfortunately, I won't have
any time to play with it much in the short term, but as soon as WD
sends me the replacement for my spare disk I'll at least install
ZFS and see how it goes.

Awesome work, once again. Thanks!

Dag-Erling Smørgrav

2007-04-11 22:30:08 UTC

Post by David Schultz
Any ideas how ZFS and GEOM are going to work out, given that ZFS
is designed to be the filesystem + volume manager in one?

Pawel has seveal years' experience writing GEOM classes, and ZFS plays
along nicely with GEOM. You can create zpools on any kind of GEOM
provider, and attach any kind of GEOM consumer to zvols.

DES

--
Dag-Erling Smørgrav - ***@des.no

Bernd Walter

2007-04-11 22:51:25 UTC

Post by David Schultz

Post by Dag-Erling SmÃ¸rgrav
ZFS is now also available on pc98 and amd64.

Great to read - is it just atomic.S missing for the remaining
architectures?

I already did a good cleanup of arm atomic functions based on your
work a while ago.

Post by David Schultz
As I recall, Solaris 10 targets PPro and later processors, whereas
FreeBSD supports everything back to a 486DX. Hence we can't
assume that cmpxchg8b is available. The last time I remember this
coming up, people argued that we had to do things slow way in the
default kernel for compatibility.

486 support is definitively needed, but it is very unlikely that many
real existing 486 system has enough RAM for ZFS.
AFAIK a ELAN520 can have up to 256MB, but I doubt that one would
spend so much RAM for such a system without better use for it.
Not shure about 586, this is more likely.
But I'm not very familar with x86 assembly, so I don't even know which
CPUs have cmpxchg8b.
If ZFS wouldn't be so greedy I might have used it on flash media for
x86 and ARM systems, but those boards usually don't have enough RAM.

Post by David Schultz
Any ideas how ZFS and GEOM are going to work out, given that ZFS
is designed to be the filesystem + volume manager in one?

Although you want to use ZFS RAID functionality GEOM has still many
goodies avalable, such as md, ggate, partition-parsing, encyption, etc.
There are other cool points, which I've found possible lately.
E.g. replace all RAIDZ drives with bigger ones, export/import the
pool and you have additional storage with the same number of drives.
You just need a single additional drive at the same time, which is
great in case you are short on drive bays.
In case you accidently added a drive you didn't want to, you can't
easily remove it, but you can workaround by replacing it with another
one, which is equal or bigger in size.
A short time workaround in such a case until you can backup/restore or
replace the wrong drive with a long standing drive, you can use sparse
md-vnode devices, ggate or gconcat ones.
You just have to be carefull with sparse files, since ZFS don't care
about it when filling with data, but you can at least detach your USB
or firewire drive and hopefully live with the situation a few days.
Today I tested a 6T Volume with sparse md files.
This all worked really great.

--
B.Walter http://www.bwct.de http://www.fizon.de
***@bwct.de ***@bwct.de ***@fizon.de

Louis Kowolowski

2007-04-11 23:10:46 UTC

On Thu, Apr 12, 2007 at 12:51:25AM +0200, Bernd Walter wrote:
...

Post by Bernd Walter
486 support is definitively needed, but it is very unlikely that many
real existing 486 system has enough RAM for ZFS.
AFAIK a ELAN520 can have up to 256MB, but I doubt that one would
spend so much RAM for such a system without better use for it.
Not shure about 586, this is more likely.
But I'm not very familar with x86 assembly, so I don't even know which
CPUs have cmpxchg8b.
If ZFS wouldn't be so greedy I might have used it on flash media for
x86 and ARM systems, but those boards usually don't have enough RAM.

I'm some people would be interested in being able to use ZFS with boxes like
Soekris for NAS (FreeNAS comes to mind) type stuff...

--
Louis Kowolowski KE7BAX ***@cryptomonkeys.com
Cryptomonkeys: http://www.cryptomonkeys.com/~louisk

Warning: Do not point laser at remaining eye!

Bernd Walter

2007-04-12 02:12:52 UTC

Post by Louis Kowolowski
...

I'm some people would be interested in being able to use ZFS with boxes like
Soekris for NAS (FreeNAS comes to mind) type stuff...

I'm currently running an NFS fileserver with 384M RAM, which seems to
work with some restrictions, but it is also putting pressure on the CPU,
which is a 700MHz PIII and this is not only while accessing compressed
data.
You might be able to get it running on a 256MB 4801, but don't expect
any speed wonders.
The upcoming 5501 might be a good candidate if populated with much RAM.
If I got the prototype picture on soekris.com right they have 512MBit
chips soldered, which gives 256MB only - more than enough for most
embedded use, but not with ZFS as it stands right now...
That said - I don't know what the default population really will be.

--
B.Walter http://www.bwct.de http://www.fizon.de
***@bwct.de ***@bwct.de ***@fizon.de

Dag-Erling Smørgrav

2007-04-12 09:39:35 UTC

Post by Louis Kowolowski
I'm some people would be interested in being able to use ZFS with boxes like
Soekris for NAS (FreeNAS comes to mind) type stuff...

I don't think a Soekris will cut the mustard. A NAS would need a
large case to hold the disks anyway, so you might as well use an EPIA
board; most C3 / C7 boards can take 1 GB, and they don't cost more
than a Soekris.

DES

--
Dag-Erling Smørgrav - ***@des.no

Peter Jeremy

2007-04-12 07:36:06 UTC

Post by David Schultz
As I recall, Solaris 10 targets PPro and later processors, whereas
FreeBSD supports everything back to a 486DX. Hence we can't
assume that cmpxchg8b is available.

There's a feature bit (CPUID_CX8) that advertises the availability of
cmpxchg8b (and maybe some related instructions). My pre-MMX 586 has
this bit set so I presume anything later than 486 will support it.
(I'm not sure about the low-end VIA, GEODE etc clones).

Post by David Schultz
The last time I remember this
coming up, people argued that we had to do things slow way in the
default kernel for compatibility.

I agree that GENERIC should run on lowest-common-denominator hardware
(the definition of that is a subject for a different thread). GENERIC
performance could be enhanced by using an indirect call for 8-byte
atomic instructions and selecting between the cmpxchg8b and
alternative implementation as part of the CPU startup (much like
i586_bcopy). If CPU_486 is not defined, you code could inline the
cmpxchg8b-based variant.

--
Peter Jeremy

Dag-Erling Smørgrav

2007-04-12 08:54:17 UTC

Post by Peter Jeremy
There's a feature bit (CPUID_CX8) that advertises the availability of
cmpxchg8b (and maybe some related instructions). My pre-MMX 586 has
this bit set so I presume anything later than 486 will support it.
(I'm not sure about the low-end VIA, GEODE etc clones).

The Geode is a 486, and does not support it.

The C3 however is a 586. The C3 Ezra and C3 Samuel / Samuel 2 do not
have CX8. I'm not sure about the C3 Nehemiah, I don't have one
running at the moment.

Post by Peter Jeremy
I agree that GENERIC should run on lowest-common-denominator hardware
(the definition of that is a subject for a different thread). GENERIC
performance could be enhanced by using an indirect call for 8-byte
atomic instructions and selecting between the cmpxchg8b and
alternative implementation as part of the CPU startup (much like
i586_bcopy). If CPU_486 is not defined, you code could inline the
cmpxchg8b-based variant.

Our native atomic operations are all defined as either macros or
static inline functions in machine/atomic.h, so we can easily make
this choice at compile time based on a config option.

DES

--
Dag-Erling Smørgrav - ***@des.no

Henrik Brix Andersen

2007-04-12 10:55:45 UTC

The Geode is a 486, and does not support it.

The Geodes found in both my net4801-50 and net4801-60 are both
i586-class CPUs with CX8 support:

$ dmesg | grep -A 2 ^CPU
CPU: Geode(TM) Integrated Processor by National Semi (266.65-MHz 586-class CPU)
Origin = "Geode by NSC" Id = 0x540 Stepping = 0
Features=0x808131<FPU,TSC,MSR,CX8,CMOV,MMX>

Regards,
Brix

--
Henrik Brix Andersen <***@brixandersen.dk>

Rick C. Petty

2007-04-12 16:06:03 UTC

Post by Dag-Erling SmÃ¸rgrav
Our native atomic operations are all defined as either macros or
static inline functions in machine/atomic.h, so we can easily make
this choice at compile time based on a config option.

Is there any way we could make the choice at boot time, by checking for
presence of the CX8 feature? Either as something like:

extern int feature_cx8; /* or MIB variable */
#define CMPXCHG8(a) (feature_cx8 ? { _asm "..." } : emulate_cmpxch8(a))

Otherwise something like ZFS which utilizes this feature a lot could
check the MIB variable and set different fn ptr in its device structure,
or something along those lines. Of course, that would require compiling
the same code twice essentially, but it had the advantage that it would
work on non-CX8 systems and that it would be fast on systems with CX8.

-- Rick C. Petty

Craig Boston

2007-04-12 18:51:59 UTC

Post by Rick C. Petty

Is there any way we could make the choice at boot time, by checking for
extern int feature_cx8; /* or MIB variable */
#define CMPXCHG8(a) (feature_cx8 ? { _asm "..." } : emulate_cmpxch8(a))

For something this low level my opinion is it's better to stay with
compile time options. After all, in the above example, cmpxchg8 is a
single machine instruction. How much overhead does it add to retrieve a
variable from memory and check it, then jump to the correct place?
Enough that it outweighs the benefit of using that instruction in the
first place?

For entire functions that have been optimized (bzero comes to mind) you
can always either use a function pointer or overwrite the code in memory
with the optimized version. The function call overhead presumably isn't
that much compared to the work that the function is doing.

Post by Rick C. Petty
Otherwise something like ZFS which utilizes this feature a lot could
check the MIB variable and set different fn ptr in its device structure,
or something along those lines. Of course, that would require compiling
the same code twice essentially, but it had the advantage that it would
work on non-CX8 systems and that it would be fast on systems with CX8.

I agree this makes sense for some things, but atomic operations are
supposed to be as fast as possible -- preferably single machine
instructions I can't think of anything short of JIT compiling the kernel
that wouldn't be a high price to pay.

Craig

Dag-Erling Smørgrav

2007-04-12 20:43:05 UTC

Post by Rick C. Petty
Is there any way we could make the choice at boot time, by checking for
extern int feature_cx8; /* or MIB variable */
#define CMPXCHG8(a) (feature_cx8 ? { _asm "..." } : emulate_cmpxch8(a))

I don't think it matters. Contrary to popular belief, atomic
operations are *expensive*. In the best case, on a UP machine, they
stall the pipeline. In the worst case, on an SMP machine, they stall
the entire memory bus.

DES

--
Dag-Erling Smørgrav - ***@des.no

Andrew Reilly

2007-04-13 03:14:22 UTC

Post by Craig Boston
For something this low level my opinion is it's better to stay with
compile time options. After all, in the above example, cmpxchg8 is a
single machine instruction. How much overhead does it add to retrieve a
variable from memory and check it, then jump to the correct place?
Enough that it outweighs the benefit of using that instruction in the
first place?

Apart from the fact that you are correct, how long is the
instruction encoding of cmpxchg8? Perhaps it could be patched
in at runtime, in place of the call to the emultaion, the way of
on-the-fly linking in shared libraries and some floating point
emulation/inline-ers?

--
Andrew

Bruce Evans

2007-04-13 07:34:56 UTC

Not for cmpxchg8b, at least. It is a remarkably slow instruction. On
AthlonXP's it has an execution latency of 39 cycles. cmpxchg only has an
cmpxchg only has an execution latency of 6 cycles (both without a lock
prefix). I don't know how to avoid using cmpxchg8b short of using a
mutex lock/unlock pair and slightly different semantics, or a generation
count and very different semantics, but without lock prefixes the
mutex pair would be much faster than the cmpxchg8b.

Post by Dag-Erling SmÃ¸rgrav
I don't think it matters.

I agree.

Post by Dag-Erling SmÃ¸rgrav
Contrary to popular belief, atomic
operations are *expensive*.

Doesn't everyone who uses atomic operations knows that they are expensive? :)

Post by Dag-Erling SmÃ¸rgrav
In the best case, on a UP machine, they
stall the pipeline. In the worst case, on an SMP machine, they stall
the entire memory bus.

In the UP case, the pipeline stall is tiny or null. Independent
instructions can still proceed, but CPUs (that have pipelines) usually
can't keep pipelines moving anyway, and atomic instructions just reduce
the chance that they can a little.

Bruce

Dag-Erling Smørgrav

2007-04-13 11:24:20 UTC

Post by Bruce Evans

Contrary to popular belief, atomic operations are *expensive*.

Doesn't everyone who uses atomic operations knows that they are
expensive? :)

Everyone who *uses* them, yes, but not everyone does. I recall an
interesting conversation at Poul-Henning's Varnish presentation in
Milan where someone in the audience relentlessly insisted that
[something or other] really wasn't as big an issue as Poul-Henning
claimed, since atomic operations were so cheap.

DES

--
Dag-Erling Smørgrav - ***@des.no

Dag-Erling Smørgrav

2007-04-13 15:59:52 UTC

Post by Andrew Reilly
Apart from the fact that you are correct, how long is the
instruction encoding of cmpxchg8?

Three bytes (0F C7 m64), four for "lock cmpxchg8" (F0 0F C7 m64). If
the top two bits of m64 are set, you may get "interesting" results :)

DES

--
Dag-Erling Smørgrav - ***@des.no

Rick C. Petty

2007-04-12 19:59:47 UTC

That's why I suggested the second method (to change fn pointers in the
device struct).

Post by Craig Boston
I agree this makes sense for some things, but atomic operations are
supposed to be as fast as possible -- preferably single machine
instructions I can't think of anything short of JIT compiling the kernel
that wouldn't be a high price to pay.

The problem is that ZFS would be compiled (by default) to work for many
platforms, and thus a majority of systems wouldn't get the nice
optimization. That's why I think we should do something along the lines of
doing a check for CX8 and changing the pointers in the vfsops and
vop_vector static structures, depending upon the availability of this
optimization.

I guess it really depends upon how much ZFS uses it; I got the sense that
it is "often".

-- Rick C. Petty

Max Laier

2007-04-08 17:10:36 UTC

ZFS is now also available on pc98 and amd64.

panic: lock "zfs:&zap->zap_f.zap_num_entries_mtx" 0xffffff006582c260
already initialized

While dump/restoreing /usr to zfs. kgdb trace attached. Let me know if
you need further information.

--
/"\ Best regards, | ***@freebsd.org
\ / Max Laier | ICQ #67774661
X http://pf4freebsd.love2party.net/ | ***@EFnet
/ \ ASCII Ribbon Campaign | Against HTML Mail and News

Max Laier

2007-04-08 18:13:59 UTC

Post by Max Laier

ZFS is now also available on pc98 and amd64.

panic: lock "zfs:&zap->zap_f.zap_num_entries_mtx" 0xffffff006582c260
already initialized
While dump/restoreing /usr to zfs. kgdb trace attached. Let me know
if you need further information.

The attached diff lets me survive the dump/restore. Not sure if this is
the right fix, but seems like the union messes with mutex initialization.

--
/"\ Best regards, | ***@freebsd.org
\ / Max Laier | ICQ #67774661
X http://pf4freebsd.love2party.net/ | ***@EFnet
/ \ ASCII Ribbon Campaign | Against HTML Mail and News

Dag-Erling Smørgrav

2007-04-08 18:20:44 UTC

Post by Max Laier
The attached diff lets me survive the dump/restore. Not sure if
this is the right fix, but seems like the union messes with mutex
initialization.

You need to track down where memory for the mutex (or rather zap) was
actually allocated, and stick the memset there. I suspect it
originates on the stack somewhere.

DES

--
Dag-Erling Smørgrav - ***@des.no

Max Laier

2007-04-08 18:43:13 UTC

Post by Max Laier
The attached diff lets me survive the dump/restore. Not sure if
this is the right fix, but seems like the union messes with mutex
initialization.

You need to track down where memory for the mutex (or rather zap) was
actually allocated, and stick the memset there. I suspect it
originates on the stack somewhere.

Well, I assume it is zeroed already, but on the way the other union
members are used which messes up the storage for the mutex. At least

Post by Dag-Erling SmÃ¸rgrav
$2 = {zap_objset = 0xffffff0001406410, zap_object = 12660, zap_dbuf =
0xffffff005ce892d0, zap_rwlock = {lock_object = { lo_name =
0xffffffff8081b416 "zfs:&zap->zap_rwlock", lo_type = 0xffffffff8081b416
"zfs:&zap->zap_rwlock", lo_flags = 41615360, lo_witness_data = {
lod_list = {stqe_next = 0x0}, lod_witness = 0x0}}, sx_lock =
18446742974215086080, sx_recurse = 0}, zap_ismicro = 0, zap_salt =
965910969, zap_u = {zap_fat = {zap_phys = 0xffffffff81670000,
zap_num_entries_mtx = {lock_object = {lo_name = 0x70000 <Address
0x70000 out of bounds>, lo_type = 0x0, lo_flags = 2155822976,
lo_witness_data = {lod_list = {stqe_next = 0x0}, lod_witness = 0x0}},
sx_lock = 1, sx_recurse = 0}, zap_block_shift = 0}, zap_micro =
{zap_phys = 0xffffffff81670000, zap_num_entries = 0, zap_num_chunks =
7, zap_alloc_next = 0, zap_avl = { avl_root = 0x0, avl_compar =
0xffffffff807f3f80 <mze_compare>, avl_offset = 0, avl_numnodes = 1,
avl_size = 0}}}}

--
/"\ Best regards, | ***@freebsd.org
\ / Max Laier | ICQ #67774661
X http://pf4freebsd.love2party.net/ | ***@EFnet
/ \ ASCII Ribbon Campaign | Against HTML Mail and News

Pawel Jakub Dawidek

2007-04-08 18:53:12 UTC

Post by Max Laier

ZFS is now also available on pc98 and amd64.

panic: lock "zfs:&zap->zap_f.zap_num_entries_mtx" 0xffffff006582c260
already initialized
While dump/restoreing /usr to zfs. kgdb trace attached. Let me know if
you need further information.

[...]

Post by Max Laier
#10 0xffffffff80295755 in panic (fmt=0xffffffff80481bc0 "lock \"%s\" %p already initialized") at /usr/src/sys/kern/kern_shutdown.c:547
#11 0xffffffff802bd72e in lock_init (lock=0x0, class=0xffffffff80a11000, name=0xa <Address 0xa out of bounds>,
type=0x1b1196 <Address 0x1b1196 out of bounds>, flags=1048064) at /usr/src/sys/kern/subr_lock.c:201
#12 0xffffffff807f092a in fzap_upgrade (zap=0xffffff006582c200, tx=0xffffff006591dd00)
at /usr/src/sys/modules/zfs/../../contrib/opensolaris/uts/common/fs/zfs/zap.c:87
#13 0xffffffff807f42d3 in mzap_upgrade (zap=0xffffff006582c200, tx=0xffffff006591dd00)
at /usr/src/sys/modules/zfs/../../contrib/opensolaris/uts/common/fs/zfs/zap_micro.c:361
#14 0xffffffff807f4cd4 in zap_add (os=0x0, zapobj=18446744071572623360, name=0xffffff00060ebc19 "org.eclipse.jdt_3.2.1.r321_v20060905-R4CM1Znkvre9wC-",
integer_size=8, num_integers=1, val=0xffffffffaeeb6860, tx=0xffffff006591dd00)
at /usr/src/sys/modules/zfs/../../contrib/opensolaris/uts/common/fs/zfs/zap_micro.c:622
#15 0xffffffff80802d06 in zfs_link_create (dl=0xffffff0065554140, zp=0xffffff005ccfac08, tx=0xffffff006591dd00, flag=1)
at /usr/src/sys/modules/zfs/../../contrib/opensolaris/uts/common/fs/zfs/zfs_dir.c:564
#16 0xffffffff8080c01c in zfs_mkdir (ap=0xffffffffaeeb6960) at /usr/src/sys/modules/zfs/../../contrib/opensolaris/uts/common/fs/zfs/zfs_vnops.c:1474
#17 0xffffffff804490f9 in VOP_MKDIR_APV (vop=0x12, a=0xffffffffaeeb6960) at vnode_if.c:1234
#18 0xffffffff80316195 in kern_mkdir (td=0xffffff000105e000, path=0x5149d1 <Address 0x5149d1 out of bounds>, segflg=15549312, mode=511) at vnode_if.h:653
#19 0xffffffff8041abd0 in syscall (frame=0xffffffffaeeb6c70) at /usr/src/sys/amd64/amd64/trap.c:825
#20 0xffffffff8040206b in Xfast_syscall () at /usr/src/sys/amd64/amd64/exception.S:272
#21 0x000000080071969c in ?? ()
Previous frame inner to this frame (corrupt stack?)
(kgdb) f 12
#12 0xffffffff807f092a in fzap_upgrade (zap=0xffffff006582c200, tx=0xffffff006591dd00)
at /usr/src/sys/modules/zfs/../../contrib/opensolaris/uts/common/fs/zfs/zap.c:87
87 mutex_init(&zap->zap_f.zap_num_entries_mtx, NULL, MUTEX_DEFAULT, 0);
(kgdb) p zap
$1 = (zap_t *) 0xffffff006582c200
(kgdb) p *zap
$2 = {zap_objset = 0xffffff0001406410, zap_object = 12660, zap_dbuf = 0xffffff005ce892d0, zap_rwlock = {lock_object = {
lo_name = 0xffffffff8081b416 "zfs:&zap->zap_rwlock", lo_type = 0xffffffff8081b416 "zfs:&zap->zap_rwlock", lo_flags = 41615360, lo_witness_data = {
lod_list = {stqe_next = 0x0}, lod_witness = 0x0}}, sx_lock = 18446742974215086080, sx_recurse = 0}, zap_ismicro = 0, zap_salt = 965910969,
zap_u = {zap_fat = {zap_phys = 0xffffffff81670000, zap_num_entries_mtx = {lock_object = {lo_name = 0x70000 <Address 0x70000 out of bounds>,
lo_type = 0x0, lo_flags = 2155822976, lo_witness_data = {lod_list = {stqe_next = 0x0}, lod_witness = 0x0}}, sx_lock = 1, sx_recurse = 0},
zap_block_shift = 0}, zap_micro = {zap_phys = 0xffffffff81670000, zap_num_entries = 0, zap_num_chunks = 7, zap_alloc_next = 0, zap_avl = {
avl_root = 0x0, avl_compar = 0xffffffff807f3f80 <mze_compare>, avl_offset = 0, avl_numnodes = 1, avl_size = 0}}}}

fzap_upgrade() changes type from 'zap_micro' to 'zap_fat' and union is
used for this (see
sys/contrib/opensolaris/uts/common/fs/zfs/sys/zap_impl.h), that's why we
see this trash:

zap_num_entries_mtx = {lock_object = {lo_name = 0x70000 <Address 0x70000 out of bounds>,
lo_type = 0x0, lo_flags = 2155822976, lo_witness_data = {lod_list = {stqe_next = 0x0},
lod_witness = 0x0}}, sx_lock = 1, sx_recurse = 0},

I already use kmem_zalloc() (note _z_) for zap allocation in
zap_micro.c, so Max is right, that we have to clear this structure here.

I'm quite tired of tracking such problems, because our mechanism for
detecting already initialized locks is too simple (based on one bit), so
I'd prefer to improve it, or just add bzero() to mutex_init().

--
Pawel Jakub Dawidek http://www.wheel.pl
***@FreeBSD.org http://www.FreeBSD.org
FreeBSD committer Am I Evil? Yes, I Am!

Pawel Jakub Dawidek

2007-04-09 01:07:03 UTC

Post by Pawel Jakub Dawidek
fzap_upgrade() changes type from 'zap_micro' to 'zap_fat' and union is
used for this (see
sys/contrib/opensolaris/uts/common/fs/zfs/sys/zap_impl.h), that's why we
zap_num_entries_mtx = {lock_object = {lo_name = 0x70000 <Address 0x70000 out of bounds>,
lo_type = 0x0, lo_flags = 2155822976, lo_witness_data = {lod_list = {stqe_next = 0x0},
lod_witness = 0x0}}, sx_lock = 1, sx_recurse = 0},
I already use kmem_zalloc() (note _z_) for zap allocation in
zap_micro.c, so Max is right, that we have to clear this structure here.
I'm quite tired of tracking such problems, because our mechanism for
detecting already initialized locks is too simple (based on one bit), so
I'd prefer to improve it, or just add bzero() to mutex_init().

I just committed a fix. Now I do 13 bits check for already initialized
locks detection instead of standard 1 bit check. Could you repeat your
test?

--
Pawel Jakub Dawidek http://www.wheel.pl
***@FreeBSD.org http://www.FreeBSD.org
FreeBSD committer Am I Evil? Yes, I Am!

Max Laier

2007-04-09 01:59:24 UTC

Post by Pawel Jakub Dawidek
fzap_upgrade() changes type from 'zap_micro' to 'zap_fat' and union
is used for this (see
sys/contrib/opensolaris/uts/common/fs/zfs/sys/zap_impl.h), that's why
zap_num_entries_mtx = {lock_object = {lo_name = 0x70000 <Address
0x70000 out of bounds>, lo_type = 0x0, lo_flags = 2155822976,
lo_witness_data = {lod_list = {stqe_next = 0x0}, lod_witness = 0x0}},
sx_lock = 1, sx_recurse = 0},
I already use kmem_zalloc() (note _z_) for zap allocation in
zap_micro.c, so Max is right, that we have to clear this structure here.
I'm quite tired of tracking such problems, because our mechanism for
detecting already initialized locks is too simple (based on one bit),
so I'd prefer to improve it, or just add bzero() to mutex_init().

I just committed a fix. Now I do 13 bits check for already initialized
locks detection instead of standard 1 bit check. Could you repeat your
test?

Will do tomorrow. Thanks.

--
/"\ Best regards, | ***@freebsd.org
\ / Max Laier | ICQ #67774661
X http://pf4freebsd.love2party.net/ | ***@EFnet
/ \ ASCII Ribbon Campaign | Against HTML Mail and News

Max Laier

2007-04-09 14:13:39 UTC

...

Post by Max Laier

Post by Pawel Jakub Dawidek
I'm quite tired of tracking such problems, because our mechanism
for detecting already initialized locks is too simple (based on one
bit), so I'd prefer to improve it, or just add bzero() to
mutex_init().

I just committed a fix. Now I do 13 bits check for already
initialized locks detection instead of standard 1 bit check. Could
you repeat your test?

Will do tomorrow. Thanks.

Confirmed to work for my testcase.

--
/"\ Best regards, | ***@freebsd.org
\ / Max Laier | ICQ #67774661
X http://pf4freebsd.love2party.net/ | ***@EFnet
/ \ ASCII Ribbon Campaign | Against HTML Mail and News

Bruno Damour

2007-04-08 06:03:11 UTC

hello,

After csup, buildworld fails for me in libumem.
Is this due to zfs import ?
Or my config ?

Thanks for any clue, i'm dying to try your brand new zfs on amd64 !!

Bruno

FreeBSD vil1.ruomad.net 7.0-CURRENT FreeBSD 7.0-CURRENT #0: Fri Mar 23
07:33:56 CET 2007 ***@vil1.ruomad.net:/usr/obj/usr/src/sys/VIL1 amd64

make buildworld:

===> cddl/lib/libumem (all)
cc -O2 -fno-strict-aliasing -pipe -march=nocona
-I/usr/src/cddl/lib/libumem/../../../compat/opensolaris/lib/libumem
-D_SOLARIS_C_SOURCE -c /usr/src/cddl/lib/libumem/umem.c
/usr/src/cddl/lib/libumem/umem.c:197: error: redefinition of 'nofail_cb'
/usr/src/cddl/lib/libumem/umem.c:30: error: previous definition of
'nofail_cb' was here
/usr/src/cddl/lib/libumem/umem.c:199: error: redefinition of `struct
umem_cache'
/usr/src/cddl/lib/libumem/umem.c:210: error: redefinition of 'umem_alloc'
/usr/src/cddl/lib/libumem/umem.c:43: error: previous definition of
'umem_alloc' was here
/usr/src/cddl/lib/libumem/umem.c:233: error: redefinition of 'umem_zalloc'
/usr/src/cddl/lib/libumem/umem.c:66: error: previous definition of
'umem_zalloc' was here
/usr/src/cddl/lib/libumem/umem.c:256: error: redefinition of 'umem_free'
/usr/src/cddl/lib/libumem/umem.c:89: error: previous definition of
'umem_free' was here
/usr/src/cddl/lib/libumem/umem.c:264: error: redefinition of
'umem_nofail_callback'
/usr/src/cddl/lib/libumem/umem.c:97: error: previous definition of
'umem_nofail_callback' was here
/usr/src/cddl/lib/libumem/umem.c:272: error: redefinition of
'umem_cache_create'
/usr/src/cddl/lib/libumem/umem.c:105: error: previous definition of
'umem_cache_create' was here
/usr/src/cddl/lib/libumem/umem.c:291: error: redefinition of
'umem_cache_alloc'
/usr/src/cddl/lib/libumem/umem.c:124: error: previous definition of
'umem_cache_alloc' was here
/usr/src/cddl/lib/libumem/umem.c:321: error: redefinition of
'umem_cache_free'
/usr/src/cddl/lib/libumem/umem.c:154: error: previous definition of
'umem_cache_free' was here
/usr/src/cddl/lib/libumem/umem.c:332: error: redefinition of
'umem_cache_destroy'
/usr/src/cddl/lib/libumem/umem.c:165: error: previous definition of
'umem_cache_destroy' was here
/usr/src/cddl/lib/libumem/umem.c:364: error: redefinition of 'nofail_cb'
/usr/src/cddl/lib/libumem/umem.c:197: error: previous definition of
'nofail_cb' was here
/usr/src/cddl/lib/libumem/umem.c:364: error: redefinition of 'nofail_cb'
/usr/src/cddl/lib/libumem/umem.c:197: error: previous definition of
'nofail_cb' was here
/usr/src/cddl/lib/libumem/umem.c:366: error: redefinition of `struct
umem_cache'
/usr/src/cddl/lib/libumem/umem.c:377: error: redefinition of 'umem_alloc'
/usr/src/cddl/lib/libumem/umem.c:210: error: previous definition of
'umem_alloc' was here
/usr/src/cddl/lib/libumem/umem.c:377: error: redefinition of 'umem_alloc'
/usr/src/cddl/lib/libumem/umem.c:210: error: previous definition of
'umem_alloc' was here
/usr/src/cddl/lib/libumem/umem.c:400: error: redefinition of 'umem_zalloc'
/usr/src/cddl/lib/libumem/umem.c:233: error: previous definition of
'umem_zalloc' was here
/usr/src/cddl/lib/libumem/umem.c:400: error: redefinition of 'umem_zalloc'
/usr/src/cddl/lib/libumem/umem.c:233: error: previous definition of
'umem_zalloc' was here
/usr/src/cddl/lib/libumem/umem.c:423: error: redefinition of 'umem_free'
/usr/src/cddl/lib/libumem/umem.c:256: error: previous definition of
'umem_free' was here
/usr/src/cddl/lib/libumem/umem.c:423: error: redefinition of 'umem_free'
/usr/src/cddl/lib/libumem/umem.c:256: error: previous definition of
'umem_free' was here
/usr/src/cddl/lib/libumem/umem.c:431: error: redefinition of
'umem_nofail_callback'
/usr/src/cddl/lib/libumem/umem.c:264: error: previous definition of
'umem_nofail_callback' was here
/usr/src/cddl/lib/libumem/umem.c:431: error: redefinition of
'umem_nofail_callback'
/usr/src/cddl/lib/libumem/umem.c:264: error: previous definition of
'umem_nofail_callback' was here
/usr/src/cddl/lib/libumem/umem.c:439: error: redefinition of
'umem_cache_create'
/usr/src/cddl/lib/libumem/umem.c:272: error: previous definition of
'umem_cache_create' was here
/usr/src/cddl/lib/libumem/umem.c:439: error: redefinition of
'umem_cache_create'
/usr/src/cddl/lib/libumem/umem.c:272: error: previous definition of
'umem_cache_create' was here
/usr/src/cddl/lib/libumem/umem.c:458: error: redefinition of
'umem_cache_alloc'
/usr/src/cddl/lib/libumem/umem.c:291: error: previous definition of
'umem_cache_alloc' was here
/usr/src/cddl/lib/libumem/umem.c:458: error: redefinition of
'umem_cache_alloc'
/usr/src/cddl/lib/libumem/umem.c:291: error: previous definition of
'umem_cache_alloc' was here
/usr/src/cddl/lib/libumem/umem.c:488: error: redefinition of
'umem_cache_free'
/usr/src/cddl/lib/libumem/umem.c:321: error: previous definition of
'umem_cache_free' was here
/usr/src/cddl/lib/libumem/umem.c:488: error: redefinition of
'umem_cache_free'
/usr/src/cddl/lib/libumem/umem.c:321: error: previous definition of
'umem_cache_free' was here
/usr/src/cddl/lib/libumem/umem.c:499: error: redefinition of
'umem_cache_destroy'
/usr/src/cddl/lib/libumem/umem.c:332: error: previous definition of
'umem_cache_destroy' was here
/usr/src/cddl/lib/libumem/umem.c:499: error: redefinition of
'umem_cache_destroy'
/usr/src/cddl/lib/libumem/umem.c:332: error: previous definition of
'umem_cache_destroy' was here
*** Error code 1

Stop in /usr/src/cddl/lib/libumem.
*** Error code 1

Stop in /usr/src/cddl/lib.
*** Error code 1

Stop in /usr/src.
*** Error code 1

Stop in /usr/src.
*** Error code 1

Stop in /usr/src.
*** Error code 1

Stop in /usr/src.

Pawel Jakub Dawidek

2007-04-08 09:49:31 UTC

Post by Bruno Damour
hello,
After csup, buildworld fails for me in libumem.
Is this due to zfs import ?
Or my config ?
Thanks for any clue, i'm dying to try your brand new zfs on amd64 !!
Bruno
===> cddl/lib/libumem (all)
cc -O2 -fno-strict-aliasing -pipe -march=nocona -I/usr/src/cddl/lib/libumem/../../../compat/opensolaris/lib/libumem -D_SOLARIS_C_SOURCE -c /usr/src/cddl/lib/libumem/umem.c
/usr/src/cddl/lib/libumem/umem.c:197: error: redefinition of 'nofail_cb'
/usr/src/cddl/lib/libumem/umem.c:30: error: previous definition of 'nofail_cb' was here
/usr/src/cddl/lib/libumem/umem.c:199: error: redefinition of `struct umem_cache'
/usr/src/cddl/lib/libumem/umem.c:210: error: redefinition of 'umem_alloc'
/usr/src/cddl/lib/libumem/umem.c:43: error: previous definition of 'umem_alloc' was here

Did you use my previous patches? There is no cddl/lib/libumem/umem.c is
HEAD, it was it's old location and it was moved to
compat/opensolaris/lib/libumem/. Delete your entire cddl/ directory and
recsup.

--
Pawel Jakub Dawidek http://www.wheel.pl
***@FreeBSD.org http://www.FreeBSD.org
FreeBSD committer Am I Evil? Yes, I Am!

Matthew Dillon

2007-04-08 18:38:14 UTC

:Hi.
:
:I'm happy to inform that the ZFS file system is now part of the FreeBSD
:operating system. ZFS is available in the HEAD branch and will be
:available in FreeBSD 7.0-RELEASE as an experimental feature.

Congratulations on your excellent work, Pawel!

-Matt

Jeremie Le Hen

2007-04-09 17:57:40 UTC

Hi,

Thank you very much for the work Pawel. This is great news.

BTW, does anyone have preliminary performances tests ? I can't do them
has I have no spare disk currently.

Thank you.
Best regards,

--
Jeremie Le Hen
< jeremie at le-hen dot org >< ttz at chchile dot org >

Andrey V. Elsukov

2007-04-10 05:17:16 UTC

Post by Pawel Jakub Dawidek
Limitations.
Currently ZFS is only compiled as kernel module and is only available
for i386 architecture. Amd64 should be available very soon, the other
archs will come later, as we implement needed atomic operations.
Missing functionality.
- We don't have iSCSI target daemon in the tree, so sharing ZVOLs via
iSCSI is also not supported at this point. This should be fixed in
the future, we may also add support for sharing ZVOLs over ggate.
- There is no support for ACLs and extended attributes.
- There is no support for booting off of ZFS file system.
Other than that, ZFS should be fully-functional.

Hi, Pawel. Thanks for the great work!

1. I have an yesterday's CURRENT and I get a `kmem_map too small`
panic when try to copy /usr/src to ZFS partition with enabled
compression. (I have 512M of RAM)

2. I've tried snapshots. Seems that all work good. I have one
question: .zfs directory should be invisible? I can `cd .zfs`
and see it's content, but may be .zfs should be visible like
an ufs's .snap?

--
WBR, Andrey V. Elsukov

Kris Kennaway

2007-04-10 06:10:07 UTC

Hi, Pawel. Thanks for the great work!
1. I have an yesterday's CURRENT and I get a `kmem_map too small`
panic when try to copy /usr/src to ZFS partition with enabled
compression. (I have 512M of RAM)

See discussion in many other emails (e.g. mine). Also cvs update.

Post by Andrey V. Elsukov
2. I've tried snapshots. Seems that all work good. I have one
question: .zfs directory should be invisible? I can `cd .zfs`
and see it's content, but may be .zfs should be visible like
an ufs's .snap?

I think this is controlled by the 'snapdir' property, see p80 of the
admin guide.

Kris

Kris Kennaway

2007-04-10 06:38:17 UTC

Post by Kris Kennaway

I think this is controlled by the 'snapdir' property, see p80 of the
admin guide.

Isn't that default 'hidden' ?

I thought that is what the claim was, and the question was how to make
it visible :)

Kris

Andrey V. Elsukov

2007-04-10 07:03:49 UTC

Post by Kris Kennaway
I thought that is what the claim was, and the question was how to make
it visible :)

Yes, thanks for the answer. Now i've been locked up in the "zfs"
state :)

How to repeat:
# zfs set snapdir=visible media/disk3/src
# ls -la media/disk3/src/.zfs

--
WBR, Andrey V. Elsukov

Kris Kennaway

2007-04-10 07:06:28 UTC

Post by Kris Kennaway
I thought that is what the claim was, and the question was how to make
it visible :)

Yes, thanks for the answer. Now i've been locked up in the "zfs"
state :)
# zfs set snapdir=visible media/disk3/src
# ls -la media/disk3/src/.zfs

\o/

You might need to recompile with DEBUG_LOCKS and DEBUG_VFS_LOCKS and
do 'show lockedvnods', but maybe this is trivially reproducible.

Kris

Post by Andrey V. Elsukov
--
WBR, Andrey V. Elsukov
UID PID PPID CPU PRI NI VSZ RSS MWCHAN STAT TT TIME COMMAND
0 1455 1127 0 -4 0 18032 3896 zfs D+ v0 0:00,01 mc
0 1462 1457 0 -4 0 3416 1236 zfs T+ p0 0:00,00 ls -lA
db> trace 1462
Tracing pid 1462 tid 100061 td 0xc29c81b0
sched_switch(c29c81b0,0,1) at sched_switch+0xc7
mi_switch(1,0) at mi_switch+0x1d4
sleepq_switch(c3404d18) at sleepq_switch+0x8a
sleepq_wait(c3404d18,50,0,0,c07315e4,...) at sleepq_wait+0x36
_sleep(c3404d18,c076f340,50,c2946f6f,0,...) at _sleep+0x24d
acquire(d3b2e728,80,60000,d3b2e708,d3b2e70c,...) at acquire+0x73
_lockmgr(c3404d18,3002,c3404d48,c29c81b0,c2941e7b,...) at _lockmgr+0x442
vop_stdlock(d3b2e770) at vop_stdlock+0x27
_VOP_LOCK_APV(c294abc0,d3b2e770) at _VOP_LOCK_APV+0x38
_vn_lock(c3404cc0,1002,c29c81b0,c2941e7b,c4,...) at _vn_lock+0xf8
domount(c29c81b0,c3404cc0,c2946f6f,c2ce87c0,d3b2e85c,...) at domount+0xfd
zfsctl_snapdir_lookup(d3b2eacc) at zfsctl_snapdir_lookup+0x1ac
VOP_LOOKUP_APV(c294adc0,d3b2eacc) at VOP_LOOKUP_APV+0x43
lookup(d3b2eb50) at lookup+0x4c0
namei(d3b2eb50) at namei+0x2d2
kern_lstat(c29c81b0,2821c268,0,d3b2ec24) at kern_lstat+0x47
lstat(c29c81b0,d3b2ed00) at lstat+0x1b
syscall(d3b2ed38) at syscall+0x29e
Xint0x80_syscall() at Xint0x80_syscall+0x20
--- syscall (190, FreeBSD ELF32, lstat), eip = 0x2818d267, esp = 0xbfbfe3fc, ebp
= 0xbfbfe498 ---
db> cont

Andrey V. Elsukov

2007-04-10 08:12:21 UTC

Post by Kris Kennaway
\o/
You might need to recompile with DEBUG_LOCKS and DEBUG_VFS_LOCKS and
do 'show lockedvnods', but maybe this is trivially reproducible.

I've rollbacked and destroyed this snapshot and now don't have this
problem. But i have several LOR.

--
WBR, Andrey V. Elsukov

Kris Kennaway

2007-04-10 18:58:21 UTC

Post by Kris Kennaway
\o/
You might need to recompile with DEBUG_LOCKS and DEBUG_VFS_LOCKS and
do 'show lockedvnods', but maybe this is trivially reproducible.

I've rollbacked and destroyed this snapshot and now don't have this
problem. But i have several LOR.

Some of these are already known, at least. Also please try to
recreate the deadlock.

Thanks,
Kris

Rong-en Fan

2007-04-10 06:35:25 UTC

Post by Kris Kennaway

Hi, Pawel. Thanks for the great work!
1. I have an yesterday's CURRENT and I get a `kmem_map too small`
panic when try to copy /usr/src to ZFS partition with enabled
compression. (I have 512M of RAM)

See discussion in many other emails (e.g. mine). Also cvs update.

I think this is controlled by the 'snapdir' property, see p80 of the
admin guide.

Isn't that default 'hidden' ?

Regards,
Rong-En Fan

Post by Kris Kennaway
Kris

Scot Hetzel

2007-04-10 06:28:25 UTC

:
Snapshots
:

File system snapshots can be accessed under the ".zfs/snapshot" direc-
tory in the root of the file system. Snapshots are automatically
mounted on demand and may be unmounted at regular intervals. The visi-
bility of the ".zfs" directory can be controlled by the "snapdir" prop-
erty.
:
snapdir=hidden | visible

Controls whether the ".zfs" directory is hidden or visible in the
root of the file system as discussed in the "Snapshots" section.
The default value is "hidden".

Scot

--
DISCLAIMER:
No electrons were mamed while sending this message. Only slightly bruised.

Oliver Fromme

2007-04-12 09:58:17 UTC

The Geode is a 486, and does not support it.

No, it's a 586-class processor. But you're right in
that it does not seem to support cmpxchg8b. I have an
old 233 MHz Geode currently running FreeBSD 4.6 (please
no comments, it's my standalone mp3 player at home and
not connected to the internet so I didn't care to update
it yet, but I certainly will update it when I have some
time). The kernel reports:

CPU: Cyrix GXm (232.74-MHz 586-class CPU)
Origin = "CyrixInstead" Id = 0x540 DIR=0x8246 Stepping=8 Revision=2

There's no "Features=" line, though. Maybe the Geode
does not support the cpuid at all. Whether it supports
cmpxchg8b is not 100% clear, but my guess would be "no".

Post by Dag-Erling SmÃ¸rgrav
The C3 however is a 586.

In fact it's a 686.

Post by Dag-Erling SmÃ¸rgrav
The C3 Ezra and C3 Samuel / Samuel 2 do not have CX8.
I'm not sure about the C3 Nehemiah, I don't have one
running at the moment.

I have a 1000 MHz C3 Nehemiah which is my home file server
(NFS and SMB), among other things (Squid, Apache, FW).
It does not support cmpxchg8b either, according to the
cpuid feature bits:

CPU: VIA C3 Nehemiah+RNG+AES (1002.28-MHz 686-class CPU)
Origin = "CentaurHauls" Id = 0x698 Stepping = 8
Features=0x381b83f<FPU,VME,DE,PSE,TSC,MSR,SEP,MTRR,PGE,CMOV,PAT,MMX,FXSR,SSE>

It's currently running 6-stable, but I would very much
like to update it to -current and use ZFS for the file
server volumes. I hope the absence of cmpxchg8b won't
make that impossible.

(It has 512 MB RAM, which should be sufficient to run
ZFS, right? The squid process also takes quite some
memory, but I've configured it to be rather small.
After all this is only a private home server. I'm not
planning to use compression, but maybe encryption (GELI)
for a small part of it.)

That wouldn't work on the C3 Nehemiah, I'm afraid. CPU_486
is not defined there (in fact I only have I686_CPU in my
kernel config), but it does not support cmpxchg8b according
to the dmesg output above. So the CPU class alone is not
sufficient to decide about the use of cmpxchg8b; you have
to check the actual CPU Features bit.

Best regards
Oliver
--
Oliver Fromme, secnetix GmbH & Co. KG, Marktplatz 29, 85567 Grafing b. M.
Handelsregister: Registergericht Muenchen, HRA 74606, Geschäftsfuehrung:
secnetix Verwaltungsgesellsch. mbH, Handelsregister: Registergericht Mün-
chen, HRB 125758, Geschäftsführer: Maik Bachmann, Olaf Erb, Ralf Gebhart

FreeBSD-Dienstleistungen, -Produkte und mehr: http://www.secnetix.de/bsd

C++: "an octopus made by nailing extra legs onto a dog"
-- Steve Taylor, 1998

Oliver Fromme

2007-04-12 10:18:00 UTC