From: Daniel Lezcano <daniel.lezcano@free•fr>
To: "Eric W. Biederman" <ebiederm@xmission•com>
Cc: Pavel Emelyanov <xemul@parallels•com>,
Sukadev Bhattiprolu <sukadev@linux•vnet.ibm.com>,
Serge Hallyn <serue@us•ibm.com>,
Linux Netdev List <netdev@vger•kernel.org>,
containers@lists•linux-foundation.org,
Netfilter Development Mailinglist
<netfilter-devel@vger•kernel.org>,
Ben Greear <greearb@candelatech•com>
Subject: Re: [RFC][PATCH] ns: Syscalls for better namespace sharing control.
Date: Sat, 06 Mar 2010 15:47:55 +0100 [thread overview]
Message-ID: <4B926B1B.5070207@free.fr> (raw)
In-Reply-To: <m1vdda1pmx.fsf@fess.ebiederm.org>
Eric W. Biederman wrote:
> Pavel Emelyanov <xemul@parallels•com> writes:
>
>
>>> 2 parallel enters? I meant you have pid 0 in the entered pid namespace.
>>> You have pid 0 because your pid simply does not map.
>>>
>> Oh, I see.
>>
>>
>>> There is nothing that makes to parallel enters impossible in that.
>>> Even today we have one thread per cpu that has task->pid == &init_struct_pid
>>> which is pid 0.
>>>
>> How about the forked processes then? Who will be their parent?
>>
>
> The normal rules of parentage apply. So the child will see simply
> see it's parent as ppid == 0. If that child daemonizes it will become
> a child of the pid namespaces init.
>
> This is a lot like something that gets started from call_usermodehelper. It's
> parent process is not a descendant of init either.
>
>
> The implementation of the join is to simply change current->nsproxy->pid_ns.
> Then to use it you simply fork to get a child in the target pid namespace.
>
If the normal rules of parentage apply, that means pid 0 has to wait
it's child.
If we are in the scenario of pid 0, it's child pid 1234 and we kill the
pid 1 of the pid namespace, I suppose pid 1234 will be killed too.
The pid 0 will stay in the pid namespace and will able to fork again a
new pid 1.
I think Serge already reported that...
That sounds good :)
>>> For the case of unshare where we are designed to be used with PAM I don't
>>> think my proposed semantics work. For a join needed an extra fork before
>>> you are really in the pid namespace should be minor.
>>>
>> Hm... One more proposal - can we adopt the planned new fork_with_pids system
>> call to fork the process right into a new pid namespace?
>>
>
> In a lot of ways I like this idea of sys_hijack/sys_cloneat, and I
> don't think anything I am doing fundamentally undermines it. The use
> case of doing things in fork is that there is automatic inheritance of
> everything. All of the namespaces and all of the control groups, and
> possibly also the parent process.
And also the rootfs for executing the command inside the container (eg.
shutdown), the uid/gid (if there is a user namespace), the mount points, ...
But I suppose we can do the same with setns for all the namespaces and
chrooting within the container rootfs.
What I see is a problem with the tty. For example, we cloneat the init
process of the container which is usually /sbin/init but this one has
its tty mapped to /dev/console, so the output of the exec'ed command
will go to the console.
> It does have the high cost that the
> process we are copying from must be stopped because there are no locks
> that let us take everything. I haven't looked at the recent proposals
> to see if anyone has solved that problem cleanly.
>
Right.
> If we can do a sys_hijack/sys_cloneat style of join, that means we can
> afford a fork. At which point the my proposed pid namespace semantics
> should be fine.
>
> aka:
> setns(NSTYPE_PID);
> pid = fork();
> if (pid == 0) {
> getpid() == 2; /* Or whatever the first free pid is joined pid namespace */
> getppid() == 0;
> } else {
> pid == 6400; /* Or whatever the first free pid is in the original pid namespace */
> waitpid(pid);
> }
>
>
>>> That doesn't handle the case of cached struct pids. A good example is
>>> waitpid, where it waits for a specific struct pid. Which means that
>>> allocating a new struct pid and changing task->pid will cause
>>> waitpid(pid) to wait forever...
>>>
>> OK. Good example. Thanks.
>>
>>
>>> To change struct pid would require the refcount on struct pid to show
>>> no references from anywhere except the task_struct.
>>>
>> I think this is OK to return -EBUSY for this. And fix the waitpid
>> respectively not to block this common case. All the others I think
>> can be stayed as is.
>>
>
> That would probably work. setsid() and setpgrp() have similar sorts
> of restrictions. That is both more challenging and more limiting than
> the semantics that come out of my unshare(CLONE_NEWPID) patch. So I
> would prefer to keep this sort of thing as a last resort.
>
>
>>> At the cost of a little memory we can solve that problem for unshare
>>> if we have a an extra upid in struct pid, how we verify there is space
>>> in struct pid I'm not certain.
>>>
>>> I do think that at least until someone calls exec the namespace pids are
>>> reported to the process itself should not change. That is kill and
>>>
>> Wait a second - in that case the wait will be blocked too! No?
>>
>
> If all we do is populate an unused struct upid in struct pid there
> isn't a chance of a problem.
>
>
>>> waitpid etc. Which suggests an implementation the opposite of what
>>> I proposed. With ns_of_pid(task_pid(current)) being used as the
>>> pid namespace of children, and current->nsproxy->pid_ns not changing
>>> in the case of unshare.
>>>
>>> Shrug.
>>>
>>> Or perhaps this is a case where we use we can implement join with
>>> an extra process but we can't implement unshare, because the effect
>>> cannot be immediate.
>>>
>> Well, I'm talking only about the join now.
>>
>
> Overall it sounds like the semantics I have proposed with
> unshare(CLONE_NEWPID) are workable, and simple to implement. The
> extra fork is a bit surprising but it certainly does not
> look like a show stopper for implementing a pid namespace join.
>
I agree, it's some kind of "ghost" process.
IMO, with a bit of userspace code it would be possible to enter or exec
a command inside a container with nsfd, setns.
+1 to test your patchset Eric :)
Just a mindless suggestion, the "nsopen" / "nsattach" syscall names
should be more clear no ?
Jumping back, one question about the nsfd and the poll for waiting the
end of the namespace.
If we have an openened file descriptor on a specific namespace, we grab
a reference on this one, so the namespace won't be destroyed until we
close the fd which is used to poll the end of the namespace, no ? Did I
miss something ?
Thanks
-- Daniel
next prev parent reply other threads:[~2010-03-06 14:48 UTC|newest]
Thread overview: 94+ messages / expand[flat|nested] mbox.gz Atom feed top
2010-01-14 14:05 RFC: netfilter: nf_conntrack: add support for "conntrack zones" Patrick McHardy
2010-01-14 15:05 ` jamal
2010-01-14 15:37 ` Patrick McHardy
2010-01-14 17:33 ` jamal
2010-01-15 10:15 ` Patrick McHardy
2010-01-15 15:19 ` jamal
2010-02-22 20:46 ` Eric W. Biederman
2010-02-22 21:55 ` jamal
2010-02-22 23:17 ` Eric W. Biederman
[not found] ` <m1wry46es9.fsf-+imSwln9KH6u2/kzUuoCbdi2O/JbrIOy@public.gmane.org>
2010-02-23 13:27 ` jamal
2010-02-23 14:07 ` Eric W. Biederman
2010-02-23 14:20 ` jamal
2010-02-23 20:00 ` Eric W. Biederman
2010-02-23 23:09 ` jamal
2010-02-24 1:43 ` Eric W. Biederman
2010-02-25 20:57 ` [RFC][PATCH] ns: Syscalls for better namespace sharing control Eric W. Biederman
2010-02-25 21:31 ` Daniel Lezcano
2010-02-25 21:49 ` Eric W. Biederman
2010-02-25 22:13 ` Daniel Lezcano
2010-02-25 22:31 ` Eric W. Biederman
2010-02-26 20:35 ` Eric W. Biederman
2010-02-25 21:46 ` Matt Helsley
2010-02-25 21:54 ` Eric W. Biederman
2010-02-26 0:53 ` Eric W. Biederman
2010-02-26 1:09 ` Matt Helsley
2010-02-26 1:26 ` Eric W. Biederman
2010-02-26 3:15 ` [RFC][PATCH] ns: Syscalls for better namespace sharing control. v2 Eric W. Biederman
[not found] ` <m18wagy9f3.fsf_-_-+imSwln9KH6u2/kzUuoCbdi2O/JbrIOy@public.gmane.org>
2010-03-03 20:29 ` Jonathan Corbet
2010-03-03 20:50 ` Eric W. Biederman
[not found] ` <m1pr3t2fvl.fsf_-_-+imSwln9KH6u2/kzUuoCbdi2O/JbrIOy@public.gmane.org>
2010-02-26 21:13 ` [RFC][PATCH] ns: Syscalls for better namespace sharing control Pavel Emelyanov
2010-02-26 21:24 ` Eric W. Biederman
2010-02-26 21:34 ` Pavel Emelyanov
2010-02-26 21:42 ` Eric W. Biederman
2010-02-26 21:58 ` Oren Laadan
2010-02-26 22:16 ` Eric W. Biederman
2010-02-26 22:52 ` Oren Laadan
2010-02-26 23:13 ` Eric W. Biederman
2010-02-27 8:30 ` Pavel Emelyanov
2010-02-27 9:04 ` Eric W. Biederman
[not found] ` <m1mxyvrqvk.fsf-+imSwln9KH6u2/kzUuoCbdi2O/JbrIOy@public.gmane.org>
2010-02-27 9:21 ` Pavel Emelyanov
2010-02-27 9:42 ` Eric W. Biederman
2010-02-27 16:16 ` Pavel Emelyanov
2010-02-27 19:08 ` Eric W. Biederman
2010-02-27 19:29 ` Pavel Emelyanov
2010-02-27 19:44 ` Eric W. Biederman
2010-02-28 22:05 ` Daniel Lezcano
2010-03-01 19:24 ` Eric W. Biederman
2010-03-01 21:42 ` Eric W. Biederman
2010-03-02 13:10 ` Cedric Le Goater
2010-03-02 15:03 ` Pavel Emelyanov
2010-03-02 15:14 ` Jan Engelhardt
2010-03-02 21:45 ` Eric W. Biederman
2010-03-02 21:19 ` Sukadev Bhattiprolu
2010-03-02 22:13 ` Eric W. Biederman
2010-03-03 0:07 ` Sukadev Bhattiprolu
2010-03-03 0:46 ` Eric W. Biederman
2010-03-03 15:38 ` Serge E. Hallyn
2010-03-03 19:47 ` Eric W. Biederman
2010-03-04 21:45 ` Eric W. Biederman
2010-03-04 22:55 ` Jan Engelhardt
2010-03-03 16:50 ` Pavel Emelyanov
2010-03-03 20:16 ` Eric W. Biederman
2010-03-05 19:18 ` Pavel Emelyanov
2010-03-05 20:26 ` Eric W. Biederman
2010-03-06 14:47 ` Daniel Lezcano [this message]
[not found] ` <4B926B1B.5070207-GANU6spQydw@public.gmane.org>
2010-03-06 20:48 ` Eric W. Biederman
[not found] ` <m1aaulyy5c.fsf-+imSwln9KH6u2/kzUuoCbdi2O/JbrIOy@public.gmane.org>
2010-03-06 21:26 ` Daniel Lezcano
2010-03-08 8:32 ` Eric W. Biederman
2010-03-08 16:54 ` Daniel Lezcano
2010-03-08 17:29 ` Eric W. Biederman
2010-03-08 19:57 ` Daniel Lezcano
2010-03-08 20:24 ` Eric W. Biederman
2010-03-08 20:42 ` Daniel Lezcano
2010-03-08 20:47 ` Eric W. Biederman
2010-03-08 21:12 ` Daniel Lezcano
2010-03-08 21:25 ` Eric W. Biederman
2010-03-08 21:49 ` Serge E. Hallyn
2010-03-08 22:24 ` Eric W. Biederman
2010-03-09 10:03 ` Daniel Lezcano
2010-03-09 10:13 ` Eric W. Biederman
2010-03-09 10:26 ` Daniel Lezcano
2010-03-10 21:16 ` Daniel Lezcano
2010-03-08 17:07 ` Serge E. Hallyn
2010-03-08 17:35 ` Eric W. Biederman
2010-03-08 17:47 ` Serge E. Hallyn
2010-03-03 20:59 ` Oren Laadan
2010-03-03 21:05 ` Eric W. Biederman
[not found] ` <m1bpfbwuze.fsf-+imSwln9KH6u2/kzUuoCbdi2O/JbrIOy@public.gmane.org>
2010-02-26 21:35 ` Pavel Emelyanov
2010-02-26 21:49 ` Eric W. Biederman
2010-02-23 23:49 ` RFC: netfilter: nf_conntrack: add support for "conntrack zones" Matt Helsley
2010-02-24 1:32 ` Eric W. Biederman
2010-02-24 1:39 ` Serge E. Hallyn
2010-01-14 18:32 ` Ben Greear
2010-01-15 15:03 ` jamal
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=4B926B1B.5070207@free.fr \
--to=daniel.lezcano@free$(echo .)fr \
--cc=containers@lists$(echo .)linux-foundation.org \
--cc=ebiederm@xmission$(echo .)com \
--cc=greearb@candelatech$(echo .)com \
--cc=netdev@vger$(echo .)kernel.org \
--cc=netfilter-devel@vger$(echo .)kernel.org \
--cc=serue@us$(echo .)ibm.com \
--cc=sukadev@linux$(echo .)vnet.ibm.com \
--cc=xemul@parallels$(echo .)com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox