public inbox for netdev@vger.kernel.org 
 help / color / mirror / Atom feed
From: Daniel Lezcano <daniel.lezcano@free•fr>
To: "Eric W. Biederman" <ebiederm@xmission•com>
Cc: Pavel Emelyanov <xemul@parallels•com>,
	Sukadev Bhattiprolu <sukadev@linux•vnet.ibm.com>,
	Serge Hallyn <serue@us•ibm.com>,
	Linux Netdev List <netdev@vger•kernel.org>,
	containers@lists•linux-foundation.org,
	Netfilter Development Mailinglist
	<netfilter-devel@vger•kernel.org>,
	Ben Greear <greearb@candelatech•com>
Subject: Re: [RFC][PATCH] ns: Syscalls for better namespace sharing control.
Date: Sat, 06 Mar 2010 15:47:55 +0100	[thread overview]
Message-ID: <4B926B1B.5070207@free.fr> (raw)
In-Reply-To: <m1vdda1pmx.fsf@fess.ebiederm.org>

Eric W. Biederman wrote:
> Pavel Emelyanov <xemul@parallels•com> writes:
>
>   
>>> 2 parallel enters?  I meant you have pid 0 in the entered pid namespace.
>>> You have pid 0 because your pid simply does not map.
>>>       
>> Oh, I see.
>>
>>     
>>> There is nothing that makes to parallel enters impossible in that.
>>> Even today we have one thread per cpu that has task->pid == &init_struct_pid
>>> which is pid 0.
>>>       
>> How about the forked processes then? Who will be their parent?
>>     
>
> The normal rules of parentage apply.   So the child will see simply
> see it's parent as ppid == 0.  If that child daemonizes it will become
> a child of the pid namespaces init.
>
> This is a lot like something that gets started from call_usermodehelper.  It's
> parent process is not a descendant of init either.
>
>
> The implementation of the join is to simply change current->nsproxy->pid_ns.
> Then to use it you simply fork to get a child in the target pid namespace.
>   
If the normal rules of parentage apply, that means pid 0 has to wait 
it's child.
If we are in the scenario of pid 0, it's child pid 1234 and we kill the 
pid 1 of the pid namespace, I suppose pid 1234 will be killed too.
The pid 0 will stay in the pid namespace and will able to fork again a 
new pid 1.

I think Serge already reported that...

That sounds good :)
>>> For the case of unshare where we are designed to be used with PAM I don't
>>> think my proposed semantics work.  For a join needed an extra fork before
>>> you are really in the pid namespace should be minor.
>>>       
>> Hm... One more proposal - can we adopt the planned new fork_with_pids system
>> call to fork the process right into a new pid namespace?
>>     
>
> In a lot of ways I like this idea of sys_hijack/sys_cloneat, and I
> don't think anything I am doing fundamentally undermines it.  The use
> case of doing things in fork is that there is automatic inheritance of
> everything.  All of the namespaces and all of the control groups, and
> possibly also the parent process.  
And also the rootfs for executing the command inside the container (eg. 
shutdown), the uid/gid (if there is a user namespace), the mount points, ...
But I suppose we can do the same with setns for all the namespaces and 
chrooting within the container rootfs.

What I see is a problem with the tty. For example, we cloneat the init 
process of the container which is usually /sbin/init but this one has 
its tty mapped to /dev/console, so the output of the exec'ed command 
will go to the console.
> It does have the high cost that the
> process we are copying from must be stopped because there are no locks
> that let us take everything.  I haven't looked at the recent proposals
> to see if anyone has solved that problem cleanly.
>   
Right.

> If we can do a sys_hijack/sys_cloneat style of join, that means we can
> afford a fork.  At which point the my proposed pid namespace semantics
> should be fine.
>
> aka:
> setns(NSTYPE_PID);
> pid = fork();
> if (pid == 0) {
> 	getpid() == 2; /* Or whatever the first free pid is joined pid namespace */
>         getppid() == 0;
> } else {
> 	pid == 6400; /* Or whatever the first free pid is in the original pid namespace */
> 	waitpid(pid);
> }
>
>   
>>> That doesn't handle the case of cached struct pids.  A good example is
>>> waitpid, where it waits for a specific struct pid.  Which means that
>>> allocating a new struct pid and changing task->pid will cause
>>> waitpid(pid) to wait forever...
>>>       
>> OK. Good example. Thanks.
>>
>>     
>>> To change struct pid would require the refcount on struct pid to show
>>> no references from anywhere except the task_struct.
>>>       
>> I think this is OK to return -EBUSY for this. And fix the waitpid
>> respectively not to block this common case. All the others I think
>> can be stayed as is.
>>     
>
> That would probably work.  setsid() and setpgrp() have similar sorts
> of restrictions.  That is both more challenging and more limiting than
> the semantics that come out of my unshare(CLONE_NEWPID) patch.  So I
> would prefer to keep this sort of thing as a last resort.
>
>   
>>> At the cost of a little memory we can solve that problem for unshare
>>> if we have a an extra upid in struct pid, how we verify there is space
>>> in struct pid I'm not certain.
>>>
>>> I do think that at least until someone calls exec the namespace pids are
>>> reported to the process itself should not change.  That is kill and
>>>       
>> Wait a second - in that case the wait will be blocked too! No?
>>     
>
> If all we do is populate an unused struct upid in struct pid there
> isn't a chance of a problem.  
>
>   
>>> waitpid etc.  Which suggests an implementation the opposite of what
>>> I proposed.  With ns_of_pid(task_pid(current)) being used as the
>>> pid namespace of children, and current->nsproxy->pid_ns not changing
>>> in the case of unshare.
>>>
>>> Shrug.
>>>
>>> Or perhaps this is a case where we use we can implement join with
>>> an extra process but we can't implement unshare, because the effect
>>> cannot be immediate.
>>>       
>> Well, I'm talking only about the join now.
>>     
>
> Overall it sounds like the semantics I have proposed with
> unshare(CLONE_NEWPID) are workable, and simple to implement.  The
> extra fork is a bit surprising but it certainly does not
> look like a show stopper for implementing a pid namespace join.
>   
I agree, it's some kind of "ghost" process.
IMO, with a bit of userspace code it would be possible to enter or exec 
a command inside a container with nsfd, setns.

+1 to test your patchset Eric :)

Just a mindless suggestion, the "nsopen" / "nsattach" syscall names 
should be more clear no ?

Jumping back, one question about the nsfd and the poll for waiting the 
end of the namespace.
If we have an openened file descriptor on a specific namespace, we grab 
a reference on this one, so the namespace won't be destroyed until we 
close the fd which is used to poll the end of the namespace, no ? Did I 
miss something ?

Thanks
  -- Daniel

  reply	other threads:[~2010-03-06 14:48 UTC|newest]

Thread overview: 94+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2010-01-14 14:05 RFC: netfilter: nf_conntrack: add support for "conntrack zones" Patrick McHardy
2010-01-14 15:05 ` jamal
2010-01-14 15:37   ` Patrick McHardy
2010-01-14 17:33     ` jamal
2010-01-15 10:15       ` Patrick McHardy
2010-01-15 15:19         ` jamal
2010-02-22 20:46           ` Eric W. Biederman
2010-02-22 21:55             ` jamal
2010-02-22 23:17               ` Eric W. Biederman
     [not found]                 ` <m1wry46es9.fsf-+imSwln9KH6u2/kzUuoCbdi2O/JbrIOy@public.gmane.org>
2010-02-23 13:27                   ` jamal
2010-02-23 14:07                     ` Eric W. Biederman
2010-02-23 14:20                       ` jamal
2010-02-23 20:00                         ` Eric W. Biederman
2010-02-23 23:09                           ` jamal
2010-02-24  1:43                             ` Eric W. Biederman
2010-02-25 20:57                             ` [RFC][PATCH] ns: Syscalls for better namespace sharing control Eric W. Biederman
2010-02-25 21:31                               ` Daniel Lezcano
2010-02-25 21:49                                 ` Eric W. Biederman
2010-02-25 22:13                                   ` Daniel Lezcano
2010-02-25 22:31                                     ` Eric W. Biederman
2010-02-26 20:35                                   ` Eric W. Biederman
2010-02-25 21:46                               ` Matt Helsley
2010-02-25 21:54                                 ` Eric W. Biederman
2010-02-26  0:53                                 ` Eric W. Biederman
2010-02-26  1:09                               ` Matt Helsley
2010-02-26  1:26                                 ` Eric W. Biederman
2010-02-26  3:15                               ` [RFC][PATCH] ns: Syscalls for better namespace sharing control. v2 Eric W. Biederman
     [not found]                                 ` <m18wagy9f3.fsf_-_-+imSwln9KH6u2/kzUuoCbdi2O/JbrIOy@public.gmane.org>
2010-03-03 20:29                                   ` Jonathan Corbet
2010-03-03 20:50                                     ` Eric W. Biederman
     [not found]                               ` <m1pr3t2fvl.fsf_-_-+imSwln9KH6u2/kzUuoCbdi2O/JbrIOy@public.gmane.org>
2010-02-26 21:13                                 ` [RFC][PATCH] ns: Syscalls for better namespace sharing control Pavel Emelyanov
2010-02-26 21:24                                   ` Eric W. Biederman
2010-02-26 21:34                                     ` Pavel Emelyanov
2010-02-26 21:42                                       ` Eric W. Biederman
2010-02-26 21:58                                         ` Oren Laadan
2010-02-26 22:16                                           ` Eric W. Biederman
2010-02-26 22:52                                             ` Oren Laadan
2010-02-26 23:13                                               ` Eric W. Biederman
2010-02-27  8:30                                         ` Pavel Emelyanov
2010-02-27  9:04                                           ` Eric W. Biederman
     [not found]                                             ` <m1mxyvrqvk.fsf-+imSwln9KH6u2/kzUuoCbdi2O/JbrIOy@public.gmane.org>
2010-02-27  9:21                                               ` Pavel Emelyanov
2010-02-27  9:42                                                 ` Eric W. Biederman
2010-02-27 16:16                                                   ` Pavel Emelyanov
2010-02-27 19:08                                                     ` Eric W. Biederman
2010-02-27 19:29                                                       ` Pavel Emelyanov
2010-02-27 19:44                                                         ` Eric W. Biederman
2010-02-28 22:05                                                           ` Daniel Lezcano
2010-03-01 19:24                                                             ` Eric W. Biederman
2010-03-01 21:42                                                             ` Eric W. Biederman
2010-03-02 13:10                                                               ` Cedric Le Goater
2010-03-02 15:03                                                             ` Pavel Emelyanov
2010-03-02 15:14                                                               ` Jan Engelhardt
2010-03-02 21:45                                                                 ` Eric W. Biederman
2010-03-02 21:19                                                               ` Sukadev Bhattiprolu
2010-03-02 22:13                                                                 ` Eric W. Biederman
2010-03-03  0:07                                                                   ` Sukadev Bhattiprolu
2010-03-03  0:46                                                                     ` Eric W. Biederman
2010-03-03 15:38                                                                       ` Serge E. Hallyn
2010-03-03 19:47                                                                         ` Eric W. Biederman
2010-03-04 21:45                                                                           ` Eric W. Biederman
2010-03-04 22:55                                                                             ` Jan Engelhardt
2010-03-03 16:50                                                                       ` Pavel Emelyanov
2010-03-03 20:16                                                                         ` Eric W. Biederman
2010-03-05 19:18                                                                           ` Pavel Emelyanov
2010-03-05 20:26                                                                             ` Eric W. Biederman
2010-03-06 14:47                                                                               ` Daniel Lezcano [this message]
     [not found]                                                                                 ` <4B926B1B.5070207-GANU6spQydw@public.gmane.org>
2010-03-06 20:48                                                                                   ` Eric W. Biederman
     [not found]                                                                                     ` <m1aaulyy5c.fsf-+imSwln9KH6u2/kzUuoCbdi2O/JbrIOy@public.gmane.org>
2010-03-06 21:26                                                                                       ` Daniel Lezcano
2010-03-08  8:32                                                                                         ` Eric W. Biederman
2010-03-08 16:54                                                                                           ` Daniel Lezcano
2010-03-08 17:29                                                                                             ` Eric W. Biederman
2010-03-08 19:57                                                                                               ` Daniel Lezcano
2010-03-08 20:24                                                                                                 ` Eric W. Biederman
2010-03-08 20:42                                                                                                   ` Daniel Lezcano
2010-03-08 20:47                                                                                                     ` Eric W. Biederman
2010-03-08 21:12                                                                                                       ` Daniel Lezcano
2010-03-08 21:25                                                                                                         ` Eric W. Biederman
2010-03-08 21:49                                                                                                           ` Serge E. Hallyn
2010-03-08 22:24                                                                                                             ` Eric W. Biederman
2010-03-09 10:03                                                                                                           ` Daniel Lezcano
2010-03-09 10:13                                                                                                             ` Eric W. Biederman
2010-03-09 10:26                                                                                                               ` Daniel Lezcano
2010-03-10 21:16                                                                                                           ` Daniel Lezcano
2010-03-08 17:07                                                                                           ` Serge E. Hallyn
2010-03-08 17:35                                                                                             ` Eric W. Biederman
2010-03-08 17:47                                                                                               ` Serge E. Hallyn
2010-03-03 20:59                                                             ` Oren Laadan
2010-03-03 21:05                                                               ` Eric W. Biederman
     [not found]                                     ` <m1bpfbwuze.fsf-+imSwln9KH6u2/kzUuoCbdi2O/JbrIOy@public.gmane.org>
2010-02-26 21:35                                       ` Pavel Emelyanov
2010-02-26 21:49                                         ` Eric W. Biederman
2010-02-23 23:49                           ` RFC: netfilter: nf_conntrack: add support for "conntrack zones" Matt Helsley
2010-02-24  1:32                             ` Eric W. Biederman
2010-02-24  1:39                               ` Serge E. Hallyn
2010-01-14 18:32   ` Ben Greear
2010-01-15 15:03     ` jamal

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4B926B1B.5070207@free.fr \
    --to=daniel.lezcano@free$(echo .)fr \
    --cc=containers@lists$(echo .)linux-foundation.org \
    --cc=ebiederm@xmission$(echo .)com \
    --cc=greearb@candelatech$(echo .)com \
    --cc=netdev@vger$(echo .)kernel.org \
    --cc=netfilter-devel@vger$(echo .)kernel.org \
    --cc=serue@us$(echo .)ibm.com \
    --cc=sukadev@linux$(echo .)vnet.ibm.com \
    --cc=xemul@parallels$(echo .)com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox