29 views
TL;DR ===== It appears ansible is not able to operate fully on FreeBSD, as(when?) `/var/tmp` has NFSv4 ACLs set. It can be [worked around](#Workaround) by trivially patching ansible, and presumably be bug-fixed in a similar manner. [toc] Initial problem description =========================== Failing task: ```yaml - name: "Create .ssh directory for {{ user }}" ansible.builtin.file: path: "/home/{{ user }}/.ssh" state: directory mode: 0700 become: true become_user: "{{ user }}" ``` Error message: ``` [loza.netizen.se] TASK: netizen-authorized_keys : Create .ssh directory for git (debug)> r fatal: [loza.netizen.se]: FAILED! => {"msg": "Failed to set permissions on the temporary files Ansible needs to create when becoming an unprivileged user (rc: 1, err: chmod: invalid file mode: A+user:git:rx:allow\n}). For information on working around this, see https://docs.ansible.com/ansible-core/2.14/playbook_guide/playbooks_privilege_escalation.html#risks-of-becoming-an-unprivileged-user#risks-of-becoming-an-unprivileged-user"} ``` Environment =========== The target ansible node is a freshly installed FreeBSD 14.0-RELEASE system, installed on a Hetzner cax11 vps from the iso they make available. It only has the default local system disk. No NFS mounts are involved. Permissions of `/var/tmp` has not knowingly been manually altered. A Debian Trixie system is used as ansible controller. This system is using a nfs mounted `/home`, which really shouldn't affect anything even if it did confuse me while being tired. Debugging Notes =============== Actual err: `chmod: invalid file mode: A+user:git:rx:allow ` Only search engine hit with that error message, where it happened on OpenBSD. https://github.com/ansible/ansible/issues/79602. No resolution, unfortunately. The documentation linked from the error message claims: > First, if setfacl is installed and available in the remote PATH, and the temporary directory on the remote host is mounted with POSIX.1e filesystem ACL support, Ansible will use POSIX ACLs to share the module file with the second unprivileged user. ...which appears to ~~not be true~~. (**update:** Sure, it is true. Yet it would make sense to try NFSv4 ACLs in addition to POSIX.1e ACLs.) ```shell % which setfacl /bin/setfacl % touch /tmp/foo % setfacl -m u:git:rwx:allow /tmp/foo % getfacl /tmp/foo # file: /tmp/foo # owner: cos # group: wheel user:git:rwx-----------:-------:allow owner@:rw-p--aARWcCos:-------:allow group@:r-----a-R-c--s:-------:allow everyone@:r-----a-R-c--s:-------:allow ``` The documentation then continues: > Next, if POSIX ACLs are not available or setfacl could not be run, Ansible will attempt to change ownership of the module file using chown for systems that support doing so as an unprivileged user. The error message mentions chmod, which suggests this step might be encountered. Yet the error message also shows a supposedly attempted argument to chmod which acl syntax I do not recognize. Further the documentation continues with suggestions on workarounds, which to me appear to be chmod 0777-like methods to attempt when ansible has attempted ~~Linux-only syntax on a BSD host?~~ (**Update:** contextually invalid syntax.) ~~I was hoping to learn how to debug what ansible is attempting, to see the python code running and failing.~~ (**Update:** I did get help on irc to navigate the code. Thanks!) Adding all the three v:s and a fourth -vvvv to ansible-playbook gives that the command sent actually **is** `/bin/sh -c chmod A+user:git:rx:allow <file>`, which is not valid syntax for either [freebsd][] nor [linux][]. It does however appear to be be valid syntax with [illumos]. As suggested on #~~illumos~~ansible (and in line with docs), some setfacl operations should be attempted already before the chmod. Studing the debug logs closely shows a `/bin/sh -c 'setfacl -m u:git:r-x ...'` with a failure of > setfacl: ...: branding mismatch; existing ACL is NFSv4, entry to be merged is POSIX.1e. In other words, there ought to be an attempt at using ~~BSD~~ NFSv4 valid syntax to setfacl in [\_fixup_perms2()][]. Quoting the brief descriptions from [setfacl][]: > A POSIX.1E ACL entry contains three colon-separated fields: an ACL tag, an ACL qualifier, and discretionary access permissions: > ... > An NFSv4 ACL entry contains four or five colon-separated fields: an ACL tag, an ACL qualifier (only for "user" and "group" tags), discretionary access permissions, ACL inheritance flags, and ACL type: Breaking the POSIX.1e acl down gives: | tag | qualifier | disc. access perm. | |-----|-----------|--------------------| | u | git | r-x | Corresponding, and extended, for NFSv4 that ought to become: | tag | qualifier | disc. access perm. | inherit. | type | |-----|-----------|--------------------|----------|-------| | u | git | r-x |d / dfiI ?| allow | Fugly patching /usr/lib/python3/dist-packages/ansible/plugins/shell as below is unfortunately not a sufficient fix: ```diff --- __init__.py.orig 2024-04-18 10:06:41.628285997 +0200 +++ __init__.py 2024-04-18 10:07:20.288285997 +0200 @@ -121,7 +121,8 @@ def set_user_facl(self, paths, user, mode): """Only sets acls for users as that's really all we need""" - cmd = ['setfacl', '-m', 'u:%s:%s' % (user, mode)] + # cmd = ['setfacl', '-m', 'u:%s:%s' % (user, mode)] + cmd = ['setfacl', '-m', 'u:%s:%s:dfiI:allow' % (user, mode)] cmd.extend(paths) cmd = [shlex.quote(c) for c in cmd] ``` In an attempt to get earlier error resolution, also /usr/lib/python3/dist-packages/ansible/plugins/action was patched like below: ```diff --- __init__.py.orig 2024-04-18 10:14:56.888285997 +0200 +++ __init__.py 2024-04-18 10:16:54.540285997 +0200 @@ -666,6 +666,13 @@ if res['rc'] == 0: return remote_paths + else: + raise AnsibleError( + 'Failed to set file mode or acl on remote temporary files ' + '(rc: {0}, err: {1})'.format( + res['rc'], + to_native(res['stderr']))) + # Step 3b: Set execute if we need to. We do this before anything else # because some of the methods below might work but not let us set +x ``` With the above patch, and `-vvvv`, it becomes apparent that the fugly patch was too naive as ansible seems to mix and match between uploading its jobfiles to both `/var/tmp/` and `~/.ansible/tmp/`. Thus instead attempting: ```diff --- __init__.py.orig 2024-04-18 10:06:41.628285997 +0200 +++ __init__.py 2024-04-18 10:27:11.012285997 +0200 @@ -127,6 +127,14 @@ return ' '.join(cmd) + def set_user_facl_nfs(self, paths, user, mode): + """Only sets acls for users as that's really all we need""" + cmd = ['setfacl', '-m', 'u:%s:%s:dfiI:allow' % (user, mode)] + cmd.extend(paths) + cmd = [shlex.quote(c) for c in cmd] + + return ' '.join(cmd) + def remove(self, path, recurse=False): path = shlex.quote(path) cmd = 'rm -f ' ``` \+ ```diff --- __init__.py.orig 2024-04-18 10:14:56.888285997 +0200 +++ __init__.py 2024-04-18 10:32:16.000285997 +0200 @@ -667,6 +667,22 @@ if res['rc'] == 0: return remote_paths + # Step 3a': Are we able to use setfacl to add user ACLs to the file if on nfs? + res = self._remote_set_user_facl_nfs( + remote_paths, + become_user, + setfacl_mode) + + if res['rc'] == 0: + return remote_paths + else: + raise AnsibleError( + 'Failed to set file mode or acl on remote temporary files ' + '(rc: {0}, err: {1})'.format( + res['rc'], + to_native(res['stderr']))) + + # Step 3b: Set execute if we need to. We do this before anything else # because some of the methods below might work but not let us set +x # as part of them. @@ -822,6 +838,14 @@ res = self._low_level_execute_command(cmd, sudoable=sudoable) return res + def _remote_set_user_facl_nfs(self, paths, user, mode, sudoable=False): + ''' + Issue a remote call to setfacl + ''' + cmd = self._connection._shell.set_user_facl_nfs(paths, user, mode) + res = self._low_level_execute_command(cmd, sudoable=sudoable) + return res + def _execute_remote_stat(self, path, all_vars, follow, tmp=None, checksum=True): ''' Get information from remote file. ``` Workaround ========== Applying the following two patches makes the problem go away, and results in successful executions in my environment. In their current format they also breaks functionality with that early return (added for shorter time to failure), but restoring the original flow is trivial. Debian Trixie: `/usr/lib/python3/dist-packages/ansible/plugins/shell/__init__.py` ```diff --- plugins/shell/__init__.py.orig 2024-04-18 10:06:41.628285997 +0200 +++ plugins/shell/__init__.py 2024-04-19 06:09:33.844285997 +0200 @@ -127,6 +127,14 @@ return ' '.join(cmd) + def set_user_facl_nfs(self, paths, user, mode): + """Only sets acls for users as that's really all we need""" + cmd = ['setfacl', '-m', 'u:%s:%s::allow' % (user, mode)] + cmd.extend(paths) + cmd = [shlex.quote(c) for c in cmd] + + return ' '.join(cmd) + def remove(self, path, recurse=False): path = shlex.quote(path) cmd = 'rm -f ' ``` Debian Trixie: `/usr/lib/python3/dist-packages/ansible/plugins/action/__init__.py` ```diff --- __init__.py.orig 2024-04-18 10:14:56.888285997 +0200 +++ __init__.py 2024-04-19 06:09:22.120285997 +0200 @@ -667,6 +667,22 @@ if res['rc'] == 0: return remote_paths + # Step 3a': Are we able to use setfacl to add user ACLs to the file if on nfs? + res = self._remote_set_user_facl_nfs( + remote_paths, + become_user, + chmod_mode) + + if res['rc'] == 0: + return remote_paths + else: + raise AnsibleError( + 'Failed to set file mode or acl on remote temporary files ' + '(rc: {0}, err: {1})'.format( + res['rc'], + to_native(res['stderr']))) + + # Step 3b: Set execute if we need to. We do this before anything else # because some of the methods below might work but not let us set +x # as part of them. @@ -822,6 +838,14 @@ res = self._low_level_execute_command(cmd, sudoable=sudoable) return res + def _remote_set_user_facl_nfs(self, paths, user, mode, sudoable=False): + ''' + Issue a remote call to setfacl + ''' + cmd = self._connection._shell.set_user_facl_nfs(paths, user, mode) + res = self._low_level_execute_command(cmd, sudoable=sudoable) + return res + def _execute_remote_stat(self, path, all_vars, follow, tmp=None, checksum=True): ''' Get information from remote file. ``` Proper Bugfix ============= Pending discussion. I hope to be able to polish the above proof-of-concept into actually mergable code, given that it doesn't get too process-heavy with needing to register for silly accounts or such admin stuff. Conversation on #ansible and [communication] suggests that creating accounts with Discourse and Microsoft GitHub are essential to contributing, so chances are I'll just maintain my [own branch][git] and wont bother being a good citizen, unless #ansible-devel can assist with a path to merge with minimum red-tape required. Just to be crystal clear, and overly verbose. I don't have an unreasonable beef with registering for either of the two required accounts. I have registered with both Discourses and GitHub in the past and, as one should, deleted myself once done. My threshold however is above having to go through the process just to land a PR in [the pool of hundreds][pulls] of already existing ones which will never get the attention they deserve. Personal Path ============= I'll migrate away from using ansible for anything under my control. This is far from the first time I've needed to spend more time on adapting to an ansible bug than what is reasonable, giving the value the tool actually provides for me. Most of the new recipies I've written during the past couple of years have offloaded most tasks to plain shell scripts creating `.<script>-has-run` files anyways, as that is both easier to debug and less of a moving target than ansible with its wont-fix bugs and ever changing interfaces. Debian Package ============== While still stuck with ansible, I've locally built a personal ansible-core 2.14.13-1cos1 package. On the odd chance it might help others in some way, I've also [attempted][debian-attempt] to rebase my fix as 2.16.5-1cos1, which is a bit troublesome since [VCS repository is not up to date][vcsinfo]. So I fail already at building even the baseline, 2.16.5-1, when attempting (knowing some args are overridden by others): ``` gbp buildpackage \ --git-upstream-branch='upstream/recreated' \ --git-debian-branch='debian/recreated' \ --git-keyid='|cos|' \ --sign-key='|cos|' \ --no-sign \ --sign-command='/bin/true' \ --git-no-sign-tags \ --git-no-pristine-tar \ --git-no-pristine-tar-commit ``` The above fails with unaccounted changes to my source tree. I've renamed the branches before pushing them, to communicate that they are failing, but without changes to them. If succeeding to build contemporary package versions, my aim is to offer the patch to the debian project. It is also likely that I publish to [apt.netizen.se][], if I get around to that before the bugfix gets accepted somewhere. [apt.netizen.se]: https://apt.netizen.se [vcsinfo]: https://qa.debian.org/cgi-bin/vcswatch?package=ansible-core [debian-attempt]: https://git.netizen.se/debian-ansible-core/ [pulls]: https://github.com/ansible/ansible/pulls [git]: https://git.netizen.se/ansible/commit/?h=fix/become_user%2bfreebsd-v2.4.13 [communication]: [https://docs.ansible.com/ansible/latest/community/communication.html [\_fixup_perms2()]: https://github.com/ansible/ansible/blob/devel/lib/ansible/plugins/action/__init__.py#L585 [setfacl]: https://man.freebsd.org/cgi/man.cgi?query=setfacl&apropos=0&sektion=0&manpath=FreeBSD+14.0-RELEASE+and+Ports&arch=default&format=html [freebsd]: https://man.freebsd.org/cgi/man.cgi?query=chmod&;sektion=1 [linux]: https://www.man7.org/linux/man-pages/man1/chmod.1.html [illumos]: https://man.omnios.org/man1/chmod.1#ACL_Operation [posix1e]: https://man.freebsd.org/cgi/man.cgi?query=posix1e