On 9/21/25 04:47, Alyssa Ross wrote: > Demi Marie Obenour writes: > >> Right now, the makefiles in host/rootfs, vm/sys/net, and img/app have >> manually-maintained lists of files and symlinks. These duplicate the >> information in the git repository and can easily get out of sync or >> cause unnecessary merge conflicts. Fix all of these issues by having >> the git repository be the source of truth, and using a script to >> generate the file lists. Developers can regenerate the lists before >> every commit, or even add a git hook to do that. >> >> Signed-off-by: Demi Marie Obenour >> --- > > I like where this is going. :) Yay! >> Changes in v3: >> - Only include the file list generator. Move the rest to separate patch >> series. >> - Remove the update-file-list make targets from img/app/Makefile and >> vm/sys/net/Makefile. >> - Link to v2: https://lore.kernel.org/r/20250910-genfiles-v2-0-37ebe07a3cdc@gmail.com >> >> Changes in v2: >> - Drop the last patch (switching to /etc/s6-rc/compiled) as it is >> controversial and should be reviewed separately. >> - Add missing copyright notices. >> - Use a wrapper shell script to make the awk code easier to read. >> - Improve documentation. >> - Add helper scripts for use in git hooks and rebasing. >> - Link to v1: https://spectrum-os.org/lists/archives/spectrum-devel/20250903-genfiles-v1-0-cc993fcb1e4c@gmail.com/ >> --- >> Documentation/development/built-in-vms.adoc | 17 ++++ >> host/rootfs/Makefile | 102 +---------------------- >> host/rootfs/file-list.mk | 99 +++++++++++++++++++++++ >> img/app/Makefile | 80 +++---------------- >> img/app/file-list.mk | 65 +++++++++++++++ >> lib/common.mk | 1 + >> scripts/genfiles.awk | 120 ++++++++++++++++++++++++++++ >> scripts/genfiles.sh | 29 +++++++ >> scripts/git-rebase | 17 ++++ >> scripts/pre-commit.sh | 11 +++ > > Let's take git-rebase and pre-commit.sh out of this patch, and focus on > the generated file lists first. Will change. >> vm/sys/net/Makefile | 50 ++---------- >> vm/sys/net/file-list.mk | 42 ++++++++++ >> 12 files changed, 422 insertions(+), 211 deletions(-) >> >> diff --git a/Documentation/development/built-in-vms.adoc b/Documentation/development/built-in-vms.adoc >> index e90009ee5a3c2c254a7ae11e36121576b819eee7..0addc7d1a2fd322fa12918656baa3d169478504d 100644 >> --- a/Documentation/development/built-in-vms.adoc >> +++ b/Documentation/development/built-in-vms.adoc > > Copyright header please! Will fix. Also, in the future you have permission to fix missing copyright headers when you commit. It's fine if you aren't comfortable doing that. >> @@ -44,6 +44,23 @@ NOTE: As a special convenience, it's not necessary to run `make clean` >> if the only change to the Nix files is modifying the packages >> installed in the VM. >> >> +The list of files used for the VM image is stored in a separate file, >> +`file-lists.mk`. To update it, run `scripts/genfiles.sh` > > Typo: file-list*s*.mk. Also, so far we haven't used code syntax for > file names. > > Maybe "used for images" would be better, since this also applies to > host/rootfs. (Obviously the ideal would be if this documentation wasn't > only written for VM images but that's out of scope. We'll get to it.) Will fix. >> +which will regenerate it from the output of `git ls-files`. Any >> +changes you made will be lost. This script uses uses Git's index to > > I think "Any changes you made will be lost." is a bit scary, because > it's not clear it only means changes to those files. The sentence could > probably just be dropped altogether — I think it's implied by "regenerate". Will fix. >> +generate the list, so you need to use `git add`, `git rm`, and `git mv` >> +to ensure that Git knows about your changes. It is not necessary to >> +commit the changes. > > "so only staged changes will be reflected"? All the extra stuff has > potential for confusion I think — for example "It is not necessary to > commit the changes." could be read as "when you make a commit, do not > include changes to file-list.mk". Will fix. >> diff --git a/lib/common.mk b/lib/common.mk >> index 277c3544036d9a9057f8ba4ad37fe2207548cc59..0a03ff440cc671264d2b859a2ae048db9252d047 100644 >> --- a/lib/common.mk >> +++ b/lib/common.mk >> @@ -1,5 +1,6 @@ >> # SPDX-License-Identifier: EUPL-1.2+ >> # SPDX-FileCopyrightText: 2021, 2023, 2025 Alyssa Ross >> +# SPDX-FileCopyrightText: 2025 Demi Marie Obenour >> >> BACKGROUND = background >> CPIO = cpio > > Accident? Yes. >> diff --git a/scripts/genfiles.awk b/scripts/genfiles.awk >> new file mode 100644 >> index 0000000000000000000000000000000000000000..6fe327fd0a314d226dbce23854aa8f119e9c8f34 >> --- /dev/null >> +++ b/scripts/genfiles.awk >> @@ -0,0 +1,120 @@ >> +#!/usr/bin/env -S LC_ALL=C LANGUAGE=C awk -E >> +# SPDX-License-Identifier: EUPL-1.2+ >> +# SPDX-FileCopyrightText: 2025 Demi Marie Obenour >> +BEGIN { >> + RS = "\n"; >> + FS = "\t"; >> + file_count = 0; >> + symlink_count = 0; >> + rc_count = 0; >> + is_rc = 0; >> + exit_code = 0; >> + done = 0; > > awk variables are implicitly initialized to 0 when you try to do > arithmetic on an undefined variable, so no need for these. GNU Awk can lint against that. I used its lint mode because it also warns against non-portable constructs. Also, an undefined awk variable used as an array subscript is treated as the empty string, not 0, which could lead to confusion. >> + modes["120000"] = "symlink"; >> + modes["040755"] = "directory"; >> + modes["100644"] = "regular"; >> + modes["100755"] = "regular"; >> +} >> + >> +function fail(msg, status) { >> + if (status ~ /^([1-9][0-9]?|1[0-9]{2}|2[0-4][1-9]|25[1-5])$/) { >> + exit_code = status; >> + } else { >> + exit_code = 1; >> + status = 1; >> + } >> + print ("FATAL: " msg) > "/dev/stderr"; >> + exit status; > > Do we ever want to exit something other than 1 from this function? Nope. >> +} >> +done { fail("Junk after DONE", 1); } >> +/^DONE$/ { >> + done = 1 >> + next >> +} >> + >> +# Make sure git produced valid output. >> +!/^[0-7]{6}\t[ -~]+$/ { >> + fail("git ls-files produced invalid output", 1); >> +} >> + > > This is very unlikely to happen, and if it does, it will be obvious from > the diff. Will drop. >> +# Extract data from built-in variables. >> +{ >> + filename = $2; >> + raw_mode = $1; >> + # awk autocreates empty string entries if the key is invalid, >> + # but the code exits in this case so that is okay. >> + mode = modes[raw_mode]; >> +} >> + >> +# Another check for a git bug. >> +filename ~ /^\/|((^|\/)\.{0,2}($|\/))/ { >> + fail("git ls-files output non-canonical or absolute path '" filename "'", 1); >> +} >> + > > If there are git bugs, we will notice and report them. We do not need > to be the test suite for git here. Okay, fair! >> +filename ~ /[^[:alnum:]_.+@/-]/ { >> + fail("filename '" filename "' has forbidden characters", 1); >> +} >> + >> +/\.license$/ { >> + if (raw_mode != "100644") { >> + fail("License file '" filename "' is executable or not regular file", 1); >> + } >> + next; >> +} > > This is also not really in scope for a script that does not care about > license files. Fair. I will leave that to the reuse check. >> + >> +mode == "directory" { next } > > Getting a directory from git ls-files would be sufficiently unexpected > that I don't think we should treat it any differently from an > unrecognized mode. Will fix. >> + >> +filename ~ /^image\/etc\/s6-rc\// { >> + if (mode != "regular") { >> + fail("s6-rc-compile input '" filename "' isn't a regular file"); >> + } >> + rc_count += 1; >> + rc_files[rc_count] = filename; > > rc_files[rc_count++] > > (will make it 0-indexed though so update the loops too) I think this might break without explicit variable initialization. >> + next; >> +} >> + >> +mode == "symlink" { >> + symlink_count += 1; >> + symlinks[symlink_count] = filename; >> + next; >> +} >> + >> +mode == "regular" { >> + file_count += 1; >> + files[file_count] = filename; >> + next; >> +} >> + >> +{ fail("File '" filename "' is not regular file, directory, or symlink (mode " raw_mode ")"); } >> + >> +END { >> + if (exit_code) { >> + exit exit_code; >> + } >> + if (!done) { >> + fail("Did not receive DONE line", 1); >> + } >> + printf ("# SPDX-License-Identifier: CC0-1.0\n" \ >> + "# SPDX-FileCopyrightText: 2025 Demi Marie Obenour \n" \ > > Okay, so, it's silly that this needs to have a copyright header on it at > all, but since we have to have one to make reuse happy, I think it > should be mine from 2021, because the comment about links is the closest > thing to creative expression in here. Will fix. >> + "# Generated by scripts/genfile.sh. Any changes will be overwritten.\n" \ >> + "FILES ::=") > out_file; > > I note the change to ::=. Do you think we should do that across the > board in our Makefiles? POSIX specifies ::= and it has better semantics in most cases, but I don't know if the BSD makes implement it. ::= causes the RHS to be expanded immediately, so subsequent changes in variables referenced by it do not affect the LHS. >> + for (array_index = 1; array_index <= file_count; array_index += 1) { >> + printf " \\\n\t%s", files[array_index] > out_file; >> + } >> + printf ("\n\n" \ >> +"# These are separate because they need to be included, but putting\n" \ >> +"# them as make dependencies would confuse make.\n" \ >> +"LINKS ::=") > out_file; >> + for (array_index = 1; array_index <= symlink_count; array_index += 1) { >> + printf " \\\n\t%s", symlinks[array_index] > out_file; >> + } >> + printf "\n\nS6_RC_FILES ::=" > out_file; >> + for (array_index = 1; array_index <= rc_count; array_index += 1) { >> + printf " \\\n\t%s", rc_files[array_index] > out_file; >> + } >> + printf "\n" > out_file; >> + if (close(out_file)) { >> + print ("Cannot close output file: " ERRNO "\n") > "/dev/stderr"; >> + exit 1; >> + } >> +} >> diff --git a/scripts/genfiles.sh b/scripts/genfiles.sh >> new file mode 100755 >> index 0000000000000000000000000000000000000000..77a8d95e88b6851be9447698556efe4f1eab174b >> --- /dev/null >> +++ b/scripts/genfiles.sh >> @@ -0,0 +1,29 @@ >> +#!/usr/bin/env -S LC_ALL=C LANGUAGE=C bash -- > > env -S is not portable, and I don't think anything here needs bash > specifically. $'\t' doesn't work with all shells, though I believe it is either part of the current POSIX standard or will be added. I'll use /usr/bin/env bash, which breaks if the script is renamed to something starting with '-'. > We can set the locale variables after the script starts, > because I don't think this wrapper script is going to do anything > locale-specific. (And shouldn't they be C.UTF-8?) The C locale is actually what I intended. The script does not rely on support for non-ASCII characters, and it does use the fact that negated character classes match all bytes. Admittedly, this will only be needed if there is a git bug. >> +set -euo pipefail >> +unset output_file astatus > > This is a bit overly defensive IMO. Both of these variables are > assigned before use, and if they weren't, the person making those > changes would be very unlikely to not notice because they had those > variables defined in their environment. Fair! >> +case $0 in >> +(/*) cd "${0%/*}/..";; >> +(*/*) cd "./${0%/*}/..";; >> +(*) cd ..;; >> +esac > > Perhaps we could use git rev-parse --show-toplevel? git ls-files doesn't have that option. >> +for i in host/rootfs img/app vm/sys/net; do >> + output_file=$i/file-list.mk >> + { >> + git -C "$i" -c core.quotePath=true ls-files $'--format=%(objectmode)\t%(path)' -- image | >> + sort -t $'\t' -k 2 > > TIL sort -t and -k! 🤯 > >> + echo DONE > > Why do we need this? To avoid producing any output file if the input is truncated. >> + } | >> + gawk -v "out_file=$output_file.tmp" -E scripts/genfiles.awk > > Why not stdout? The output file is created by awk so that it is only created if nothing went wrong. > And why gawk? I didn't immediately notice anything > non-POSIX, and as usual would prefer to stick to it. POSIX does not specify -E. I can use -f instead, though. >> + if [ -f "$output_file" ]; then >> + # Avoid changing output file if it is up to date, as that >> + # would cause unnecessary rebuilds. >> + if cmp -s -- "$output_file.tmp" "$output_file"; then >> + rm -- "$output_file.tmp" >> + continue >> + else >> + astatus=$? >> + if [ "$astatus" != 1 ]; then exit "$astatus"; fi > > Could avoid the need for the variable and multiple ifs. Up to you > whether you prefer it: > > set +e > cmp -s -- "$output_file.tmp" "$output_file" > set -e > case $? in > 0) > rm -- "$output_file.tmp" > continue > ;; > 1) > ;; > *) > exit $? > ;; > esac This might set $? to the return value of 'set -e' (0). Whether or not it actually does is at least not obvious from reading the code. >> + fi >> + fi >> + mv -- "$output_file.tmp" "$output_file" >> +done -- Sincerely, Demi Marie Obenour (she/her/hers)