Demi Marie Obenour writes: > Right now, the makefiles in host/rootfs, vm/sys/net, and img/app have > manually-maintained lists of files and symlinks. These duplicate the > information in the git repository and can easily get out of sync or > cause unnecessary merge conflicts. Fix all of these issues by having > the git repository be the source of truth, and using a script to > generate the file lists. Developers can regenerate the lists before > every commit, or even add a git hook to do that. > > Signed-off-by: Demi Marie Obenour > --- I like where this is going. :) > Changes in v3: > - Only include the file list generator. Move the rest to separate patch > series. > - Remove the update-file-list make targets from img/app/Makefile and > vm/sys/net/Makefile. > - Link to v2: https://lore.kernel.org/r/20250910-genfiles-v2-0-37ebe07a3cdc@gmail.com > > Changes in v2: > - Drop the last patch (switching to /etc/s6-rc/compiled) as it is > controversial and should be reviewed separately. > - Add missing copyright notices. > - Use a wrapper shell script to make the awk code easier to read. > - Improve documentation. > - Add helper scripts for use in git hooks and rebasing. > - Link to v1: https://spectrum-os.org/lists/archives/spectrum-devel/20250903-genfiles-v1-0-cc993fcb1e4c@gmail.com/ > --- > Documentation/development/built-in-vms.adoc | 17 ++++ > host/rootfs/Makefile | 102 +---------------------- > host/rootfs/file-list.mk | 99 +++++++++++++++++++++++ > img/app/Makefile | 80 +++---------------- > img/app/file-list.mk | 65 +++++++++++++++ > lib/common.mk | 1 + > scripts/genfiles.awk | 120 ++++++++++++++++++++++++++++ > scripts/genfiles.sh | 29 +++++++ > scripts/git-rebase | 17 ++++ > scripts/pre-commit.sh | 11 +++ Let's take git-rebase and pre-commit.sh out of this patch, and focus on the generated file lists first. > vm/sys/net/Makefile | 50 ++---------- > vm/sys/net/file-list.mk | 42 ++++++++++ > 12 files changed, 422 insertions(+), 211 deletions(-) > > diff --git a/Documentation/development/built-in-vms.adoc b/Documentation/development/built-in-vms.adoc > index e90009ee5a3c2c254a7ae11e36121576b819eee7..0addc7d1a2fd322fa12918656baa3d169478504d 100644 > --- a/Documentation/development/built-in-vms.adoc > +++ b/Documentation/development/built-in-vms.adoc Copyright header please! > @@ -44,6 +44,23 @@ NOTE: As a special convenience, it's not necessary to run `make clean` > if the only change to the Nix files is modifying the packages > installed in the VM. > > +The list of files used for the VM image is stored in a separate file, > +`file-lists.mk`. To update it, run `scripts/genfiles.sh` Typo: file-list*s*.mk. Also, so far we haven't used code syntax for file names. Maybe "used for images" would be better, since this also applies to host/rootfs. (Obviously the ideal would be if this documentation wasn't only written for VM images but that's out of scope. We'll get to it.) > +which will regenerate it from the output of `git ls-files`. Any > +changes you made will be lost. This script uses uses Git's index to I think "Any changes you made will be lost." is a bit scary, because it's not clear it only means changes to those files. The sentence could probably just be dropped altogether — I think it's implied by "regenerate". > +generate the list, so you need to use `git add`, `git rm`, and `git mv` > +to ensure that Git knows about your changes. It is not necessary to > +commit the changes. "so only staged changes will be reflected"? All the extra stuff has potential for confusion I think — for example "It is not necessary to commit the changes." could be read as "when you make a commit, do not include changes to file-list.mk". > diff --git a/lib/common.mk b/lib/common.mk > index 277c3544036d9a9057f8ba4ad37fe2207548cc59..0a03ff440cc671264d2b859a2ae048db9252d047 100644 > --- a/lib/common.mk > +++ b/lib/common.mk > @@ -1,5 +1,6 @@ > # SPDX-License-Identifier: EUPL-1.2+ > # SPDX-FileCopyrightText: 2021, 2023, 2025 Alyssa Ross > +# SPDX-FileCopyrightText: 2025 Demi Marie Obenour > > BACKGROUND = background > CPIO = cpio Accident? > diff --git a/scripts/genfiles.awk b/scripts/genfiles.awk > new file mode 100644 > index 0000000000000000000000000000000000000000..6fe327fd0a314d226dbce23854aa8f119e9c8f34 > --- /dev/null > +++ b/scripts/genfiles.awk > @@ -0,0 +1,120 @@ > +#!/usr/bin/env -S LC_ALL=C LANGUAGE=C awk -E > +# SPDX-License-Identifier: EUPL-1.2+ > +# SPDX-FileCopyrightText: 2025 Demi Marie Obenour > +BEGIN { > + RS = "\n"; > + FS = "\t"; > + file_count = 0; > + symlink_count = 0; > + rc_count = 0; > + is_rc = 0; > + exit_code = 0; > + done = 0; awk variables are implicitly initialized to 0 when you try to do arithmetic on an undefined variable, so no need for these. > + modes["120000"] = "symlink"; > + modes["040755"] = "directory"; > + modes["100644"] = "regular"; > + modes["100755"] = "regular"; > +} > + > +function fail(msg, status) { > + if (status ~ /^([1-9][0-9]?|1[0-9]{2}|2[0-4][1-9]|25[1-5])$/) { > + exit_code = status; > + } else { > + exit_code = 1; > + status = 1; > + } > + print ("FATAL: " msg) > "/dev/stderr"; > + exit status; Do we ever want to exit something other than 1 from this function? > +} > +done { fail("Junk after DONE", 1); } > +/^DONE$/ { > + done = 1 > + next > +} > + > +# Make sure git produced valid output. > +!/^[0-7]{6}\t[ -~]+$/ { > + fail("git ls-files produced invalid output", 1); > +} > + This is very unlikely to happen, and if it does, it will be obvious from the diff. > +# Extract data from built-in variables. > +{ > + filename = $2; > + raw_mode = $1; > + # awk autocreates empty string entries if the key is invalid, > + # but the code exits in this case so that is okay. > + mode = modes[raw_mode]; > +} > + > +# Another check for a git bug. > +filename ~ /^\/|((^|\/)\.{0,2}($|\/))/ { > + fail("git ls-files output non-canonical or absolute path '" filename "'", 1); > +} > + If there are git bugs, we will notice and report them. We do not need to be the test suite for git here. > +filename ~ /[^[:alnum:]_.+@/-]/ { > + fail("filename '" filename "' has forbidden characters", 1); > +} > + > +/\.license$/ { > + if (raw_mode != "100644") { > + fail("License file '" filename "' is executable or not regular file", 1); > + } > + next; > +} This is also not really in scope for a script that does not care about license files. > + > +mode == "directory" { next } Getting a directory from git ls-files would be sufficiently unexpected that I don't think we should treat it any differently from an unrecognized mode. > + > +filename ~ /^image\/etc\/s6-rc\// { > + if (mode != "regular") { > + fail("s6-rc-compile input '" filename "' isn't a regular file"); > + } > + rc_count += 1; > + rc_files[rc_count] = filename; rc_files[rc_count++] (will make it 0-indexed though so update the loops too) > + next; > +} > + > +mode == "symlink" { > + symlink_count += 1; > + symlinks[symlink_count] = filename; > + next; > +} > + > +mode == "regular" { > + file_count += 1; > + files[file_count] = filename; > + next; > +} > + > +{ fail("File '" filename "' is not regular file, directory, or symlink (mode " raw_mode ")"); } > + > +END { > + if (exit_code) { > + exit exit_code; > + } > + if (!done) { > + fail("Did not receive DONE line", 1); > + } > + printf ("# SPDX-License-Identifier: CC0-1.0\n" \ > + "# SPDX-FileCopyrightText: 2025 Demi Marie Obenour \n" \ Okay, so, it's silly that this needs to have a copyright header on it at all, but since we have to have one to make reuse happy, I think it should be mine from 2021, because the comment about links is the closest thing to creative expression in here. > + "# Generated by scripts/genfile.sh. Any changes will be overwritten.\n" \ > + "FILES ::=") > out_file; I note the change to ::=. Do you think we should do that across the board in our Makefiles? > + for (array_index = 1; array_index <= file_count; array_index += 1) { > + printf " \\\n\t%s", files[array_index] > out_file; > + } > + printf ("\n\n" \ > +"# These are separate because they need to be included, but putting\n" \ > +"# them as make dependencies would confuse make.\n" \ > +"LINKS ::=") > out_file; > + for (array_index = 1; array_index <= symlink_count; array_index += 1) { > + printf " \\\n\t%s", symlinks[array_index] > out_file; > + } > + printf "\n\nS6_RC_FILES ::=" > out_file; > + for (array_index = 1; array_index <= rc_count; array_index += 1) { > + printf " \\\n\t%s", rc_files[array_index] > out_file; > + } > + printf "\n" > out_file; > + if (close(out_file)) { > + print ("Cannot close output file: " ERRNO "\n") > "/dev/stderr"; > + exit 1; > + } > +} > diff --git a/scripts/genfiles.sh b/scripts/genfiles.sh > new file mode 100755 > index 0000000000000000000000000000000000000000..77a8d95e88b6851be9447698556efe4f1eab174b > --- /dev/null > +++ b/scripts/genfiles.sh > @@ -0,0 +1,29 @@ > +#!/usr/bin/env -S LC_ALL=C LANGUAGE=C bash -- env -S is not portable, and I don't think anything here needs bash specifically. We can set the locale variables after the script starts, because I don't think this wrapper script is going to do anything locale-specific. (And shouldn't they be C.UTF-8?) > +set -euo pipefail > +unset output_file astatus This is a bit overly defensive IMO. Both of these variables are assigned before use, and if they weren't, the person making those changes would be very unlikely to not notice because they had those variables defined in their environment. > +case $0 in > +(/*) cd "${0%/*}/..";; > +(*/*) cd "./${0%/*}/..";; > +(*) cd ..;; > +esac Perhaps we could use git rev-parse --show-toplevel? > +for i in host/rootfs img/app vm/sys/net; do > + output_file=$i/file-list.mk > + { > + git -C "$i" -c core.quotePath=true ls-files $'--format=%(objectmode)\t%(path)' -- image | > + sort -t $'\t' -k 2 TIL sort -t and -k! 🤯 > + echo DONE Why do we need this? > + } | > + gawk -v "out_file=$output_file.tmp" -E scripts/genfiles.awk Why not stdout? And why gawk? I didn't immediately notice anything non-POSIX, and as usual would prefer to stick to it. > + if [ -f "$output_file" ]; then > + # Avoid changing output file if it is up to date, as that > + # would cause unnecessary rebuilds. > + if cmp -s -- "$output_file.tmp" "$output_file"; then > + rm -- "$output_file.tmp" > + continue > + else > + astatus=$? > + if [ "$astatus" != 1 ]; then exit "$astatus"; fi Could avoid the need for the variable and multiple ifs. Up to you whether you prefer it: set +e cmp -s -- "$output_file.tmp" "$output_file" set -e case $? in 0) rm -- "$output_file.tmp" continue ;; 1) ;; *) exit $? ;; esac > + fi > + fi > + mv -- "$output_file.tmp" "$output_file" > +done