patches and low-level development discussion
 help / color / mirror / code / Atom feed
From: Demi Marie Obenour <demiobenour@gmail.com>
To: Alyssa Ross <hi@alyssa.is>
Cc: Spectrum OS Development <devel@spectrum-os.org>
Subject: Re: [PATCH v3] Generate file lists from a script
Date: Sun, 21 Sep 2025 12:51:31 -0400	[thread overview]
Message-ID: <e8a7ce72-7f2a-480a-b6ee-55dcc5e31bac@gmail.com> (raw)
In-Reply-To: <87bjn4b39h.fsf@alyssa.is>


[-- Attachment #1.1.1: Type: text/plain, Size: 13897 bytes --]

On 9/21/25 04:47, Alyssa Ross wrote:
> Demi Marie Obenour <demiobenour@gmail.com> writes:
> 
>> Right now, the makefiles in host/rootfs, vm/sys/net, and img/app have
>> manually-maintained lists of files and symlinks.  These duplicate the
>> information in the git repository and can easily get out of sync or
>> cause unnecessary merge conflicts.  Fix all of these issues by having
>> the git repository be the source of truth, and using a script to
>> generate the file lists.  Developers can regenerate the lists before
>> every commit, or even add a git hook to do that.
>>
>> Signed-off-by: Demi Marie Obenour <demiobenour@gmail.com>
>> ---
> 
> I like where this is going. :)

Yay!

>> Changes in v3:
>> - Only include the file list generator.  Move the rest to separate patch
>>   series.
>> - Remove the update-file-list make targets from img/app/Makefile and
>>   vm/sys/net/Makefile.
>> - Link to v2: https://lore.kernel.org/r/20250910-genfiles-v2-0-37ebe07a3cdc@gmail.com
>>
>> Changes in v2:
>> - Drop the last patch (switching to /etc/s6-rc/compiled) as it is
>>   controversial and should be reviewed separately.
>> - Add missing copyright notices.
>> - Use a wrapper shell script to make the awk code easier to read.
>> - Improve documentation.
>> - Add helper scripts for use in git hooks and rebasing.
>> - Link to v1: https://spectrum-os.org/lists/archives/spectrum-devel/20250903-genfiles-v1-0-cc993fcb1e4c@gmail.com/
>> ---
>>  Documentation/development/built-in-vms.adoc |  17 ++++
>>  host/rootfs/Makefile                        | 102 +----------------------
>>  host/rootfs/file-list.mk                    |  99 +++++++++++++++++++++++
>>  img/app/Makefile                            |  80 +++----------------
>>  img/app/file-list.mk                        |  65 +++++++++++++++
>>  lib/common.mk                               |   1 +
>>  scripts/genfiles.awk                        | 120 ++++++++++++++++++++++++++++
>>  scripts/genfiles.sh                         |  29 +++++++
>>  scripts/git-rebase                          |  17 ++++
>>  scripts/pre-commit.sh                       |  11 +++
> 
> Let's take git-rebase and pre-commit.sh out of this patch, and focus on
> the generated file lists first.

Will change.

>>  vm/sys/net/Makefile                         |  50 ++----------
>>  vm/sys/net/file-list.mk                     |  42 ++++++++++
>>  12 files changed, 422 insertions(+), 211 deletions(-)
>>
>> diff --git a/Documentation/development/built-in-vms.adoc b/Documentation/development/built-in-vms.adoc
>> index e90009ee5a3c2c254a7ae11e36121576b819eee7..0addc7d1a2fd322fa12918656baa3d169478504d 100644
>> --- a/Documentation/development/built-in-vms.adoc
>> +++ b/Documentation/development/built-in-vms.adoc
> 
> Copyright header please!

Will fix.  Also, in the future you have permission to fix missing copyright
headers when you commit.  It's fine if you aren't comfortable doing that.

>> @@ -44,6 +44,23 @@ NOTE: As a special convenience, it's not necessary to run `make clean`
>>  if the only change to the Nix files is modifying the packages
>>  installed in the VM.
>>  
>> +The list of files used for the VM image is stored in a separate file,
>> +`file-lists.mk`.  To update it, run `scripts/genfiles.sh`
> 
> Typo: file-list*s*.mk.  Also, so far we haven't used code syntax for
> file names.
> 
> Maybe "used for images" would be better, since this also applies to
> host/rootfs.  (Obviously the ideal would be if this documentation wasn't
> only written for VM images but that's out of scope.  We'll get to it.)

Will fix.

>> +which will regenerate it from the output of `git ls-files`.  Any
>> +changes you made will be lost.  This script uses uses Git's index to
> 
> I think "Any changes you made will be lost." is a bit scary, because
> it's not clear it only means changes to those files.  The sentence could
> probably just be dropped altogether — I think it's implied by "regenerate".

Will fix.

>> +generate the list, so you need to use `git add`, `git rm`, and `git mv`
>> +to ensure that Git knows about your changes.  It is not necessary to
>> +commit the changes.
> 
> "so only staged changes will be reflected"?  All the extra stuff has
> potential for confusion I think — for example "It is not necessary to
> commit the changes." could be read as "when you make a commit, do not
> include changes to file-list.mk".

Will fix.

>> diff --git a/lib/common.mk b/lib/common.mk
>> index 277c3544036d9a9057f8ba4ad37fe2207548cc59..0a03ff440cc671264d2b859a2ae048db9252d047 100644
>> --- a/lib/common.mk
>> +++ b/lib/common.mk
>> @@ -1,5 +1,6 @@
>>  # SPDX-License-Identifier: EUPL-1.2+
>>  # SPDX-FileCopyrightText: 2021, 2023, 2025 Alyssa Ross <hi@alyssa.is>
>> +# SPDX-FileCopyrightText: 2025 Demi Marie Obenour <demiobenour@gmail.com>
>>  
>>  BACKGROUND = background
>>  CPIO = cpio
> 
> Accident?

Yes.

>> diff --git a/scripts/genfiles.awk b/scripts/genfiles.awk
>> new file mode 100644
>> index 0000000000000000000000000000000000000000..6fe327fd0a314d226dbce23854aa8f119e9c8f34
>> --- /dev/null
>> +++ b/scripts/genfiles.awk
>> @@ -0,0 +1,120 @@
>> +#!/usr/bin/env -S LC_ALL=C LANGUAGE=C awk -E
>> +# SPDX-License-Identifier: EUPL-1.2+
>> +# SPDX-FileCopyrightText: 2025 Demi Marie Obenour <demiobenour@gmail.com>
>> +BEGIN {
>> +	RS = "\n";
>> +	FS = "\t";
>> +	file_count = 0;
>> +	symlink_count = 0;
>> +	rc_count = 0;
>> +	is_rc = 0;
>> +	exit_code = 0;
>> +	done = 0;
> 
> awk variables are implicitly initialized to 0 when you try to do
> arithmetic on an undefined variable, so no need for these.

GNU Awk can lint against that.  I used its lint mode because it also
warns against non-portable constructs.  Also, an undefined awk
variable used as an array subscript is treated as the empty string,
not 0, which could lead to confusion.

>> +	modes["120000"] = "symlink";
>> +	modes["040755"] = "directory";
>> +	modes["100644"] = "regular";
>> +	modes["100755"] = "regular";
>> +}
>> +
>> +function fail(msg, status) {
>> +	if (status ~ /^([1-9][0-9]?|1[0-9]{2}|2[0-4][1-9]|25[1-5])$/) {
>> +		exit_code = status;
>> +	} else {
>> +		exit_code = 1;
>> +		status = 1;
>> +	}
>> +	print ("FATAL: " msg) > "/dev/stderr";
>> +	exit status;
> 
> Do we ever want to exit something other than 1 from this function?

Nope.

>> +}
>> +done { fail("Junk after DONE", 1); }
>> +/^DONE$/ {
>> +	done = 1
>> +	next
>> +}
>> +
>> +# Make sure git produced valid output.
>> +!/^[0-7]{6}\t[ -~]+$/ {
>> +	fail("git ls-files produced invalid output", 1);
>> +}
>> +
> 
> This is very unlikely to happen, and if it does, it will be obvious from
> the diff.

Will drop.

>> +# Extract data from built-in variables.
>> +{
>> +	filename = $2;
>> +	raw_mode = $1;
>> +	# awk autocreates empty string entries if the key is invalid,
>> +	# but the code exits in this case so that is okay.
>> +	mode = modes[raw_mode];
>> +}
>> +
>> +# Another check for a git bug.
>> +filename ~ /^\/|((^|\/)\.{0,2}($|\/))/ {
>> +	fail("git ls-files output non-canonical or absolute path '" filename "'", 1);
>> +}
>> +
> 
> If there are git bugs, we will notice and report them.  We do not need
> to be the test suite for git here.

Okay, fair!

>> +filename ~ /[^[:alnum:]_.+@/-]/ {
>> +	fail("filename '" filename "' has forbidden characters", 1);
>> +}
>> +
>> +/\.license$/ {
>> +	if (raw_mode != "100644") {
>> +		fail("License file '" filename "' is executable or not regular file", 1);
>> +	}
>> +	next;
>> +}
> 
> This is also not really in scope for a script that does not care about
> license files.

Fair.  I will leave that to the reuse check.

>> +
>> +mode == "directory" { next }
> 
> Getting a directory from git ls-files would be sufficiently unexpected
> that I don't think we should treat it any differently from an
> unrecognized mode.

Will fix.

>> +
>> +filename ~ /^image\/etc\/s6-rc\// {
>> +	if (mode != "regular") {
>> +		fail("s6-rc-compile input '" filename "' isn't a regular file");
>> +	}
>> +	rc_count += 1;
>> +	rc_files[rc_count] = filename;
> 
> rc_files[rc_count++]
> 
> (will make it 0-indexed though so update the loops too)

I think this might break without explicit variable initialization.

>> +	next;
>> +}
>> +
>> +mode == "symlink" {
>> +	symlink_count += 1;
>> +	symlinks[symlink_count] = filename;
>> +	next;
>> +}
>> +
>> +mode == "regular" {
>> +	file_count += 1;
>> +	files[file_count] = filename;
>> +	next;
>> +}
>> +
>> +{ fail("File '" filename "' is not regular file, directory, or symlink (mode " raw_mode ")"); }
>> +
>> +END {
>> +	if (exit_code) {
>> +		exit exit_code;
>> +	}
>> +	if (!done) {
>> +		fail("Did not receive DONE line", 1);
>> +	}
>> +	printf ("# SPDX-License-Identifier: CC0-1.0\n" \
>> +	       "# SPDX-FileCopyrightText: 2025 Demi Marie Obenour <demiobenour@gmail.com>\n" \
> 
> Okay, so, it's silly that this needs to have a copyright header on it at
> all, but since we have to have one to make reuse happy, I think it
> should be mine from 2021, because the comment about links is the closest
> thing to creative expression in here.

Will fix.

>> +	       "# Generated by scripts/genfile.sh.  Any changes will be overwritten.\n" \
>> +	       "FILES ::=") > out_file;
> 
> I note the change to ::=.  Do you think we should do that across the
> board in our Makefiles?

POSIX specifies ::= and it has better semantics in most cases, but I don't
know if the BSD makes implement it.  ::= causes the RHS to be expanded immediately,
so subsequent changes in variables referenced by it do not affect the LHS.

>> +	for (array_index = 1; array_index <= file_count; array_index += 1) {
>> +		printf " \\\n\t%s", files[array_index] > out_file;
>> +	}
>> +	printf ("\n\n" \
>> +"# These are separate because they need to be included, but putting\n" \
>> +"# them as make dependencies would confuse make.\n" \
>> +"LINKS ::=") > out_file;
>> +	for (array_index = 1; array_index <= symlink_count; array_index += 1) {
>> +		printf " \\\n\t%s", symlinks[array_index] > out_file;
>> +	}
>> +	printf "\n\nS6_RC_FILES ::=" > out_file;
>> +	for (array_index = 1; array_index <= rc_count; array_index += 1) {
>> +		printf " \\\n\t%s", rc_files[array_index] > out_file;
>> +	}
>> +	printf "\n" > out_file;
>> +	if (close(out_file)) {
>> +		print ("Cannot close output file: " ERRNO "\n") > "/dev/stderr";
>> +		exit 1;
>> +	}
>> +}
>> diff --git a/scripts/genfiles.sh b/scripts/genfiles.sh
>> new file mode 100755
>> index 0000000000000000000000000000000000000000..77a8d95e88b6851be9447698556efe4f1eab174b
>> --- /dev/null
>> +++ b/scripts/genfiles.sh
>> @@ -0,0 +1,29 @@
>> +#!/usr/bin/env -S LC_ALL=C LANGUAGE=C bash --
> 
> env -S is not portable, and I don't think anything here needs bash
> specifically.

$'\t' doesn't work with all shells, though I believe it is either
part of the current POSIX standard or will be added.  I'll use
/usr/bin/env bash, which breaks if the script is renamed to something
starting with '-'.

> We can set the locale variables after the script starts,
> because I don't think this wrapper script is going to do anything
> locale-specific. (And shouldn't they be C.UTF-8?)

The C locale is actually what I intended.  The script does not rely
on support for non-ASCII characters, and it does use the fact that
negated character classes match all bytes.  Admittedly, this will
only be needed if there is a git bug.

>> +set -euo pipefail
>> +unset output_file astatus
> 
> This is a bit overly defensive IMO.  Both of these variables are
> assigned before use, and if they weren't, the person making those
> changes would be very unlikely to not notice because they had those
> variables defined in their environment.

Fair!

>> +case $0 in
>> +(/*) cd "${0%/*}/..";;
>> +(*/*) cd "./${0%/*}/..";;
>> +(*) cd ..;;
>> +esac
> 
> Perhaps we could use git rev-parse --show-toplevel?

git ls-files doesn't have that option.

>> +for i in host/rootfs img/app vm/sys/net; do
>> +    output_file=$i/file-list.mk
>> +    {
>> +	git -C "$i" -c core.quotePath=true ls-files $'--format=%(objectmode)\t%(path)' -- image |
>> +	sort -t $'\t' -k 2
> 
> TIL sort -t and -k! 🤯
> 
>> +	echo DONE
> 
> Why do we need this?

To avoid producing any output file if the input is truncated.

>> +    } |
>> +    gawk -v "out_file=$output_file.tmp" -E scripts/genfiles.awk
> 
> Why not stdout?

The output file is created by awk so that it is only created if
nothing went wrong.

> And why gawk?  I didn't immediately notice anything
> non-POSIX, and as usual would prefer to stick to it.

POSIX does not specify -E.  I can use -f instead, though.

>> +    if [ -f "$output_file" ]; then
>> +	    # Avoid changing output file if it is up to date, as that
>> +	    # would cause unnecessary rebuilds.
>> +	    if cmp -s -- "$output_file.tmp" "$output_file"; then
>> +		    rm -- "$output_file.tmp"
>> +		    continue
>> +	    else
>> +		    astatus=$?
>> +		    if [ "$astatus" != 1 ]; then exit "$astatus"; fi
> 
> Could avoid the need for the variable and multiple ifs.  Up to you
> whether you prefer it:
> 
> set +e
> cmp -s -- "$output_file.tmp" "$output_file"
> set -e
> case $? in
> 0)
> 	rm -- "$output_file.tmp"
>         continue
>         ;;
> 1)
> 	;;
> *)
> 	exit $?
>         ;;
> esac

This might set $? to the return value of 'set -e' (0).  Whether or
not it actually does is at least not obvious from reading the code.

>> +	    fi
>> +    fi
>> +    mv -- "$output_file.tmp" "$output_file"
>> +done


-- 
Sincerely,
Demi Marie Obenour (she/her/hers)

[-- Attachment #1.1.2: OpenPGP public key --]
[-- Type: application/pgp-keys, Size: 7253 bytes --]

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

  reply	other threads:[~2025-09-21 16:51 UTC|newest]

Thread overview: 51+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-09-04  1:56 [PATCH 0/4] Generate file lists from a script Demi Marie Obenour
2025-09-04  1:56 ` [PATCH 1/4] Move all files for the image into a subdirectory Demi Marie Obenour
2025-09-04  1:56 ` [PATCH 2/4] Generate makefile file lists from a script Demi Marie Obenour
2025-09-08  9:59   ` Alyssa Ross
2025-09-08 18:45     ` Demi Marie Obenour
2025-09-09 14:51       ` Alyssa Ross
2025-09-04  1:56 ` [PATCH 3/4] Common make rules for building erofs images Demi Marie Obenour
2025-09-08 10:01   ` Alyssa Ross
2025-09-08 18:53     ` Demi Marie Obenour
2025-09-09 14:56       ` Alyssa Ross
2025-09-04  1:56 ` [PATCH 4/4] Use /etc/s6-rc/compiled for compiled s6-rc directory Demi Marie Obenour
2025-09-10  5:29 ` [PATCH v2 0/3] Generate file lists from a script Demi Marie Obenour
2025-09-10  5:29   ` [PATCH v2 1/3] Move all files for the image into a subdirectory Demi Marie Obenour
2025-09-10 18:58     ` Alyssa Ross
2025-09-11 12:21       ` Demi Marie Obenour
2025-09-10  5:29   ` [PATCH v2 2/3] Generate makefile file lists from a script Demi Marie Obenour
2025-09-10  5:29   ` [PATCH v2 3/3] Common make rules for building erofs images Demi Marie Obenour
2025-09-11 12:47   ` [PATCH v3 0/4] Generate file lists from a script Demi Marie Obenour
2025-09-11 12:47     ` [PATCH v3 1/4] Do not ignore errors from tar Demi Marie Obenour
2025-09-17 11:48       ` Alyssa Ross
2025-09-18  2:45         ` Demi Marie Obenour
2025-09-19  7:46           ` Alyssa Ross
2025-09-30 12:59             ` Alyssa Ross
2025-09-19  7:55       ` Alyssa Ross
2025-09-19 19:03         ` Demi Marie Obenour
2025-09-11 12:47     ` [PATCH v3 2/4] Move all files for the image into a subdirectory Demi Marie Obenour
2025-09-17 12:30       ` Alyssa Ross
2025-09-17 12:39       ` Alyssa Ross
2025-09-17 13:03       ` Alyssa Ross
2025-09-11 12:47     ` [PATCH v3 3/4] Generate makefile file lists from a script Demi Marie Obenour
2025-09-11 12:47     ` [PATCH v3 4/4] Common make rules for building erofs images Demi Marie Obenour
2025-09-21  2:23   ` [PATCH v3] Generate file lists from a script Demi Marie Obenour
2025-09-21  8:47     ` Alyssa Ross
2025-09-21 16:51       ` Demi Marie Obenour [this message]
2025-09-21 17:07         ` Alyssa Ross
2025-09-21 17:24     ` [PATCH v4] " Demi Marie Obenour
2025-09-25 11:22       ` Alyssa Ross
2025-09-26 16:31       ` [PATCH v5] " Demi Marie Obenour
2025-09-27  8:19         ` Alyssa Ross
2025-09-27  8:42           ` Demi Marie Obenour
2025-09-27 16:22         ` [PATCH v6] " Demi Marie Obenour
2025-09-29  8:12           ` Alyssa Ross
2025-09-29 17:20             ` Demi Marie Obenour
2025-09-29 17:18           ` [PATCH v7] " Demi Marie Obenour
2025-10-01  9:20             ` Alyssa Ross
2025-10-01  9:24               ` Demi Marie Obenour
2025-10-01  9:35             ` Alyssa Ross
2025-10-01 18:30             ` [PATCH v8] " Demi Marie Obenour
2025-10-02  9:46               ` Alyssa Ross
2025-10-02 17:37               ` [PATCH v9] " Demi Marie Obenour
2025-10-03  9:04                 ` Alyssa Ross

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=e8a7ce72-7f2a-480a-b6ee-55dcc5e31bac@gmail.com \
    --to=demiobenour@gmail.com \
    --cc=devel@spectrum-os.org \
    --cc=hi@alyssa.is \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://spectrum-os.org/git/crosvm
	https://spectrum-os.org/git/doc
	https://spectrum-os.org/git/mktuntap
	https://spectrum-os.org/git/nixpkgs
	https://spectrum-os.org/git/spectrum
	https://spectrum-os.org/git/ucspi-vsock
	https://spectrum-os.org/git/www

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).