git.postgresql.org Git - postgresql.git/log

Adjust signature of cluster_rel() and its subroutines

cluster_rel() receives the OID of the relation to process, which it
opens and locks; but then its subroutine copy_table_data() also receives
the relation OID and opens it by itself.  This is a bit wasteful.  It's
better to have cluster_rel() receive the relation already open, and pass
it down to its subroutines as necessary; then cluster_rel closes the rel
before returning.  This simplifies things.

But a better motivation to make this change is that a future command to
do logical-decoding-based "concurrent VACUUM FULL" will need to release
all locks on the relation (and possibly on the clustering index) at some
point.  Since it makes little sense to keep the relation reference
without the lock, the cluster_rel() function will also close it (and
the index).  With this arrangement, neither the function nor its
subroutines need open extra references, which, again, makes things simpler.

Author: Antonin Houska <ah@cybertec.at>
Discussion: https://postgr.es/m/82651.1720540558@antos

Fix UNION planner datatype issue

66c0185a3 gave the planner the ability to have union child queries
provide the union planner with pre-sorted input so that UNION queries
could be more efficiently implemented using Merge Append.

That commit overlooked checking that the UNION target list and the union
child target list's types all match. In some corner cases, this could
result in the planner producing sorts using the sort operator of the
top-level UNION's target list type rather than of the union child's
target list's type. The implications of this range from silently
working correctly, despite using the wrong sort operator all the way up
to a segmentation fault.

Here we fix by adjusting the planner so it makes no attempt to have the
subquery produce pre-sorted results when the data type of the UNION
target list and the types from the subquery target list don't match
exactly.

Backpatch to 17, where 66c0185a3 was introduced.

Reported-by: Jason Smith <dqetool@126.com>
Diagnosed-by: Tom Lane <tgl@sss.pgh.pa.us>
Bug: 18764
Discussion: https://postgr.es/m/18764-63ad667ea26e877a%40postgresql.org
Backpatch-through: 17

Merge pgstat_count_io_op_n() and pgstat_count_io_op()

The pgstat_count_io_op() function, which counts a single I/O operation,
wraps pgstat_count_io_op_n() with a counter value of 1. The latter is
declared in pgstat.h and used nowhere in the code, so let's remove it in
favor of the former.

This change makes also the code more symmetric with
pgstat_count_io_op_time(), that already uses a similar set of arguments,
except that it counts also the I/O time. This will ease a bit the
integration of a follow-up patch that adds byte-level tracking in
pg_stat_io for some of its attributes, lifting the current restriction
based on BLCKSZ as all I/O operations are assumed to be block-based.

Author: Nazir Bilal Yavuz
Reviewed-by: Bertrand Drouvot
Discussion: https://postgr.es/m/CAN55FZ32ze812=yjyZg1QeXhKvACUM_Nu0_gyPQcUKKuVHL5xA@mail.gmail.com

Refactor some code related to backend statistics

This commit changes the way pending backend statistics are tracked by
moving them into a new structure called PgStat_BackendPending, removing
PgStat_BackendPendingIO. PgStat_BackendPending currently only includes
PgStat_PendingIO for the pending I/O stats.

pgstat_flush_backend() is extended with a "flags" argument to control
which parts of the stats of a backend should be flushed.

With this refactoring, it becomes easier to plug into backend statistics
more data. A patch to add information related to WAL in this stats kind
is under discussion.

Author: Bertrand Drouvot
Discussion: https://postgr.es/m/Z3zqc4o09dM/Ezyz@ip-10-97-1-34.eu-west-3.compute.internal

Fix an ALTER GROUP ... DROP USER error message.

This error message stated the privileges required to add a member
to a group even if the user was trying to drop a member:

postgres=> alter group a drop user b;
ERROR:  permission denied to alter role
DETAIL:  Only roles with the ADMIN option on role "a" may add members.

Since the required privileges for both operations are the same, we
can fix this by modifying the message to mention both adding and
dropping members:

postgres=> alter group a drop user b;
ERROR:  permission denied to alter role
DETAIL:  Only roles with the ADMIN option on role "a" may add or drop members.

Author: ChangAo Chen
Reviewed-by: Tom Lane
Discussion: https://postgr.es/m/tencent_FAA0D00E3514AAF0BBB6322542A6094FEF05%40qq.com
Backpatch-through: 16

Use @extschema:name@ notation in contrib transform modules.

Harden hstore_plperl, hstore_plpython, and ltree_plpython
against search-path-based attacks by using @extschema:name@
notation to refer to the underlying hstore or ltree data type.

This allows removal of the previous documentation warning
suggesting that they must be installed in the same schema as
the underlying data type. In passing, also improve a para in
extend.sgml to suggest using @extschema:name@ for such purposes.

Discussion: https://postgr.es/m/692480.1736021695@sss.pgh.pa.us

Simplify signature of RewriteTable

This function doesn't need the lockmode to be passed: it was being used
to lock the new heap, but that's bogus, because the only caller has
already obtained the appropriate lock on the new heap (which is
unimportant anyway, because the relation's creation is not yet committed
and so no other session can see it).

Noticed while reviewed Antonin Houska's patch to add VACUUM FULL
CONCURRENTLY.

doc: Clarify synchronous_standby_names parameter.

The synchronous_standby_names GUC allows specifying num_sync,
the number of synchronous standbys transactions must wait for
replies from. This value must be an integer greater than zero.
This commit updates the documentation to clarify this requirement.

Reported-by: Asphator <asphator@gmail.com>
Discussion: https://postgr.es/m/18663-b02f75cb919f1b60@postgresql.org

Fix SLRU bank selection code

The originally submitted code (using bit masking) was correct when the
number of slots was restricted to be a power of two -- but that
limitation was removed during development that led to commit
53c2a97a9266, which made the bank selection code incorrect.  This led to
always using a smaller number of banks than available.  Change said code
to use integer modulo instead, which works correctly with an arbitrary
number of banks.

It's likely that we could improve on this to avoid runtime use of
integer division.  But with this change we're, at least, not wasting
memory on unused banks, and more banks mean less contention, which is
likely to have a much higher performance impact than a single
instruction's latency.

Author: Yura Sokolov <y.sokolov@postgrespro.ru>
Reviewed-by: Andrey Borodin <x4mmm@yandex-team.ru>
Discussion: https://postgr.es/m/9444dc46-ca47-43ed-9058-89c456316306@postgrespro.ru

Fix off_t overflow in pg_basebackup on Windows.

walmethods.c used off_t to navigate around a pg_wal.tar file that could
exceed 2GB, which doesn't work on Windows and would fail with misleading
errors. Use pgoff_t instead.

Back-patch to all supported branches.

Author: Davinder Singh <davinder.singh@enterprisedb.com>
Reported-by: Jakub Wartak <jakub.wartak@enterprisedb.com>
Discussion: https://postgr.es/m/CAKZiRmyM4YnokK6Oenw5JKwAQ3rhP0YTz2T-tiw5dAQjGRXE3Q%40mail.gmail.com

Provide 64-bit ftruncate() and lseek() on Windows.

Change our ftruncate() macro to use the 64-bit variant of chsize(), and
add a new macro to redirect lseek() to _lseeki64().

Back-patch to all supported releases, in preparation for a bug fix.

Tested-by: Davinder Singh <davinder.singh@enterprisedb.com>
Discussion: https://postgr.es/m/CAKZiRmyM4YnokK6Oenw5JKwAQ3rhP0YTz2T-tiw5dAQjGRXE3Q%40mail.gmail.com

Fix duplicate typedef from commit a2f17f004d.

Reported-by: Thomas Munro

Control collation behavior with a method table.

Previously, behavior branched based on the provider. A method table is
less error-prone and more flexible.

The ctype behavior will be addressed in an upcoming commit.

Reviewed-by: Andreas Karlsson
Discussion: https://postgr.es/m/2830211e1b6e6a2e26d845780b03e125281ea17b.camel%40j-davis.com

Move code for collation version into provider-specific files.

Author: Andreas Karlsson
Discussion: https://postgr.es/m/4548a168-62cd-457b-8d06-9ba7b985c477%40proxel.se

Disallow NAMEDTUPLESTORE RTEs in stored views, rules, etc.

A named tuplestore is necessarily a transient object, so it makes
no sense to reference one in a persistent object such as a view.
We didn't previously prevent that, with the result that if you
tried you would get some weird failure about how the executor
couldn't find the tuplestore.

We can mechanize a check for this case cheaply by making dependency
extraction complain if it comes across such an RTE.  This is a
plausible way of dealing with it since part of the problem is that we
have no way to make a pg_depend representation of a named tuplestore.

Report and fix by Yugo Nagata.  Although this is an old problem,
it's a very weird corner case and there have been no reports from
end users.  So it seems sufficient to fix it in master.

Discussion: https://postgr.es/m/20240726160714.e74d0db579f2c017e1ca0b7e@sraoss.co.jp

Set exit status for pgindent if pg_bsd_indent fails

Also document the exit codes in the script.

The new exit code is 3, and is not overridden by the exit code set in
--check mode.

Author: Ashutosh Bapat

Discussion: https://postgr.es/m/CAExHW5sPRSiFeLdP-u1Fa5ba7YS2f0gvLjmKOobopKadJwQ_GQ@mail.gmail.com

plpgsql: pure parser and reentrant scanner

The plpgsql scanner is a wrapper around the core scanner, which
already uses the flex %option reentrant.  This patch only pushes up a
few levels the place where the scanner handle is allocated.  Before,
it was allocated in pl_scanner.c in a global variable, so to the
outside the scanner was not reentrant.  Now, it is allocated in
pl_comp.c and is passed as an argument to yyparse(), similar to how it
is handled in other reentrant scanners.

Also use flex yyextra to handle context information, instead of global
variables.  Again, this uses the existing yyextra support in the core
scanner.  This complements the other changes to make the scanner
reentrant.

The bison option %pure-parser is used to make the generated parser
pure.  This happens in the usual way, since plpgsql has its own bison
parser definition.

Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Discussion: https://www.postgresql.org/message-id/flat/eb6faeac-2a8a-4b69-9189-c33c520e5b7b@eisentraut.org

Remove useless function declaration

This function apparently never existed.

pg_freespacemap: Fix declaration of pg_freespace(regclass)

This function called generate_series() without enforcing its input
argument types, making possible for an attacker to catch this call, by
defining for example a generate_series(int,bigint).

The internals of pg_freespace(regclass) are changed to force the use of
bigint for the inputs of generate_series(). A more consistent style is
applied for all its hardcoded values, while on it.

Issue introduced in 3f323eba89fb.

Reported-by: Noah Misch
Reviewed-by: Noah Misch
Discussion: https://postgr.es/m/20250106190428.ec.nmisch@google.com

ExecInitAgg: update aggstate->numaggs and ->numtrans earlier.

Functions hash_agg_entry_size() and build_hash_tables() make use of
those values for memory size estimates.

Because this change only affects memory estimates, don't backpatch.

Discussion: https://postgr.es/m/7530bd8783b1a78d53a3c70383e38d8da0a5ffe5.camel%40j-davis.com

nodeSetOp.c: missing additionalsize for BuildTupleHashTable().

Provide additionalsize argument, which can affect the calculations for
'nbuckets'. Also, future work for Hash Aggregation will rely on the
correct additionalsize.

Discussion: https://postgr.es/m/7530bd8783b1a78d53a3c70383e38d8da0a5ffe5.camel%40j-davis.com

Remove unused TupleHashTableData->entrysize.

Discussion: https://postgr.es/m/7530bd8783b1a78d53a3c70383e38d8da0a5ffe5.camel%40j-davis.com

Add missing typedefs.list entry for AggStatePerGroupData.

Discussion: https://postgr.es/m/7530bd8783b1a78d53a3c70383e38d8da0a5ffe5.camel%40j-davis.com

Use PqMsg_* macros in postgres.c.

Commit f4b54e1ed9, which introduced macros for protocol characters,
missed updating a couple of places in postgres.c.

Author: Dave Cramer
Reviewed-by: Fabrízio de Royes Mello
Discussion: https://postgr.es/m/CADK3HHJUVBPoVOmFesPB-fN8_dYt%2BQELV2UB6jxOW2Z40qF-qw%40mail.gmail.com
Backpatch-through: 17

Add passwordcheck.min_password_length.

This new parameter can be used to change the minimum allowed
password length (in bytes). Note that it has no effect if a user
supplies a pre-encrypted password.

Author: Emanuele Musella, Maurizio Boriani
Reviewed-by: Tomas Vondra, Bertrand Drouvot, Japin Li
Discussion: https://postgr.es/m/CA%2BugDNyYtHOtWCqVD3YkSVYDWD_1fO8Jm_ahsDGA5dXhbDPwrQ%40mail.gmail.com

Lower default value of autovacuum_worker_slots in initdb as needed.

Commit c758119e5b increased the default number of semaphores
required for autovacuum workers from 3 to 16.  Unfortunately, some
systems have very low default settings for SEMMNS, and this change
moved the minimum required for Postgres well beyond that limit (see
commit 38da053463 for more details).

With this commit, initdb will lower the default value for
autovacuum_worker_slots as needed, just like it already does for
parameters such as max_connections and shared_buffers.  We test
for (max_connections / 6) slots, which conveniently has the
following properties:

* For the initial max_connections default of 100, the default of
  autovacuum_worker_slots will be 16, which is its initial default
  value specified in the documentation and in guc_tables.c.

* For the lowest possible max_connections default of 25, the
  default of autovacuum_worker_slots will be 4, which means we only
  need one additional semaphore for autovacuum workers (as compared
  to before commit c758119e5b).  This leaves some wiggle room for
  new auxiliary workers, etc. on systems with low SEMMNS, and it
  ensures that the default number of slots will be greater than or
  equal to the default value of autovacuum_max_workers (3).

Reported-by: Tom Lane
Suggested-by: Andres Freund
Reviewed-by: Tom Lane
Discussion: https://postgr.es/m/1346002.1736198977%40sss.pgh.pa.us

Fix error message wording

The originals are ambiguous and a bit out of style.

Reviewed-by: Amit Langote <amitlangote09@gmail.com>
Discussion: https://postgr.es/m/202412141243.efesjyyvzxsz@alvherre.pgsql

Fix meson detection of a couple of 64 bit builtins.

A couple of checks were missed by commit 962da900, so we would fail to
detect the features.

Reported-by: Юрий Соколов <y.sokolov@postgrespro.ru>
Discussion: https://postgr.es/m/42C25E2A-6519-4549-9F47-6B0686E83836%40postgrespro.ru

Remove unnecessary code to handle CONSTR_NOTNULL

Commit 14e87ffa5c54 needlessly added support for CONSTR_NOTNULL entries
to StoreConstraints. It's dead code, so remove it.

To make the situation regarding constraint creation clearer, change
comments in heap_create_with_catalog, StoreConstraints, MergeAttributes
to explain which types of constraint are used on each.

Author: 何建 (Jian He) <jian.universality@gmail.com>
Discussion: https://postgr.es/m/CACJufxFxzqrCiUNfjJ0tQU+=nKQkQCGtGzUBude=SMOwj5VNjQ@mail.gmail.com

Improve nbtree unsatisfiable RowCompare detection.

Move nbtree's detection of RowCompare quals that are unsatisfiable due
to having a NULL in their first row element: rather than detecting these
cases at the point where _bt_first builds its insertion scan key, do so
earlier, during preprocessing proper. This brings the RowCompare case
in line every other case involving an unsatisfiable-due-to-NULL qual.

nbtree now consistently detects such unsatisfiable quals -- even when
they happen to involve a key that isn't examined by _bt_first at all.
Affected cases thereby avoid useless full index scans that cannot
possibly return any matching rows.

Author: Peter Geoghegan <pg@bowt.ie>
Reviewed-By: Matthias van de Meent <boekewurm+postgres@gmail.com>
Discussion: https://postgr.es/m/CAH2-WzmySVXst2hFrOATC-zw1Byg1XC-jYUS314=mzuqsNwk+Q@mail.gmail.com

nbtree: Simplify _bt_first parallel scan handling.

This new structure relieves _bt_first from having separate calls to
_bt_start_array_keys for the serial case and parallel case. This saves
code, and seems clearer.

Follow-up to work from commits 4e6e375b and b5ee4e52.

Author: Peter Geoghegan <pg@bowt.ie>
Reviewed-By: Matthias van de Meent <boekewurm+postgres@gmail.com>
Discussion: https://postgr.es/m/CAH2-Wz=XjUZjBjHJdhTvuH5MwoJObWGoM2RG2LyFg5WUdWyk=A@mail.gmail.com

Remove unused parameter in lookup_var_attr_stats

The parameter 'rel' in lookup_var_attr_stats was once used to draw an
ERROR when ANALYZE failed to acquire sufficient data to build extended
statistics.  bf2a691e0 changed the logic to raise a WARNING in the
caller instead.  As a result, this parameter is no longer needed and
can be removed.  Since this is a static function, we can always easily
reintroduce the parameter if it's ever needed in the future.

Author: Ilia Evdokimov
Reviewed-by: Fabrízio de Royes Mello
Discussion: https://postgr.es/m/b3880f22-5808-4206-88d4-1553a81c3440@tantorlabs.com

Allow changing autovacuum_max_workers without restarting.

This commit introduces a new parameter named
autovacuum_worker_slots that controls how many autovacuum worker
slots to reserve during server startup. Modifying this new
parameter's value does require a server restart, but it should
typically be set to the upper bound of what you might realistically
need to set autovacuum_max_workers. With that new parameter in
place, autovacuum_max_workers can now be changed with a SIGHUP
(e.g., pg_ctl reload).

If autovacuum_max_workers is set higher than
autovacuum_worker_slots, a WARNING is emitted, and the server will
only start up to autovacuum_worker_slots workers at a given time.
If autovacuum_max_workers is set to a value less than the number of
currently-running autovacuum workers, the existing workers will
continue running, but no new workers will be started until the
number of running autovacuum workers drops below
autovacuum_max_workers.

Reviewed-by: Sami Imseih, Justin Pryzby, Robert Haas, Andres Freund, Yogesh Sharma
Discussion: https://postgr.es/m/20240410212344.GA1824549%40nathanxps13

Remove duplicate definitions in proc.h

These are also present in procnumber.h

Reported-by: Peter Eisentraut
Discussion: https://www.postgresql.org/message-id/bd04d675-4672-4f87-800a-eb5d470c15fc@eisentraut.org

flex code modernization: Replace YY_EXTRA_TYPE define with flex option

Replace #define YY_EXTRA_TYPE with %option extra-type. The latter is
the way recommended by the flex manual (available since flex 2.5.34).

Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Discussion: https://www.postgresql.org/message-id/flat/eb6faeac-2a8a-4b69-9189-c33c520e5b7b@eisentraut.org

doc: Clarify log level for VERBOSE messages in maintenance commands.

VERBOSE messages from ANALYZE, CLUSTER, REINDEX, and VACUUM are logged
at the INFO level, but this detail was missing from the documentation.
This commit updates the docs to mention the log level for these messages.

Author: Masahiro Ikeda
Reviewed-by: Yugo Nagata
Discussion: https://postgr.es/m/b4a4b7916982dccd9607c8efb3ce5116@oss.nttdata.com

Always use the caller-provided context for radix tree leaves

Previously, it would not have worked for a caller to pass a slab
context, since it would have been used for other things which likely
had incompatible size. In an attempt to be helpful and avoid possible
space wastage due to aset's power-of-two rounding, RT_CREATE would
create an additional slab context if the value type was fixed-length
and larger than pointer size. The problem was, we have since added
the bump context type, and the generation context was a possibility as
well, so silently overriding the caller's choice may actually be worse.

Commit e8a6f1f908d arranged so that the caller-provided context is
used only for leaves, so it's safe for the caller to use slab here
if they wish. As demonstration, use slab in one of the radix tree
regression tests.

Reviewed by Masahiko Sawada

Discussion: https://postgr.es/m/CANWCAZZDCo4k5oURg_pPxM6+WZ1oiG=sqgjmQiELuyP0Vtrwig@mail.gmail.com

Get rid of radix tree's general purpose memory context

Previously, this was notionally used only for the entry point of the
tree and as a convenient parent for other contexts.

For shared memory, the creator previously allocated the entry point
in this context, but attaching backends didn't have access to that,
so they just used the caller's context. For the sake of consistency,
allocate every instance of an entry point in the caller's context.

For local memory, allocate the control object in the caller's context
as well. This commit also makes the "leaf context" the notional parent
of the child contexts used for nodes, so it's a bit of a misnomer,
but a future commit will make the node contexts independent rather
than children, so leave it this way for now to avoid code churn.

The memory context parameter for RT_CREATE is now unused in the case
of shared memory, so remove it and adjust callers to match.

In passing, remove unused "context" member from struct TidStore,
which seems to have been an oversight.

Reviewed by Masahiko Sawada

Discussion: https://postgr.es/m/CANWCAZZDCo4k5oURg_pPxM6+WZ1oiG=sqgjmQiELuyP0Vtrwig@mail.gmail.com

Use caller's memory context for radix tree iteration state

Typically only one iterator is present at any time, so it's overkill
to devote an entire context for this. Get rid of it and use the
caller's context.

This is tidy-up work, so no backpatch in this form. However, a
hypothetical extension to v17 that tried to start iteration from
an attaching backend would result in a crash, so that'll be fixed
separately in a way that doesn't change behavior in core.

Patch by me, reported and reviewed by Masahiko Sawada

Discussion: https://postgr.es/m/CAD21AoBB2U47V=F+wQRB1bERov_of5=BOZGaybjaV8FLQyqG3Q@mail.gmail.com

Remove useless configure check

The test for "decltype" as a variant of "typeof" apparently never
worked (see also commit 3582b223d49), so remove it.

Discussion: https://www.postgresql.org/message-id/flat/795b1c54-c64a-47f9-8fa3-880dcab59975%40eisentraut.org

meson: Fix missing name arguments of cc.compiles() calls

Without it, the check won't show up in the meson setup/configure
output.

Discussion: https://www.postgresql.org/message-id/flat/795b1c54-c64a-47f9-8fa3-880dcab59975%40eisentraut.org

Document strange jsonb sort order for empty top level arrays

Slightly faulty logic in the original jsonb code (commit d9134d0a355)
results in an empty top level array sorting less than a json null. We
can't change the sort order now since it would affect btree indexes over
jsonb, so document the anomaly.

Backpatch to all live branches (13 .. 17)

In master, also add a code comment noting the anomaly.

Reported-by: Yan Chengpen
Reviewed-by: Jian He
Discussion: https://postgr.es/m/OSBPR01MB45199DD8DA2D1CECD50518188E272@OSBPR01MB4519.jpnprd01.prod.outlook.com

Ignore nullingrels when looking up statistics

When looking up statistical data about an expression, we do not need
to concern ourselves with the outer joins that could null the
Vars/PHVs contained in the expression.  Accounting for nullingrels in
the expression could cause estimate_num_groups to count the same Var
multiple times if it's marked with different nullingrels.  This is
incorrect, and could lead to "ERROR:  corrupt MVNDistinct entry" when
searching for multivariate n-distinct.

Furthermore, the nullingrels could prevent us from matching an
expression to expressional index columns or to the expressions in
extended statistics, leading to inaccurate estimates.

To fix, strip out all the nullingrels from the expression before we
look up statistical data about it.  There is one ensuing plan change
in the regression tests, but it looks reasonable and does not
compromise its original purpose.

This patch could result in plan changes, but it fixes an actual bug,
so back-patch to v16 where the outer-join-aware-Var infrastructure was
introduced.

Author: Richard Guo
Discussion: https://postgr.es/m/CAMbWs4-2Z4k+nFTiZe0Qbu5n8juUWenDAtMzi98bAZQtwHx0-w@mail.gmail.com

Fix outdated CHUNKHDRSZ value in nodeAgg.c

CHUNKHDRSZ was defined as 16 bytes, which was true when that code went in,
but since c6e0fe1f2, 8 is a more accurate value. Here we adjust it to use
sizeof(MemoryChunk), which is normally 8, or 16 for cassert builds.

c6e0fe1f2 first appeared in v16, so this is technically wrong in v16 up
to master, but let's apply this only to master as adjusting this does
influence the estimated number of batches in the aggregate costing code
and we don't want to cause plan instability in released versions.

Reviewed-by: Tom Lane
Discussion: https://postgr.es/m/CAApHDvpMpRQvsTqZo3FinXkgytwxwF8sCyZm83xDj-1s_hLe+w@mail.gmail.com

Fix an assortment of spelling mistakes and typos

Author: Alexander Lakhin <exclusion@gmail.com>
Discussion: https://postgr.es/m/5812a0b9-b0cf-4151-9a14-d9f00e4f2858@gmail.com

Update copyright for 2025

Backpatch-through: 13

Update obsolete reference to plpgsql's gram.y file.

This was evidently missed in 05346c131, which renamed that
file to pl_gram.y.

Japin Li

Discussion: https://postgr.es/m/ME0P300MB0445F7CA7456C2AC67D37A01B6092@ME0P300MB0445.AUSP300.PROD.OUTLOOK.COM

injection_points: Tweak variable-numbered stats to work with pending data

As coded, the module was not using pending entries to store its data
locally before doing a flush to the central dshash with a timed
pgstat_report_stat() call. Hence, the flush callback was defined, but
finished by being not used. As a template, this is more efficient than
the original logic of updating directly the shared memory entries as
this reduces the interactions that need to be done with the pgstats
hash table in shared memory.

injection_stats_flush_cb() was also missing a pgstat_unlock_entry(), so
add one, while on it.

Reviewed-by: Bertrand Drouvot
Discussion: https://postgr.es/m/Z3JbLhKFFm6kKfT8@ip-10-97-1-34.eu-west-3.compute.internal

Fix memory leak in pgoutput with relation attribute map

pgoutput caches the attribute map of a relation, that is free()'d only
when validating a RelationSyncEntry. However, this code path is not
taken when calling any of the SQL functions able to do some logical
decoding, like pg_logical_slot_{get,peek}_changes(), leaking some memory
into CacheMemoryContext on repeated calls.

To address this, a relation's attribute map is allocated in
PGOutputData's cachectx, free()'d at the end of the execution of these
SQL functions when logical decoding ends. This is available down to 15.
v13 and v14 have a similar leak, which will be dealt with later.

Reported-by: Masahiko Sawada
Author: Vignesh C
Reviewed-by: Hou Zhijie
Discussion: https://postgr.es/m/CAD21AoDkAhQVSukOfH3_reuF-j4EU0-HxMqU3dU+bSTxsqT14Q@mail.gmail.com
Discussion: https://postgr.es/m/CALDaNm1hewNAsZ_e6FF52a=9drmkRJxtEPrzCB6-9mkJyeBBqA@mail.gmail.com
Backpatch-through: 15

Remove redundant wording in pg_statistic.h

Author: Junwang Zhao
Discussion: https://postgr.es/m/CAEG8a3JbMCHna=N5ZSx6huLnTDfW34kw7Pf2n8+3M-9UrrwesA@mail.gmail.com

Fix failures with incorrect epoch handling for 2PC files at recovery

At the beginning of recovery, an orphaned two-phase file in an epoch
different than the one defined in the checkpoint record could not be
removed based on the assumptions that AdjustToFullTransactionId() relies
on, assuming that all files would be either from the current epoch or
from the previous epoch.

If the checkpoint epoch was 0 while the 2PC file was orphaned and in the
future, AdjustToFullTransactionId() would underflow the epoch used to
build the 2PC file path. In non-assert builds, this would create a
WARNING message referring to a 2PC file with an epoch of "FFFFFFFF" (or
UINT32_MAX), as an effect of the underflow calculation, leaving the
orphaned file around.

Some tests are added with dummy 2PC files in the past and the future,
checking that these are properly removed.

Issue introduced by 5a1dfde8334b, that has switched two-phase state
files to use FullTransactionIds.

Reported-by: Vitaly Davydov
Author: Michael Paquier
Reviewed-by: Vitaly Davydov
Discussion: https://postgr.es/m/13b5b6-676c3080-4d-531db900@47931709
Backpatch-through: 17

Fix handling of orphaned 2PC files in the future at recovery

Before 728bd991c3c4, that has improved the support for 2PC files during
recovery, the initial logic scanning files in pg_twophase was done so as
files in the future of the transaction ID horizon were checked first,
followed by a check if a transaction ID is aborted or committed which
could involve a pg_xact lookup.  After this commit, these checks have
been done in reverse order.

Files detected as in the future do not have a state that can be checked
in pg_xact, hence this caused recovery to fail abruptly should an
orphaned 2PC file in the future of the transaction ID horizon exist in
pg_twophase at the beginning of recovery.

A test is added to check for this scenario, using an empty 2PC with a
transaction ID large enough to be in the future when running the test.
This test is added in 16 and older versions for now.  17 and newer
versions are impacted by a second bug caused by the addition of the
epoch in the 2PC file names.  An equivalent test will be added in these
branches in a follow-up commit, once the second set of issues reported
are fixed.

Author: Vitaly Davydov, Michael Paquier
Discussion: https://postgr.es/m/11e597-676ab680-8d-374f23c0@145466129
Backpatch-through: 13

contrib/pageinspect: Use SQL-standard function bodies.

In the same spirit as 969bbd0fa, 13e3796c9, 3f323eba8.

Tom Lane and Ronan Dunklau

Discussion: https://postgr.es/m/3316564.aeNJFYEL58@aivenlaptop

contrib/xml2: Use SQL-standard function bodies.

In the same spirit as 969bbd0fa, 13e3796c9, 3f323eba8.

Tom Lane and Ronan Dunklau

Discussion: https://postgr.es/m/3316564.aeNJFYEL58@aivenlaptop

contrib/citext: Use SQL-standard function bodies.

In the same spirit as 969bbd0fa, 13e3796c9, 3f323eba8.

Tom Lane and Ronan Dunklau

Discussion: https://postgr.es/m/3316564.aeNJFYEL58@aivenlaptop

Fix overly large values/nulls arrays

These arrays were sized with Natts_pg_trigger (19) when they should have
been sized with Natts_pg_event_trigger (7). We'd better fix this as
it's clearly a mistake and it could become problematic if
pg_event_trigger were to gain a dozen or so more columns in the future.

No backpatch as there's no actual bug and the column count on those
tables isn't going to change in released versions.

Author: Xin Zhang <zhanghien@qq.com>
Discussion: https://postgr.es/m/tencent_05AD0FB321A414EC3661204D2102AA6EF605@qq.com

Replace PGPROC.isBackgroundWorker with isRegularBackend.

Commit 34486b609 effectively redefined isBackgroundWorker as meaning
"not a regular backend", whereas before it had the narrower
meaning of AmBackgroundWorkerProcess(). For clarity, rename the
field to isRegularBackend and invert its sense.

Discussion: https://postgr.es/m/1808397.1735156190@sss.pgh.pa.us

Exclude parallel workers from connection privilege/limit checks.

Cause parallel workers to not check datallowconn, rolcanlogin, and
ACL_CONNECT privileges.  The leader already checked these things
(except for rolcanlogin which might have been checked for a different
role).  Re-checking can accomplish little except to induce unexpected
failures in applications that might not even be aware that their query
has been parallelized.  We already had the principle that parallel
workers rely on their leader to pass a valid set of authorization
information, so this change just extends that a bit further.

Also, modify the ReservedConnections, datconnlimit and rolconnlimit
logic so that these limits are only enforced against regular backends,
and only regular backends are counted while checking if the limits
were already reached.  Previously, background processes that had an
assigned database or role were subject to these limits (with rather
random exclusions for autovac workers and walsenders), and the set of
existing processes that counted against each limit was quite haphazard
as well.  The point of these limits, AFAICS, is to ensure the
availability of PGPROC slots for regular backends.  Since all other
types of processes have their own separate pools of PGPROC slots, it
makes no sense either to enforce these limits against them or to count
them while enforcing the limit.

While edge-case failures of these sorts have been possible for a
long time, the problem got a good deal worse with commit 5a2fed911
(CVE-2024-10978), which caused parallel workers to make some of these
checks using the leader's current role where before we had used its
AuthenticatedUserId, thus allowing parallel queries to fail after
SET ROLE.  The previous behavior was fairly accidental and I have
no desire to return to it.

This patch includes reverting 73c9f91a1, which was an emergency hack
to suppress these same checks in some cases.  It wasn't complete,
as shown by a recent bug report from Laurenz Albe.  We can also revert
fd4d93d26 and 492217301, which hacked around the same problems in one
regression test.

In passing, remove the special case for autovac workers in
CheckMyDatabase; it seems cleaner to have AutoVacWorkerMain pass
the INIT_PG_OVERRIDE_ALLOW_CONNS flag, now that that does what's
needed.

Like 5a2fed911, back-patch to supported branches (which sadly no
longer includes v12).

Discussion: https://postgr.es/m/1808397.1735156190@sss.pgh.pa.us

Reserve a PGPROC slot and semaphore for the slotsync worker process.

The need for this was missed in commit 93db6cbda, with the result
being that if we launch a slotsync worker it would consume one of
the PGPROCs in the max_connections pool.  That could lead to inability
to launch the worker, or to subsequent failures of connection requests
that should have succeeded according to the configured settings.

Rather than create some one-off infrastructure to support this,
let's group the slotsync worker with the existing autovac launcher
in a new category of "special worker" processes.  These are kind of
like auxiliary processes, but they cannot use that infrastructure
because they need to be able to run transactions.

For the moment, make these processes share the PGPROC freelist
used for autovac workers (which previously supplied the autovac
launcher too).  This is partly to avoid an ABI change in v17,
and partly because it seems silly to have a freelist with
at most two members.  This might be worth revisiting if we grow
enough workers in this category.

Tom Lane and Hou Zhijie.  Back-patch to v17.

Discussion: https://postgr.es/m/1808397.1735156190@sss.pgh.pa.us

In REASSIGN OWNED of a database, lock the tuple as mandated.

Commit aac2c9b4fde889d13f859c233c2523345e72d32b mandated such locking
and attempted to fulfill that mandate, but it missed REASSIGN OWNED.
Hence, it remained possible to lose VACUUM's inplace update of
datfrozenxid if a REASSIGN OWNED processed that database at the same
time.  This didn't affect the other inplace-updated catalog, pg_class.
For pg_class, REASSIGN OWNED calls ATExecChangeOwner() instead of the
generic AlterObjectOwner_internal(), and ATExecChangeOwner() fulfills
the locking mandate.

Like in GRANT, implement this by following the locking protocol for any
catalog subject to the generic AlterObjectOwner_internal().  It would
suffice to do this for IsInplaceUpdateOid() catalogs only.  Back-patch
to v13 (all supported versions).

Kirill Reshke.  Reported by Alexander Kukushkin.

Discussion: https://postgr.es/m/CAFh8B=mpKjAy4Cuun-HP-f_vRzh2HSvYFG3rhVfYbfEBUhBAGg@mail.gmail.com

Speedup tuple deformation with additional function inlining

This adjusts slot_deform_heap_tuple() to add special-case loops to
eliminate much of the branching that was done within the body of the
main deform loop.

Previously, while looping over each attribute to deform,
slot_deform_heap_tuple() would always recheck if the given attribute was
NULL by looking at HeapTupleHasNulls() and if so, went on to check the
tuple's NULL bitmap.  Since many tuples won't contain any NULLs, we can
just check HeapTupleHasNulls() once and when there are no NULLs, use a
more compact version of the deforming loop which contains no NULL checking
code at all.

The same is possible for the "slow" mode checking part of the loop.  That
variable was checked several times for each attribute, once to determine
if the offset to the attribute value could be taken from the attcacheoff,
and again to check if the offset could be cached for next time.

These "slow" checks can mostly be eliminated by instead having multiple
loops.  Initially, we can start in the non-slow loop and break out of
that loop if and only if we must stop caching the offset.  This
eliminates branching for both slow and non-slow deforming methods.  The
amount of code required for the no nulls / non-slow version is very
small.  It's possible to have separate loops like this due to the fact
that once we move into slow mode, we never need to switch back into
non-slow mode for a given tuple.

We have the compiler take care of writing out the multiple required
loops by having a pg_attribute_always_inline function which gets called
various times passing in constant values for the "slow" and "hasnulls"
parameters.  This allows the compiler to eliminate const-false branches
and remove comparisons for const-true ones.

This commit has shown overall query performance increases of around 5-20%
in deform-heavy OLAP-type workloads.

Author: David Rowley
Reviewed-by: Victor Yegorov
Discussion: https://postgr.es/m/CAGnEbog92Og2CpC2S8=g_HozGsWtt_3kRS1sXjLz0jKSoCNfLw@mail.gmail.com
Discussion: https://postgr.es/m/CAApHDvo9e0XG71WrefYaRv5n4xNPLK4k8LjD0mSR3c9KR2vi2Q@mail.gmail.com

Improve handling of date_trunc() units for infinite input values

Previously, if an infinite value was passed to date_trunc(), then the
same infinite value would always be returned regardless of the field
unit given by the caller. This commit updates the function so that an
error is returned when an invalid unit is passed to date_trunc() with an
infinite value.

This matches the behavior of date_trunc() with a finite value and
date_part() with an infinite value, making the handling of interval,
timestamp and timestamptz more consistent across the board for these two
functions.

Some tests are added to cover all these new failure cases, with an
unsupported unit and infinite values for the three data types. There
were no test cases in core that checked all these patterns up to now.

Author: Joseph Koshakow
Discussion: https://postgr.es/m/CAAvxfHc4084dGzEJR0_pBZkDuqbPGc5wn7gK_M0XR_kRiCdUJQ@mail.gmail.com

Remove unused totalrows parameter in compute_expr_stats

The totalrows parameter in compute_expr_stats is unused, so remove it.
This is a static function, so the parameter can easily be added again if
it's ever needed.

Author: Ilia Evdokimov <ilya.evdokimov@tantorlabs.ru>
Discussion: https://postgr.es/m/667b92d2-f953-4fcb-9377-3765f5b94187@tantorlabs.com

plpgsql: Rename a variable for clarity

Rename "core_yy_extra_type core_yy" to "core_yy_extra". The previous
name was a bit unclear and confusing. The new name matches the name
used elsewhere for the same purpose, for example in
src/backend/parser/gramparse.h.

Fix typo in comment of compute_return_type() in functioncmds.c

Author: Japin Li
Discussion: https://postgr.es/m/ME0P300MB0445D51BCFA8680F0B35FD6EB60C2@ME0P300MB0445.AUSP300.PROD.OUTLOOK.COM

meson: Export all libcommon functions in Windows builds

This fixes "unresolved external symbol" errors with extensions that
use functions from libpgport that need special CFLAGS to
compile. Currently, that includes the CRC-32 functions.

Commit 2571c1d5cc did this for libcommon, but I missed that libpqport
has the same issue.

Reported-by: Tom Lane
Backpatch-through: 16, where Meson was introduced
Discussion: https://www.postgresql.org/message-id/CAOdR5yF0krWrxycA04rgUKCgKugRvGWzzGLAhDZ9bzNv8g0Lag@mail.gmail.com

Add commit 301de6a6f60 to .git-blame-ignore-revs.

Partial pgindent of .l and .y files

Trying to clean up the code a bit while we're working on these files
for the reentrant scanner/pure parser patches. This cleanup only
touches the code sections after the second '%%' in each file, via a
manually-supervised and locally hacked up pgindent.

meson: Export all libcommon functions in Windows builds

This fixes "unresolved external symbol" errors with extensions that
use functions from libcommon. This was reported with pgvector.

Reported-by: Andrew Kane
Author: Vladlen Popolitov
Backpatch-through: 16, where Meson was introduced
Discussion: https://www.postgresql.org/message-id/CAOdR5yF0krWrxycA04rgUKCgKugRvGWzzGLAhDZ9bzNv8g0Lag@mail.gmail.com

guc: reentrant scanner

Use the flex %option reentrant to make the generated scanner
reentrant, and perhaps eventually thread-safe, but that will require
additional work.

Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Reviewed-by: Andreas Karlsson <andreas@proxel.se>
Discussion: https://www.postgresql.org/message-id/flat/eb6faeac-2a8a-4b69-9189-c33c520e5b7b@eisentraut.org

jsonpath scanner: reentrant scanner

Use the flex %option reentrant to make the generated scanner
reentrant and thread-safe.  Note: The parser was already pure.

Simplify flex scan buffer management: Instead of constructing the
buffer from pieces and then using yy_scan_buffer(), we can just use
yy_scan_string(), which does the same thing internally.  (Actually, we
use yy_scan_bytes() here because we already have the length.)

Use flex yyextra to handle context information, instead of global
variables.  This complements the other changes to make the scanner
reentrant.

Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Reviewed-by: Andreas Karlsson <andreas@proxel.se>
Discussion: https://www.postgresql.org/message-id/flat/eb6faeac-2a8a-4b69-9189-c33c520e5b7b@eisentraut.org

Fix nbtree symbol name comment reference.

Oversight in commit 5bf748b86b.

syncrep parser: pure parser and reentrant scanner

Use the flex %option reentrant and the bison option %pure-parser to
make the generated scanner and parser pure, reentrant, and
thread-safe.

Make the generated scanner use palloc() etc. instead of malloc() etc.
Previously, we only used palloc() for the buffer, but flex would still
use malloc() for its internal structures.  Now, all the memory is
under palloc() control.

Simplify flex scan buffer management: Instead of constructing the
buffer from pieces and then using yy_scan_buffer(), we can just use
yy_scan_string(), which does the same thing internally.

The previous code was necessary because we allocated the buffer with
palloc() and the rest of the state was handled by malloc().  But this
is no longer the case; everything is under palloc() now.

Use flex yyextra to handle context information, instead of global
variables.  This complements the other changes to make the scanner
reentrant.

Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Reviewed-by: Andreas Karlsson <andreas@proxel.se>
Discussion: https://www.postgresql.org/message-id/flat/eb6faeac-2a8a-4b69-9189-c33c520e5b7b@eisentraut.org

replication parser: pure parser and reentrant scanner

Use the flex %option reentrant and the bison option %pure-parser to
make the generated scanner and parser pure, reentrant, and
thread-safe.

Make the generated scanner use palloc() etc. instead of malloc() etc.
Previously, we only used palloc() for the buffer, but flex would still
use malloc() for its internal structures.  As a result, there could be
some small memory leaks in case of uncaught errors.  Now, all the
memory is under palloc() control, so there are no more such issues.

Simplify flex scan buffer management: Instead of constructing the
buffer from pieces and then using yy_scan_buffer(), we can just use
yy_scan_string(), which does the same thing internally.

The previous code was necessary because we allocated the buffer with
palloc() and the rest of the state was handled by malloc().  But this
is no longer the case; everything is under palloc() now.

Use flex yyextra to handle context information, instead of global
variables.  This complements the other changes to make the scanner
reentrant.

Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Co-authored-by: Andreas Karlsson <andreas@proxel.se>
Reviewed-by: Andreas Karlsson <andreas@proxel.se>
Discussion: https://www.postgresql.org/message-id/flat/eb6faeac-2a8a-4b69-9189-c33c520e5b7b@eisentraut.org

Remove pgrminclude and associated scripts

Per git log, the last time someone tried to do something with
pgrminclude was around 2011. And it's always had a tendency of
causing trouble when it was active. Also, pgcominclude is redundant
with headerscheck.

Discussion: https://www.postgresql.org/message-id/flat/2d4dc7b2-cb2e-49b1-b8ca-ba5f7024f05b%40eisentraut.org

Remove pgrminclude annotations

Per git log, the last time someone tried to do something with
pgrminclude was around 2011. Many (not all) of the "pgrminclude
ignore" annotations are of a newer date but seem to have just been
copied around during refactorings and file moves and don't seem to
reflect an actual need anymore.

There have been some parallel experiments with include-what-you-use
(IWYU) annotations, but these don't seem to correspond very strongly
to pgrminclude annotations, so there is no value in keeping the
existing ones even for that kind of thing.

So, wipe them all away. We can always add new ones in the future
based on actual needs.

Discussion: https://www.postgresql.org/message-id/flat/2d4dc7b2-cb2e-49b1-b8ca-ba5f7024f05b%40eisentraut.org

Fix race condition in TupleDescCompactAttr assert code

5983a4cff added CompactAttribute as an abbreviated alternative to
FormData_pg_attribute to allow more cache-friendly processing in tasks
related to TupleDescs.  That commit contained some assert-only code to
check that the CompactAttribute had been populated correctly, however,
the method used to do that checking caused the TupleDesc's
CompactAttribute to be zeroed before it was repopulated and compared to
the snapshot taken before the memset call.  This caused issues as the type
cache caches TupleDescs in shared memory which can be used by multiple
backend processes at the same time.  There was a window of time between
the zero and repopulation of the CompactAttribute where another process
would mistakenly think that the CompactAttribute is invalid due to the
memset.

To fix this, instead of taking a snapshot of the CompactAttribute and
calling populate_compact_attribute() and comparing the snapshot to the
freshly populated TupleDesc's CompactAttribute, refactor things so we
can just populate a temporary CompactAttribute on the stack.  This way
we don't touch the TupleDesc's memory.

Reported-by: Alexander Lakhin, SQLsmith
Discussion: https://postgr.es/m/ca3a256a-5d12-42db-aabe-a75a030d9fb9@gmail.com

Try to avoid semaphore-related test failures on NetBSD/OpenBSD.

These two platforms have a remarkably tight default limit on the
number of SysV semaphores in the system: SEMMNS is only 60
out-of-the-box.  Unless manual action is taken to raise that,
we'll only be able to allocate 3 sets of 16 usable semaphores
each, leading to initdb setting max_connections to just 20.
That's problematic because the core regression tests expect
to be able to launch 20 concurrent sessions, leaving us with
no headroom.  This seems to be the cause of intermittent
buildfarm failures on some machines.

While there's no getting around the fact that you'd better raise
SEMMNS for production use on these platforms, it does seem desirable
for "make check" to pass reliably without that.  We can make that
happen, at least for awhile longer, with two small changes:

* Change sysv_sema.c's SEMAS_PER_SET to 19, so that we can eat up
all of the available semas not just most of them.

* Change initdb to make the smallest max_connections value it will
consider be 25 not 20.

As of HEAD this will leave us with four free semaphores (using the
default values for other relevant parameters such as max_wal_senders).
So we won't need to consider this again until we've invented five
more background processes.  Maybe by then we can switch both these
platforms to some other semaphore API.

For the moment, do this only in master; there've not been field
complaints that might justify a back-patch.

Discussion: https://postgr.es/m/db2773a2-aca0-43d0-99c1-060efcd9954e@gmail.com

Reset btpo_cycleid in nbtree VACUUM's REDO routine.

Reset btpo_cycleid to 0 in btree_xlog_vacuum for consistency with
_bt_delitems_vacuum (the corresponding original execution code). This
makes things neater.

There might be some performance benefit to being consistent like this.
When btvacuumpage doesn't call _bt_delitems_vacuum, it can still
proactively reset btpo_cycleid to 0 via a separate hint-like update
mechanism (it does so whenever it sees that it isn't already set to 0).
And so it's possible that being consistent about resetting btpo_cycleid
like this will save work later on, after standby promotion: subsequent
VACUUMs won't need to clear btpo_cycleid using the hint-like update
mechanism as often as they otherwise would.

Author: Peter Geoghegan <pg@bowt.ie>
Reviewed-By: Andrey Borodin <x4mmm@yandex-team.ru>
Discussion: https://postgr.es/m/CAH2-Wz=+LDFxn9NZyEsCo8ifcyKt6+n-VLyygySEHgMz+oynqw@mail.gmail.com

postgres_fdw: re-issue cancel requests a few times if necessary.

Despite the best efforts of commit 0e5c82380, we're still seeing
occasional failures of postgres_fdw's query_cancel test in the
buildfarm.  Investigation suggests that its 100ms timeout is
still not enough to reliably ensure that the remote side starts
the query before receiving the cancel request --- and if it
hasn't, it will just discard the request because it's idle.

We discussed allowing a cancel request to kill the next-received
query, but that would have wide and perhaps unpleasant side-effects.
What seems safer is to make postgres_fdw do what a human user would
likely do, which is issue another cancel request if the first one
didn't seem to do anything.  We'll keep the same overall 30 second
grace period before concluding things are broken, but issue additional
cancel requests after 1 second, then 2 more seconds, then 4, then 8.
(The next one in series is 16 seconds, but we'll hit the 30 second
timeout before that.)

Having done that, revert the timeout in query_cancel.sql to 10 ms.
That will still be enough on most machines, most of the time, for
the remote query to start; but now we're intentionally risking the
race condition occurring sometimes in the buildfarm, so that the
repeat-cancel code path will get some testing.

As before, back-patch to v17.  We might eventually contemplate
back-patching this further, and/or adding similar logic to dblink.
But given the lack of field complaints to date, this feels like
mostly an exercise in test case stabilization, so v17 is enough.

Discussion: https://postgr.es/m/colnv3lzzmc53iu5qoawynr6qq7etn47lmggqr65ddtpjliq5d@glkveb4m6nop

Don't allow GetTransactionSnapshot() in logical decoding

A historic snapshot should only be used for catalog access, not
general queries. We never call GetTransactionSnapshot() during logical
decoding, which is good because it wouldn't be very sensible, so the
code to deal with that was unreachable and untested. Turn it into an
error, to avoid doing that in the future either.

Discussion: https://www.postgresql.org/message-id/a868fe78-ddb4-4b0a-9b96-873d91d93cfd@iki.fi

Remove unnecessary GetTransactionSnapshot() calls

In get_database_list() and get_subscription_list(), the
GetTransactionSnapshot() call is not required because the catalog
table scans use the catalog snapshot, which is held until the end of
the scan. See table_beginscan_catalog(), which calls
RegisterSnapshot(GetCatalogSnapshot(relid)).

In InitPostgres, it's a little less obvious that it's not required,
but still true I believe. All the catalog lookups in InitPostgres()
also use the catalog snapshot, and the looked up values are copied
while still holding the snapshot.

Furthermore, as the removed FIXME comments said, calling
GetTransactionSnapshot() didn't really prevent MyProc->xmin from being
reset anyway.

Discussion: https://www.postgresql.org/message-id/7c56f180-b9e1-481e-8c1d-efa63de3ecbb@iki.fi

Fix incorrect source filename references

Jian He reported the src/include/utility/tcop.h one and the remainder
were found by using a script to look for src/* and check that we have a
filename or directory of that name.

In passing, fix some out-date comments.

Reported-by: Jian He <jian.universality@gmail.com>
Reviewed-by: Tom Lane
Discussion: https://postgr.es/m/CACJufxGoE3H-7VgO02=PrR4SNuVWDVbfTyUnwO0HvS-Lxurnog@mail.gmail.com

Fix some comments related to library unloading

Library unloading has never been supported with its code removed in
ab02d702ef08, and there were some comments still mentioning that it was
a possible operation.

ChangAo has noticed the incorrect references in dfmgr.c, while I have
noticed the other ones while scanning the rest of the tree for similar
mistakes.

Author: ChangAo Chen, Michael Paquier
Reviewed-by: Tom Lane
Discussion: https://postgr.es/m/tencent_1D09840A1632D406A610C8C4E2491D74DB0A@qq.com

Update TransactionXmin when MyProc->xmin is updated

GetSnapshotData() set TransactionXmin = MyProc->xmin, but when
SnapshotResetXmin() advanced MyProc->xmin, it did not advance
TransactionXmin correspondingly. That meant that TransactionXmin could
be older than MyProc->xmin, and XIDs between than TransactionXmin and
the real MyProc->xmin could be vacuumed away. One known consequence is
in pg_subtrans lookups: we might try to look up the status of an XID
that was already truncated away.

Back-patch to all supported versions.

Reviewed-by: Andres Freund
Discussion: https://www.postgresql.org/message-id/d27a046d-a1e4-47d1-a95c-fbabe41debb4@iki.fi

Optimize alignment calculations in tuple form/deform

Here we convert CompactAttribute.attalign from a char, which is directly
derived from pg_attribute.attalign into a uint8, which stores the number
of bytes to align the column's value by in the tuple.

This allows tuple deformation and tuple size calculations to move away
from using the inefficient att_align_nominal() macro, which manually
checks each TYPALIGN_* char to translate that into the alignment bytes
for the given type. Effectively, this commit changes those to TYPEALIGN
calls, which are branchless and only perform some simple arithmetic with
some bit-twiddling.

The removed branches were often mispredicted by CPUs, especially so in
real-world tables which often contain a mishmash of different types
with different alignment requirements.

Author: David Rowley
Reviewed-by: Andres Freund, Victor Yegorov
Discussion: https://postgr.es/m/CAApHDvrBztXP3yx=NKNmo3xwFAFhEdyPnvrDg3=M0RhDs+4vYw@mail.gmail.com

Mark CatalogSnapshotData static

Like CurrentSnapshotData, it should not be accessed directly outside
snapmgr.c.

Fix variable reference in comment

This used to say "nsubxcnt isn't decreased when subtransactions
abort", but there's no variable called nsubxcnt. Commit 8548ddc61b
changed it to "subxcnt", among other typo fixes, but that was wrong
too: the comment actually talks about txn->nsubtxns. That's the field
that's incremented but never decremented and is used for the
allocation earlier in the function.

Fix overflow danger in SampleHeapTupleVisible(), take 2

28328ec87b45725 addressed one overflow danger in
SampleHeapTupleVisible() but introduced another, albeit a less likely
one. Modify the binary search code to remove this danger.

Reported-by: Richard Guo
Reviewed-by: Richard Guo, Ranier Vilela
Discussion: https://postgr.es/m/CAMbWs4_bE%2BNscChbKWzw6HZOipCUyXfA5133qvoXQ654D3B2gQ%40mail.gmail.com

Fix corruption when relation truncation fails.

RelationTruncate() does three things, while holding an
AccessExclusiveLock and preventing checkpoints:

1. Logs the truncation.
2. Drops buffers, even if they're dirty.
3. Truncates some number of files.

Step 2 could previously be canceled if it had to wait for I/O, and step
3 could and still can fail in file APIs.  All orderings of these
operations have data corruption hazards if interrupted, so we can't give
up until the whole operation is done.  When dirty pages were discarded
but the corresponding blocks were left on disk due to ERROR, old page
versions could come back from disk, reviving deleted data (see
pgsql-bugs #18146 and several like it).  When primary and standby were
allowed to disagree on relation size, standbys could panic (see
pgsql-bugs #18426) or revive data unknown to visibility management on
the primary (theorized).

Changes:

* WAL is now unconditionally flushed first
* smgrtruncate() is now called in a critical section, preventing
   interrupts and causing PANIC on file API failure
* smgrtruncate() has a new parameter for existing fork sizes,
   because it can't call smgrnblocks() itself inside a critical section

The changes apply to RelationTruncate(), smgr_redo() and
pg_truncate_visibility_map().  That last is also brought up to date with
other evolutions of the truncation protocol.

The VACUUM FileTruncate() failure mode had been discussed in older
reports than the ones referenced below, with independent analysis from
many people, but earlier theories on how to fix it were too complicated
to back-patch.  The more recently invented cancellation bug was
diagnosed by Alexander Lakhin.  Other corruption scenarios were spotted
by me while iterating on this patch and earlier commit 75818b3a.

Back-patch to all supported releases.

Reviewed-by: Michael Paquier <michael@paquier.xyz>
Reviewed-by: Robert Haas <robertmhaas@gmail.com>
Reported-by: rootcause000@gmail.com
Reported-by: Alexander Lakhin <exclusion@gmail.com>
Discussion: https://postgr.es/m/18146-04e908c662113ad5%40postgresql.org
Discussion: https://postgr.es/m/18426-2d18da6586f152d6%40postgresql.org

Remove pg_attribute.attcacheoff column

The column is no longer needed as the offset is now cached in the
CompactAttribute struct per commit 5983a4cff.

Author: David Rowley
Reviewed-by: Andres Freund, Victor Yegorov
Discussion: https://postgr.es/m/CAApHDvrBztXP3yx=NKNmo3xwFAFhEdyPnvrDg3=M0RhDs+4vYw@mail.gmail.com

Relax regression test for fsync check of backend-level stats

One test added in 9aea73fc61d4 did not take into account that the
backend may have some fsync even after a checkpoint. Let's relax it to
be more flexible.

Per report from buildfarm member grassquit, via Alexander Lakhin.

Author: Bertrand Drouvot
Discussion: https://postgr.es/m/6143ab0a-9e88-4790-8d9d-50ba45657761@gmail.com

Introduce CompactAttribute array in TupleDesc, take 2

The new compact_attrs array stores a few select fields from
FormData_pg_attribute in a more compact way, using only 16 bytes per
column instead of the 104 bytes that FormData_pg_attribute uses.  Using
CompactAttribute allows performance-critical operations such as tuple
deformation to be performed without looking at the FormData_pg_attribute
element in TupleDesc which means fewer cacheline accesses.

For some workloads, tuple deformation can be the most CPU intensive part
of processing the query.  Some testing with 16 columns on a table
where the first column is variable length showed around a 10% increase in
transactions per second for an OLAP type query performing aggregation on
the 16th column.  However, in certain cases, the increases were much
higher, up to ~25% on one AMD Zen4 machine.

This also makes pg_attribute.attcacheoff redundant.  A follow-on commit
will remove it, thus shrinking the FormData_pg_attribute struct by 4
bytes.

Author: David Rowley
Reviewed-by: Andres Freund, Victor Yegorov
Discussion: https://postgr.es/m/CAApHDvrBztXP3yx=NKNmo3xwFAFhEdyPnvrDg3=M0RhDs+4vYw@mail.gmail.com

Remove final mention of FREEZE_PAGE from comments

b7493e1ab35 removed leftover mentions of XLOG_HEAP2_FREEZE_PAGE records
from comments but neglected to remove one mention of FREEZE_PAGE.

Reported off-list by Alexander Lakhin

Get rid of old version of BuildTupleHashTable().

It was reasonable to preserve the old API of BuildTupleHashTable()
in the back branches, but in HEAD we should actively discourage use
of that version.  There are no remaining callers in core, so just
get rid of it.  Then rename BuildTupleHashTableExt() back to
BuildTupleHashTable().

While at it, fix up the miserably-poorly-maintained header comment
for BuildTupleHashTable[Ext].  It looks like more than one patch in
this area has had the opinion that updating comments is beneath them.

Discussion: https://postgr.es/m/538343.1734646986@sss.pgh.pa.us

Use ExecGetCommonSlotOps infrastructure in more places.

Append, MergeAppend, and RecursiveUnion can all use the support
functions added in commit 276279295.  The first two can report a
fixed result slot type if all their children return the same fixed
slot type.  That does nothing for the append step itself, but might
allow optimizations in the parent plan node.  RecursiveUnion can
optimize tuple hash table operations in the same way as SetOp now
does.

Patch by me; thanks to Richard Guo and David Rowley for review.

Discussion: https://postgr.es/m/1850138.1731549611@sss.pgh.pa.us

Improve planner's handling of SetOp plans.

Remove the code for inserting flag columns in the inputs of a SetOp.
That was the only reason why there would be resjunk columns in a
set-operations plan tree, so we can get rid of some code that
supported that, too.

Get rid of choose_hashed_setop() in favor of building Paths for
the hashed and sorted alternatives, and letting them fight it out
within add_path().

Remove set_operation_ordered_results_useful(), which was giving wrong
answers due to examining the wrong ancestor node: we need to examine
the immediate SetOperationStmt parent not the topmost node.  Instead
make each caller of recurse_set_operations() pass down the relevant
parent node.  (This thinko seems to have led only to wasted planning
cycles and possibly-inferior plans, not wrong query answers.  Perhaps
we should back-patch it, but I'm not doing so right now.)

Teach generate_nonunion_paths() to consider pre-sorted inputs for
sorted SetOps, rather than always generating a Sort node.

Patch by me; thanks to Richard Guo and David Rowley for review.

Discussion: https://postgr.es/m/1850138.1731549611@sss.pgh.pa.us

Convert SetOp to read its inputs as outerPlan and innerPlan.

The original design for set operations involved appending the two
input relations into one and adding a flag column that allows
distinguishing which side each row came from.  Then the SetOp node
pries them apart again based on the flag.  This is bizarre.  The
only apparent reason to do it is that when sorting, we'd only need
one Sort node not two.  But since sorting is at least O(N log N),
sorting all the data is actually worse than sorting each side
separately --- plus, we have no chance of taking advantage of
presorted input.  On top of that, adding the flag column frequently
requires an additional projection step that adds cycles, and then
the Append node isn't free either.  Let's get rid of all of that
and make the SetOp node have two separate children, using the
existing outerPlan/innerPlan infrastructure.

This initial patch re-implements nodeSetop.c and does a bare minimum
of work on the planner side to generate correctly-shaped plans.
In particular, I've tried not to change the cost estimates here,
so that the visible changes in the regression test results will only
involve removal of useless projection steps and not any changes in
whether to use sorted vs hashed mode.

For SORTED mode, we combine successive identical tuples from each
input into groups, and then merge-join the groups.  The tuple
comparisons now use SortSupport instead of simple equality, but
the group-formation part should involve roughly the same number of
tuple comparisons as before.  The cross-comparisons between left and
right groups probably add to that, but I'm not sure to quantify how
many more comparisons we might need.

For HASHED mode, nodeSetop's logic is almost the same as before,
just refactored into two separate loops instead of one loop that
has an assumption that it will see all the left-hand inputs first.

In both modes, I added early-exit logic to not bother reading the
right-hand relation if the left-hand input is empty, since neither
INTERSECT nor EXCEPT modes can produce any output if the left input
is empty.  This could have been done before in the hashed mode, but
not in sorted mode.  Sorted mode can also stop as soon as it exhausts
the left input; any remaining right-hand tuples cannot have matches.

Also, this patch adds some infrastructure for detecting whether
child plan nodes all output the same type of tuple table slot.
If they do, the hash table logic can use slightly more efficient
code based on assuming that that's the input slot type it will see.
We'll make use of that infrastructure in other plan node types later.

Patch by me; thanks to Richard Guo and David Rowley for review.

Discussion: https://postgr.es/m/1850138.1731549611@sss.pgh.pa.us

Remove extra prefetch iterator setup for Bitmap Table Scan

1a0da347a7ac98db replaced Bitmap Table Scan's separate private and
shared bitmap iterators with a unified iterator. It accidentally set up
the prefetch iterator twice for non-parallel bitmap table scans. Remove
the extra set up call to tbm_begin_iterate().

Fix bitmap table scan crash on iterator release

1a0da347a7ac98db replaced Bitmap Table Scan's individual private and
shared iterators with a unified iterator. It neglected, however, to
check if the iterator had already been cleaned up before doing so on
rescan. Add this check both on rescan and end scan to be safe.

Reported-by: Richard Guo
Author: Richard Guo
Discussion: https://postgr.es/m/CAMbWs48nrhcLY1kcd-u9oD%2B6yiS631F_8Fx8ZGsO-BYDwH%2Bbyw%40mail.gmail.com