The following is a summary of some of the changes that have been made to the School's general-use Linux environment and how those changes might relate to problems that were encountered in 2023. Further details as well as links to the relevant RT tickets are also provided.
Winter 2023 memory shortage on the linux.student.cs servers. In response the Infrastructure Group has increased the amount of RAM in the linux.student.cs environment by approximately 400%.
A closer examination showed that the memory use of individual users at
the time was reasonable and that there were simply more users using
significant amounts of memory than the systems could collectively
handle[1]. In response to this the Infrastructure Group purchased
and deployed several new servers with approximately 3x the amount of
RAM as the previous generation. We also purchased additional RAM
and installed it in several of the existing servers. I don't have
hard figures for the total RAM available in the W2023 environment,
but the current environment totals over 6TB, which is more than triple
the previous peak usage[2].
[1]
https://rt.uwaterloo.ca/Ticket/Display.html?id=1280253#txn-31897224
[2]
https://rt.uwaterloo.ca/Ticket/Display.html?id=1280253#txn-31897227
Poor home directory performance. Since April 2023 several large changes have been made to the CephFS based home directory service resulting in order of magnitude reductions in latency across the board.
Reverting to the default I/O scheduler resulted in an order of magnitude
improvement in median data read/write latency (60ms -> 1ms). This also
resulted in a 20x improvement to the cluster recovery rate, allowing
client data to be fully replicated after hardware failure within hours
as opposed to weeks/months[3]. Investigations into undocumented
scalability limits of the cluster's Metadata layer resulted in the
deployment of additional Metadata Servers, proportionally increasing
throughput of Metadata operations and reducing the impact of asymmetries
in our client workload[4]. Further improvements to Metadata performance
were made by identifying undocumented scalability problems in the
underlying storage for Metadata and mitigating them by re-deploying
that hardware in a new configuration to enable parallelism[5].
The end result of these two modifications was a significant reduction
in latency under load and a dramatic reduction in tail latency.
[3]
https://rt.uwaterloo.ca/Ticket/Display.html?id=1288524#txn-32316080
[4]
https://rt.uwaterloo.ca/Ticket/Display.html?id=1288288#txn-32308938
[5]
https://rt.uwaterloo.ca/Ticket/Display.html?id=1275465#txn-31877979
Intermittent crashes of linux.student.cs machines. These crashes happened several times a week (more frequently under load) and were caused by invalid memory accesses in the kernel.
They have been eliminated by deploying our own build of the stable Long
Term Support (LTS) kernel (5.15) with an appropriately modified and
back ported patch from upstream Linux (6.2). The Infrastructure Group is
currently working to upstream that patch to the LTS line of kernels[6].
[6]
https://rt.uwaterloo.ca/Ticket/Display.html?id=1304328
Regards, Anthony Brennan Information Technology Specialist, CSCF Infrastructure Group