CSCF: Infrastructure Technology Internal Notes



Greetings, here are some highlights from the 2023 calendar year:

The following is a summary of some of the changes that have been made to the School's general-use Linux environment and how those changes might relate to problems that were encountered in 2023. Further details as well as links to the relevant RT tickets are also provided.

  1. Winter 2023 memory shortage on the linux.student.cs servers. In response the Infrastructure Group has increased the amount of RAM in the linux.student.cs environment by approximately 400%.

    A closer examination showed that the memory use of individual users at the time was reasonable and that there were simply more users using significant amounts of memory than the systems could collectively handle[1]. In response to this the Infrastructure Group purchased and deployed several new servers with approximately 3x the amount of RAM as the previous generation. We also purchased additional RAM and installed it in several of the existing servers. I don't have hard figures for the total RAM available in the W2023 environment, but the current environment totals over 6TB, which is more than triple the previous peak usage[2].

    [1] https://rt.uwaterloo.ca/Ticket/Display.html?id=1280253#txn-31897224
    [2] https://rt.uwaterloo.ca/Ticket/Display.html?id=1280253#txn-31897227

  2. Poor home directory performance. Since April 2023 several large changes have been made to the CephFS based home directory service resulting in order of magnitude reductions in latency across the board.

    Reverting to the default I/O scheduler resulted in an order of magnitude improvement in median data read/write latency (60ms -> 1ms). This also resulted in a 20x improvement to the cluster recovery rate, allowing client data to be fully replicated after hardware failure within hours as opposed to weeks/months[3]. Investigations into undocumented scalability limits of the cluster's Metadata layer resulted in the deployment of additional Metadata Servers, proportionally increasing throughput of Metadata operations and reducing the impact of asymmetries in our client workload[4]. Further improvements to Metadata performance were made by identifying undocumented scalability problems in the underlying storage for Metadata and mitigating them by re-deploying that hardware in a new configuration to enable parallelism[5]. The end result of these two modifications was a significant reduction in latency under load and a dramatic reduction in tail latency.

    [3] https://rt.uwaterloo.ca/Ticket/Display.html?id=1288524#txn-32316080
    [4] https://rt.uwaterloo.ca/Ticket/Display.html?id=1288288#txn-32308938
    [5] https://rt.uwaterloo.ca/Ticket/Display.html?id=1275465#txn-31877979

  3. Intermittent crashes of linux.student.cs machines. These crashes happened several times a week (more frequently under load) and were caused by invalid memory accesses in the kernel.

    They have been eliminated by deploying our own build of the stable Long Term Support (LTS) kernel (5.15) with an appropriately modified and back ported patch from upstream Linux (6.2). The Infrastructure Group is currently working to upstream that patch to the LTS line of kernels[6].

    [6] https://rt.uwaterloo.ca/Ticket/Display.html?id=1304328

Work to further improve performance and reliability is ongoing, starting with the deployment of solid state storage hardware for home directory data, to be completed as hardware delivery and available staff time allow.
Regards,
Anthony Brennan
Information Technology Specialist, CSCF Infrastructure Group