Info Guide NET002

This document was written to give an overview of the factors that affect the performance of a server, and the changes that can be made in both hardware and software to optimise the performance of the server. It is NOT intended to be an in-depth guide, and does not cover all the factors that may relate to performance-tuning a server. It is intended that this document be used as a guide, with reference being made to the relevant product manuals for a more in-depth description of the options available. Also worth reading are a number of Novell Application Notes that have been issued on server tuning and performance optimisation.

The document is split into three sections, namely the server hardware, operating system tuning parameters, and the application software in use. Not covered are aspects such as the workstation hardware, the topology used and the network protocols used, although these will also play an important part in the performance of the system.

The first stage of server performance tuning and optimisation is to identify potential bottlenecks in the hardware of the server in question. These can be the CPU, the memory, the network adapter, and/or the disk sub-system.

During the following discussions, the basic file server performance graph below will be used to illustrate the various aspects of server tuning. This graph assumes that for a given configuration, the server hardware is optimised, and shows how, as server usage increases, the number of transactions per second increases, until you reach a threshold where the server disk cache hits are at their maximum, and the throughput on the network adapter is also at maximum. As this threshold is passed, transactions per second start to decrease as the disk cache hit rate reduces and more and more data is having to be pulled directly from the disk sub-system.

A general statement made by Novell regarding CPUs under NetWare is that ‘the faster the CPU the better, especially when the server is acting as a database server, where a 486/66 or greater is recommended’. Having said that, not all server operations are server CPU intensive, and there are a number of steps that can be taken to reduce CPU loading, therefore reducing the need for faster and better processors in the server. The most important of these steps is the use of bus-mastering or parallel tasking adapters for the disk and network interfaces.

The CPU must execute instructions for each user I/O request. Insufficient processor speed degrades the server’s I/O performance. Increasing processor speed beyond what is required will have little impact on response time and performance, however it does improve the capacity to add extra users. Also, as I/O performance of network and disk sub-systems improves, CPU performance load increases, requiring either faster CPUs, or some means of reducing the workload of the CPU.

Bus-mastering cards reduce the CPU loading on servers dramatically. Tests by IBM show a typical CPU load can be reduced to less than 20% (over normal non-bus-mastering adapters) for a given operation by implementing bus-mastering technology in servers. The effect of adding bus-mastering cards on CPU utilisation is typically increased user capacity, as there is less CPU load involved in disk and network I/O, although on a server that is already showing insufficient CPU resources, there should be a noticeable increase in server performance, as the processor is spending less cycles executing I/O instructions.

Parallel tasking LAN cards, such as the 3C5x9 EtherLink III range from 3Com reduce the CPU loading by performing separate tasks in parallel to transfer data with greater efficiency. Tests carried out by LANQuest Labs in an IBM PS/2 model 95 show that for most typical workloads, the parallel tasking 3C529 will out-perform the single tasking 3C527 (a bus-mastering card) by up to 52%, only falling behind when the network is at near saturation (when the 3C527 is 10% faster).

In general, most currently installed 486 based servers have ample CPU performance, typical bottlenecks are in the network and disk sub-systems.

Adding memory to a NetWare server improves the disk cache hit rate, improving temporarily the sustained throughput rate of the server. The performance improvements will however degrade as additional users are added. Also, the benefits of extra memory are dependant on other server hardware factors. A higher disk cache hit rate will imply an increase in network utilisation. A slow network adapter will therefore minimise the impact of adding additional memory. Also, a higher disk cache hit rate implies higher CPU utilisation, and poor CPU performance can also reduce the impact of adding extra memory.

Adding faster network adapters will affect server performance by improving the maximum peak transaction rate. Under heavy loads however, the performance improvement will only have a slight effect, as the server will be disk-bound. Also, as network adapter performance improves, increased CPU time is required to service additional requests, therefore an increase in server CPU performance is required. Using bus-mastering or parallel tasking network adapters will reduce the CPU impact while still preserving the performance benefits of faster cards.

Adding a faster disk sub-system has the greatest impact on a server under heavy loads, as it improves the minimum sustained transfer rate. It will however have minimal performance impact under light server loads, as most requests are serviced directly from the disk cache (network transfer times form a relatively large component of the overall performance on a lightly loaded server), and disk transfer times are hidden by the use of the disk cache.

Ways of improving the performance of the disk sub-system include upgrading to physically faster disks, and implementing data striping (in simple terms, spanning a NetWare volume over multiple disks) with multiple controller cards. If a server is currently running with disk mirroring (two sets of disks off one disk controller), upgrading this to a duplexed configuration (two sets of disks and two disk controllers, one set of disks off each controller) will maintain data integrity without loss of performance (as the disk controller does not have to issue two sets of commands for each disk write).

As server disk performance improves, increased network adapter performance and CPU performance are required to support greater disk I/O transaction rates. Using bus-master disk controller cards will reduce the impact of faster controllers on CPU loading.

Overall, it can be seen that upgrading just one of the sub-systems in a server will not have a marked effect in improving all aspects of the server’s performance. It is only when multiple improvements are made that overall performance gains can be made. However, selective server hardware upgrading can help improve the performance of a poorly performing server so long as there is careful consideration of where the bottlenecks on the server are, and the applications that are suffering as a result.

What follows is an explanation of some of the NetWare tuneable SET parameters, and how changing them can affect server performance. The list is NOT a complete one, and is far from exhaustive, but should be used to illustrate how the various parameters can be used to optimise the performance of NetWare for particular environments. For full details, please consult the relevant NetWare manuals.

WARNING.....
The following tuning parameters come with a Novell and Apricot health warning! Ensure you know your server workload extremely well before using any of these. For reference and further information, read the June 92, May 93, June 93 and October 93 Novell AppNotes, and the relevant sections in the NetWare documentation. Also, remember that NetWare auto-tunes itself!

2.1.1. Set Cache Buffer Size
NetWare cache performance is at its optimum when the cache buffer size is the same as the block size. In a multi-volume environment, the cache buffer size should be the same as the smallest volume block size in use.

By default, under NetWare 3.x, the block size is 4K and the cache buffer size is also 4K.

2.1.2. Set Dirty Disk Cache Delay Time
This parameter sets the minimum amount of time the system waits before writing a dirty cache buffer (one where the data in memory is different to the data on the disk) to the disk. On a system where the number or size of data writes are small, performance may be improved by increasing this value.

2.1.3. Set Maximum Concurrent Disk Cache Writes
This value controls the maximum number of disk cache writes that can occur at the same time. The value should be increased for write-intensive environments and decreased for read-intensive environments. Increasing this value improves disk write performance at the expense of disk reads.

2.2.1. Set Minimum and Maximum Directory Cache Buffers
These parameters control the minimum and maximum number of directory cache buffers that can be allocated by the system. Increasing these values may give improved performance if the primary server workload involves much directory browsing and searching. Applications that would benefit would be file oriented such as a word processor or a spreadsheet, whereas a record oriented application such as a database would see little benefit.

2.2.2. Set Maximum Concurrent Directory Cache Writes
This parameter controls the maximum number of directory cache writes that can be active at any one time. Increasing this value may improve performance if many thousands of small files are accessed regularly (assuming that the disk sub-system can cope).

2.2.3. Set Dirty Directory Cache Delay Time
This parameter controls how long NetWare will wait before writing a dirty directory cache buffer to the disk. Increasing this value improves the probability of subsequent changes to the same directory being made in memory rather than to the disk, therefore improving the chance of multiple write requests being satisfied by one physical write.

2.2.4. Set Directory Cache Buffer Nonreferenced Delay
This parameter controls how long NetWare will leave a directory cache buffer in memory before overwriting it. Increasing the value causes the server to retain directory cache buffer information for longer, resulting in more requests being serviced from the cache. Decreasing this value can help servers that are low on cache buffers, although if this is the case, it is preferable to add more memory to the server.

2.3.1. NetWare 3.X
The default block size under NetWare 3.x is 4K. The main advantage of increasing this value is that performance is improved when doing read-ahead operations on sequential files. Also, a larger block size reduces the FAT memory requirements for a volume. The main disadvantage is that a larger block size will result in wasted disk space with small files.

Also, remember that the block size can only be set when the volume is created. It cannot be changed later!

2.3.2. NetWare 4.X
The default block size under NetWare 4.x is determined by the volume size. It is however recommended that a 64K block size is used. The advantages for this are the same as with NetWare 3.x, but because of the use of block sub-allocation under NetWare 4.x, the impact of small files is dramatically reduced.

Adding name space support to a volume creates an additional file directory entry for each additional name space loaded for every file on that volume. This results in directory writes on a volume with added name space support taking longer than on a volume without added name space support. Therefore, added name spaces should only be used on volumes that actually need them. This is without taking into consideration the extra RAM required for the FATs of a volume with added name space support.

2.5.1. Set Maximum and Minimum Packet Receive Buffers
As a ‘rule of thumb’, there should be 2 packet receive buffers defined for each connection to a NetWare server. This value may however need to be increased for heavy communication based products such as NetWare for SAA.

In a NetWare 4.x environment, the server has increased priority for packet routing, so the packet receive buffer requirements are lower than those for an equivalent NetWare 3.x configuration.

Servers routing between fast and slow network topologies (such as between LAN and WAN, or between 16 and 4Mb/s Token Ring) will have increased needs for receive buffers, and so again, the maximum receive buffer allocation may need to be increased.

Setting the MINIMUM PACKET RECEIVE BUFFER value higher than default has the effect of eliminating any delay before the operating system allocates packet receive buffers. Setting the MAXIMUM PACKET RECEIVE BUFFER value has the effect of limiting the amount of server memory allocated to the receive buffers.

2.5.2. Set New Packet Receive Buffer Wait Time
Under NetWare 3.x, once memory has been removed from the available disk cache memory pool to service other requirements, it does not get returned. This can cause the gradual slow-down of a server where having the maximum available amount of disk cache is of benefit. Increasing the NEW PACKET RECEIVE BUFFER WAIT TIME has the effect of preventing unusual network traffic spikes from forcing the server to allocate extra packet receive buffers when they may not be needed for the bulk of the time the server is operational.

As of the release of NetWare 3.12 (and optionally in NetWare 3.11 if Packet Burst is enabled), NetWare adds a unique signature to each NCP packet sent across the network, in an attempt to make the system more secure. Each network request that has an NCP packet signature requires an additional 400 instructions to be carried out on the server in order to sign and validate each NCP packet. In an environment where this extra level of security is not vital, turning off the NCP Packet Signature option can reclaim vital CPU resources on a busy server. This needs to be set on the server (SET NCP PACKET SIGNATURE OPTION = 0 or 1) and in the NET.CFG file of each workstation using VLMs (SIGNATURE LEVEL = 0)

NetWare 3.x and above is supplied as standard with MONITOR.NLM. This tool can be used to watch a number of the factors that affect server performance, such as CPU utilisation, the number of service processes, cache buffer values, packet receive buffers, and so on. The main disadvantage of using MONITOR is that the values shown are those at a particular moment in time, rather than displaying historical trends (which are more useful when charting overall server utilisation).

To get a better feel for the trends of the various performance statistics, software such as Novell’s NetWare Management Services (NMS) can be used. This allows the trends of multiple servers to be monitored from a central location, and can output the data in a format that can be read directly into spreadsheets. The main disadvantage of this approach is cost.

In conjunction with the above, LANalyzer for Windows, or the NetWare LANalyzer Agents for NMS can be used to give an indication of the overall network bandwidth utilisation, and can allow a more detailed look at the types of server requests being made across the network so that application behaviour can be more effectively monitored (to see for example whether there are more read than write requests during a large file update).

The third factor in optimising the performance of a server is understanding how the applications used on the network impact on the performance of that network. The impact of software on the server performance should not be under-estimated, as poorly designed software can have a major impact on the overall performance of the system.

In real terms, understanding the behaviour of the software in use on the network should be the first step in optimising the performance of a server, as the steps needed to improve performance will vary dependant on the workload placed on the server by the applications in use. Adding cache RAM to a server where the primary applications do not take advantage of cache technology will not improve performance dramatically, whereas speeding up the disk sub-system or the LAN interface could offer better overall performance gains.

3.2.1 DOS Copy
The DOS COPY command is still widely used for file server performance analysis. It can be used to move a known amount of data and the time taken to copy the file can be used as a measure of the performance of the network.

The DOS COPY command works by issuing 64Kb (128 sector) sequential I/O requests, which does not reflect the way most popular applications operate. Because of the way the COPY command works, high performance figures can be easily manipulated by using read-ahead caching disk controllers. These controllers are however very likely to perform differently when placed in a ‘real application’ environment.

3.2.2 Lotus 1-2-3
Lotus 1-2-3 loads files in forward sequential blocks that are less than 2048 bytes in size, alternating between large and small blocks, always with uneven block boundaries, and file saves are made in a similar way. These small requests are sequential, and in a NetWare environment, the majority of the I/O requests will be serviced by NetWare’s disk caching algorithms, so the installation of disk caching controllers will have a minimal performance impact.

As the majority of requests are NetWare Disk Cache hits, the primary performance bottleneck is related to getting the data onto the network, therefore adding higher performance network cards, or splitting the network load between multiple network cards would offer the best performance gains.

3.2.3 WordPerfect
When reading a file, WordPerfect loads the file in 2Kb blocks, but in REVERSE, reading from the end to the beginning, although when saving a file, it writes sequentially in blocks of more than 8Kb.

If a read-ahead cache is in use, a read request from WordPerfect would be SLOWER than on a non-cached system, as for each block read, the cache would proceed to read in the next block of data (which WordPerfect has already read), causing the number of disk reads to increase.

Also, when writing, WordPerfect uses large sequential blocks, and so the effect of having a write-cache would be limited, as the data throughput would be sustained at a high level for the duration of the write.

This type of application would benefit most by increasing the throughput of the disk sub-system.

3.2.4 cc:Mail
As an example, we will assume that a 3K mail message is to be sent. cc:Mail will read the message from the disk, write an encrypted note back to the disk, then read the encrypted note, and append it to a mail log file. It will also have to read the database of mail users to get the correct addressing information, and add that information to the log file also. During the course of all this activity, there will be a total of six different files open, and four lock operations will be carried out, and each individual I/O request will be small.

The overall result of all these different requests for a single transaction are that the CPU utilisation has much more of an influence on performance than other aspects. As an example, to move a defined block of data via cc:Mail may put the server CPU utilisation at 100%, but an equivalent amount of data under WordPerfect may only cause 20% CPU utilisation.

In order to optimise the server performance for this kind of application, improving the server CPU performance (to counter the large number of I/O requests) and network adapter performance (to counter the small network frames) should be considered.

Upgrading software versions can also have a marked effect on the performance of a network. Although the upgraded software may have more functionality, the system performance hit taken as a result of the upgrade may have more of a negative effect on the system than the benefits gained by the new features.

3.3.1 DOS vs Windows Lotus 1-2-3
As a quick example, upgrading from Lotus 1-2-3 Rel 3 for DOS to Lotus 1-2-3 Rel 4 for Windows has the following effect for a sample operation....

Lotus 1-2-3 Rel 3 for DOS	Lotus 1-2-3 Rel 4 for Windows
211 I/O requests	331 I/O requests
61:39 read/write ratio	79:21 read/write ratio
open/close 1 file	open/close 13 files
100% small blocks (<2Kb) used	52% small blocks (<2Kb) used

In summary, 57% more requests are made through the network to perform a load/save operation under Windows than under DOS.

The bulk of the increased overhead is because of the use of DLL and font files under Windows, as the files are only accessed when needed.

3.3.2 Quattro V1.01 to Quattro Pro V4.0
Quattro v1.01 was a character based DOS application. Quattro Pro v4.0 is still a DOS based application, but includes its own enhanced graphical user interface. The resultant changes in resource utilisation with this upgrade are as follows.....

Quattro v1.01	Quattro Pro v4.0
51 I/O requests	1576 I/O requests
52:48 read/write ratio	90:10 read/write ratio
open/close 2 files	open/close 6 files
96% large blocks (>8Kb)	61% small blocks (<2Kb)

In summary, there are 31 times more requests through the network to perform a load/save operation under version 4.0 compared to version 1.01.

3.3.3 DOS vs Windows WordPerfect
Again, comparing between WordPerfect 5.1 for DOS and WordPerfect 5.2 for Windows, the following is found.....

WordPerfect 5.1 DOS	WordPerfect 5.2 Windows
112 I/O requests	203 I/O requests
85:15 read/write ratio	87:13 read/write ratio
open/close 3 files	open/close 18 files
77% small blocks (<2Kb)	62% small blocks (<2Kb)

For a given load/save operation, the Windows version of WordPerfect will result in 45% more network requests than the DOS version. Note that the Windows version of WordPerfect also reads files backwards!