Tuning for high load under Linux

HowTos, FAQs, Tips & Tricks, & Guides
Post Reply
jogger
Posts: 45
Joined: 19 Feb 2018 09:00

Tuning for high load under Linux

Post by jogger »

This guide is for you if you run i2p 24/7 constantly over 50-60% total CPU. Others will see little benefit. Target systems are most likely ARM and low-end Intel.

There is one common advice for Java under Linux to set vm.swappiness to 0. Only do that if you are sure never to exceed physical memory. Otherwise you are in for crashes. For stable 24/7 use vm.swappiness=1 with 1 Gig of fast swap, tested over a long time.

Use Java/OpenJDK >=9. Much faster than previous versions. I am using OpenJDK 9, saw more crashes under 10, 11 uses more memory and spends more time in system calls.

After at least 2 days uptime use "top -H" and "ps hH -C java -o "cp tid comm" | sort -rn" to become familiar with your CPU usage. "NTCP Pumper" and "UDP packet pusher" should come out at top.

These are the bottlenecks we are targeting. If you run a bigLITTLE design, then - after working through this guide - use "taskset" to assign these both to your fast cores. Assign "SimpleTimer" to the slower cores. This will propagate to the less critical threads SimpleTimer starts.

Basic reason for the rest of this guide is the interaction between the JVM and the Linux scheduler (CFS) which is known not to be the best. You can google details, but basic facts are that CFS guarantees every thread to run at least once every 100ms and in that time frame hands out dynamically calculated time slices between 0.75 ms and 6 ms. A switched out thread will most likely wait at least 0.5 ms before migrated to a different core when free. All values hardcoded, with CFS there is no way peeking into the kernel and getting a clue about actual time slices.

We can take accounting pressure from CFS by allowing calculation of time slices over large time intervals (this will not alter any of the values above) :
su
echo 1000000 > /proc/sys/kernel/sched_cfs_bandwidth_slice_us

If a thread does not free the CPU or blocks over I/O (voluntary context switch) within the given time slice, a non-voluntary context switch will occur. This is really bad, because just now processing of a network packet is interrupted, only to begin processing another packet within another thread which will also be preempted and so forth.

You can view context switches via "cat /proc/$javapid/task/$tid/status". It is not uncommon to see > 200 non-voluntary context switches per thread per second for the major packet processing threads. Take CFS parameters above and it is very likely that 20 - 40% of CPU is simply unaccessible for any thread taking into account also stop-the-world garbage collection pauses. The more cores you have the more you will suffer.

The only way to reduce non-voluntary context switches impacting throughput is to extend time slices. This works by reducing the number of threads and by prioritizing the important ones. First look outside i2p. This comes usually down to the question whether you need a GUI on the machine. Shutting it down will help a bit. Inside i2p you can cut down on the number of snark instances, disable DHT or open trackers, if you do not need them. Components for interactive use like the proxy and susimail can be moved to a different machine.

Next we need to go to the source to cut down on the number of concurrent threads responsible for most non-voluntary context switches. They were introduced in order to improve performance by running important functions more frequently (but at random points in time). This added overhead and gained throughput. But it looks like poor mans scheduling. I cut down on the number of threads so that no thread comes out above "NTCP Pumper" and "UDP packet pusher".

This means one job runner (look near "int runners"), one build handler (set in advanced config), one NTCP writer (MAX_CONCURRENT_WRITERS), three NTCP readers (MAX_CONCURRENT_READERS), two GW pumpers (MAX_PUMPERS) and one UDP receiver (MAX_THREADS). Using the "ps" command above your result should be similar. Do "ant updater" and update.

You will see less non-voluntary context switches, 20% less garbage collection and lower memory consumption.

But performance metrics will not be satisfactory and CFS does not know anything about priorities, so it is time to play "nice". CFS uses a hardcoded nice table where 5 nice levels correspond to a factor 3 priority. So "sudo nice -5 'i2p/i2prouter start'" will make sure a competing tar or make job will not affect i2p performance much. But we will prioritise within i2p itself. Example: Advancing the nice level of the build handler by 5 is like having three build handlers running at regular intervals and preempting less.

I put this into the attached script "nicer.sh". it assign fixed nice levels to the job runner and build handler and rebalances the nice level of the other important threads as groups every 30 seconds based on non-voluntary context switches. It survives i2p restarts and can be run interactively or from cron @restart (do not forget 2>&1 >/dev/null). Even if you do not want to change i2p source, "nicer.sh" will bring you noticeable performance improvements.

I estimate the total effect of this guide to be at 20% better throughput.

Last word: Do not try to use the FIFO scheduler using chrt or schedtool on single threads or i2p as a whole. You can improve some metrics but overall performance will suffer let alone crashes.
jogger
Posts: 45
Joined: 19 Feb 2018 09:00

Re: Tuning for high load under Linux

Post by jogger »

This is the mentioned "nicer.sh":

#!/bin/bash

oldpid=99999
initlevel=-5
jobrunlevel=-15
buildlevel=-11
creditlimit=44

while true
do
javapid=$(pidof java)
if [ -z $javapid ]
then
sleep 300
continue
fi

if [ $javapid -ne $oldpid ]
then
credits=0
unset nonvol
unset prio
unset sum
unset newsum
unset names
declare -A nonvol
declare -A prio
declare -A sum
declare -A newsum
declare -A names

sleep 30
sudo renice $initlevel $(ps hH -C java -o 'tid') 2>&1 >/dev/null
sudo renice $jobrunlevel $(ps H -C java -o 'tid comm' | grep JobQ | sed 's/^ *//' | cut -f 1 -d " ") 2>&1 >/dev/null
sudo renice $buildlevel $(ps H -C java -o 'tid comm' | grep Build | sed 's/^ *//' | cut -f 1 -d " ") 2>&1 >/dev/null

while read tid comm
do
name=${comm:0:11}
names[$tid]="$name"
sum["$name"]=0
# nonvol[$tid]=$(grep nonvol /proc/$javapid/task/$tid/status | cut -f 2)
buffer=($(< /proc/$javapid/task/$tid/status))
nonvol[$tid]=${buffer[-1]}
prio["$name"]=$initlevel
done < <(ps H -C java -o 'tid comm' | grep -e NTCP -e UDP -e GW)

for tid in "${!names[@]}"
do
index="${names[$tid]}"
sum["$index"]=$((sum["$index"] + nonvol[$tid]))
newsum["$index"]=${sum["$index"]}
done

oldpid=$javapid
continue
fi

sleep 30

max=0
min=99999999
total=0

for tid in "${!names[@]}"
do
old=${nonvol[$tid]}
buffer=($(< /proc/$javapid/task/$tid/status))
cur=${buffer[-1]}
diff=$((cur - old))
nonvol[$tid]=$cur
total=$((total + diff))
index="${names[$tid]}"
newsum["$index"]=$((newsum["$index"] + diff))
done

for name in "${!sum[@]}"
do
diff=$((newsum["$name"] - sum["$name"]))
if [ $diff -lt $min ] && [ ${prio["$name"]} -lt $initlevel ]
then
min=$diff
minname=$name
fi
if [ $diff -gt $max ]
then
max=$diff
maxname=$name
fi
sum["$name"]=${newsum["$name"]}
done

newprio=$((prio["$maxname"] - 1))
for tid in "${!names[@]}"
do
if [ "$maxname" == "${names[$tid]}" ]
then
sudo renice $newprio $tid 2>&1 >/dev/null
fi
done

echo "Up: " $newprio $maxname " Max:" $max "of" $total
prio["$maxname"]=$newprio

if [ $credits -eq $creditlimit ]
then

newprio=$((prio["$minname"] + 1))
for tid in "${!names[@]}"
do
if [ "$minname" == "${names[$tid]}" ]
then
sudo renice $newprio $tid 2>&1 >/dev/null
fi
done
prio["$minname"]=$newprio
echo "Down: " $newprio $minname " Min:" $min
else
credits=$((credits + 1))
fi

done
jogger
Posts: 45
Joined: 19 Feb 2018 09:00

Re: Tuning for high load under Linux

Post by jogger »

The following script brings you 10 seconds views of the performance impacting context switches:

#!/bin/bash

oldpid=99999

while true
do
javapid=$(pidof java)
if [ -z $javapid ]
then
sleep 300
continue
fi

if [ $javapid -ne $oldpid ]
then
credits=0
unset nonvol
unset sum
unset newsum
unset names
declare -A nonvol
declare -A vol
declare -A volsum
declare -A newvolsum
declare -A sum
declare -A newsum
declare -A names

while read tid comm
do
name=${comm:0:11}
names[$tid]="$name"
sum["$name"]=0
buffer=($(< /proc/$javapid/task/$tid/status))
nonvol[$tid]=${buffer[-1]}
vol[$tid]=${buffer[-3]}
done < <(ps H -C java -o 'tid comm' | grep -e NTCP -e UDP -e GW)

for tid in "${!names[@]}"
do
index="${names[$tid]}"
sum["$index"]=$((sum["$index"] + nonvol[$tid]))
volsum["$index"]=$((volsum["$index"] + nonvol[$tid]))
newsum["$index"]=${sum["$index"]}
newvolsum["$index"]=${volsum["$index"]}
done

oldpid=$javapid
continue
fi

sleep 10

for tid in "${!names[@]}"
do
index="${names[$tid]}"
buffer=($(< /proc/$javapid/task/$tid/status))
old=${nonvol[$tid]}
cur=${buffer[-1]}
diff=$((cur - old))
nonvol[$tid]=$cur
newsum["$index"]=$((newsum["$index"] + diff))
old=${vol[$tid]}
cur=${buffer[-3]}
diff=$((cur - old))
vol[$tid]=$cur
newvolsum["$index"]=$((newvolsum["$index"] + diff))
done

for name in "${!sum[@]}"
do
echo $name $((newsum["$name"] - sum["$name"])) $((newvolsum["$name"] - volsum["$name"]))
sum["$name"]=${newsum["$name"]}
volsum["$name"]=${newvolsum["$name"]}
done

done
Post Reply