Tuesday, September 13, 2016

DCMTK and 10 Gb/s

During testing on 10 Gb/s network with the tools mentioned in the previous post I experienced very low bandwidth usage during the tests: ~ 1.5 Gb/s instead of the 8 Gb/s I measured with IPerf and with my own Java tool for network transfer speed measurement.

Investigation showed that DCMTK unfortunately always set TCP send and receive buffers by setsoc

bufLen = 65536; // a socket buffer size of 64K gives best throughput for image transmission
...

setsockopt(sock, SOL_SOCKET, SO_SNDBUF, (char *) &bufLen, sizeof(bufLen));    (void) setsockopt(sock, SOL_SOCKET, SO_RCVBUF, (char *) &bufLen, sizeof(bufLen));

This was probably good approach before TCP autotuning was introduced, but since it is available in newer kernels (>2.6) setting SO_SNDBUF or SO_RCVBUF is switching off the TCP autotune feature that results terrible performance over 10 Gb/s (see https://www.psc.edu/index.php/networking/641-tcp-tune)


When I commented out these setsockopt lines in
./dcmnet/libsrc/dulfsm.cc
./dcmnet/libsrc/dul.cc
the newly built tools could reach ~5.2 Gb/s bandwidth utilization.

DCMTK adds 1 sec delay

When measured transfer speed with a simple dcmtk server (C-MOVE SCP/C-STORE SCU) and client (CMOVE SCU/C-STORE SCP) I was surprised to see that there is always a 1 sec delay before the actual C-STORE transfer starts.

Checking the source code revealed that in dimmove.c at selectReadable function:

timeout = 1; /* poll wait until an assoc req or move rsp */

then invoke with assoc and subassoc in the assocList:
ASC_selectReadableAssociation(assocList, assocCount, timeout))

that leads to DcmTransportConnection::fastSelectReadableAssociation
where it instead of a single select on the assoc and the subassoc it does a select for first on the association, then after a 1 sec delay on the subassociation.

This one second timout for select that is experienced as 1 sec extra if someone relies on the callback implementation for the C-STORE SCP on the client side.

When I implemented the C-CTORE SCP in its separate thread (and not using the callback in the DIMSE_moveUser) obviously this 1 sec delay was removed.

Thursday, February 4, 2016

DCMTK through localhost is slow

Recently I created a small DICOM server application for testing purposes, building them on 64 bit Ubuntu 15.10. As a first step I created C-STORE SCU and SCP pair of applications and I was happy with the performance I measured SCU accessing the SCP through the loopback interface (127.0.0.1): 1000 CT image in uncompressed explicit little endian format was received in less than a second. So far so good, I created the C-MOVE SCU and SCP applications (that included the C-STORE parts as well) and executed the same tests. To my surprise it was extremely slow: It was about 5 minutes to receive the same data.

First I was concerned that I was doing things wrong in my code, having not much experience with DCMTK. After spending too much time reviewing my code without any results I moved C-STORE SCP part to a separate process and bumm, got excellent performance. Another try, running the C-MOVE SCP and SCU in separate machines (or in C-MOVE SCU in a docker container) and bumm, again excellent performance. Plus if I used other C-MOVE SCU tool like gdcmscu it gave also excellent performance over the loopback interface...

But trying DCMTK's movescu tool connecting to my C-MOVE SCP gave the very same slow result, so it's kinda DCMTK SCU to DCMTK SCP strange behavior over the loopback interface?

All in all, after trying different things like disabling multi-threading support in DCMTK, profiling my SCU/SCP pair - those did not lead anywhere - my Goggle searches focused on people reporting slowness over the loopback interface and found this link useful:

http://stackoverflow.com/questions/5832308/linux-loopback-performance-with-tcp-nodelay-enabled

After setting

export TCP_NODELAY=0

for DCMTK (see ./dcmtk/src/dcmtk/config/docs/envvars.txt for environment variables configuring DCMTK runtime) the performance through loopback became significantly better (1 min instead of 4 mins), but still not as good as with C-STORE SCP/SCUs (1 sec).

After unsetting TCP_NODELAY I tried increasing

"TCP_BUFFER_LENGTH Affected: dcmnet Explanation: By default, DCMTK uses a TCP send and receive buffer length of 64K. If the environment variable TCP_BUFFER_LENGTH is set, it specifies an override for the TCP buffer length. The value is specified in bytes, not in Kbytes."

long story short: Increasing the BUFFER_LENGTH as

export TCP_BUFFER_LENGTH=129076 (129075 still results slow performance)
then performance became excellent, 1 sec for 1000 CT images (instead of 5 mins).
Root cause is not clear yet. Why it happens over the loopback interface and seemingly not over eth0?

Friday, January 1, 2016

Docear integration with PDFXCviewer through PlayOnLinux

Started exploring Docear (http://www.docear.org/) that is a "a unique solution to academic literature management, i.e. it helps you organizing, creating, and discovering academic literature". Downloading, starting it on my 15.10 Ubuntu went without problem. Configuring it to use the recommended PDF viewer application (PDFXCviewer) needed some tweaking, maybe because I installed it with PlayOnLinux.

Configuring it in Docear's preferences at the PDF Management needs the following command given:

playonlinux*--run*PDFXCview*/A*page=$PAGE*`winepath -w "$FILE"`


Monday, July 16, 2012

Maximum number of processes/threads in Linux

In Linux there are three variables that define how many threads one process can create:
  1. /proc/sys/kernel/threads-max
  2. /proc/sys/kernel/pid_max
  3. /proc/sys/vm/max_map_count

threads-max

Threads max defines how many threads / process can be created (if the other two parameters do not limit it to a lower number). The default value is calculated at boot time as

max_threads = totalram_pages / (8 * THREAD_SIZE / PAGE_SIZE) 

pid_max 

This  setting simply sets the maximum number that can be used a process identifier (pid). Since threads are also processes (Light Weight Processes to be correct) they also use up pids. This means that if the other two parameters are not limiting, this parameter will define the maximum number of threads one can create in the system. Of course, this is a system wide limitation.

Reaching this limit results that the system is not able to clone new processes (whether new process or thread). This leads to unstable system behavior.

 

max_map_count

This parameter sets the maximum number of Virtual Memory Areas (VMAs) that one process can own. VMA is a contiguous area of virtual address space. One can check these areas by cat /proc/PID/maps. Each new thread need new stack, created by calling malloc typically implicitly. These mallocs need these VMAs to be allocated to the process. If the other two parameters are not limiting, this parameter will define the maximum number of threads one process can start. 

More information:

http://www.novell.com/support/kb/doc.php?id=7000830
http://www.redhat.com/magazine/001nov04/features/vm/

Sunday, June 3, 2012

Huawei E220 and USSD

Huawei E220 modem when connected creates two devices:

- /dev/ttyUSB0
- /dev/ttyUSB1

In order to get an USSD message e.g balance check one shall execute:

# echo "AT+CUSD=1,*102#^M" > /dev/ttyUSB0; cat /dev/ttyUSB1

IMPORTANT: ^M (Carriage Return character) is typed as CTRL+v and CTRL+m

Let's create a one liner that gives back the balance check as a single line:

# echo "AT+CUSD=1,*102#,15^M" > /dev/ttyUSB0; for i in `seq 15`; \
 do read line; if echo $line | grep '+CUSD'; then break; fi ; done < /dev/ttyUSB1

I use this as an input content for my monthly SMS from my server to my mobile phone.