Overrun when doing sync meta transfers TX and RX

Discussions related to schematic capture, PCB layout, signal integrity, and RF development

Moderator: robert.ghilduta

Bcd
Posts: 14
Joined: Tue Jun 06, 2017 3:00 am

Overrun when doing sync meta transfers TX and RX

Post by Bcd » Thu Dec 07, 2017 12:28 pm

Hi there,

I am working on a SDR program that requires using both RX and TX at specific timestamps. In testing the program, sometimes I get overruns and discontinued timestamps on RX.

I reproduced the bug with a small program by modifying the sync_rxtx.c (in order to reduce the clutter) and, yes, I am able to see the same overrun behaviour.

Looking at dmesg I see that the USB device is reset due to LPM from time to time. I correlate the messages with the moments when the reset occurs but I am not sure about that.

One strange thing is that the overrun occurs when I create the timestam for TX based on the one from RX like this:
meta_tx.timestamp = meta_rx.timestamp + 10000;

but it does not occur when I create it like this:

meta_tx.timestamp += 10000;

I am using Ubuntu 16.04.3 with kernel 4.10.

Is there something I am missing, is there something I need to do?

output from console:

.........
TX actual_count: 0, timestamp: 81667826
RX actual_count: 10000, timestamp: 81667826
TX actual_count: 0, timestamp: 81677826
RX actual_count: 10000, timestamp: 81677826
TX actual_count: 0, timestamp: 81687826
RX actual_count: 10000, timestamp: 81687826
TX actual_count: 0, timestamp: 81697826
RX actual_count: 10000, timestamp: 81697826
TX actual_count: 0, timestamp: 81707826
RX actual_count: 10000, timestamp: 81707826
TX actual_count: 0, timestamp: 81717826
RX actual_count: 10000, timestamp: 81717826
TX actual_count: 0, timestamp: 81727826
RX actual_count: 10000, timestamp: 81727826
TX actual_count: 0, timestamp: 81737826
RX overrun, count: 8792, timestamp: 81737826
RX actual_count: 8792, timestamp: 81737826
TX actual_count: 0, timestamp: 81747826
RX actual_count: 10000, timestamp: 81747646
Closing the device...

dmesg:

[70124.812673] usb 1-4: Product: bladeRF
[70124.812676] usb 1-4: Manufacturer: Nuand
[70124.812678] usb 1-4: SerialNumber: 9eeeeecf610343cba827a9889dcc9c23
[70135.899122] usb 1-4: reset high-speed USB device number 5 using xhci_hcd
[70189.999920] usb 1-4: USB disconnect, device number 5
[70266.959216] usb 2-2: new SuperSpeed USB device number 9 using xhci_hcd
[70266.979724] usb 2-2: LPM exit latency is zeroed, disabling LPM.
[70266.980198] usb 2-2: New USB device found, idVendor=2cf0, idProduct=5246
[70266.980203] usb 2-2: New USB device strings: Mfr=1, Product=2, SerialNumber=3
[70266.980207] usb 2-2: Product: bladeRF
[70266.980210] usb 2-2: Manufacturer: Nuand
[70266.980213] usb 2-2: SerialNumber: 9eeeeecf610343cba827a9889dcc9c23
[70276.919235] usb 2-2: reset SuperSpeed USB device number 9 using xhci_hcd
[70276.939527] usb 2-2: LPM exit latency is zeroed, disabling LPM.
[70317.299166] usb 2-2: reset SuperSpeed USB device number 9 using xhci_hcd
[70317.319220] usb 2-2: LPM exit latency is zeroed, disabling LPM.
[70376.383426] usb 2-2: reset SuperSpeed USB device number 9 using xhci_hcd
[70376.403448] usb 2-2: LPM exit latency is zeroed, disabling LPM.
[70821.463588] usb 2-2: reset SuperSpeed USB device number 9 using xhci_hcd
[70821.483728] usb 2-2: LPM exit latency is zeroed, disabling LPM.
[70869.671442] usb 2-2: reset SuperSpeed USB device number 9 using xhci_hcd
[70869.691695] usb 2-2: LPM exit latency is zeroed, disabling LPM.

Link to download the sync_rxtx.c

https://we.tl/ZUJuq92pOy

Thanks

Bcd
Posts: 14
Joined: Tue Jun 06, 2017 3:00 am

Re: Overrun when doing sync meta transfers TX and RX

Post by Bcd » Sat Dec 09, 2017 11:35 am

Since last post I did more investigation on the USB problem and I enabled USB logging in kernel and attach now a complete dmesg with all debug messages. Seems the EP is reset from time to time, not sure if this is ok or not or if it affects the transfer.

Also I attach also the complete log from the example_meta_rxtx.c app

Am am not able to attach logs. Is there a way for enabling attachments instead of external storage and post link here?

dmesg - https://we.tl/187wzTP9aS
run log - https://we.tl/GmOO7ODV2g

Thanks

Bcd
Posts: 14
Joined: Tue Jun 06, 2017 3:00 am

Re: Overrun when doing sync meta transfers TX and RX

Post by Bcd » Wed Dec 13, 2017 6:41 am

More news on the problem.

I moved the bladerf to another PC, having a different USB 3.0 controller. I am using the same software stack and FW/FPGA versions: 0.6, 2.0.0

The result is the same, the app crashes as there is overrun. From logs it seems that the driver restarts and hensce the discontinuity in timestamps after reboot..


Anyone willing to help me with running the app (see above posts) to help me identify if this is a failing BladeRF?

Thanks

Bcd
Posts: 14
Joined: Tue Jun 06, 2017 3:00 am

Re: Overrun when doing sync meta transfers TX and RX

Post by Bcd » Wed Dec 13, 2017 6:56 am

Well mixed news and feelings !!! I did a test by reducing the temperature of the BladeRF by taking it outside of the window. There are 12 degree C outside and I am using the plastic case that I bought from Nuand together with bladeRF.

Seems the problem disappears!!! And there is no overrun for long time. The app ran for like 10 hours without overrun and crashing. When I bring it inside and the temperature rises it starts giving overruns and I return back to the usual behavior.

Anyone owning a bladeRF wanting to test the same app to see if I have a defective unit? The unit is new and I just bought it after I returned the previous one due to being deaf (the first one was a new unit too).

Thanks

Bcd
Posts: 14
Joined: Tue Jun 06, 2017 3:00 am

Re: Overrun when doing sync meta transfers TX and RX

Post by Bcd » Wed Dec 13, 2017 9:44 am

So I add a modified sync_rxtx.c without prints on the console, for those who want to test:

https://we.tl/Givbyq11vL

and also example_common.h - it defines the samplerate and frequencyes:

https://we.tl/YNyqv0g3QS

and example_common.c

https://we.tl/oYCQfrdyPo
Last edited by Bcd on Wed Dec 13, 2017 12:51 pm, edited 1 time in total.

bglod
Posts: 103
Joined: Thu Jun 18, 2015 6:10 pm

Re: Overrun when doing sync meta transfers TX and RX

Post by bglod » Wed Dec 13, 2017 11:01 am

Fascinating. If it's temperature dependent, it sounds very much like a timing issue. Can you open up Quartus' TimeQuest and report failing paths? That might tell you something. If you're using the stock image, I can take a gander at it.
Electrical Engineer
Nuand, LLC.

Bcd
Posts: 14
Joined: Tue Jun 06, 2017 3:00 am

Re: Overrun when doing sync meta transfers TX and RX

Post by Bcd » Wed Dec 13, 2017 12:58 pm

bglod wrote:
Wed Dec 13, 2017 11:01 am
Fascinating. If it's temperature dependent, it sounds very much like a timing issue. Can you open up Quartus' TimeQuest and report failing paths? That might tell you something. If you're using the stock image, I can take a gander at it.
I do not have Quartus installed and did not use it before. I might try when have some time.
You can reproduce the problem using the above files. Your colleague Ray was able to reproduce it so far.

bglod
Posts: 103
Joined: Thu Jun 18, 2015 6:10 pm

Re: Overrun when doing sync meta transfers TX and RX

Post by bglod » Wed Dec 13, 2017 1:05 pm

I gotcha. It would help if I checked IRC, huh? :oops:
Electrical Engineer
Nuand, LLC.

User avatar
rtucker
Posts: 36
Joined: Sun Jan 25, 2015 10:38 am

Re: Overrun when doing sync meta transfers TX and RX

Post by rtucker » Wed Dec 13, 2017 4:49 pm

So yep, was working with Bcd via IRC today...

The application in question alternates between doing a bladerf_sync_rx of 10000 samples, followed by a bladerf_sync_tx of 10000 samples, scheduled for 1024 ticks after the RX sample timestamp (meta_tx.timestamp = meta_rx.timestamp + 1024).

The parameters:
num_buffers = 256
buffer_size = 8192
num_transfers = 32
sample rate = 7.68 Msps

How things break down: discontinuities in the RX sample timestamps occur, and eventually a BLADERF_ERR_TIME_PAST is thrown because bladerf_sync_tx is called with a timestamp in the past.

Things that don't seem to make a difference on my end:
  • FPGA 0.6.0 vs 0.7.1
  • libbladeRF 1.7.2 vs 1.9.0[
  • Temperature (tried 5, 20, and 40 degrees C, no big difference)
Things that do seem to make a difference:
  • buffer_size (I cannot get it to fail with 8192, but I can with 1024)
  • samples_len (if sample_len is bigger than buffer_size, it seems more likely to end on a BLADERF_ERR_TIME_PAST)
  • I/O load on host computer
The I/O load one was a bit of a surprise. I had a disk-intensive background task running on my laptop while working with this earlier, and then stepped away for a few hours. That disk-intensive process was done, and sure enough, I could no longer reproduce the problem. But if I did a "dd if=/dev/zero of=largefile bs=1024 count=4000000", the discontinuities start up again and eventually it will crash out. (Note that the original background task and dd are both CPU-intensive, so it could be that too.)

The textbook way to end up in this situation is to have too long of a gap between calls to bladerf_sync_rx(). So, I threw in some instrumentation.

First, up top:

Code: Select all

#include <sys/time.h>

#define RX_TIME_USEC 1500
#define TX_TIME_USEC 1500

// https://www.gnu.org/software/libc/manual/html_node/Elapsed-Time.html
int
timeval_subtract (struct timeval *result, struct timeval *x, struct timeval *y)
{
  /* Perform the carry for the later subtraction by updating y. */
  if (x->tv_usec < y->tv_usec) {
    int nsec = (y->tv_usec - x->tv_usec) / 1000000 + 1;
    y->tv_usec -= 1000000 * nsec;
    y->tv_sec += nsec;
  }
  if (x->tv_usec - y->tv_usec > 1000000) {
    int nsec = (x->tv_usec - y->tv_usec) / 1000000;
    y->tv_usec += 1000000 * nsec;
    y->tv_sec -= nsec;
  }

  /* Compute the time remaining to wait.
     tv_usec is certainly positive. */
  result->tv_sec = x->tv_sec - y->tv_sec;
  result->tv_usec = x->tv_usec - y->tv_usec;

  /* Return 1 if result is negative. */
  return x->tv_sec < y->tv_sec;
}
Then, the rxtx_loop section looks like this... each of the lines containing a reference to "tv_" is added for instrumentation. I think I put a " // <<<" at the end of the new lines.

Code: Select all

    /** [rxtx_loop] */

    struct bladerf_metadata meta_rx;
    struct bladerf_metadata meta_tx;
    struct bladerf_metadata meta;

    struct timeval tv_now, tv_start, tv_end, tv_delta, tv_last_rx, tv_last_tx; // <<<

    memset(&meta_rx, 0, sizeof(meta_rx));
    memset(&meta_tx, 0, sizeof(meta_tx));
    memset(&meta, 0, sizeof(meta));

    meta_tx.flags = BLADERF_META_FLAG_TX_BURST_START;
    meta.flags = BLADERF_META_FLAG_TX_BURST_START;

    meta_rx.flags = BLADERF_META_FLAG_RX_NOW;
    status = bladerf_sync_rx(dev, rx_samples, samples_len, &meta_rx, timeout_rx);
    if (status != 0)
        printf("Failed to RX count: %u, timestamp: %lu\n", meta_rx.actual_count, meta_rx.timestamp);
    else if (!status && meta_rx.status == BLADERF_META_STATUS_OVERRUN)
        printf("RX overrun, count: %u, timestamp: %lu\n", meta_rx.actual_count, meta_rx.timestamp);

    gettimeofday(&tv_last_rx, NULL); // <<<

#if 1
    // obtain current timestamp
    status = bladerf_get_timestamp(dev, BLADERF_MODULE_TX, &meta_tx.timestamp);
    if (status != 0) {
        fprintf(stderr, "Failed to get current TX timestamp: %s\n", bladerf_strerror(status));
        return status;
    } else {
        printf("Current TX timestamp: %lu\n", meta_tx.timestamp);
    }
#endif

    gettimeofday(&tv_last_tx, NULL); // <<<

    // an initial delay
    meta_tx.timestamp = meta_rx.timestamp;;

    while (status == 0 && !done) {
        gettimeofday(&tv_start, NULL); // <<<

#if 1   
        /* Receive samples */
        status = bladerf_sync_rx(dev, rx_samples, samples_len, &meta_rx, timeout_rx);
    if (status != 0)
        printf("Failed to RX count: %u, timestamp: %lu\n", meta_rx.actual_count, meta_rx.timestamp);
    else if (!status && meta_rx.status == BLADERF_META_STATUS_OVERRUN) {
        printf("RX overrun, count: %u, timestamp: %lu\n", meta_rx.actual_count, meta_rx.timestamp);
    }

    gettimeofday(&tv_now, NULL); // <<<
    timeval_subtract(&tv_delta, &tv_now, &tv_last_rx); // <<<
    if (tv_delta.tv_sec > 0 || tv_delta.tv_usec > RX_TIME_USEC) { // <<<
        printf("%ld.%06ld since last RX (current ts = %lu)\n", tv_delta.tv_sec, tv_delta.tv_usec, meta_rx.timestamp); // <<<
    } // <<<
    gettimeofday(&tv_last_rx, NULL); // <<<

    //printf("RX actual_count: %u, timestamp: %lu\n", meta_rx.actual_count, meta_rx.timestamp);
#endif

    status = 0;
        if (status == 0) {

            /* Process these samples, and potentially produce a response
             * to transmit */

            //done = do_work(rx_samples, samples_len, &have_tx_data, tx_samples, samples_len);
            done = false;
            have_tx_data = 1;

            if (!done && have_tx_data) {

        meta_tx.timestamp = meta_rx.timestamp + 10000;
        //meta_tx.timestamp += 10000;

                /* Transmit a response */
                status = bladerf_sync_tx(dev, tx_samples, samples_len, &meta_tx, timeout_tx);
                if (status != 0) {
                    fprintf(stderr, "Failed to TX samples: %s\n",
                            bladerf_strerror(status));
                }
        else if (!status && meta_tx.status == BLADERF_META_STATUS_UNDERRUN)
            printf("TX actual_count: %u, timestamp: %lu\n", meta_tx.actual_count, meta_tx.timestamp);

        gettimeofday(&tv_now, NULL); // <<<
        timeval_subtract(&tv_delta, &tv_now, &tv_last_tx); // <<<
        if (tv_delta.tv_sec > 0 || tv_delta.tv_usec > TX_TIME_USEC) { // <<<
            printf("%ld.%06ld since last TX (current ts = %lu)\n", tv_delta.tv_sec, tv_delta.tv_usec, meta_tx.timestamp);  // <<<
        } // <<<
        gettimeofday(&tv_last_tx, NULL); // <<<

        meta_tx.flags = BLADERF_META_FLAG_TX_UPDATE_TIMESTAMP;


        //meta_tx.timestamp += samples_len + 10;

            }
        } else {
            fprintf(stderr, "Failed to RX samples: %s\n",
                    bladerf_strerror(status));
        }

        //done = (meta_tx.timestamp > 50000000);

        gettimeofday(&tv_end, NULL); // <<<

    }

    timeval_subtract(&tv_delta, &tv_end, &tv_start); // <<<
    printf("last loop took %ld.%06ld seconds\n", tv_delta.tv_sec, tv_delta.tv_usec); // <<<

    /** [rxtx_loop] */
With samples_len = 8192, at 7.68 Msamples/s, the time between each call to RX or TX should average out to about 0.001067 seconds. If it starts running substantially longer than this on a regular basis, an overflow can occur.

So, with this in mind, I would recommend making sure your system is as quiet as possible (CPU, disk, network, video, etc). Playing around with gprof might prove interesting, too. I don't think it's a hardware problem, although I could see this being exacerbated by a weird USB signal integrity issue. Try a different USB cable? I'd think that kind of issue would show up with slower sample rates as well, though.
Rey Tucker (she/her)
Systems Engineer, Nuand LLC
Rochester, NY, USA

Bcd
Posts: 14
Joined: Tue Jun 06, 2017 3:00 am

Re: Overrun when doing sync meta transfers TX and RX

Post by Bcd » Thu Dec 14, 2017 3:29 am

Thanks for your effort Rey!

Below I am going through things I consider worth investigating:

- Where do I get the latest FPGA? You mentioned 0.7.1, I have 0.6.0

- I/O for me does not matter much. I did the dd test and there is no noticeable difference. Maybe because on my laptop the USB3.0 controller does not have another device connected to it. How is it in your case? lsusb -t shows the tree.

- I integrated the changes in your comments above and the result is that the average time for RX/TX is about 2.2ms so I needed to change the defines to have 2500 instead of 1500. Hmm, what is the actual calculus of transfer for 10000bytes RX/RX taking in consideration libbladerf, libusb and controllers? I will do later the same test on a PC where I had the same problems.

- the temperature matters a lot in my case. At the room temp (23 degree C) there happens the sample discontinuity that I believe it is due to a reset of device during transfer. If I take the bladeRF board outside in 12 degree C, there is no timestamp discontinuity and the timediff is kept below 2.5 and the app can run for long time (half an hour at least for now). See the next point regarding the reset.

See below an excerpt from a quick run and crash at room temp:

Running...
[VERBOSE @ nios_access.c:298] nios_config_read: Read 0x0003002f
[VERBOSE @ nios_access.c:309] nios_config_write: Wrote 0x0003002f
[VERBOSE @ nios_access.c:298] nios_config_read: Read 0x0003002f
[VERBOSE @ nios_access.c:309] nios_config_write: Wrote 0x0003002f
[DEBUG @ bladerf.c:352] Enable Module: RX - True
[VERBOSE @ nios_access.c:409] nios_lms6_read: Read 0x00 from addr 0x70
[VERBOSE @ nios_access.c:421] nios_lms6_write: Wrote 0x01 to addr 0x70
[DEBUG @ bladerf.c:352] Enable Module: TX - True
[VERBOSE @ nios_access.c:409] nios_lms6_read: Read 0x00 from addr 0x40
[VERBOSE @ nios_access.c:421] nios_lms6_write: Wrote 0x02 to addr 0x40
[DEBUG @ sync.c:326] sync_rx: Worker is idle. Going to reset buf mgmt.
[DEBUG @ sync.c:346] sync_rx: Reset buf_mgmt consumer index
[DEBUG @ sync.c:361] sync_rx: Worker is now running.
[VERBOSE @ nios_access.c:370] nios_get_timestamp: Read TX timstamp: 0x140720752595968
Current TX timestamp: 18004
[DEBUG @ sync.c:849] sync_tx: Worker is now running.
0.004856 since last TX (current ts = 181036)
0.004253 since last RX (current ts = 190012)
0.002671 since last RX (current ts = 1820012)
0.002722 since last TX (current ts = 1821036)
[DEBUG @ sync.c:485] Sample discontinuity detected @ buffer 140, message 4: Expected t=48997628, got t=317433084
RX overrun, count: 7616, timestamp: 48990012
[DEBUG @ sync.c:485] Sample discontinuity detected @ buffer 140, message 5: Expected t=317433592, got t=48998136
RX overrun, count: 508, timestamp: 317433084
[DEBUG @ sync.c:704] Provided timestamp=48999160 is in past: current=317444108
Failed to TX samples: Requested timestamp is in the past
last loop took 0.000988 seconds
[DEBUG @ bladerf.c:352] Enable Module: RX - False
[VERBOSE @ nios_access.c:409] nios_lms6_read: Read 0x01 from addr 0x70


- there is a reset that happen at the device every times the app exists due to overruns. I can see it in dmesg (how is it in your case?).

[ 93.186439] usb 2-2: reset SuperSpeed USB device number 2 using xhci_hcd
[ 93.206379] usb 2-2: LPM exit latency is zeroed, disabling LPM.
[ 294.510555] usb 2-2: reset SuperSpeed USB device number 2 using xhci_hcd
[ 294.530444] usb 2-2: LPM exit latency is zeroed, disabling LPM.
[ 354.166147] usb 2-2: reset SuperSpeed USB device number 2 using xhci_hcd
[ 354.186369] usb 2-2: LPM exit latency is zeroed, disabling LPM.

The reset happens during transfer and I think it is the cause of the bladerf timestamp mismatch after which the app exits. If the reset does not happen, even though the RX and TX tv timestamp diff go sometimes above 2.5ms there is no problem and the app keeps going. Could it be the FPGA or FX3 resetting? Is there a log for the FPGA/FX3 to check?

- I ordered a quality cable, I'll see how it goes.


Thanks

Post Reply