[OpenOCD-devel] Debug probe hardware that can read/write target memory directly

Discussion:

Fewell, Edward

2017-05-31 12:59:34 UTC

I am currently implementing the Texas Instruments XDS110 debug probe in OpenOCD. The firmware of this probe includes APIs to read and write memory via the target's DAP for ARM Cortex based devices. The OpenOCD software stack does not appear to support this and only can execute DAP register transactions via the probe.

Any ideas on how best to implement this feature from within OpenOCD? The performance gains are huge for this probe. We see almost 40x improvement between the CMSIS-DAP interface vs using our built-in memory calls on the XDS110.

The memory calls within the ARM Cortex architecture code would need to check if the probe supports direct memory calls then route the call to be handled entirely by the probe firmware. Otherwise, it could continue on to do the memory transaction via DAP register calls.

Edward Fewell
Texas Instruments, Inc

Fewell, Edward

2017-05-31 14:19:10 UTC

Permalink

This is code running on the debug probeâs CPU, not the target. Itâs flashed into the XDS110 as part of its firmware. So the probe has the âsmartsâ to do ARM DAP memory transactions directly.

The current software stack issues DAP register accesses to do a memory transaction (e.g. setting CSW, TAP, SELECT, reading the AP, etc.). The firmware in our probe does all that for us. So the host side code just needs to pass in a request like âread 4 bytes from address 0x20000000 on AP #0â, and the XDS110 does all the work. The result is fewer USB packets needed, and because this is a full speed USB device, that ends up being a significant performance boost. This is a low-cost probe. The base CPU chip only has full speed support. High speed requires external parts not included in most XDS110 implementations. The purpose of this design was something cheap that could be essentially given away for âfreeâ as part of an evaluation board.

Edward

From: Steven Stallion [mailto:***@squareup.com]
Sent: Wednesday, May 31, 2017 8:55 AM
To: Fewell, Edward
Cc: openocd-***@lists.sourceforge.net
Subject: Re: [OpenOCD-devel] Debug probe hardware that can read/write target memory directly

Hi Edward,

It sounds like you're describing an algorithm (code that is executed on the target). Is this something that is resident only when firmware is loaded? Is this something that's baked into every firmware image or is it copied to RAM during debugging?

Cheers,
Steve

On Wed, May 31, 2017 at 7:59 AM, Fewell, Edward <***@ti.com<mailto:***@ti.com>> wrote:
I am currently implementing the Texas Instruments XDS110 debug probe in OpenOCD. The firmware of this probe includes APIs to read and write memory via the targetâs DAP for ARM Cortex based devices. The OpenOCD software stack does not appear to support this and only can execute DAP register transactions via the probe.

Any ideas on how best to implement this feature from within OpenOCD? The performance gains are huge for this probe. We see almost 40x improvement between the CMSIS-DAP interface vs using our built-in memory calls on the XDS110.

The memory calls within the ARM Cortex architecture code would need to check if the probe supports direct memory calls then route the call to be handled entirely by the probe firmware. Otherwise, it could continue on to do the memory transaction via DAP register calls.

Edward Fewell
Texas Instruments, Inc

------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
OpenOCD-devel mailing list
OpenOCD-***@lists.sourceforge.net<mailto:OpenOCD-***@lists.sourceforge.net>
https://lists.sourceforge.net/lists/listinfo/openocd-devel

Forest Crossman

2017-06-03 21:34:11 UTC

Permalink

Check out the HLA (high-level adapter) API--it specifies functions for bulk
memory reads and writes, so it might work for your use case. If you'd like
some examples on how to use it, take a look at the stlink_usb.c and
ti_icdi_usb.c drivers in src/jtag/drivers.

Post by Fewell, Edward
I am currently implementing the Texas Instruments XDS110 debug probe in
OpenOCD. The firmware of this probe includes APIs to read and write memory
via the targetâs DAP for ARM Cortex based devices. The OpenOCD software
stack does not appear to support this and only can execute DAP register
transactions via the probe.
Any ideas on how best to implement this feature from within OpenOCD? The
performance gains are huge for this probe. We see almost 40x improvement
between the CMSIS-DAP interface vs using our built-in memory calls on the
XDS110.
The memory calls within the ARM Cortex architecture code would need to
check if the probe supports direct memory calls then route the call to be
handled entirely by the probe firmware. Otherwise, it could continue on to
do the memory transaction via DAP register calls.
Edward Fewell
Texas Instruments, Inc
------------------------------------------------------------
------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
OpenOCD-devel mailing list
https://lists.sourceforge.net/lists/listinfo/openocd-devel

Andreas Fritiofson

2017-06-03 22:39:22 UTC

Permalink

Hi and thanks for working on OpenOCD!

OpenOCD architecture is not flexible enough that we can easily support
generic adapters with arbitrary specific optimizations (nor do I think it
should be). We have the "High-Level Adapters"; ST-Link and ICDI that have
similar high level primitives for memory access, except those lack any
low-level operations at all. The support for those adapters exist at the
target layer (!) because that's the best fit. It's a horrible hack and it
makes these adapters unsuitable for low-level debug work. You probably
don't want to go that route and you probably can't if the firmware doesn't
include complete "target-level" operations.

Post by Fewell, Edward
Any ideas on how best to implement this feature from within OpenOCD? The
performance gains are huge for this probe. We see almost 40x improvement
between the CMSIS-DAP interface vs using our built-in memory calls on the
XDS110.

I'm going to have to be a bit skeptical here. Without knowing the details
of what you have done or the XDS110 internals, I don't think you can get
such significant speedup on memory transfers simply by using custom
commands instead of the CMSIS-DAP Transfer commands *when both options are
optimally implemented at both ends*. The CMSIS-DAP protocol itself has very
little overhead for bulk memory access, just 4 or 5 bytes per command which
can be up to 64K transfers. I doubt you can do noticeably better than that
by changing protocol. The MEM-AP overhead is also small since a bulk memory
access is basically just streaming words at least a KiB at a time with a
few register setups in between blocks. AP registers are only updated when
needed by OpenOCD so several smaller accesses do not require significantly
more AP writes than the inevitable transmission of the new address.

If you can actually get 40x speedup (in what sense?), you could potentially
suffer from a bad CMSIS-DAP implementation on either side (OpenOCD is
probably not optimal there) or more likely, be using an entirely different
USB configuration/interface/altsetting with bulk endpoints instead of the
HID mandated interrupt endpoints that kills the potential of the CMSIS-DAP
protocol for no good reason at all.

So I'd guess that the speedup from higher-level memory transaction "smarts"
in the XDS110 FW is a red herring and that the performance gain is more or
less only due to using bulk instead of interrupt transfers. Then, one could
ask, why not simply transport the CMSIS-DAP protocol over the bulk
endpoints instead of using a different protocol. It would fit neatly into
OpenOCD (and other existing debuggers that already speak CMSIS-DAP) with
just a minor change in the low-level USB transport code...

The memory calls within the ARM Cortex architecture code would need to

Post by Fewell, Edward
check if the probe supports direct memory calls then route the call to be
handled entirely by the probe firmware. Otherwise, it could continue on to
do the memory transaction via DAP register calls.

What does the memory access fall back to if it's not supported? How are
non-memory access operations implemented? Are the "DAP register calls" also
implemented over the API that provides the memory access functions? What
else does that API provide?

Perhaps if you really wanted to use the probe's memory access functions
(which, if I guess correctly, by themselves won't play a major role in the
performance gains), you could detect memory reads by interpreting the
access pattern of AP registers and transform them to the corresponding
memory accesses.

/Andreas

Fewell, Edward

2017-06-04 15:46:53 UTC

Permalink

Youâre right, I donât think the XDS110 will qualify as an HLA adapter. Itâs only optimization is to read/write memory as block accesses. It has no further knowledge or APIs to do debug on the target.

Keep in mind this is a full speed device, not high speed. Cutting down on the number of USB packets needed does help considerably. I donât doubt that the firmwareâs CMSIS-DAP code is not particularly well optimized. We took ARMâs code for the CMSIS-DAP and did the minimum porting effort they recommended to make it work. Other than updating it one time to pick up their fixes, we didnât touch it further. Nor are we likely to now. The intent from here on is to focus on OpenOCD.

The XDS110âs APIs include DAP register calls that are carried over a BULK endpoint. Using that gives me about a 5x performance improvement over the CMSIS-DAP implementation. I intend to issue a patch to implement this level of support soon. However, thatâs still considerably slower than what I saw when I hacked up the target layer to call into the XDS110 memory APIs directly. Perhaps Iâll build in some logging to demonstrate the difference in USB traffic doing that way.

Performance was measured by noting download times for a 75k program to RAM on the target, nothing particularly sophisticated.

The XDS110 does support full low level JTAG and SWD support. So debug can work over low level interfaces when the host code doesnât use the memory API optimizations.

Edward

From: ***@fritiofson.net [mailto:***@fritiofson.net] On Behalf Of Andreas Fritiofson
Sent: Saturday, June 03, 2017 5:39 PM
To: Fewell, Edward
Cc: openocd-***@lists.sourceforge.net
Subject: Re: [OpenOCD-devel] Debug probe hardware that can read/write target memory directly

Hi and thanks for working on OpenOCD!

On Wed, May 31, 2017 at 2:59 PM, Fewell, Edward <***@ti.com<mailto:***@ti.com>> wrote:
I am currently implementing the Texas Instruments XDS110 debug probe in OpenOCD. The firmware of this probe includes APIs to read and write memory via the targetâs DAP for ARM Cortex based devices. The OpenOCD software stack does not appear to support this and only can execute DAP register transactions via the probe.

OpenOCD architecture is not flexible enough that we can easily support generic adapters with arbitrary specific optimizations (nor do I think it should be). We have the "High-Level Adapters"; ST-Link and ICDI that have similar high level primitives for memory access, except those lack any low-level operations at all. The support for those adapters exist at the target layer (!) because that's the best fit. It's a horrible hack and it makes these adapters unsuitable for low-level debug work. You probably don't want to go that route and you probably can't if the firmware doesn't include complete "target-level" operations.

Any ideas on how best to implement this feature from within OpenOCD? The performance gains are huge for this probe. We see almost 40x improvement between the CMSIS-DAP interface vs using our built-in memory calls on the XDS110.

I'm going to have to be a bit skeptical here. Without knowing the details of what you have done or the XDS110 internals, I don't think you can get such significant speedup on memory transfers simply by using custom commands instead of the CMSIS-DAP Transfer commands *when both options are optimally implemented at both ends*. The CMSIS-DAP protocol itself has very little overhead for bulk memory access, just 4 or 5 bytes per command which can be up to 64K transfers. I doubt you can do noticeably better than that by changing protocol. The MEM-AP overhead is also small since a bulk memory access is basically just streaming words at least a KiB at a time with a few register setups in between blocks. AP registers are only updated when needed by OpenOCD so several smaller accesses do not require significantly more AP writes than the inevitable transmission of the new address.

If you can actually get 40x speedup (in what sense?), you could potentially suffer from a bad CMSIS-DAP implementation on either side (OpenOCD is probably not optimal there) or more likely, be using an entirely different USB configuration/interface/altsetting with bulk endpoints instead of the HID mandated interrupt endpoints that kills the potential of the CMSIS-DAP protocol for no good reason at all.

So I'd guess that the speedup from higher-level memory transaction "smarts" in the XDS110 FW is a red herring and that the performance gain is more or less only due to using bulk instead of interrupt transfers. Then, one could ask, why not simply transport the CMSIS-DAP protocol over the bulk endpoints instead of using a different protocol. It would fit neatly into OpenOCD (and other existing debuggers that already speak CMSIS-DAP) with just a minor change in the low-level USB transport code...

The memory calls within the ARM Cortex architecture code would need to check if the probe supports direct memory calls then route the call to be handled entirely by the probe firmware. Otherwise, it could continue on to do the memory transaction via DAP register calls.

What does the memory access fall back to if it's not supported? How are non-memory access operations implemented? Are the "DAP register calls" also implemented over the API that provides the memory access functions? What else does that API provide?

Perhaps if you really wanted to use the probe's memory access functions (which, if I guess correctly, by themselves won't play a major role in the performance gains), you could detect memory reads by interpreting the access pattern of AP registers and transform them to the corresponding memory accesses.

/Andreas

Duane Ellis

2017-06-04 16:51:35 UTC

Permalink

ed>> Keep in mind this is a full speed device, not high speed. Cutting down on the number of USB packets needed does help considerably.

Full vrs High speed is not important here.

Better: draw a timeline of the USB traffic of actual usb traffic captured on the wire and work on fixing that.

I am not talking about one of those “software sniffers” - these do not capture the NAKs and other bus traffic.
you need a capture from a real hardware sniffer (TotalPhase Beagle, Ellisys, or CatC USB Chief)

In theory, a USB 1.1 device transfers at 12mBIT - or 1.5mByte/second, but USB has an over you only get about 2/3 of the
total data rate, or about 1mByte per second. But how do you get that? that’s the key.

Lets’ assume that the command/data over head is another 2/3 = so - 600Kbytes/second would be unbelievable.

Nothing is currently that fast.

I like to describe the problem as train “box cars”, periodically USB sends a packet (think of the packet as a train box car)
the true limitation is not bit rate, but packet rate, aka: Box cars per second. If you send each box car out the door with
1 box - leaving room for 63 other boxes - you efficiency is horrible.

Second, if you send a command and that command requires a reply, these two transfers often become very inefficient
usb transfers (ie: the box cars are mostly empty)

Third, if there is any delay in preparing the *next* packet, your time slot for that packet is LOST, effectively you just
pushed a empty box car down the train track.

This is why I say: Draw a time line of actual USB traffic, as captured by an “on-the-wire” USB protocol analyzer

You need to include the NAKs in that time line graph and work on getting rid of those. You also need to include the SOF counts with dead time between them Each one of those is a missed opportunity to send data, effectively a box car with zero boxes just rolled down the track.

-Duane.

Fewell, Edward

2017-06-04 19:41:42 UTC

Permalink

Thanks for that info. The "box cars" analogy helps with describing the issue.

Yes, we are very, very guilty of sending out trains with just 1 box. The direct memory API was essentially a patch to our code base to correct that issue, especially for large memory block transfers. Our current target code base was developed, literally, decades ago for ISA and PCI bus probes where this was not an issue. And our first USB based probes ran our entire software stack on the probe's CPU, so it still wasn't much of an issue. The XDS110 firmware design is constrained that the target code always wants a response to each and every request limiting our ability to send multiple "boxes" on each "train." The memory APIs are to help with this issue filling the "boxes" with memory data up or down the pipe.

So if I understand correctly, the target code in OpenOCD optimizes this by queueing up DAP register accesses and then calls the execute queue command to run them all at once. I can add APIs to the XDS110 firmware to enable that support to help speed things up. That should be sufficient and be fully compatible with OpenOCD.

In the meantime, though I would still like to issue a patch to add support for the XDS110 with the current firmware. Even after adding new APIs, I think we still need to support users that haven't updated their XDS110s yet but issue them a warning that they could get better performance if they get the firmware update. Does that sound reasonable?

Edward

-----Original Message-----
From: Duane Ellis [mailto:***@duaneellis.com]
Sent: Sunday, June 04, 2017 11:52 AM
To: Fewell, Edward
Cc: Andreas Fritiofson; openocd-***@lists.sourceforge.net
Subject: Re: [OpenOCD-devel] Debug probe hardware that can read/write target memory directly

ed>> Keep in mind this is a full speed device, not high speed. Cutting down on the number of USB packets needed does help considerably.

Full vrs High speed is not important here.

Better: draw a timeline of the USB traffic of actual usb traffic captured on the wire and work on fixing that.

I am not talking about one of those “software sniffers” - these do not capture the NAKs and other bus traffic.
you need a capture from a real hardware sniffer (TotalPhase Beagle, Ellisys, or CatC USB Chief)

In theory, a USB 1.1 device transfers at 12mBIT - or 1.5mByte/second, but USB has an over you only get about 2/3 of the total data rate, or about 1mByte per second. But how do you get that? that’s the key.

Lets’ assume that the command/data over head is another 2/3 = so - 600Kbytes/second would be unbelievable.

Nothing is currently that fast.

I like to describe the problem as train “box cars”, periodically USB sends a packet (think of the packet as a train box car) the true limitation is not bit rate, but packet rate, aka: Box cars per second. If you send each box car out the door with
1 box - leaving room for 63 other boxes - you efficiency is horrible.

Second, if you send a command and that command requires a reply, these two transfers often become very inefficient usb transfers (ie: the box cars are mostly empty)

Third, if there is any delay in preparing the *next* packet, your time slot for that packet is LOST, effectively you just pushed a empty box car down the train track.

This is why I say: Draw a time line of actual USB traffic, as captured by an “on-the-wire” USB protocol analyzer

You need to include the NAKs in that time line graph and work on getting rid of those. You also need to include the SOF counts with dead time between them Each one of those is a missed opportunity to send data, effectively a box car with zero boxes just rolled down the track.

-Duane.

Duane Ellis

2017-06-04 20:42:20 UTC

Permalink

If I was to re-write a USB protocol for OpenOCD - I would do exactly this.

Look very carefully at the SWD commands.

The commands are exactly 1 byte + 4 byte read, or 1 byte + 4byte write.

The first SWD byte always has bits 7 and 0 set (START + PARK)
thus there are really only (256/4)=64 possible values for an SWD command.
That leaves lots of other values for other commands.

Hence, any byte that comes down the pipe out side of that range is not an SWD command
For an SWD command look at the RW bit, and determine if there are 4 data bytes to write
or 4 data bytes to read…

Perform the SWD transaction and either write (buffer!) data back to the host.

Key point: The data back to the host is just queued.. not automatically sent until either (a) the endpoint is full
or (b) the host sends a special command to *flush* the IN endpoint.

The other unused command byte values can be used to:

enable switch modes
JTAG operations
configure SWD or JTAG frequencies
Probably a command to clear any sticky error/status bit
And a command to read status bits.
And maybe some GPIO commands to set/clear reset pins
Maybe a delay (N) milliseconds so you can set reset, delay, then clear reset.

I can explain a lot more if you are interested.
The code to do this is SUPER TINY!

-Duane.