Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement Fast Convolution algorithm #80

Open
AXP opened this issue Nov 16, 2020 · 18 comments
Open

Implement Fast Convolution algorithm #80

AXP opened this issue Nov 16, 2020 · 18 comments

Comments

@AXP
Copy link
Contributor

AXP commented Nov 16, 2020

I'd like to add a fast convolution support to DaisySP. Let's discuss the requirements.

1. Definition
In short, it is an algorithm that allows computing linear convolution of a stream of data with a constant (usually long) vector in logarithmic rather than linear time by employing the fact that convolution in time domain is equivalent to multiplication in frequency domain, and the latter could be efficiently computed with Fast Fourier Transform in O(N*logN) time.
Typical applications of linear convolution are FIR filtering and reverberation.
There are two problems that are immediately faced with naive implementation:

  • The Fourier transform performs cyclic convolution rather than the required linear convolution
  • FFT is inherently block-based and thus introduces latency
    There are elegant solutions to both problems.
    The linear convolution may be redefined via cyclic convolution with a smart choice of FFT size and input signal padding, which is usually referred to as "Overlap and Add" and "Overlap and Save" algorithms (the latter will more appropriately called "Overlap and Discard" in this issue). The latency may be made arbitrarily small and even completely avoided by partitioning the filter kernel into a series of segments that are processed separately and recombined at the output which is possible since convolution is a linear operation.

2. High level requirements
2.1. The algorithm should be functionally equivalent to the existing arm_fir_f32 function
2.2. The algorithm should have valid results for any filter kernel lengths (including not power-of-2 and add values)
2.3. There is no preference between Overlap-and-Save and Overlap-and-Discard algorithms (initial profiling on Daisy shows that O&D has a slight edge)
2.4. The algorithm should reuse existing arm_* functions as much as possible unless it can be shown that custom implementation gives at least 10% overall performance increase
2.5. The algorithm should not use dynamic memory allocation
2.6. The algorithm should support the following channel mappings:
2.6.1. Mono -> Mono (1 filter kernel) [mandatory]
2.6.2. Mono -> Stereo (2 filter kernels) [optional]
2.6.3. Stereo -> Stereo (2 filter kernels L->L, R->R) [optional]
2.6.4. Stereo -> Stereo (4 filter kernels L->L, L->R, R->L, R->R) [optional]
2.7. The algorithm should allow multiple instances
2.8. The algorithm should use single precision floating point data
2.9. The algorithm should support single sample processing API
2.10. The algorithm should support block sample processing API for arbitrary block sizes
2.11. The algorithm should have the following unit tests:
2.11.1. Verification of processing results against arm_fir_f32 for various filter lengths, including non-power-of-2 and odd values, the target metric should be root mean squared error between reference and device-under-test and should not exceed -90 dB
2.11.2. Performance profiling (processing time per sample vs. filter kernel), including comparison with arm_fir_f32
2.11.3. Performance profiling (processing time per sample vs. block length) including comparison with arm_fir_f32
2.12. The unit tests should be self-contained and finish the execution with clear verification result (e.g. "PASS" or "FAIL") printed to console over USB UART.
2.13. The algorithm should be accompanied by an example application (e.g. spring reverb, since it's hard to do algorithmically)
2.14. The algorithm should be implemented in object-oriented fashion and follow DaisySP coding style

**3. Implementation details **
3.1. I propose to have two separate implementations, one is pure FFT convolution with latency and another is the partitioned convolution which would be based on the first one. In some applications the latency could be tolerated and thus partitioning would introduce unnecessary processing overhead.
3.2. In general, FFT convoloution algorithms allow some freedom in selecting the FFT size to decrease average processing cost at the expense of even more latency. I propose to limit the FFT size to 2x of the filter length (nearest power-of-2 of course). This is a reasonable compromise and also most useful for later inclusion into partitioned convolution scheme.
3.3. So far I haven't decided whether to allow or not run-time filter length choice (up to a pre-configured limit). Fixing it at compile time allows significant optimization (about 15% of throughput).
3.4. ARM filtering routines require the filter coefficients to be reversed in time (i.e. late-first order). I propose not to mimic this behavior as it is counter-intuitive for reverbs especially. But I can add an boolean parameter to flip the impulse response if required.
3.5. The ARM FFT routines inconveniently require out-of-place execution, while also modifying (i.e. corrupting) the input buffer. Otherwise several further optimizations would be possible.
3.6. Currently I propose to have the algorithm contain all necessary memory buffers within its class. Are there any arguments for user-provided memory storage?
3.7. Currently linking ARM CMSIS libraries for FFT and FIR routines significantly bloats the resulting binary. I currently can barely fit into 128kB flash. I didn't investigate it in depth, but it looks like the precomputed tables from arm_common_tables.c are the culprit.
3.8. The straightforward partitioned convolution has individual FFT's lined up in a way that results in a very spiky computation cost. It could easily result in audio drop-outs. Maybe I'll need to come up with some other, less efficient on the average, but more even processing scheme.
3.9. The partitioned convolution can be very efficiently processed in multiple threads in parallel. I'm currently thinking about fake parallelism, where the processing API would be split into two parts: bulk of the work (i.e. FFT's) would be supposed to be executed in the main application loop, while the time critical input/output handling would take part in the audio callback (i.e. in the ISR).
3.10. Just FYI, FFT convolution seems to overtake direct FIR filtering at 64 sample filter lengths on Daisy.
3.11. Haven't decided on the naming yet. My current class is called FastConv. I'm not sure whether to make two classes to implement # 3.1, or simply add compile-time configuration parameter. In the former case, I'd need a name for the second class.

Most probably I'm forgetting something, but I'd like to get the discussion going as soon as possible, so I can adjust my plans.

@tschrama
Copy link

Fast convolution would be great. There is a fast convolution function build for the Teensy MCU. You might get some more insight reading the trendy forum.

Ps
What does ‘logarithmically’ to do with FFT convolution?

@AXP
Copy link
Contributor Author

AXP commented Nov 16, 2020

Fast convolution would be great. There is a fast convolution function build for the Teensy MCU. You might get some more insight reading the trendy forum.

Thanks for the pointer, I wasn't keeping track so that's good to know.
I've skimmed through it. So far it seems like their implementation is rather cumbersome and contains GPL code. I believe I can do better. I will go through their discussion again later to see if there's still some knowledge to pick up.

Ps
What does ‘logarithmically’ to do with FFT convolution?

I was talking in terms of so called O-notation. Conventional convolution of input sequence with a filter kernel of length N requires N multiplications per output sample, hence I called it linear time. If we are talking about processing a block of length M, it would take MN multiplications.
If FFT is used instead, it becomes the dominant operation. FFT computational complexity is proportional to N
log2(N) which gets the block of N samples computed. So per sample it takes log2(N) operations. Hence I called it "logarithmic time".

Here's an illustration:

image
Source: https://towardsdatascience.com/linear-time-vs-logarithmic-time-big-o-notation-6ef4227051fb

@stephenhensley
Copy link
Collaborator

Woah! Thanks for sharing so much detail, this is really great. I've been wanting to add convolution, but I have had little to no time to even consider the approach so this is truly exciting!

Your plan sounds solid to me.

There are definitely some oddities with the arm wrt code size. The common tables, get brought in through the init functions for all table sizes, including the once that are not used. I'm open to solutions on this front.

Related to that, this would be the first thing that relies on the functions within the cmsis dsp lib, and while I do agree that we should use as much of that as possible I am hoping that despite being focused on an embedded platform, this library could be portable enough to test/use on PCs as well as embedded enviornments.

@tschrama
Copy link

ah. Now I understand😄. Thx

@recursinging
Copy link

Now this is an issue description, awesome.

Unfortunately, my years of slogging out business software in Java don't afford me the knowledge to address your implementation strategy specifically, but if I may anyway add my .02€, I'd like to echo @stephenhensley

I am hoping that despite being focused on an embedded platform, this library could be portable enough to test/use on PCs as well as embedded enviornments.

Please consider this when designing your implementation. It'd be a great to have both a reference implementation, intuitively designed, and free from platform specific oddities, and an optimized platform specific implementation. Optimally the implementations could be compared during testing for conformance.

I'm looking forward to your contribution.

@AXP
Copy link
Contributor Author

AXP commented Nov 19, 2020

There are definitely some oddities with the arm wrt code size. The common tables, get brought in through the init functions for all table sizes, including the once that are not used. I'm open to solutions on this front.

My further experiments show that linker effectively purges unreferenced symbols. For example, if fixed point (Q15,Q31) FFT's are not used, then corresponding tables are not included in the executable. What gets included are tables for all FFT sizes for a given format, since FFT size is passed to the init function as run-time parameter and thus in general cannot be determined at build time. By the way someone currently commented out most of the FFT sizes in libdaisy apparently to bring the size down.

My current solution is rather simple, I've created a replacement source file that I compile and link instead of the existing arm_common_tables.c. The difference is that all tables are placed into the BSS SDRAM section and there's an "init" function to fill them in at run-time.

Otherwise, if we want to support all sizes (and for binary partitioned convolution they are all required) the resulting twiddle factor tables occupy almost all available internal flash, so there's practically no more room for the user application code.

Related to that, this would be the first thing that relies on the functions within the cmsis dsp lib, and while I do agree that we should use as much of that as possible I am hoping that despite being focused on an embedded platform, this library could be portable enough to test/use on PCs as well as embedded enviornments.

Thanks, that was actually one of the questions I had, but forgot to write down.
So far my plan of record was to have #if !defined(__arm__) #error "DIY" ;)
Since we want to have this to work cross-platform:

  1. FFT needs to be included in DaisySP. The implementation I'm using on a PC is FFTReal by Laurent de Soras (http://ldesoras.free.fr/prod.html#src_audio). I can profile it on Daisy to compare with ARM's own implementation. We can then decide if we want to pull that in. It would be nice because a) it comes with a WTFPL license b) Laurent is a Daisy user himself.
  2. How come all existing examples require Daisy's board libraries? Shouldn't it then be something like ".../examples/daisy/..." ".../examples/x86/..." etc. Maybe I'm missing a guide how to even build the library for other targets? Would be nice to be able to do something like make all TARGET=PC and get a console app as a result.

@AXP
Copy link
Contributor Author

AXP commented Nov 19, 2020

Now this is an issue description, awesome.

Thanks, but I bet you'd rather have a working code without issue description than the other way round :)

I am hoping that despite being focused on an embedded platform, this library could be portable enough to test/use on PCs as well as embedded enviornments.

Please consider this when designing your implementation. It'd be a great to have both a reference implementation, intuitively designed, and free from platform specific oddities, and an optimized platform-specific implementation. Optimally the implementations could be compared during testing for conformance.

As we all know, trying to make something universal ends up making it non-efficient or non-readable. Usually both.
In general I agree, especially since it's clearly spelled out in the contribution guidelines.
I'll see what I can do. Adding a reference implementation is definitely an option. I also have some ideas how to efficiently externalize the FFT/Complex Multiplication/iFFT sandwich (which is very reference-y and readable) and separate it from all the addressing magic (which is cross-platform anyway).

I'm looking forward to your contribution.

Thanks! The plain spectrum-domain convolution (the one that still has the block latency) is working perfectly already. I'm finessing all the unit and performance tests. Next I will work on the partitioned mode (zero-latency, which is the final goal), which will still take several weeks to complete since I'm only working on this in my spare time.

@AXP
Copy link
Contributor Author

AXP commented Nov 20, 2020

My current solution is rather simple, I've created a replacement source file that I compile and link instead of the existing arm_common_tables.c. The difference is that all tables are placed into the BSS SDRAM section and there's an "init" function to fill them in at run-time.

One caveat though is that SDRAM latency is higher than internal SRAM.
For comparison, larger FFT's are 25% slower if tables are in SDRAM.
Or, 20% faster if they are in the tightly coupled memory (DTCM).

@EleonoreMizo
Copy link

EleonoreMizo commented Nov 20, 2020

Having an efficient fast convolution with low latency would be very nice, indeed.

If it can help, I can add a second API to FFTReal to split the processing in smaller chunks. This would help spreading the load on several blocks in the processing thread without using another (simulated) thread, resulting in a more predictable overall load.

@AXP
Copy link
Contributor Author

AXP commented Nov 20, 2020

Having an efficient fast convolution with low latency would be very nice, indeed.

If it can help, I can add a second API to FFTReal to split the processing in smaller chunks. This would help spreading the load on several blocks in the processing thread without using another (simulated) thread, resulting in a more predictable overall load.

Welcome to the party!
That would be awesome. I missed that for a long time even on a PC. Having such API I could indeed create a deterministic scheduler and not rely on OS threading or its imitation.

@AXP
Copy link
Contributor Author

AXP commented Nov 22, 2020

I've done some naive profiling for FFTReal already (pulled the latest out of Pedale Vite repository).
So far it's 2x-3x slower than ARM's. Fixed length version is better and peculiarly bounded between normal FFTReal and dynamic size ARM implementation.
This is without scaling, which is a separate step in FFTReal. For my use case it doesn't matter since it will be factored out of real time operation.
I'll do deeper comparison later to see impact on the whole algorithm.

The units are Daisy ticks (200 MHz)
Note: the plot is log-log
image

@EleonoreMizo
Copy link

Sure. FFTReal was designed 25 year ago and optimized for those days CPUs, and hasn’t really evolved since. I don’t expect it to compete with modern designs on current CPUs.

@jfsantos
Copy link

Were there any advances in this direction? Looking into implementing reverbs and cabinet simulators for guitars using the Daisy so that would be really handy :)

@AXP
Copy link
Contributor Author

AXP commented Jul 19, 2021

Hi, unfortunately I was very busy for the last half of a year.
So the status is as before - I have a working prototype, which I need to tune, cleanup and test before I'm comfortable submitting it.
One of the key questions was which FFT library should be used in Daisy. I was in favour of supporting both FFTReal and ARM's implementations.
Now I would also need to catch up with any advancements in the Daisy code base.
Hopefully I'm able to work on this soon!

@jfsantos
Copy link

@AXP do you have a working repo that is public so we could check it out and help test/contribute? I have some experience using kissfft and wouldn't mind adding it to whatever profiling you already have. I figure that people using this on the Daisy boards will want to use the ARM FFT, but those using DaisySP without a Daisy board might benefit from a more efficient FFT library.

@michaelbbaker
Copy link

@AXP I'm so interested in this! Any progress? Is this abandoned? Is there a possibility of you sharing what you've developed so others can contribute?

@ahause23
Copy link

ahause23 commented Jun 3, 2022

Im keen to look into IR stuff too for guitar pedals. mainly speaker emulation and EQ profiles. Has there been any progress on this front?

@AXP
Copy link
Contributor Author

AXP commented Jun 3, 2022

@AXP I'm so interested in this! Any progress? Is this abandoned? Is there a possibility of you sharing what you've developed so others can contribute?

Im keen to look into IR stuff too for guitar pedals. mainly speaker emulation and EQ profiles. Has there been any progress on this front?

Hello everybody. I am sorry but currently there's very little chance I could continue working on this. I have two jobs and I'm in the process of relocation to a different country. But I will try to find time to upload what I've already done to a public repo so others may pick up.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants