Hi,
Just out of sheer curiosity I decided to build a little proggy to see what the differences would be in speed for the various available ways we can allocate memory. For this test I decided to, per method, build a list with 65536 nodes and subsequently deallocate every node again and that 16 times in succession. The results were quite 'staggering' to say the least (according to these results):
INFO : Started INFO : -Iteration [ 0 ] INFO : -- malloc INFO : -- AllocVec INFO : -- AllocVecPooled nice INFO : -- AllocVecPooled brute INFO : -- ItemPoolAlloc nice INFO : -- ItemPoolAlloc brute INFO : -Iteration [ 1 ] INFO : -- malloc INFO : -- AllocVec INFO : -- AllocVecPooled nice INFO : -- AllocVecPooled brute INFO : -- ItemPoolAlloc nice INFO : -- ItemPoolAlloc brute INFO : -Iteration [ 2 ] INFO : -- malloc INFO : -- AllocVec INFO : -- AllocVecPooled nice INFO : -- AllocVecPooled brute INFO : -- ItemPoolAlloc nice INFO : -- ItemPoolAlloc brute INFO : -Iteration [ 3 ] INFO : -- malloc INFO : -- AllocVec INFO : -- AllocVecPooled nice INFO : -- AllocVecPooled brute INFO : -- ItemPoolAlloc nice INFO : -- ItemPoolAlloc brute INFO : -Iteration [ 4 ] INFO : -- malloc INFO : -- AllocVec INFO : -- AllocVecPooled nice INFO : -- AllocVecPooled brute INFO : -- ItemPoolAlloc nice INFO : -- ItemPoolAlloc brute INFO : -Iteration [ 5 ] INFO : -- malloc INFO : -- AllocVec INFO : -- AllocVecPooled nice INFO : -- AllocVecPooled brute INFO : -- ItemPoolAlloc nice INFO : -- ItemPoolAlloc brute INFO : -Iteration [ 6 ] INFO : -- malloc INFO : -- AllocVec INFO : -- AllocVecPooled nice INFO : -- AllocVecPooled brute INFO : -- ItemPoolAlloc nice INFO : -- ItemPoolAlloc brute INFO : -Iteration [ 7 ] INFO : -- malloc INFO : -- AllocVec INFO : -- AllocVecPooled nice INFO : -- AllocVecPooled brute INFO : -- ItemPoolAlloc nice INFO : -- ItemPoolAlloc brute INFO : -Iteration [ 8 ] INFO : -- malloc INFO : -- AllocVec INFO : -- AllocVecPooled nice INFO : -- AllocVecPooled brute INFO : -- ItemPoolAlloc nice INFO : -- ItemPoolAlloc brute INFO : -Iteration [ 9 ] INFO : -- malloc INFO : -- AllocVec INFO : -- AllocVecPooled nice INFO : -- AllocVecPooled brute INFO : -- ItemPoolAlloc nice INFO : -- ItemPoolAlloc brute INFO : -Iteration [ 10 ] INFO : -- malloc INFO : -- AllocVec INFO : -- AllocVecPooled nice INFO : -- AllocVecPooled brute INFO : -- ItemPoolAlloc nice INFO : -- ItemPoolAlloc brute INFO : -Iteration [ 11 ] INFO : -- malloc INFO : -- AllocVec INFO : -- AllocVecPooled nice INFO : -- AllocVecPooled brute INFO : -- ItemPoolAlloc nice INFO : -- ItemPoolAlloc brute INFO : -Iteration [ 12 ] INFO : -- malloc INFO : -- AllocVec INFO : -- AllocVecPooled nice INFO : -- AllocVecPooled brute INFO : -- ItemPoolAlloc nice INFO : -- ItemPoolAlloc brute INFO : -Iteration [ 13 ] INFO : -- malloc INFO : -- AllocVec INFO : -- AllocVecPooled nice INFO : -- AllocVecPooled brute INFO : -- ItemPoolAlloc nice INFO : -- ItemPoolAlloc brute INFO : -Iteration [ 14 ] INFO : -- malloc INFO : -- AllocVec INFO : -- AllocVecPooled nice INFO : -- AllocVecPooled brute INFO : -- ItemPoolAlloc nice INFO : -- ItemPoolAlloc brute INFO : -Iteration [ 15 ] INFO : -- malloc INFO : -- AllocVec INFO : -- AllocVecPooled nice INFO : -- AllocVecPooled brute INFO : -- ItemPoolAlloc nice INFO : -- ItemPoolAlloc brute : Method Start Build Remove Done End Total : malloc [ 53184056 ] -- [ 53184109 ] -- [ 53365143 ] -- [ 53595826 ] -- [ 53595845 ] == [ 411789 ] : AllocVec [ 53623861 ] -- [ 53623898 ] -- [ 54376149 ] -- [ 54939848 ] -- [ 54939870 ] == [ 1316009 ] : MemPool nice [ 54969963 ] -- [ 54970181 ] -- [ 56333899 ] -- [ 57580586 ] -- [ 57580748 ] == [ 2610785 ] : MemPool brute [ 57614524 ] -- [ 57614703 ] -- [ 58976719 ] -- [ 58976737 ] -- [ 59514870 ] == [ 1900346 ] : ItemPool nice [ 59547492 ] -- [ 59547733 ] -- [ 59742363 ] -- [ 59826589 ] -- [ 59869667 ] == [ 322175 ] : ItemPool brute [ 60219450 ] -- [ 60400551 ] -- [ 60642973 ] -- [ 60642993 ] -- [ 60673755 ] == [ 454305 ] INFO : Finished
The results for both variants of 'ItemPool' are surprising: deallocating all nodes on a one by one base, the nice way, seems to be faster then the 'brute' way, the latter one taking around 40% MORE time to complete, 454305 for brute vs 322175 for noce.
The program Is this one:
/* ** ** main.c - AllocationSpeed ** ** ** Jan 23, 2026: First incarnation **************************************/ #include <stdlib.h> #include <time.h> #include <proto/dos.h> #include <proto/exec.h> void main(void) { clock_t TimeArray[5] [6] = {0}; uint16 AlloCount; struct List *L; struct Node *N, *R; IDOS->Printf("INFO : Started\n"); for (uint16 i = 0; i < 16; i++) { IDOS->Printf("INFO : -Iteration [ %ld ]\n", i); IDOS->Printf("INFO : -- malloc\n"); { TimeArray[0][0] += clock(); AlloCount = 0; if (L = malloc(sizeof(struct List))) { IExec->NewList(L); TimeArray[0][1] += clock(); do { if (N = malloc(sizeof(struct Node))) { IExec->AddTail(L, N); } } while (++AlloCount); TimeArray[0][2] += clock(); N = IExec->GetHead(L); while (N) { R = N; N = IExec->GetSucc(N); IExec->Remove(R); free(R); } TimeArray[0][3] += clock(); free(L); } TimeArray[0][4] += clock(); } /* ** --------------------------------------------------------------- */ IDOS->Printf("INFO : -- AllocVec\n"); { TimeArray[1][0] += clock(); AlloCount = 0; if (L = IExec->AllocVecTags(sizeof(struct List), TAG_END)) { IExec->NewList(L); TimeArray[1][1] += clock(); do { if (N = IExec->AllocVecTags(sizeof(struct Node), TAG_END)) { IExec->AddTail(L, N); } } while (++AlloCount); TimeArray[1][2] += clock(); N = IExec->GetHead(L); while (N) { R = N; N = IExec->GetSucc(N); IExec->Remove(R); IExec->FreeVec(R); } TimeArray[1][3] += clock(); IExec->FreeVec(L); } TimeArray[1][4] += clock(); } /* ** --------------------------------------------------------------- */ { IDOS->Printf("INFO : -- AllocVecPooled nice\n"); TimeArray[2][0] += clock(); AlloCount = 0; APTR MemPool = IExec->AllocSysObjectTags(ASOT_MEMPOOL, ASOPOOL_MFlags, MEMF_CLEAR, TAG_END); if (MemPool) { if (L = IExec->AllocVecPooled(MemPool, sizeof(struct List))) { IExec->NewList(L); TimeArray[2][1] += clock(); do { if (N = IExec->AllocVecPooled(MemPool, sizeof(struct Node))) { IExec->AddTail(L, N); } } while (++AlloCount); TimeArray[2][2] += clock(); N = IExec->GetHead(L); while (N) { R = N; N = IExec->GetSucc(N); IExec->Remove(R); IExec->FreeVecPooled(MemPool, R); } TimeArray[2][3] += clock(); IExec->FreeVecPooled(MemPool, L); } IExec->FreeSysObject(ASOT_MEMPOOL, MemPool); } TimeArray[2][4] += clock(); } /* ** --------------------------------------------------------------- */ { IDOS->Printf("INFO : -- AllocVecPooled brute\n"); TimeArray[3][0] += clock(); AlloCount = 0; APTR MemPool = IExec->AllocSysObjectTags(ASOT_MEMPOOL, ASOPOOL_MFlags, MEMF_CLEAR, TAG_END); if (MemPool) { if (L = IExec->AllocVecPooled(MemPool, sizeof(struct List))) { IExec->NewList(L); TimeArray[3][1] += clock(); do { if (N = IExec->AllocVecPooled(MemPool, sizeof(struct Node))) { IExec->AddTail(L, N); } } while (++AlloCount); TimeArray[3][2] += clock(); TimeArray[3][3] += clock(); IExec->FreeVecPooled(MemPool, L); } IExec->FreeSysObject(ASOT_MEMPOOL, MemPool); } TimeArray[3][4] += clock(); } /* ** --------------------------------------------------------------- */ { IDOS->Printf("INFO : -- ItemPoolAlloc nice\n"); TimeArray[4][0] += clock(); AlloCount = 0; APTR MemPool = IExec->AllocSysObjectTags(ASOT_MEMPOOL, ASOPOOL_MFlags, MEMF_CLEAR, TAG_END); if (MemPool) { if (L = IExec->AllocVecPooled(MemPool, sizeof(struct List))) { IExec->NewList(L); APTR ItemPool = IExec->AllocSysObjectTags(ASOT_ITEMPOOL, ASOITEM_MFlags, MEMF_CLEAR , ASOITEM_ItemSize, sizeof(struct Node) , TAG_END); if (ItemPool) { TimeArray[4][1] += clock(); do { if (N = IExec->ItemPoolAlloc(ItemPool)) { IExec->AddTail(L, N); } } while (++AlloCount); TimeArray[4][2] += clock(); N = IExec->GetHead(L); while (N) { R = N; N = IExec->GetSucc(N); IExec->Remove(R); IExec->ItemPoolFree(ItemPool, R); } TimeArray[4][3] += clock(); IExec->FreeSysObject(ASOT_ITEMPOOL, ItemPool); } IExec->FreeVecPooled(MemPool, L); } IExec->FreeSysObject(ASOT_MEMPOOL, MemPool); } TimeArray[4][4] += clock(); } /* ** --------------------------------------------------------------- */ { IDOS->Printf("INFO : -- ItemPoolAlloc brute\n"); TimeArray[5][0] += clock(); AlloCount = 0; APTR MemPool = IExec->AllocSysObjectTags(ASOT_MEMPOOL, ASOPOOL_MFlags, MEMF_CLEAR, TAG_END); if (MemPool) { if (L = IExec->AllocVecPooled(MemPool, sizeof(struct List))) { IExec->NewList(L); APTR ItemPool = IExec->AllocSysObjectTags(ASOT_ITEMPOOL, ASOITEM_MFlags, MEMF_CLEAR , ASOITEM_ItemSize, sizeof(struct Node) , TAG_END); if (ItemPool) { TimeArray[5][1] += clock(); do { if (N = IExec->ItemPoolAlloc(ItemPool)) { IExec->AddTail(L, N); } } while (++AlloCount); TimeArray[5][2] += clock(); TimeArray[5][3] += clock(); IExec->FreeSysObject(ASOT_ITEMPOOL, ItemPool); } IExec->FreeVecPooled(MemPool, L); } IExec->FreeSysObject(ASOT_MEMPOOL, MemPool); } TimeArray[5][4] += clock(); } } /* for() */ IDOS->Printf("\n : Method Start Build Remove Done End Total\n"); IDOS->Printf(" : malloc [ %lld ] -- [ %lld ] -- [ %lld ] -- [ %lld ] -- [ %lld ] == [ %8lld ]\n", TimeArray[0][0], TimeArray[0][1], TimeArray[0][2], TimeArray[0][3], TimeArray[0][4], TimeArray[0][4] - TimeArray[0][0]); IDOS->Printf(" : AllocVec [ %lld ] -- [ %lld ] -- [ %lld ] -- [ %lld ] -- [ %lld ] == [ %8lld ]\n", TimeArray[1][0], TimeArray[1][1], TimeArray[1][2], TimeArray[1][3], TimeArray[1][4], TimeArray[1][4] - TimeArray[1][0]); IDOS->Printf(" : MemPool nice [ %lld ] -- [ %lld ] -- [ %lld ] -- [ %lld ] -- [ %lld ] == [ %8lld ]\n", TimeArray[2][0], TimeArray[2][1], TimeArray[2][2], TimeArray[2][3], TimeArray[2][4], TimeArray[2][4] - TimeArray[2][0]); IDOS->Printf(" : MemPool brute [ %lld ] -- [ %lld ] -- [ %lld ] -- [ %lld ] -- [ %lld ] == [ %8lld ]\n", TimeArray[3][0], TimeArray[3][1], TimeArray[3][2], TimeArray[3][3], TimeArray[3][4], TimeArray[3][4] - TimeArray[3][0]); IDOS->Printf(" : ItemPool nice [ %lld ] -- [ %lld ] -- [ %lld ] -- [ %lld ] -- [ %lld ] == [ %8lld ]\n", TimeArray[4][0], TimeArray[4][1], TimeArray[4][2], TimeArray[4][3], TimeArray[4][4], TimeArray[4][4] - TimeArray[4][0]); IDOS->Printf(" : ItemPool brute [ %lld ] -- [ %lld ] -- [ %lld ] -- [ %lld ] -- [ %lld ] == [ %8lld ]\n", TimeArray[5][0], TimeArray[5][1], TimeArray[5][2], TimeArray[5][3], TimeArray[5][4], TimeArray[5][4] - TimeArray[5][0]); IDOS->Printf("INFO : Finished\n"); }
And if you want to try it out for yourself, here's the applied makefile:
PROGNAME = AllocationSpeed DEFINES += -DDEBUG \ -DPROGNAME=$(PROGNAME) # -DUSE_ARGOPTS \ # -DUSE_DATETIME $(PROGNAME) : main.c gcc -o $(PROGNAME) main.c $(DEFINES) @Echo "==== Done compiling =========="
Surprising, no?
OldFart

Interesting tests. If I read the results correctly, malloc is the second fastest way, and AllocVec is more than three times slower, right?
@walkero
And a bit disturbing at the same time!
Like you stated: Interesting tests.
OldFart
Interesting results indeed, and not what one would expect. However, looking over the code I see some things that might affect the results:
o The pooled allocations use MEMF_CLEAR, while malloc() and AllocVec() do not. It takes time to clear the memory as it's allocated, so you're kind of comparing apples and oranges when comparing cleared against uncleared allocations.
o In the ItemPool tests a MemPool is created for a single List allocation. That's rather inefficient, and the overhead of setting up and freeing the MemPool might to some extent mask the speed of the ItemPool. What do you see if you use AllocVecTags() instead to allocate the List, which is more like the unpooled tests?
o You're depending on default puddle sizes for the pooled memory, which might not be optimal for this particular test, or at least might be different between the MemPool test and the ItemPool test.
o Standard libraries for C tend to do their own memory management, on top of whatever the OS provides. So the results for malloc() will likely depend on which C library was used. You seem to be using the default, which would be newlib. The results may be different for clib2/clib4.
o The presence of virtual memory can make tests like this difficult to do, as there might be a bunch of virtual memory operations going on underneath the allocations you're doing. Do you get similar results if you run the test program multiple times? What if you reverse the order of the individual tests (ItemPool brute first, malloc last)?
o Then there's the effect of the CPU's memory cache on the results, which might vary from one system to another. What Amiga are you running the test on?
Trying to look a little deeper at the results, I used the figures from your test and tried to filter out some of the overhead and look only at the the time required to perform the allocations (the difference between step 2 and step 1), the time required to perform the frees (the difference between step 3 and step 2), as well as the sum of those:
Test Allocations Frees Total (2 - 1) (3 - 2) (Alloc + Free) ---- ----------- ------- -------- malloc 181034 230683 411717 AllocVec 752251 563699 1315950 Pool (nice) 1363718 1246687 2610405 Pool (brute) 1362016 18* 1362034 Item (nice) 194630 84226 278856 Item (brute) 242422 20* 242442* The time required to free the pools was not recorded, so the time shown is just the time between two calls to clock(). It would be more informative to free the pools between the two calls to clock() for the brute tests.
As noted above, the pooled tests had to clear the memory after allocating it, so naturally that makes them take longer. It's also interesting to note that the allocation time for the "nice" and "brute" ItemPool tests is significantly different, even though the code is exactly the same.
Still, the results make a little more sense. malloc() seems to be quite fast compared to AllocVec(). AllocVecPooled() seems rather slow, but to be fair we'd need to see it without MEMF_CLEAR. ItemPools are quite fast, as expected, and would be even faster without MEMF_CLEAR.
@msteed
Some valid points you bring to the table.
I tweeked the code for both 'MemPool' and 'ItemPool'-variants into 4 scenario's: 'nice & clean', 'nice, but dirty', 'brute, but clean' and 'brute & dirty', where clean and dirty indicate use of the MEMF_CLEAR-flag
As this is only once in a loop, I don't see it as much of a hindrance.
Doesn't the autodocs wriggle you into this?
Most coders will use newlib, no?
During testruns results do not show extreme differences, just a few percentage points, so the results can be seen as very consistent. And for that very reason the outer loop runs 16 times. That may even out some external influences.
System of choice, my only option, is an X5000/20.
In the first published list of results, the final line got mangled somehow and reflected wrong figures.
After tweeking, i got these results:
OldFart
Not really. Most people that are porting software from other systems started already using clib4 for some time now.
Do you have the latest code somewhere available. I am curious to test on my systems.
I don't expect it to make much difference either, but since the result of the tests was not what we expected, I was casting about for something that might not work the way we expected.
Because you're allocating Nodes, which are rather small, I actually expect the default pool parameters (whatever they are) to work well. But again, I was looking for something that might not work the way we expect.
The 'PoolStat' command should show the parameters for the pool created by the test, but you'd need a special version of the test that paused long enough before deleting the pool that you could run the command.
For writing native OS4 programs newlib will likely be the choice. But as walkero pointed out, when porting code from other platforms clib4 is a better choice.
Good; that tends to rule out interference from other things.
Your revised test results mostly make sense, with the notable exception of MemoryPools:
o Given the additional memory management done by the C library, malloc() is in effect using memory pools (its own, not those of Exec), so it's not surprising that it's faster than AllocVec() (though the difference is more than I might have guessed).
o It takes longer to clear memory after allocating it, but not very much longer (it doesn't take much time to clear something as small as a Node; the difference would be greater with larger allocations). The faster the allocation itself is, the more significant the slowdown due to clearing the memory becomes.
o Freeing a pool in one operation is significantly faster than freeing all the individual allocations in the pool one at a time.
o ItemPools are the fastest, as you'd expect given their limitations.
The real mystery is why MemoryPools are so slow. Faster allocation is supposed to be one of the main reasons to use MemoryPools, but they actually seem to be twice as slow as just using AllocVec(). And as you can see if you run 'PoolStat' after the system has been active for a while, MemoryPools get used a lot.
Maybe the memory manager in OS4 has been improved to the point that it's now faster than MemoryPools? Though the OS4 Wiki background section on MemoryPools seems to suggest that they were enhanced in OS4 as well.
I'd be curious to try it on my X1000 too.
Thanks to OldFart for sending me a copy of the program so I could run it on my X1000. Here's what I got (stripping out all the intermediate times to make it easier to read):
Since all the percentages are relative to malloc(), the fact that malloc() is somewhat slower relative to the other methods on the X1000 makes the percentages a little lower for most of the other methods than on the X5000. But the overall trend is the same: AllocVecPooled() is roughly six times slower than malloc() and twice as slow as AllocVec(), and ItemPoolAlloc() is roughly twice as fast as malloc(), six times as fast as AllocVec(), and a whopping 8 to 10 times as fast as AllocVecPooled(). So whatever is going on with MemoryPools that makes them so slow is consistent, happening on both the X5000 and the X1000.
I couldn't resist comparing the speed of the X1000 to that of the X5000, using OldFart's results for the X5000 figures. Here the percentage figure is how much longer the X1000 takes to run the test (assuming that clock() is measuring the same thing on both machines, which is supposed to be the case).
As in past benchmarks I've seen, the difference -- the X1000 is 20% to 50% slower -- is greater than the difference in clock speed between the two machines.