Why is this code 6.5x slower with optimizations enabled?Unit Testing C CodeWith arrays, why is it the case...
How can I hide my bitcoin transactions to protect anonymity from others?
The use of multiple foreign keys on same column in SQL Server
Validation accuracy vs Testing accuracy
Email Account under attack (really) - anything I can do?
"which" command doesn't work / path of Safari?
What would happen to a modern skyscraper if it rains micro blackholes?
whey we use polarized capacitor?
Patience, young "Padovan"
Why doesn't Newton's third law mean a person bounces back to where they started when they hit the ground?
Why are 150k or 200k jobs considered good when there are 300k+ births a month?
Is it possible to do 50 km distance without any previous training?
GPS Rollover on Android Smartphones
Why is this code 6.5x slower with optimizations enabled?
Why has Russell's definition of numbers using equivalence classes been finally abandoned? ( If it has actually been abandoned).
How can the DM most effectively choose 1 out of an odd number of players to be targeted by an attack or effect?
I probably found a bug with the sudo apt install function
"You are your self first supporter", a more proper way to say it
Draw simple lines in Inkscape
N.B. ligature in Latex
How long does it take to type this?
What is the command to reset a PC without deleting any files
How old can references or sources in a thesis be?
Can I interfere when another PC is about to be attacked?
What defenses are there against being summoned by the Gate spell?
Why is this code 6.5x slower with optimizations enabled?
Unit Testing C CodeWith arrays, why is it the case that a[5] == 5[a]?Why doesn't GCC optimize a*a*a*a*a*a to (a*a*a)*(a*a*a)?Why are elementwise additions much faster in separate loops than in a combined loop?What is “:-!!” in C code?Why is my program slow when looping over exactly 8192 elements?Obfuscated C Code Contest 2006. Please explain sykes2.cWhy does the C preprocessor interpret the word “linux” as the constant “1”?Why does GCC generate 15-20% faster code if I optimize for size instead of speed?How is the linking done for string functions in C?
.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty{ height:90px;width:728px;box-sizing:border-box;
}
I wanted to benchmark glibc's strlen function for some reason and found out it apparently performs much slower with optimizations enabled in GCC and I have no idea why.
Here's my code:
#include <time.h>
#include <string.h>
#include <stdlib.h>
#include <stdio.h>
int main() {
char *s = calloc(1 << 20, 1);
memset(s, 65, 1000000);
clock_t start = clock();
for (int i = 0; i < 128; ++i) {
s[strlen(s)] = 'A';
}
clock_t end = clock();
printf("%lldn", (long long)(end-start));
return 0;
}
On my machine it outputs:
$ gcc test.c && ./a.out
13336
$ gcc -O1 test.c && ./a.out
199004
$ gcc -O2 test.c && ./a.out
83415
$ gcc -O3 test.c && ./a.out
83415
Somehow, enabling optimizations causes it to execute longer.
c gcc glibc
add a comment |
I wanted to benchmark glibc's strlen function for some reason and found out it apparently performs much slower with optimizations enabled in GCC and I have no idea why.
Here's my code:
#include <time.h>
#include <string.h>
#include <stdlib.h>
#include <stdio.h>
int main() {
char *s = calloc(1 << 20, 1);
memset(s, 65, 1000000);
clock_t start = clock();
for (int i = 0; i < 128; ++i) {
s[strlen(s)] = 'A';
}
clock_t end = clock();
printf("%lldn", (long long)(end-start));
return 0;
}
On my machine it outputs:
$ gcc test.c && ./a.out
13336
$ gcc -O1 test.c && ./a.out
199004
$ gcc -O2 test.c && ./a.out
83415
$ gcc -O3 test.c && ./a.out
83415
Somehow, enabling optimizations causes it to execute longer.
c gcc glibc
With gcc-8.2 debug version takes 51334, release 8246. Release compiler options-O3 -march=native -DNDEBUG
– Maxim Egorushkin
2 hours ago
Please report it to gcc's bugzilla.
– Marc Glisse
1 hour ago
Using-fno-builtinmakes the problem go away. So presumably the issue is that in this particular instance, GCC's builtinstrlenis slower than the library's.
– David Schwartz
1 hour ago
It is generatingrepnz scasbfor strlen at -O1.
– Marc Glisse
1 hour ago
@MarcGlisse and for-O2and-O3, it's loading and comparing thechars as integers. Unfortunately, the naive-O0uses the library function which uses vector-instructions that beat this optimization easily.
– EOF
1 hour ago
add a comment |
I wanted to benchmark glibc's strlen function for some reason and found out it apparently performs much slower with optimizations enabled in GCC and I have no idea why.
Here's my code:
#include <time.h>
#include <string.h>
#include <stdlib.h>
#include <stdio.h>
int main() {
char *s = calloc(1 << 20, 1);
memset(s, 65, 1000000);
clock_t start = clock();
for (int i = 0; i < 128; ++i) {
s[strlen(s)] = 'A';
}
clock_t end = clock();
printf("%lldn", (long long)(end-start));
return 0;
}
On my machine it outputs:
$ gcc test.c && ./a.out
13336
$ gcc -O1 test.c && ./a.out
199004
$ gcc -O2 test.c && ./a.out
83415
$ gcc -O3 test.c && ./a.out
83415
Somehow, enabling optimizations causes it to execute longer.
c gcc glibc
I wanted to benchmark glibc's strlen function for some reason and found out it apparently performs much slower with optimizations enabled in GCC and I have no idea why.
Here's my code:
#include <time.h>
#include <string.h>
#include <stdlib.h>
#include <stdio.h>
int main() {
char *s = calloc(1 << 20, 1);
memset(s, 65, 1000000);
clock_t start = clock();
for (int i = 0; i < 128; ++i) {
s[strlen(s)] = 'A';
}
clock_t end = clock();
printf("%lldn", (long long)(end-start));
return 0;
}
On my machine it outputs:
$ gcc test.c && ./a.out
13336
$ gcc -O1 test.c && ./a.out
199004
$ gcc -O2 test.c && ./a.out
83415
$ gcc -O3 test.c && ./a.out
83415
Somehow, enabling optimizations causes it to execute longer.
c gcc glibc
c gcc glibc
edited 2 hours ago
Fei Xiang
2,1634822
2,1634822
asked 2 hours ago
TsarNTsarN
3814
3814
With gcc-8.2 debug version takes 51334, release 8246. Release compiler options-O3 -march=native -DNDEBUG
– Maxim Egorushkin
2 hours ago
Please report it to gcc's bugzilla.
– Marc Glisse
1 hour ago
Using-fno-builtinmakes the problem go away. So presumably the issue is that in this particular instance, GCC's builtinstrlenis slower than the library's.
– David Schwartz
1 hour ago
It is generatingrepnz scasbfor strlen at -O1.
– Marc Glisse
1 hour ago
@MarcGlisse and for-O2and-O3, it's loading and comparing thechars as integers. Unfortunately, the naive-O0uses the library function which uses vector-instructions that beat this optimization easily.
– EOF
1 hour ago
add a comment |
With gcc-8.2 debug version takes 51334, release 8246. Release compiler options-O3 -march=native -DNDEBUG
– Maxim Egorushkin
2 hours ago
Please report it to gcc's bugzilla.
– Marc Glisse
1 hour ago
Using-fno-builtinmakes the problem go away. So presumably the issue is that in this particular instance, GCC's builtinstrlenis slower than the library's.
– David Schwartz
1 hour ago
It is generatingrepnz scasbfor strlen at -O1.
– Marc Glisse
1 hour ago
@MarcGlisse and for-O2and-O3, it's loading and comparing thechars as integers. Unfortunately, the naive-O0uses the library function which uses vector-instructions that beat this optimization easily.
– EOF
1 hour ago
With gcc-8.2 debug version takes 51334, release 8246. Release compiler options
-O3 -march=native -DNDEBUG– Maxim Egorushkin
2 hours ago
With gcc-8.2 debug version takes 51334, release 8246. Release compiler options
-O3 -march=native -DNDEBUG– Maxim Egorushkin
2 hours ago
Please report it to gcc's bugzilla.
– Marc Glisse
1 hour ago
Please report it to gcc's bugzilla.
– Marc Glisse
1 hour ago
Using
-fno-builtin makes the problem go away. So presumably the issue is that in this particular instance, GCC's builtin strlen is slower than the library's.– David Schwartz
1 hour ago
Using
-fno-builtin makes the problem go away. So presumably the issue is that in this particular instance, GCC's builtin strlen is slower than the library's.– David Schwartz
1 hour ago
It is generating
repnz scasb for strlen at -O1.– Marc Glisse
1 hour ago
It is generating
repnz scasb for strlen at -O1.– Marc Glisse
1 hour ago
@MarcGlisse and for
-O2 and -O3, it's loading and comparing the chars as integers. Unfortunately, the naive -O0 uses the library function which uses vector-instructions that beat this optimization easily.– EOF
1 hour ago
@MarcGlisse and for
-O2 and -O3, it's loading and comparing the chars as integers. Unfortunately, the naive -O0 uses the library function which uses vector-instructions that beat this optimization easily.– EOF
1 hour ago
add a comment |
1 Answer
1
active
oldest
votes
Testing your code on Godbolt's Compiler Explorer provides this explanation:
- at
-O0or without optimisations, the generated code call the C library functionstrlen
- at
-O1the generated code uses a simple inline expansion using arep scasbinstruction. - at
-O2and above, the generated code uses a more elaborate inline expansion.
Benchmarking your code repeatedly shows a substantial variation from one run to another, but increasing the number of iterations shows that:
- the
-O1code is much slower than the C library implementation:32240vs3090
- the
-O2code is faster than the-O1but still substantially slower than the C ibrary code:8570vs3090.
This behavior is specific to gcc and the glibc. The same test on OS/X with clang and Apple's Libc does not show a significant difference, which is not a surprise as Godbolt shows that clang generates a call to the C library strlen at all optimisation levels.
This could be considered a bug in gcc/glibc but more extensive benchmarking might show that the overhead of calling strlen has a more important impact than the lack of performance of the inline code for small strings. The strings on which you benchmark are uncommonly large, so focusing the benchmark on ultra-long strings might not give meaningful results.
I updated the benchmark for smaller strings and it shows similar performance for string lengths varying from 0 to 100 at -O0 and -O2 but still a much worse performance at -O1, 3 times slower.
Here is the updated code:
#include <stdlib.h>
#include <string.h>
#include <time.h>
void benchmark(int repeat, int minlen, int maxlen) {
char *s = malloc(maxlen + 1);
memset(s, 'A', minlen);
long long bytes = 0, calls = 0;
clock_t clk = clock();
for (int n = 0; n < repeat; n++) {
for (int i = minlen; i < maxlen; ++i) {
bytes += i + 1;
calls += 1;
s[i] = '';
s[strlen(s)] = 'A';
}
}
clk = clock() - clk;
free(s);
double avglen = (minlen + maxlen - 1) / 2.0;
double ns = (double)clk * 1e9 / CLOCKS_PER_SEC;
printf("average length %7.0f -> avg time: %7.3f ns/byte, %7.3f ns/calln",
avglen, ns / bytes, ns / calls);
}
int main() {
benchmark(10000000, 0, 1);
benchmark(1000000, 0, 10);
benchmark(1000000, 5, 15);
benchmark(100000, 0, 100);
benchmark(100000, 50, 150);
benchmark(10000, 0, 1000);
benchmark(10000, 500, 1500);
benchmark(1000, 0, 10000);
benchmark(1000, 5000, 15000);
benchmark(100, 1000000 - 50, 1000000 + 50);
return 0;
}
Here is the output:
chqrlie> gcc -std=c99 -O0 benchstrlen.c && ./a.out
average length 0 -> avg time: 14.000 ns/byte, 14.000 ns/call
average length 4 -> avg time: 2.364 ns/byte, 13.000 ns/call
average length 10 -> avg time: 1.238 ns/byte, 13.000 ns/call
average length 50 -> avg time: 0.317 ns/byte, 16.000 ns/call
average length 100 -> avg time: 0.169 ns/byte, 17.000 ns/call
average length 500 -> avg time: 0.074 ns/byte, 37.000 ns/call
average length 1000 -> avg time: 0.068 ns/byte, 68.000 ns/call
average length 5000 -> avg time: 0.064 ns/byte, 318.000 ns/call
average length 10000 -> avg time: 0.062 ns/byte, 622.000 ns/call
average length 1000000 -> avg time: 0.062 ns/byte, 62000.000 ns/call
chqrlie> gcc -std=c99 -O1 benchstrlen.c && ./a.out
average length 0 -> avg time: 20.000 ns/byte, 20.000 ns/call
average length 4 -> avg time: 3.818 ns/byte, 21.000 ns/call
average length 10 -> avg time: 2.190 ns/byte, 23.000 ns/call
average length 50 -> avg time: 0.990 ns/byte, 50.000 ns/call
average length 100 -> avg time: 0.816 ns/byte, 82.000 ns/call
average length 500 -> avg time: 0.679 ns/byte, 340.000 ns/call
average length 1000 -> avg time: 0.664 ns/byte, 664.000 ns/call
average length 5000 -> avg time: 0.651 ns/byte, 3254.000 ns/call
average length 10000 -> avg time: 0.649 ns/byte, 6491.000 ns/call
average length 1000000 -> avg time: 0.648 ns/byte, 648000.000 ns/call
chqrlie> gcc -std=c99 -O2 benchstrlen.c && ./a.out
average length 0 -> avg time: 10.000 ns/byte, 10.000 ns/call
average length 4 -> avg time: 2.000 ns/byte, 11.000 ns/call
average length 10 -> avg time: 1.048 ns/byte, 11.000 ns/call
average length 50 -> avg time: 0.337 ns/byte, 17.000 ns/call
average length 100 -> avg time: 0.299 ns/byte, 30.000 ns/call
average length 500 -> avg time: 0.202 ns/byte, 101.000 ns/call
average length 1000 -> avg time: 0.188 ns/byte, 188.000 ns/call
average length 5000 -> avg time: 0.174 ns/byte, 868.000 ns/call
average length 10000 -> avg time: 0.172 ns/byte, 1716.000 ns/call
average length 1000000 -> avg time: 0.172 ns/byte, 172000.000 ns/call
Wouldn't it still be better for the inlined version to use the same optimizations as the librarystrlen, giving the best of both worlds?
– Daniel H
1 hour ago
1
It would, but the hand optimized version in the C library might be larger and more complicated to inline. I have not looked into this recently, but there used to be a mix of complex platform specific macros in<string.h>and hard coded optimisations in the gcc code generator. Definitely still room for improvement on intel targets.
– chqrlie
1 hour ago
Does it change if you use-march=native -mtune=native?
– Deduplicator
18 mins ago
Note that the GNU C library function forstrlen()is likely optimised for extremely large strings (that no sane programmer will care about) at the expense of performance for small strings (that are extremely common); and the optimisations done by the library version should never be done. The problem here is that the OP's code doesn't keep track of the string's length itself (e.g. with anint len;variable) and should not have usedstrlen()at all, making the code so bad for performance that "optimised for something no sane programmer would care about" actually helped.
– Brendan
15 mins ago
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55563598%2fwhy-is-this-code-6-5x-slower-with-optimizations-enabled%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
Testing your code on Godbolt's Compiler Explorer provides this explanation:
- at
-O0or without optimisations, the generated code call the C library functionstrlen
- at
-O1the generated code uses a simple inline expansion using arep scasbinstruction. - at
-O2and above, the generated code uses a more elaborate inline expansion.
Benchmarking your code repeatedly shows a substantial variation from one run to another, but increasing the number of iterations shows that:
- the
-O1code is much slower than the C library implementation:32240vs3090
- the
-O2code is faster than the-O1but still substantially slower than the C ibrary code:8570vs3090.
This behavior is specific to gcc and the glibc. The same test on OS/X with clang and Apple's Libc does not show a significant difference, which is not a surprise as Godbolt shows that clang generates a call to the C library strlen at all optimisation levels.
This could be considered a bug in gcc/glibc but more extensive benchmarking might show that the overhead of calling strlen has a more important impact than the lack of performance of the inline code for small strings. The strings on which you benchmark are uncommonly large, so focusing the benchmark on ultra-long strings might not give meaningful results.
I updated the benchmark for smaller strings and it shows similar performance for string lengths varying from 0 to 100 at -O0 and -O2 but still a much worse performance at -O1, 3 times slower.
Here is the updated code:
#include <stdlib.h>
#include <string.h>
#include <time.h>
void benchmark(int repeat, int minlen, int maxlen) {
char *s = malloc(maxlen + 1);
memset(s, 'A', minlen);
long long bytes = 0, calls = 0;
clock_t clk = clock();
for (int n = 0; n < repeat; n++) {
for (int i = minlen; i < maxlen; ++i) {
bytes += i + 1;
calls += 1;
s[i] = '';
s[strlen(s)] = 'A';
}
}
clk = clock() - clk;
free(s);
double avglen = (minlen + maxlen - 1) / 2.0;
double ns = (double)clk * 1e9 / CLOCKS_PER_SEC;
printf("average length %7.0f -> avg time: %7.3f ns/byte, %7.3f ns/calln",
avglen, ns / bytes, ns / calls);
}
int main() {
benchmark(10000000, 0, 1);
benchmark(1000000, 0, 10);
benchmark(1000000, 5, 15);
benchmark(100000, 0, 100);
benchmark(100000, 50, 150);
benchmark(10000, 0, 1000);
benchmark(10000, 500, 1500);
benchmark(1000, 0, 10000);
benchmark(1000, 5000, 15000);
benchmark(100, 1000000 - 50, 1000000 + 50);
return 0;
}
Here is the output:
chqrlie> gcc -std=c99 -O0 benchstrlen.c && ./a.out
average length 0 -> avg time: 14.000 ns/byte, 14.000 ns/call
average length 4 -> avg time: 2.364 ns/byte, 13.000 ns/call
average length 10 -> avg time: 1.238 ns/byte, 13.000 ns/call
average length 50 -> avg time: 0.317 ns/byte, 16.000 ns/call
average length 100 -> avg time: 0.169 ns/byte, 17.000 ns/call
average length 500 -> avg time: 0.074 ns/byte, 37.000 ns/call
average length 1000 -> avg time: 0.068 ns/byte, 68.000 ns/call
average length 5000 -> avg time: 0.064 ns/byte, 318.000 ns/call
average length 10000 -> avg time: 0.062 ns/byte, 622.000 ns/call
average length 1000000 -> avg time: 0.062 ns/byte, 62000.000 ns/call
chqrlie> gcc -std=c99 -O1 benchstrlen.c && ./a.out
average length 0 -> avg time: 20.000 ns/byte, 20.000 ns/call
average length 4 -> avg time: 3.818 ns/byte, 21.000 ns/call
average length 10 -> avg time: 2.190 ns/byte, 23.000 ns/call
average length 50 -> avg time: 0.990 ns/byte, 50.000 ns/call
average length 100 -> avg time: 0.816 ns/byte, 82.000 ns/call
average length 500 -> avg time: 0.679 ns/byte, 340.000 ns/call
average length 1000 -> avg time: 0.664 ns/byte, 664.000 ns/call
average length 5000 -> avg time: 0.651 ns/byte, 3254.000 ns/call
average length 10000 -> avg time: 0.649 ns/byte, 6491.000 ns/call
average length 1000000 -> avg time: 0.648 ns/byte, 648000.000 ns/call
chqrlie> gcc -std=c99 -O2 benchstrlen.c && ./a.out
average length 0 -> avg time: 10.000 ns/byte, 10.000 ns/call
average length 4 -> avg time: 2.000 ns/byte, 11.000 ns/call
average length 10 -> avg time: 1.048 ns/byte, 11.000 ns/call
average length 50 -> avg time: 0.337 ns/byte, 17.000 ns/call
average length 100 -> avg time: 0.299 ns/byte, 30.000 ns/call
average length 500 -> avg time: 0.202 ns/byte, 101.000 ns/call
average length 1000 -> avg time: 0.188 ns/byte, 188.000 ns/call
average length 5000 -> avg time: 0.174 ns/byte, 868.000 ns/call
average length 10000 -> avg time: 0.172 ns/byte, 1716.000 ns/call
average length 1000000 -> avg time: 0.172 ns/byte, 172000.000 ns/call
Wouldn't it still be better for the inlined version to use the same optimizations as the librarystrlen, giving the best of both worlds?
– Daniel H
1 hour ago
1
It would, but the hand optimized version in the C library might be larger and more complicated to inline. I have not looked into this recently, but there used to be a mix of complex platform specific macros in<string.h>and hard coded optimisations in the gcc code generator. Definitely still room for improvement on intel targets.
– chqrlie
1 hour ago
Does it change if you use-march=native -mtune=native?
– Deduplicator
18 mins ago
Note that the GNU C library function forstrlen()is likely optimised for extremely large strings (that no sane programmer will care about) at the expense of performance for small strings (that are extremely common); and the optimisations done by the library version should never be done. The problem here is that the OP's code doesn't keep track of the string's length itself (e.g. with anint len;variable) and should not have usedstrlen()at all, making the code so bad for performance that "optimised for something no sane programmer would care about" actually helped.
– Brendan
15 mins ago
add a comment |
Testing your code on Godbolt's Compiler Explorer provides this explanation:
- at
-O0or without optimisations, the generated code call the C library functionstrlen
- at
-O1the generated code uses a simple inline expansion using arep scasbinstruction. - at
-O2and above, the generated code uses a more elaborate inline expansion.
Benchmarking your code repeatedly shows a substantial variation from one run to another, but increasing the number of iterations shows that:
- the
-O1code is much slower than the C library implementation:32240vs3090
- the
-O2code is faster than the-O1but still substantially slower than the C ibrary code:8570vs3090.
This behavior is specific to gcc and the glibc. The same test on OS/X with clang and Apple's Libc does not show a significant difference, which is not a surprise as Godbolt shows that clang generates a call to the C library strlen at all optimisation levels.
This could be considered a bug in gcc/glibc but more extensive benchmarking might show that the overhead of calling strlen has a more important impact than the lack of performance of the inline code for small strings. The strings on which you benchmark are uncommonly large, so focusing the benchmark on ultra-long strings might not give meaningful results.
I updated the benchmark for smaller strings and it shows similar performance for string lengths varying from 0 to 100 at -O0 and -O2 but still a much worse performance at -O1, 3 times slower.
Here is the updated code:
#include <stdlib.h>
#include <string.h>
#include <time.h>
void benchmark(int repeat, int minlen, int maxlen) {
char *s = malloc(maxlen + 1);
memset(s, 'A', minlen);
long long bytes = 0, calls = 0;
clock_t clk = clock();
for (int n = 0; n < repeat; n++) {
for (int i = minlen; i < maxlen; ++i) {
bytes += i + 1;
calls += 1;
s[i] = '';
s[strlen(s)] = 'A';
}
}
clk = clock() - clk;
free(s);
double avglen = (minlen + maxlen - 1) / 2.0;
double ns = (double)clk * 1e9 / CLOCKS_PER_SEC;
printf("average length %7.0f -> avg time: %7.3f ns/byte, %7.3f ns/calln",
avglen, ns / bytes, ns / calls);
}
int main() {
benchmark(10000000, 0, 1);
benchmark(1000000, 0, 10);
benchmark(1000000, 5, 15);
benchmark(100000, 0, 100);
benchmark(100000, 50, 150);
benchmark(10000, 0, 1000);
benchmark(10000, 500, 1500);
benchmark(1000, 0, 10000);
benchmark(1000, 5000, 15000);
benchmark(100, 1000000 - 50, 1000000 + 50);
return 0;
}
Here is the output:
chqrlie> gcc -std=c99 -O0 benchstrlen.c && ./a.out
average length 0 -> avg time: 14.000 ns/byte, 14.000 ns/call
average length 4 -> avg time: 2.364 ns/byte, 13.000 ns/call
average length 10 -> avg time: 1.238 ns/byte, 13.000 ns/call
average length 50 -> avg time: 0.317 ns/byte, 16.000 ns/call
average length 100 -> avg time: 0.169 ns/byte, 17.000 ns/call
average length 500 -> avg time: 0.074 ns/byte, 37.000 ns/call
average length 1000 -> avg time: 0.068 ns/byte, 68.000 ns/call
average length 5000 -> avg time: 0.064 ns/byte, 318.000 ns/call
average length 10000 -> avg time: 0.062 ns/byte, 622.000 ns/call
average length 1000000 -> avg time: 0.062 ns/byte, 62000.000 ns/call
chqrlie> gcc -std=c99 -O1 benchstrlen.c && ./a.out
average length 0 -> avg time: 20.000 ns/byte, 20.000 ns/call
average length 4 -> avg time: 3.818 ns/byte, 21.000 ns/call
average length 10 -> avg time: 2.190 ns/byte, 23.000 ns/call
average length 50 -> avg time: 0.990 ns/byte, 50.000 ns/call
average length 100 -> avg time: 0.816 ns/byte, 82.000 ns/call
average length 500 -> avg time: 0.679 ns/byte, 340.000 ns/call
average length 1000 -> avg time: 0.664 ns/byte, 664.000 ns/call
average length 5000 -> avg time: 0.651 ns/byte, 3254.000 ns/call
average length 10000 -> avg time: 0.649 ns/byte, 6491.000 ns/call
average length 1000000 -> avg time: 0.648 ns/byte, 648000.000 ns/call
chqrlie> gcc -std=c99 -O2 benchstrlen.c && ./a.out
average length 0 -> avg time: 10.000 ns/byte, 10.000 ns/call
average length 4 -> avg time: 2.000 ns/byte, 11.000 ns/call
average length 10 -> avg time: 1.048 ns/byte, 11.000 ns/call
average length 50 -> avg time: 0.337 ns/byte, 17.000 ns/call
average length 100 -> avg time: 0.299 ns/byte, 30.000 ns/call
average length 500 -> avg time: 0.202 ns/byte, 101.000 ns/call
average length 1000 -> avg time: 0.188 ns/byte, 188.000 ns/call
average length 5000 -> avg time: 0.174 ns/byte, 868.000 ns/call
average length 10000 -> avg time: 0.172 ns/byte, 1716.000 ns/call
average length 1000000 -> avg time: 0.172 ns/byte, 172000.000 ns/call
Wouldn't it still be better for the inlined version to use the same optimizations as the librarystrlen, giving the best of both worlds?
– Daniel H
1 hour ago
1
It would, but the hand optimized version in the C library might be larger and more complicated to inline. I have not looked into this recently, but there used to be a mix of complex platform specific macros in<string.h>and hard coded optimisations in the gcc code generator. Definitely still room for improvement on intel targets.
– chqrlie
1 hour ago
Does it change if you use-march=native -mtune=native?
– Deduplicator
18 mins ago
Note that the GNU C library function forstrlen()is likely optimised for extremely large strings (that no sane programmer will care about) at the expense of performance for small strings (that are extremely common); and the optimisations done by the library version should never be done. The problem here is that the OP's code doesn't keep track of the string's length itself (e.g. with anint len;variable) and should not have usedstrlen()at all, making the code so bad for performance that "optimised for something no sane programmer would care about" actually helped.
– Brendan
15 mins ago
add a comment |
Testing your code on Godbolt's Compiler Explorer provides this explanation:
- at
-O0or without optimisations, the generated code call the C library functionstrlen
- at
-O1the generated code uses a simple inline expansion using arep scasbinstruction. - at
-O2and above, the generated code uses a more elaborate inline expansion.
Benchmarking your code repeatedly shows a substantial variation from one run to another, but increasing the number of iterations shows that:
- the
-O1code is much slower than the C library implementation:32240vs3090
- the
-O2code is faster than the-O1but still substantially slower than the C ibrary code:8570vs3090.
This behavior is specific to gcc and the glibc. The same test on OS/X with clang and Apple's Libc does not show a significant difference, which is not a surprise as Godbolt shows that clang generates a call to the C library strlen at all optimisation levels.
This could be considered a bug in gcc/glibc but more extensive benchmarking might show that the overhead of calling strlen has a more important impact than the lack of performance of the inline code for small strings. The strings on which you benchmark are uncommonly large, so focusing the benchmark on ultra-long strings might not give meaningful results.
I updated the benchmark for smaller strings and it shows similar performance for string lengths varying from 0 to 100 at -O0 and -O2 but still a much worse performance at -O1, 3 times slower.
Here is the updated code:
#include <stdlib.h>
#include <string.h>
#include <time.h>
void benchmark(int repeat, int minlen, int maxlen) {
char *s = malloc(maxlen + 1);
memset(s, 'A', minlen);
long long bytes = 0, calls = 0;
clock_t clk = clock();
for (int n = 0; n < repeat; n++) {
for (int i = minlen; i < maxlen; ++i) {
bytes += i + 1;
calls += 1;
s[i] = '';
s[strlen(s)] = 'A';
}
}
clk = clock() - clk;
free(s);
double avglen = (minlen + maxlen - 1) / 2.0;
double ns = (double)clk * 1e9 / CLOCKS_PER_SEC;
printf("average length %7.0f -> avg time: %7.3f ns/byte, %7.3f ns/calln",
avglen, ns / bytes, ns / calls);
}
int main() {
benchmark(10000000, 0, 1);
benchmark(1000000, 0, 10);
benchmark(1000000, 5, 15);
benchmark(100000, 0, 100);
benchmark(100000, 50, 150);
benchmark(10000, 0, 1000);
benchmark(10000, 500, 1500);
benchmark(1000, 0, 10000);
benchmark(1000, 5000, 15000);
benchmark(100, 1000000 - 50, 1000000 + 50);
return 0;
}
Here is the output:
chqrlie> gcc -std=c99 -O0 benchstrlen.c && ./a.out
average length 0 -> avg time: 14.000 ns/byte, 14.000 ns/call
average length 4 -> avg time: 2.364 ns/byte, 13.000 ns/call
average length 10 -> avg time: 1.238 ns/byte, 13.000 ns/call
average length 50 -> avg time: 0.317 ns/byte, 16.000 ns/call
average length 100 -> avg time: 0.169 ns/byte, 17.000 ns/call
average length 500 -> avg time: 0.074 ns/byte, 37.000 ns/call
average length 1000 -> avg time: 0.068 ns/byte, 68.000 ns/call
average length 5000 -> avg time: 0.064 ns/byte, 318.000 ns/call
average length 10000 -> avg time: 0.062 ns/byte, 622.000 ns/call
average length 1000000 -> avg time: 0.062 ns/byte, 62000.000 ns/call
chqrlie> gcc -std=c99 -O1 benchstrlen.c && ./a.out
average length 0 -> avg time: 20.000 ns/byte, 20.000 ns/call
average length 4 -> avg time: 3.818 ns/byte, 21.000 ns/call
average length 10 -> avg time: 2.190 ns/byte, 23.000 ns/call
average length 50 -> avg time: 0.990 ns/byte, 50.000 ns/call
average length 100 -> avg time: 0.816 ns/byte, 82.000 ns/call
average length 500 -> avg time: 0.679 ns/byte, 340.000 ns/call
average length 1000 -> avg time: 0.664 ns/byte, 664.000 ns/call
average length 5000 -> avg time: 0.651 ns/byte, 3254.000 ns/call
average length 10000 -> avg time: 0.649 ns/byte, 6491.000 ns/call
average length 1000000 -> avg time: 0.648 ns/byte, 648000.000 ns/call
chqrlie> gcc -std=c99 -O2 benchstrlen.c && ./a.out
average length 0 -> avg time: 10.000 ns/byte, 10.000 ns/call
average length 4 -> avg time: 2.000 ns/byte, 11.000 ns/call
average length 10 -> avg time: 1.048 ns/byte, 11.000 ns/call
average length 50 -> avg time: 0.337 ns/byte, 17.000 ns/call
average length 100 -> avg time: 0.299 ns/byte, 30.000 ns/call
average length 500 -> avg time: 0.202 ns/byte, 101.000 ns/call
average length 1000 -> avg time: 0.188 ns/byte, 188.000 ns/call
average length 5000 -> avg time: 0.174 ns/byte, 868.000 ns/call
average length 10000 -> avg time: 0.172 ns/byte, 1716.000 ns/call
average length 1000000 -> avg time: 0.172 ns/byte, 172000.000 ns/call
Testing your code on Godbolt's Compiler Explorer provides this explanation:
- at
-O0or without optimisations, the generated code call the C library functionstrlen
- at
-O1the generated code uses a simple inline expansion using arep scasbinstruction. - at
-O2and above, the generated code uses a more elaborate inline expansion.
Benchmarking your code repeatedly shows a substantial variation from one run to another, but increasing the number of iterations shows that:
- the
-O1code is much slower than the C library implementation:32240vs3090
- the
-O2code is faster than the-O1but still substantially slower than the C ibrary code:8570vs3090.
This behavior is specific to gcc and the glibc. The same test on OS/X with clang and Apple's Libc does not show a significant difference, which is not a surprise as Godbolt shows that clang generates a call to the C library strlen at all optimisation levels.
This could be considered a bug in gcc/glibc but more extensive benchmarking might show that the overhead of calling strlen has a more important impact than the lack of performance of the inline code for small strings. The strings on which you benchmark are uncommonly large, so focusing the benchmark on ultra-long strings might not give meaningful results.
I updated the benchmark for smaller strings and it shows similar performance for string lengths varying from 0 to 100 at -O0 and -O2 but still a much worse performance at -O1, 3 times slower.
Here is the updated code:
#include <stdlib.h>
#include <string.h>
#include <time.h>
void benchmark(int repeat, int minlen, int maxlen) {
char *s = malloc(maxlen + 1);
memset(s, 'A', minlen);
long long bytes = 0, calls = 0;
clock_t clk = clock();
for (int n = 0; n < repeat; n++) {
for (int i = minlen; i < maxlen; ++i) {
bytes += i + 1;
calls += 1;
s[i] = '';
s[strlen(s)] = 'A';
}
}
clk = clock() - clk;
free(s);
double avglen = (minlen + maxlen - 1) / 2.0;
double ns = (double)clk * 1e9 / CLOCKS_PER_SEC;
printf("average length %7.0f -> avg time: %7.3f ns/byte, %7.3f ns/calln",
avglen, ns / bytes, ns / calls);
}
int main() {
benchmark(10000000, 0, 1);
benchmark(1000000, 0, 10);
benchmark(1000000, 5, 15);
benchmark(100000, 0, 100);
benchmark(100000, 50, 150);
benchmark(10000, 0, 1000);
benchmark(10000, 500, 1500);
benchmark(1000, 0, 10000);
benchmark(1000, 5000, 15000);
benchmark(100, 1000000 - 50, 1000000 + 50);
return 0;
}
Here is the output:
chqrlie> gcc -std=c99 -O0 benchstrlen.c && ./a.out
average length 0 -> avg time: 14.000 ns/byte, 14.000 ns/call
average length 4 -> avg time: 2.364 ns/byte, 13.000 ns/call
average length 10 -> avg time: 1.238 ns/byte, 13.000 ns/call
average length 50 -> avg time: 0.317 ns/byte, 16.000 ns/call
average length 100 -> avg time: 0.169 ns/byte, 17.000 ns/call
average length 500 -> avg time: 0.074 ns/byte, 37.000 ns/call
average length 1000 -> avg time: 0.068 ns/byte, 68.000 ns/call
average length 5000 -> avg time: 0.064 ns/byte, 318.000 ns/call
average length 10000 -> avg time: 0.062 ns/byte, 622.000 ns/call
average length 1000000 -> avg time: 0.062 ns/byte, 62000.000 ns/call
chqrlie> gcc -std=c99 -O1 benchstrlen.c && ./a.out
average length 0 -> avg time: 20.000 ns/byte, 20.000 ns/call
average length 4 -> avg time: 3.818 ns/byte, 21.000 ns/call
average length 10 -> avg time: 2.190 ns/byte, 23.000 ns/call
average length 50 -> avg time: 0.990 ns/byte, 50.000 ns/call
average length 100 -> avg time: 0.816 ns/byte, 82.000 ns/call
average length 500 -> avg time: 0.679 ns/byte, 340.000 ns/call
average length 1000 -> avg time: 0.664 ns/byte, 664.000 ns/call
average length 5000 -> avg time: 0.651 ns/byte, 3254.000 ns/call
average length 10000 -> avg time: 0.649 ns/byte, 6491.000 ns/call
average length 1000000 -> avg time: 0.648 ns/byte, 648000.000 ns/call
chqrlie> gcc -std=c99 -O2 benchstrlen.c && ./a.out
average length 0 -> avg time: 10.000 ns/byte, 10.000 ns/call
average length 4 -> avg time: 2.000 ns/byte, 11.000 ns/call
average length 10 -> avg time: 1.048 ns/byte, 11.000 ns/call
average length 50 -> avg time: 0.337 ns/byte, 17.000 ns/call
average length 100 -> avg time: 0.299 ns/byte, 30.000 ns/call
average length 500 -> avg time: 0.202 ns/byte, 101.000 ns/call
average length 1000 -> avg time: 0.188 ns/byte, 188.000 ns/call
average length 5000 -> avg time: 0.174 ns/byte, 868.000 ns/call
average length 10000 -> avg time: 0.172 ns/byte, 1716.000 ns/call
average length 1000000 -> avg time: 0.172 ns/byte, 172000.000 ns/call
edited 28 mins ago
answered 1 hour ago
chqrliechqrlie
62.9k848107
62.9k848107
Wouldn't it still be better for the inlined version to use the same optimizations as the librarystrlen, giving the best of both worlds?
– Daniel H
1 hour ago
1
It would, but the hand optimized version in the C library might be larger and more complicated to inline. I have not looked into this recently, but there used to be a mix of complex platform specific macros in<string.h>and hard coded optimisations in the gcc code generator. Definitely still room for improvement on intel targets.
– chqrlie
1 hour ago
Does it change if you use-march=native -mtune=native?
– Deduplicator
18 mins ago
Note that the GNU C library function forstrlen()is likely optimised for extremely large strings (that no sane programmer will care about) at the expense of performance for small strings (that are extremely common); and the optimisations done by the library version should never be done. The problem here is that the OP's code doesn't keep track of the string's length itself (e.g. with anint len;variable) and should not have usedstrlen()at all, making the code so bad for performance that "optimised for something no sane programmer would care about" actually helped.
– Brendan
15 mins ago
add a comment |
Wouldn't it still be better for the inlined version to use the same optimizations as the librarystrlen, giving the best of both worlds?
– Daniel H
1 hour ago
1
It would, but the hand optimized version in the C library might be larger and more complicated to inline. I have not looked into this recently, but there used to be a mix of complex platform specific macros in<string.h>and hard coded optimisations in the gcc code generator. Definitely still room for improvement on intel targets.
– chqrlie
1 hour ago
Does it change if you use-march=native -mtune=native?
– Deduplicator
18 mins ago
Note that the GNU C library function forstrlen()is likely optimised for extremely large strings (that no sane programmer will care about) at the expense of performance for small strings (that are extremely common); and the optimisations done by the library version should never be done. The problem here is that the OP's code doesn't keep track of the string's length itself (e.g. with anint len;variable) and should not have usedstrlen()at all, making the code so bad for performance that "optimised for something no sane programmer would care about" actually helped.
– Brendan
15 mins ago
Wouldn't it still be better for the inlined version to use the same optimizations as the library
strlen, giving the best of both worlds?– Daniel H
1 hour ago
Wouldn't it still be better for the inlined version to use the same optimizations as the library
strlen, giving the best of both worlds?– Daniel H
1 hour ago
1
1
It would, but the hand optimized version in the C library might be larger and more complicated to inline. I have not looked into this recently, but there used to be a mix of complex platform specific macros in
<string.h> and hard coded optimisations in the gcc code generator. Definitely still room for improvement on intel targets.– chqrlie
1 hour ago
It would, but the hand optimized version in the C library might be larger and more complicated to inline. I have not looked into this recently, but there used to be a mix of complex platform specific macros in
<string.h> and hard coded optimisations in the gcc code generator. Definitely still room for improvement on intel targets.– chqrlie
1 hour ago
Does it change if you use
-march=native -mtune=native?– Deduplicator
18 mins ago
Does it change if you use
-march=native -mtune=native?– Deduplicator
18 mins ago
Note that the GNU C library function for
strlen() is likely optimised for extremely large strings (that no sane programmer will care about) at the expense of performance for small strings (that are extremely common); and the optimisations done by the library version should never be done. The problem here is that the OP's code doesn't keep track of the string's length itself (e.g. with an int len; variable) and should not have used strlen() at all, making the code so bad for performance that "optimised for something no sane programmer would care about" actually helped.– Brendan
15 mins ago
Note that the GNU C library function for
strlen() is likely optimised for extremely large strings (that no sane programmer will care about) at the expense of performance for small strings (that are extremely common); and the optimisations done by the library version should never be done. The problem here is that the OP's code doesn't keep track of the string's length itself (e.g. with an int len; variable) and should not have used strlen() at all, making the code so bad for performance that "optimised for something no sane programmer would care about" actually helped.– Brendan
15 mins ago
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55563598%2fwhy-is-this-code-6-5x-slower-with-optimizations-enabled%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
With gcc-8.2 debug version takes 51334, release 8246. Release compiler options
-O3 -march=native -DNDEBUG– Maxim Egorushkin
2 hours ago
Please report it to gcc's bugzilla.
– Marc Glisse
1 hour ago
Using
-fno-builtinmakes the problem go away. So presumably the issue is that in this particular instance, GCC's builtinstrlenis slower than the library's.– David Schwartz
1 hour ago
It is generating
repnz scasbfor strlen at -O1.– Marc Glisse
1 hour ago
@MarcGlisse and for
-O2and-O3, it's loading and comparing thechars as integers. Unfortunately, the naive-O0uses the library function which uses vector-instructions that beat this optimization easily.– EOF
1 hour ago