IT story

C ++ 표준은 iostream의 성능 저하를 요구합니까, 아니면 구현이 좋지 않은 경우에만 처리합니까?

hot-time 2020. 5. 12. 08:04
반응형

C ++ 표준은 iostream의 성능 저하를 요구합니까, 아니면 구현이 좋지 않은 경우에만 처리합니까?


C ++ 표준 라이브러리 iostream의 성능 저하에 대해 언급 할 때마다 불신의 물결에 부딪칩니다. 그러나 나는 iostream 라이브러리 코드 (전체 컴파일러 최적화)에 소비 된 많은 시간을 보여주는 프로파일 러 결과를 가지고 있으며 iostream에서 OS 특정 I / O API로 전환하고 사용자 지정 버퍼 관리로 인해 크게 개선되었습니다.

C ++ 표준 라이브러리는 어떤 추가 작업을 수행하며 표준에 필요하며 실제로 유용합니까? 아니면 일부 컴파일러는 수동 버퍼 관리와 경쟁하는 iostream 구현을 제공합니까?

벤치 마크

문제를 해결하기 위해, 나는 iostreams 내부 버퍼링을 실행하는 몇 가지 짧은 프로그램을 작성했습니다.

점을 유의 ostringstream하고 stringbuf그들이 너무 느립니다 때문에 버전이 적은 반복을 실행합니다.

이데온에서는 + + ostringstream보다 약 3 배 느리고 원시 버퍼 보다 약 15 배 느립니다 . 실제 응용 프로그램을 사용자 지정 버퍼링으로 전환했을 때 프로파일 링 전후에 일관된 느낌이 듭니다.std:copyback_inserterstd::vectormemcpy

이들은 모두 메모리 내 버퍼이므로 느린 디스크 I / O, 너무 많은 플러시, stdio와의 동기화 또는 사람들이 C ++ 표준 라이브러리의 느려진 관찰을 변명하기 위해 사용하는 다른 것들에서 iostream의 느림을 비난 할 수 없습니다 요오드.

다른 시스템의 벤치 마크와 일반적인 구현이 수행하는 작업 (gcc의 libc ++, Visual C ++, Intel C ++ 등)과 표준에 의해 요구되는 오버 헤드의 양에 대한 논평을 보는 것이 좋을 것입니다.

이 테스트의 근거

많은 사람들이 iostream이 포맷 된 출력에 더 일반적으로 사용된다고 올바르게 지적했습니다. 그러나 이진 파일 액세스를 위해 C ++ 표준에서 제공하는 유일한 최신 API이기도합니다. 그러나 내부 버퍼링에서 성능 테스트를 수행하는 실제 이유는 일반적인 형식의 I / O에 적용됩니다. iostream이 디스크 컨트롤러에 원시 데이터를 제공 할 수없는 경우 포맷을 담당 할 때 어떻게 유지할 수 있습니까?

벤치 마크 타이밍

이것들은 모두 외부 ( k) 루프의 반복입니다.

ideone (gcc-4.3.4, 알려지지 않은 OS 및 하드웨어) :

  • ostringstream: 53 밀리 초
  • stringbuf: 27ms
  • vector<char>back_inserter: 17.6 MS
  • vector<char> 일반 반복자와 함께 : 10.6 ms
  • vector<char> 반복자와 범위 검사 : 11.4ms
  • char[]: 3.7ms

내 랩톱 (Visual C ++ 2010 x86, cl /Ox /EHscWindows 7 Ultimate 64 비트, Intel Core i7, 8GB RAM)에서 :

  • ostringstream: 73.4 밀리 초, 71.6ms
  • stringbuf: 21.7ms, 21.3ms
  • vector<char>back_inserter: 34.6ms, 34.4ms
  • vector<char> 일반 반복자 사용시 : 1.10ms, 1.04ms
  • vector<char> 반복자와 경계 검사 : 1.11ms, 0.87ms, 1.12ms, 0.89ms, 1.02ms, 1.14ms
  • char[]: 1.48ms, 1.57ms

프로파일 활용 최적화와 비주얼 C ++ 2010 86, cl /Ox /EHsc /GL /c, link /ltcg:pgi, 실행 link /ltcg:pgo, 측정 :

  • ostringstream: 61.2ms, 60.5ms
  • vector<char> 일반 반복기 사용시 : 1.04ms, 1.03ms

cygwin gcc 4.3.4를 사용하는 동일한 노트북, 동일한 OS g++ -O3:

  • ostringstream: 62.7ms, 60.5ms
  • stringbuf: 44.4ms, 44.5ms
  • vector<char>back_inserter: 13.5ms, 13.6ms
  • vector<char> 일반 반복자와 함께 : 4.1ms, 3.9ms
  • vector<char> 반복자와 범위 검사 : 4.0ms, 4.0ms
  • char[]: 3.57ms, 3.75ms

동일한 랩톱, Visual C ++ 2008 SP1, cl /Ox /EHsc:

  • ostringstream: 88.7ms, 87.6ms
  • stringbuf: 23.3ms, 23.4ms
  • vector<char>back_inserter: 26.1 ms, 24.5 ms
  • vector<char> 일반 반복기 사용시 : 3.13ms, 2.48ms
  • vector<char> iterator and bounds check: 2.97 ms, 2.53 ms
  • char[]: 1.52 ms, 1.25 ms

Same laptop, Visual C++ 2010 64-bit compiler:

  • ostringstream: 48.6 ms, 45.0 ms
  • stringbuf: 16.2 ms, 16.0 ms
  • vector<char> and back_inserter: 26.3 ms, 26.5 ms
  • vector<char> with ordinary iterator: 0.87 ms, 0.89 ms
  • vector<char> iterator and bounds check: 0.99 ms, 0.99 ms
  • char[]: 1.25 ms, 1.24 ms

EDIT: Ran all twice to see how consistent the results were. Pretty consistent IMO.

NOTE: On my laptop, since I can spare more CPU time than ideone allows, I set the number of iterations to 1000 for all methods. This means that ostringstream and vector reallocation, which takes place only on the first pass, should have little impact on the final results.

EDIT: Oops, found a bug in the vector-with-ordinary-iterator, the iterator wasn't being advanced and therefore there were too many cache hits. I was wondering how vector<char> was outperforming char[]. It didn't make much difference though, vector<char> is still faster than char[] under VC++ 2010.

Conclusions

Buffering of output streams requires three steps each time data is appended:

  • Check that the incoming block fits the available buffer space.
  • Copy the incoming block.
  • Update the end-of-data pointer.

The latest code snippet I posted, "vector<char> simple iterator plus bounds check" not only does this, it also allocates additional space and moves the existing data when the incoming block doesn't fit. As Clifford pointed out, buffering in a file I/O class wouldn't have to do that, it would just flush the current buffer and reuse it. So this should be an upper bound on the cost of buffering output. And it's exactly what is needed to make a working in-memory buffer.

So why is stringbuf 2.5x slower on ideone, and at least 10 times slower when I test it? It isn't being used polymorphically in this simple micro-benchmark, so that doesn't explain it.


Not answering the specifics of your question so much as the title: the 2006 Technical Report on C++ Performance has an interesting section on IOStreams (p.68). Most relevant to your question is in Section 6.1.2 ("Execution Speed"):

Since certain aspects of IOStreams processing are distributed over multiple facets, it appears that the Standard mandates an inefficient implementation. But this is not the case — by using some form of preprocessing, much of the work can be avoided. With a slightly smarter linker than is typically used, it is possible to remove some of these inefficiencies. This is discussed in §6.2.3 and §6.2.5.

Since the report was written in 2006 one would hope that many of the recommendations would have been incorporated into current compilers, but perhaps this is not the case.

As you mention, facets may not feature in write() (but I wouldn't assume that blindly). So what does feature? Running GProf on your ostringstream code compiled with GCC gives the following breakdown:

  • 44.23% in std::basic_streambuf<char>::xsputn(char const*, int)
  • 34.62% in std::ostream::write(char const*, int)
  • 12.50% in main
  • 6.73% in std::ostream::sentry::sentry(std::ostream&)
  • 0.96% in std::string::_M_replace_safe(unsigned int, unsigned int, char const*, unsigned int)
  • 0.96% in std::basic_ostringstream<char>::basic_ostringstream(std::_Ios_Openmode)
  • 0.00% in std::fpos<int>::fpos(long long)

So the bulk of the time is spent in xsputn, which eventually calls std::copy() after lots of checking and updating of cursor positions and buffers (have a look in c++\bits\streambuf.tcc for the details).

My take on this is that you've focused on the worst-case situation. All the checking that is performed would be a small fraction of the total work done if you were dealing with reasonably large chunks of data. But your code is shifting data in four bytes at a time, and incurring all the extra costs each time. Clearly one would avoid doing so in a real-life situation - consider how negligible the penalty would have been if write was called on an array of 1m ints instead of on 1m times on one int. And in a real-life situation one would really appreciate the important features of IOStreams, namely its memory-safe and type-safe design. Such benefits come at a price, and you've written a test which makes these costs dominate the execution time.


I'm rather disappointed in the Visual Studio users out there, who rather had a gimme on this one:

  • In the Visual Studio implementation of ostream, the sentry object (which is required by the standard) enters a critical section protecting the streambuf (which is not required). This doesn't seem to be optional, so you pay the cost of thread synchronization even for a local stream used by a single thread, which has no need for synchronization.

This hurts code that uses ostringstream to format messages pretty severely. Using the stringbuf directly avoids the use of sentry, but the formatted insertion operators can't work directly on streambufs. For Visual C++ 2010, the critical section is slowing down ostringstream::write by a factor of three vs the underlying stringbuf::sputn call.

Looking at beldaz's profiler data on newlib, it seems clear that gcc's sentry doesn't do anything crazy like this. ostringstream::write under gcc only takes about 50% longer than stringbuf::sputn, but stringbuf itself is much slower than under VC++. And both still compare very unfavorably to using a vector<char> for I/O buffering, although not by the same margin as under VC++.


The problem you see is all in the overhead around each call to write(). Each level of abstraction that you add (char[] -> vector -> string -> ostringstream) adds a few more function call/returns and other housekeeping guff that - if you call it a million times - adds up.

I modified two of the examples on ideone to write ten ints at a time. The ostringstream time went from 53 to 6 ms (almost 10 x improvement) while the char loop improved (3.7 to 1.5) - useful, but only by a factor of two.

If you're that concerned about performance then you need to choose the right tool for the job. ostringstream is useful and flexible, but there's a penalty for using it the way you're trying to. char[] is harder work, but the performance gains can be great (remember the gcc will probably inline the memcpys for you as well).

In short, ostringstream isn't broken, but the closer you get to the metal the faster your code will run. Assembler still has advantages for some folk.


To get better performance you have to understand how the containers you are using work. In your char[] array example, the array of the required size is allocated in advance. In your vector and ostringstream example you are forcing the objects to repeatedly allocate and reallocate and possibly copy data many times as the object grows.

With std::vector this is easly resolved by initialising the size of the vector to the final size as you did the char array; instead you rather unfairly cripple the performance by resizing to zero! That is hardly a fair comparison.

With respect to ostringstream, preallocating the space is not possible, I would suggest that it is an inappropruate use. The class has far greater utility than a simple char array, but if you don't need that utility, then don't use it, because you will pay the overhead in any case. Instead it should be used for what it is good for - formatting data into a string. C++ provides a wide range of containers and an ostringstram is amongst the least appropriate for this purpose.

In the case of the vector and ostringstream you get protection from buffer overrun, you don't get that with a char array, and that protection does not come for free.

참고URL : https://stackoverflow.com/questions/4340396/does-the-c-standard-mandate-poor-performance-for-iostreams-or-am-i-just-deali

반응형