mmap () 대 판독 블록

IT story

mmap () 대 판독 블록

hot-time 2020. 5. 26. 07:48

mmap () 대 판독 블록

잠재적으로 100GB 이상 크기의 파일을 처리하는 프로그램을 개발 중입니다. 파일에는 가변 길이 레코드 세트가 포함됩니다. 첫 번째 구현을 시작하여 실행 중이며 특히 입력 파일이 여러 번 스캔되므로 I / O를보다 효율적으로 수행하는 데 성능 향상을 모색하고 있습니다.

mmap()C ++의 fstream라이브러리 를 통해 블록 을 사용하고 읽는 것에 대한 경험 규칙이 있습니까? 내가하고 싶은 것은 디스크에서 버퍼로 큰 블록을 읽고 버퍼에서 완전한 레코드를 처리 한 다음 더 읽으십시오.

mmap()코드는 잠재적으로 매우부터 지저분한 얻을 수있는 mmapD 블록은 페이지 경계 (내 이해) 크기에 거짓말에 필요한 '및 기록 할 수 잠재적에서 같은 페이지 경계. fstreams를 사용하면 페이지 크기 경계에있는 블록을 읽는 것에 만 국한되지 않기 때문에 레코드의 시작을 찾고 다시 읽을 수 있습니다.

실제로 완전한 구현을 작성하지 않고이 두 옵션 중에서 어떻게 결정할 수 있습니까? 경험 법칙 (예 : mmap()2 배 빠름)이나 간단한 테스트가 있습니까?

Linux에서 mmap / read performance에 대한 최종 단어를 찾으려고 노력 했으며 Linux 커널 메일 링리스트에서 멋진 게시물 ( link )을 발견했습니다. 그것은 2000, 그래서 IO 및 그 이후 커널의 가상 메모리에 많은 개선이 있었다, 그러나 잘하는 이유에 대해 설명 mmap또는 read빠르거나 느려질 수 있습니다합니다.

에 대한 호출 mmap이보다 많은 오버 헤드를가집니다 read( epoll보다 poll많은 오버 헤드가있는 것보다 read). 가상 메모리 매핑을 변경하는 것은 다른 프로세스 간 전환이 비싸다는 동일한 이유로 일부 프로세서에서 상당히 비싼 작업입니다.
IO 시스템은 이미 디스크 캐시를 사용할 수 있으므로 파일을 읽으면 사용하는 방법에 관계없이 캐시에 도달하거나 누락됩니다.

하나,

특히 액세스 패턴이 희박하고 예측할 수없는 경우 메모리 맵이 임의 액세스에 더 빠릅니다.
메모리 맵을 사용하면 완료 할 때까지 캐시에서 페이지 를 계속 사용할 수 있습니다 . 즉, 파일을 오랜 시간 동안 많이 사용한 다음 닫았다가 다시 열면 페이지가 계속 캐시됩니다. 을 (를) 사용하면 read파일이 이전 캐시에서 플러시되었을 수 있습니다. 파일을 사용하여 즉시 버리는 경우에는 적용되지 않습니다. ( mlock페이지를 캐시에 보관하기 위해 페이지 를 시도 하면 디스크 캐시를 능가하려고하지만 이런 종류의 바보는 시스템 성능에 거의 도움이되지 않습니다).
파일을 직접 읽는 것은 매우 간단하고 빠릅니다.

mmap / read에 대한 토론은 두 가지 다른 성능 토론을 상기시킵니다.

일부 Java 프로그래머는 비 차단 I / O가 종종 I / O를 차단하는 것보다 느리다는 사실에 충격을 받았습니다. 비 차단 I / O가 더 많은 syscall을 수행해야한다는 것을 알면 완벽하게 이해됩니다.
다른 네트워크 프로그래머들은 epoll종종보다 느리다 는 사실에 충격을 받았습니다 . poll관리에 epoll더 많은 시스템 콜이 필요 하다는 것을 알면 완벽하게 이해됩니다 .

결론 : 데이터에 무작위로 액세스하거나 오랫동안 보관하거나 다른 프로세스와 공유 할 수있는 경우 메모리 맵을 사용하십시오 ( MAP_SHARED실제 공유가없는 경우에는 그리 흥미롭지 않습니다). 데이터에 순차적으로 액세스하거나 읽은 후 폐기하면 파일을 정상적으로 읽습니다. 그리고 어느 방법으로도 프로그램이 덜 복잡해지면 그렇게하십시오 . 많은 실제 사례에서 벤치 마크가 아닌 실제 응용 프로그램을 테스트하지 않고도 더 빠른 방법을 보여줄 수있는 확실한 방법은 없습니다.

(이 질문에 대해 죄송하지만 답변을 찾고 있었고이 질문은 Google 결과의 최상위에 계속 올라 왔습니다.)

주요 성능 비용은 디스크 I / O입니다. "mmap ()"은 확실히 istream보다 빠르지 만 디스크 i / o가 런타임을 지배하므로 차이가 눈에 띄지 않을 수 있습니다.

나는 (아래 / 위 참조) "의 mmap ()는 것을 자신의 주장을 테스트하는 벤 콜린스의 코드 조각을 시도 방법은 빨리"와 측정 가능한 차이를 찾을 수 없습니다. 그의 답변에 대한 내 의견을 참조하십시오.

나는 확실히 것 없는 당신의 "기록"거대한하지 않는 한 별도로 차례로 각 레코드를 mmap 할 추천 - 그 끔찍하게 느린 것, 각 레코드에 대해이 시스템 호출을 필요로하고 가능한 디스크 메모리 캐시에서 페이지를 잃고 .... .

귀하의 경우 mmap (), istream 및 저수준 open () / read () 호출은 모두 거의 동일하다고 생각합니다. 이 경우 mmap ()을 권장합니다.

파일 내에 무작위 액세스 (순차 아님)가 있으며
모든 것이 메모리에 편안하게 맞거나 파일 내에 참조 위치가 있으므로 특정 페이지를 매핑하고 다른 페이지를 매핑 할 수 있습니다. 그렇게하면 운영 체제는 사용 가능한 RAM을 사용하여 최대의 이점을 얻습니다.
또는 여러 프로세스가 동일한 파일에서 읽고 작동하는 경우 프로세스가 모두 동일한 실제 페이지를 공유하므로 mmap ()은 환상적입니다.

(btw-mmap () / MapViewOfFile ()을 좋아합니다).

의 mmap는 방법 빨리. 간단한 벤치 마크를 작성하여 스스로 증명할 수 있습니다.

char data[0x1000];
std::ifstream in("file.bin");

while (in)
{
  in.read(data, 0x1000);
  // do something with data
}

대:

const int file_size=something;
const int page_size=0x1000;
int off=0;
void *data;

int fd = open("filename.bin", O_RDONLY);

while (off < file_size)
{
  data = mmap(NULL, page_size, PROT_READ, 0, fd, off);
  // do stuff with data
  munmap(data, page_size);
  off += page_size;
}

분명히, 나는 (예 page_size를 들어 파일이 배수가 아닌 경우 파일 끝에 도달 할 때를 결정하는 방법과 같은) 세부 정보를 생략 하지만 실제로는 이것보다 훨씬 복잡하지 않아야합니다 .

가능하면 데이터를 부분이 아닌 전체적으로 mmap () 될 수있는 여러 파일로 분할하려고 할 수 있습니다 (훨씬 간단 함).

몇 달 전에 나는 boost_iostreams에 대한 슬라이딩 창 mmap () 기반 스트림 클래스를 반 구운 구현했지만 아무도 신경 쓰지 않았고 다른 것들로 바빴습니다. 불행히도 몇 주 전에 완료되지 않은 오래된 프로젝트의 아카이브를 삭제했으며 그 중 하나는 희생자 중 하나였습니다.

업데이트 : Microsoft가 처음에 mmap으로 수행하는 대부분의 작업을 수행하는 멋진 파일 캐시를 구현했기 때문에 Windows 에서이 벤치 마크가 상당히 다르게 보일 것이라는 경고를 추가해야합니다. 즉, 자주 액세스하는 파일의 경우 std :: ifstream.read ()를 수행하면 파일 캐시가 이미 메모리 매핑을 수행했기 때문에 mmap만큼 빠르며 투명합니다.

최종 업데이트 : Look, people : OS와 표준 라이브러리, 디스크 및 메모리 계층의 다양한 플랫폼 조합 mmap에서 블랙 박스로 보는 시스템 호출 이 항상 항상 실질적으로 더 빠를 것이라고 말할 수는 없습니다. 보다 read. 내 말이 그런 식으로 해석 될 수 있더라도 그것은 나의 의도가 아니었다. 궁극적으로 필자의 요점은 메모리 매핑 된 i / o가 일반적으로 바이트 기반 i / o보다 빠르다는 것입니다. 이것은 여전히 사실 입니다. 실험적으로 둘 사이에 차이가 없다는 것을 알게되면 나에게 합리적 인 유일한 설명은 귀하의 플랫폼이 덮개 아래에서 호출을 수행하는 데 유리한 방식으로 메모리 매핑을 구현한다는 것입니다read. 휴대용으로 메모리 매핑 된 I / O를 사용하고 있는지 확실하게 확인할 수있는 유일한 방법은을 사용하는 것 mmap입니다. 이식성에 신경 쓰지 않고 대상 플랫폼의 특정 특성에 의존 할 수 read있다면 성능을 크게 저하시키지 않고 사용하는 것이 적합 할 수 있습니다.

답변 목록을 정리하려면 편집 : @jbl :

슬라이딩 윈도우 mmap이 흥미롭게 들립니다. 그것에 대해 조금 더 말할 수 있습니까?

물론-나는 Git (libgit ++, 당신이 원한다면 libgit ++)을위한 C ++ 라이브러리를 작성하고 있었고 이것과 비슷한 문제가 발생했다. 큰 (매우 큰) 파일을 열 수 있고 성능이 총 견이 아니어야했다. (와 마찬가지로 std::fstream).

Boost::Iostreams이미 mapping_file 소스를 가지고 있지만 문제는 mmap전체 파일 을 핑 (ping) 하는 것이므로 2 ^ (wordsize)로 제한됩니다. 32 비트 시스템에서 4GB는 충분하지 않습니다. .packGit에 파일보다 훨씬 큰 파일 이 있다고 예상하는 것은 무리가 없으므로 일반 파일 I / O에 의존하지 않고 청크로 파일을 읽어야했습니다. 의 덮개에서 Boost::Iostreams, 나는 사이의 상호 작용의 다소 다른보기 인 소스 구현 std::streambuf및 std::istream. 당신은 또한 단지 상속에 의해 유사한 접근 방법을 시도 할 수 std::filebuf으로 mapped_filebuf유사, 상속 std::fstream에를 a mapped_fstream. 두 사람 사이의 상호 작용은 제대로 이해하기 어렵습니다.Boost::Iostreams 일부 작업이 완료되었으며 필터 및 체인에 대한 후크도 제공하므로 그렇게 구현하는 것이 더 유용 할 것이라고 생각했습니다.

여기에 많은 주요 요점을 다루는 좋은 답변이 이미 많이 있으므로 직접 위에서 언급하지 않은 몇 가지 문제를 추가하겠습니다. 즉,이 답변은 장단점의 포괄적 인 것으로 간주되어서는 안되며 다른 답변에 대한 부록으로 간주되어야합니다.

mmap은 마술처럼 보인다

파일이 이미 완전히 캐시되는 경우 촬영 ^일을 기준으로 ² , mmap처럼 거의 보일 수도 마법 :

mmap 전체 파일을 (잠재적으로) 매핑하기 위해 한 번의 시스템 호출 만 있으면됩니다. 그 후에는 더 이상 시스템 호출이 필요하지 않습니다.
mmap 커널에서 사용자 공간으로 파일 데이터의 사본이 필요하지 않습니다.
mmap allows you to access the file "as memory", including processing it with whatever advanced tricks you can do against memory, such as compiler auto-vectorization, SIMD intrinsics, prefetching, optimized in-memory parsing routines, OpenMP, etc.

In the case that the file is already in the cache, it seems impossible to beat: you just directly access the kernel page cache as memory and it can't get faster than that.

Well, it can.

mmap is not actually magic because...

mmap still does per-page work

A primary hidden cost of mmap vs read(2) (which is really the comparable OS-level syscall for reading blocks) is that with mmap you'll need to do "some work" for every 4K page in user-space, even though it might be hidden by the page-fault mechanism.

For a example a typical implementation that just mmaps the entire file will need to fault-in so 100 GB / 4K = 25 million faults to read a 100 GB file. Now, these will be minor faults, but 25 billion page faults is still not going to be super fast. The cost of a minor fault is probably in the 100s of nanos in the best case.

mmap relies heavily on TLB performance

Now, you can pass MAP_POPULATE to mmap to tell it to set up all the page tables before returning, so there should be no page faults while accessing it. Now, this has the little problem that it also reads the entire file into RAM, which is going to blow up if you try to map a 100GB file - but let's ignore that for now³. The kernel needs to do per-page work to set up these page tables (shows up as kernel time). This ends up being a major cost in the mmap approach, and it's proportional to the file size (i.e., it doesn't get relatively less important as the file size grows)⁴.

Finally, even in user-space accessing such a mapping isn't exactly free (compared to large memory buffers not originating from a file-based mmap) - even once the page tables are set up, each access to a new page is going to, conceptually, incur a TLB miss. Since mmaping a file means using the page cache and its 4K pages, you again incur this cost 25 million times for a 100GB file.

Now, the actual cost of these TLB misses depends heavily on at least the following aspects of your hardware: (a) how many 4K TLB enties you have and how the rest of the translation caching works performs (b) how well hardware prefetch deals with with the TLB - e.g., can prefetch trigger a page walk? (c) how fast and how parallel the page walking hardware is. On modern high-end x86 Intel processors, the page walking hardware is in general very strong: there are at least 2 parallel page walkers, a page walk can occur concurrently with continued execution, and hardware prefetching can trigger a page walk. So the TLB impact on a streaming read load is fairly low - and such a load will often perform similarly regardless of the page size. Other hardware is usually much worse, however!

read() avoids these pitfalls

The read() syscall, which is what generally underlies the "block read" type calls offered e.g., in C, C++ and other languages has one primary disadvantage that everyone is well-aware of:

Every read() call of N bytes must copy N bytes from kernel to user space.

On the other hand, it avoids most the costs above - you don't need to map in 25 million 4K pages into user space. You can usually malloc a single buffer small buffer in user space, and re-use that repeatedly for all your read calls. On the kernel side, there is almost no issue with 4K pages or TLB misses because all of RAM is usually linearly mapped using a few very large pages (e.g., 1 GB pages on x86), so the underlying pages in the page cache are covered very efficiently in kernel space.

So basically you have the following comparison to determine which is faster for a single read of a large file:

Is the extra per-page work implied by the mmap approach more costly than the per-byte work of copying file contents from kernel to user space implied by using read()?

On many systems, they are actually approximately balanced. Note that each one scales with completely different attributes of the hardware and OS stack.

In particular, the mmap approach becomes relatively faster when:

The OS has fast minor-fault handling and especially minor-fault bulking optimizations such as fault-around.
The OS has a good MAP_POPULATE implementation which can efficiently process large maps in cases where, for example, the underlying pages are contiguous in physical memory.
The hardware has strong page translation performance, such as large TLBs, fast second level TLBs, fast and parallel page-walkers, good prefetch interaction with translation and so on.

... while the read() approach becomes relatively faster when:

The read() syscall has good copy performance. E.g., good copy_to_user performance on the kernel side.
The kernel has an efficient (relative to userland) way to map memory, e.g., using only a few large pages with hardware support.
The kernel has fast syscalls and a way to keep kernel TLB entries around across syscalls.

The hardware factors above vary wildly across different platforms, even within the same family (e.g., within x86 generations and especially market segments) and definitely across architectures (e.g., ARM vs x86 vs PPC).

The OS factors keep changing as well, with various improvements on both sides causing a large jump in the relative speed for one approach or the other. A recent list includes:

Addition of fault-around, described above, which really helps the mmap case without MAP_POPULATE.
Addition of fast-path copy_to_user methods in arch/x86/lib/copy_user_64.S, e.g., using REP MOVQ when it is fast, which really help the read() case.

Update after Spectre and Meltdown

The mitigations for the Spectre and Meltdown vulnerabilities considerably increased the cost of a system call. On the systems I've measured, the cost of a "do nothing" system call (which is an estimate of the pure overhead of the system call, apart from any actual work done by the call) went from about 100 ns on a typical modern Linux system to about 700 ns. Furthermore, depending on your system, the page-table isolation fix specifically for Meltdown can have additional downstream effects apart from the direct system call cost due to the need to reload TLB entries.

All of this is a relative disadvantage for read() based methods as compared to mmap based methods, since read() methods must make one system call for each "buffer size" worth of data. You can't arbitrarily increase the buffer size to amortize this cost since using large buffers usually performs worse since you exceed the L1 size and hence are constantly suffering cache misses.

On the other hand, with mmap, you can map in a large region of memory with MAP_POPULATE and the access it efficiently, at the cost of only a single system call.

¹ This more-or-less also includes the case where the file wasn't fully cached to start with, but where the OS read-ahead is good enough to make it appear so (i.e., the page is usually cached by the time you want it). This is a subtle issue though because the way read-ahead works is often quite different between mmap and read calls, and can be further adjusted by "advise" calls as described in ².

² ... because if the file is not cached, your behavior is going to be completely dominated by IO concerns, including how sympathetic your access pattern is to the underlying hardware - and all your effort should be in ensuring such access is as sympathetic as possible, e.g. via use of madvise or fadvise calls (and whatever application level changes you can make to improve access patterns).

³ You could get around that, for example, by sequentially mmaping in windows of a smaller size, say 100 MB.

⁴ In fact, it turns out the MAP_POPULATE approach is (at least one some hardware/OS combination) only slightly faster than not using it, probably because the kernel is using faultaround - so the actual number of minor faults is reduced by a factor of 16 or so.

I'm sorry Ben Collins lost his sliding windows mmap source code. That'd be nice to have in Boost.

Yes, mapping the file is much faster. You're essentially using the the OS virtual memory subsystem to associate memory-to-disk and vice versa. Think about it this way: if the OS kernel developers could make it faster they would. Because doing so makes just about everything faster: databases, boot times, program load times, et cetera.

The sliding window approach really isn't that difficult as multiple continguous pages can be mapped at once. So the size of the record doesn't matter so long as the largest of any single record will fit into memory. The important thing is managing the book-keeping.

If a record doesn't begin on a getpagesize() boundary, your mapping has to begin on the previous page. The length of the region mapped extends from the first byte of the record (rounded down if necessary to the nearest multiple of getpagesize()) to the last byte of the record (rounded up to the nearest multiple of getpagesize()). When you're finished processing a record, you can unmap() it, and move on to the next.

This all works just fine under Windows too using CreateFileMapping() and MapViewOfFile() (and GetSystemInfo() to get SYSTEM_INFO.dwAllocationGranularity --- not SYSTEM_INFO.dwPageSize).

mmap should be faster, but I don't know how much. It very much depends on your code. If you use mmap it's best to mmap the whole file at once, that will make you life a lot easier. One potential problem is that if your file is bigger than 4GB (or in practice the limit is lower, often 2GB) you will need a 64bit architecture. So if you're using a 32 environment, you probably don't want to use it.

Having said that, there may be a better route to improving performance. You said the input file gets scanned many times, if you can read it out in one pass and then be done with it, that could potentially be much faster.

Perhaps you should pre-process the files, so each record is in a separate file (or at least that each file is a mmap-able size).

Also could you do all of the processing steps for each record, before moving onto the next one? Maybe that would avoid some of the IO overhead?

I agree that mmap'd file I/O is going to be faster, but while your benchmarking the code, shouldn't the counter example be somewhat optimized?

Ben Collins wrote:

char data[0x1000];
std::ifstream in("file.bin");

while (in)
{
    in.read(data, 0x1000);
    // do something with data 
}

I would suggest also trying:

char data[0x1000];
std::ifstream iifle( "file.bin");
std::istream  in( ifile.rdbuf() );

while( in )
{
    in.read( data, 0x1000);
    // do something with data
}

And beyond that, you might also try making the buffer size the same size as one page of virtual memory, in case 0x1000 is not the size of one page of virtual memory on your machine... IMHO mmap'd file I/O still wins, but this should make things closer.

To my mind, using mmap() "just" unburdens the developer from having to write their own caching code. In a simple "read through file eactly once" case, this isn't going to be hard (although as mlbrock points out you still save the memory copy into process space), but if you're going back and forth in the file or skipping bits and so forth, I believe the kernel developers have probably done a better job implementing caching than I can...

I remember mapping a huge file containing a tree structure into memory years ago. I was amazed by the speed compared to normal de-serialization which involves lot of work in memory, like allocating tree nodes and setting pointers. So in fact I was comparing a single call to mmap (or its counterpart on Windows) against many (MANY) calls to operator new and constructor calls. For such kind of task, mmap is unbeatable compared to de-serialization. Of course one should look into boosts relocatable pointer for this.

This sounds like a good use-case for multi-threading... I'd think you could pretty easily setup one thread to be reading data while the other(s) process it. That may be a way to dramatically increase the perceived performance. Just a thought.

I think the greatest thing about mmap is potential for asynchronous reading with:

    addr1 = NULL;
    while( size_left > 0 ) {
        r = min(MMAP_SIZE, size_left);
        addr2 = mmap(NULL, r,
            PROT_READ, MAP_FLAGS,
            0, pos);
        if (addr1 != NULL)
        {
            /* process mmap from prev cycle */
            feed_data(ctx, addr1, MMAP_SIZE);
            munmap(addr1, MMAP_SIZE);
        }
        addr1 = addr2;
        size_left -= r;
        pos += r;
    }
    feed_data(ctx, addr1, r);
    munmap(addr1, r);

Problem is that I can't find the right MAP_FLAGS to give a hint that this memory should be synced from file asap. I hope that MAP_POPULATE gives the right hint for mmap (i.e. it will not try to load all contents before return from call, but will do that in async. with feed_data). At least it gives better results with this flag even that manual states that it does nothing without MAP_PRIVATE since 2.6.23.

참고URL : https://stackoverflow.com/questions/45972/mmap-vs-reading-blocks

'IT story' 카테고리의 다른 글

매니페스트에 신청 클래스를 등록 하시겠습니까? (0)	2020.05.26
Heroku를 마스터가 아닌 Git 브랜치로 실행 (0)	2020.05.26
흥미로운 반복 템플릿 패턴 (CRTP)은 무엇입니까? (0)	2020.05.26
파이썬 쉘에서 화살표 키를 누를 때 이스케이프 문자보기 (0)	2020.05.26
루아 문자열을 int로 (0)	2020.05.26

현재글mmap () 대 판독 블록

hot-time

mmap () 대 판독 블록

mmap () 대 판독 블록

mmap은 마술처럼 보인다

mmap is not actually magic because...

mmap still does per-page work

mmap relies heavily on TLB performance

read() avoids these pitfalls

Update after Spectre and Meltdown

'IT story' 카테고리의 다른 글

'IT story'의 다른글

티스토리툴바

mmap () 대 판독 블록

mmap () 대 판독 블록

mmap은 마술처럼 보인다

mmap is not actually magic because...

mmap still does per-page work

mmap relies heavily on TLB performance

read() avoids these pitfalls

Update after Spectre and Meltdown

'IT story' 카테고리의 다른 글

'IT story'의 다른글

관련글

티스토리툴바