1.一开始,我对linux的IO层的认识从标准IO库开始,从《C程序设计语言》(TCPL)中知道有,然后有FILE *这种简单的接口对文件进行读写。


3.精读《UNIX高级编程》后,知道read和write也不是直接写设备,而是把数据从用户态内存拷贝到内核缓冲区(也就是page cache),或者反过来,这是二次缓冲。内核需要把多个进程的读写合并,并且放到写队列中。从这里开始,理解了stdio.h和read/write都是同步IO。还有异步IO,但是目前在linux下没有成熟的异步IO库。关于异步IO有一篇文章《Linux kernel AIO这个奇葩》。









[*] 换种说法,这是***


这组接口的优点在于,从page cache层到用户态内存之间,并非复制,而是移动。通过对用户进程空间的内存映射和修改页表,达到了0复制的效果。实际上,目前linux基本实现了readahead和mmap,而设想中的mwrite和fdatasync_area未实现。

那么,为什么linus一直拒绝O_DIRECT这种绕开page cache的“高效”的方式来实现同步IO呢?他后面提了page cache设计的三个原因:

linus认为,简单的绕过page cache是平庸的,人们太关注“绕过缓存直达硬盘”这种概念了。进一步深入的比较read/write和mmap的性能差距,linus谈到:

Yes. However, it's even _nicer_ if you don't need to walk the page tables at all. Quite a lot of operations could be done directly on the page cache. I'm not a huge fan of mmap() myself - the biggest advantage of mmap is when you don't know your access patterns, and you have reasonably good locality. In many other cases mmap is just a total loss, because the page table walking is often more expensive than even a memcpy(). That's _especially_ true if you have to move mappings around, and you have to invalidate TLB's. memcpy() often gets a bad name. Yeah, memory is slow, but especially if you copy something you just worked on, you're actually often better off letting the CPU cache do its job, rather than walking page tables and trying to be clever. Just as an example: copying often means that you don't need nearly as much locking and synchronization - which in turn avoids one whole big mess (yes, the memcpy() will look very hot in profiles, but then doing extra work to avoid the memcpy() will cause spread-out overhead that is a lot worse and harder to think about). This is why a simple read()/write() loop often _beats_ mmap approaches. And often it's actually better to not even have big buffers (ie the old "avoid system calls by aggregation" approach) because that just blows your cache away. Right now, the fastest way to copy a file is apparently by doing lots of ~8kB read/write pairs (that data may be slightly stale, but it was true at some point). Never mind the system call overhead - just having the extra buffer stay in the L1 cache and avoiding page faults from mmap is a bigger win. And I don't think mmap _can_ beat that. It's fundamental. In contrast, direct page cache accesses really can do so. Exactly because they don't touch any page tables at all, and because they can take advantage of internal kernel data structure layout and move pages around without any cost..

也就是说,memcpy虽然名声很差,因为内存很慢,但其实大部分memcpy的工作由CPU的L1 cache完成了。相比之下,mmap的工作需要遍历页表,而一次page fault就会进入中断。所以 8KB每次的read/write的速度往往比mmap要快,只要这8KB都在L1 cache中。但如果实现了linus所说的智能的mwrite,就可以避免页表的使用,而只是由page cache来完成工作。


在这个帖子里面我发现了inux的系统调用splice/vmsplice,可以最快的从两个文件描述符之间拷贝数据,详见《splice系列系统调用》关于page cache,我另外找到一篇文章,讲得很好《Linux Cache 机制探究》。





d.邮件列表中有Larry McVoy,后来发现他是一个挺有名的内核维护者,同时他还搞了商业版本控制软件BitKeeper,并且被linus用于linux内核的版本管理。但后来两者分道扬镳,详见《BitKeeper姻缘了断》。而linus开发了自己的版本控制系统git。


7. 我做了一些实验,关于fwrite/write/mmap的性能对比,结论还是挺有趣的,目前还没整理好,且听下回分解~~

