Quote:
An important thing to understand about Working Set (or better, Working Set Private) is that it is only trimmed (i.e. reflects recent process usage) if you are already low on physical memory.
If you have enough physical memory, Windows may never eject any pages from your Work Set, and thus you may think your process is running through 90MB of physical memory constantly, when it is only using 1 or 2MB continuously.
I'm not talking about the windows cache -- I'm talking about the L2 or L3 cache on your cpu-chip.
A 6-core Xeon has 12MB of L3 processor cache while a 4 or 6 core "Core-i7" has 8MB of L3 cache or less. The exact number vary depending on the cpu-generation and type, but as an example,
Xeon's 5660 and 5680 both have 6 cores (they differ by clock speed), They have L1 data and instruction caches local to each core of < 16KB, (8K?), L2 caches local to each core of 256-512K (forget which), and an L3-cache of 12MB that' shared among all cores, then it goes to main memory. When the CPU needs data, the times and delays, ballpark, are:
L1 cache - about 4 cycles of your cpu (1GHz=1 nanosecond/tick, so 2-4Ghz x 4=2-1ns.)
L2 cache takes about 10ns
L3 takes 15-35ns
main memory has about 85-100+ns latency depending on the memory speed.
For a modern chip (more L3 core, but slower access time but faster main memory as well, yield different number, but gives you an idea):
http://www.anandtech.com/show/8423/intel-xeon-e5-version-3-up-to-18-haswell-ep-cores-/11
So if you are running a single program that is executing a tight loop in < 512K of code, it can take 10 times as long if it is interrupted and has to refetch all of its data from main memory. Just executing out of the in-cpu (on the cpu chip) L3 cache is 3x faster than executing from main memory.
These caches are managed automatically by the chip and have almost no options for OS control at this time. The only way your program stays in the L3 cache is to make sure it uses < 12M during its runtime. Every interrupt by another program eats away at the 12M, which is why for timing and benchmarks they time on dedicated machines with nothing else running.
You can force a memory trim to see how fast its working set grows in a given time). I don't know if Process Explorer has that option or not (I used to use Process Explorer, but one of the users in the PE forum developed Process Hacker (not a hacking tool despite the name!) which is a superset of ProcessExplorer's functionality -- it has the trim option). But if you force a trim you can see how much total, new memory a process needs in a certain time. For DF, that was 3MB of new memory/s -- meaning a 12MB L3 cache won't hold it -- ever. Period. The only way to have DF not affect all your programs is for it to *sleep* when it isn't doing anything. Instead, it is busy waiting (see https://en.wikipedia.org/wiki/Busy_waiting). Busy waiting is generally consider harmful to OS & CPU performance and energy usage.
Quote:
I am using Process Explorer and I see DF running about 94MB WS Private and 0.4 to 0.8 CPU%. PE appears to follow Task Manager and cap total CPU at 100%, so given my 4 core HT processor, DF is using up to 0.1% of a single (HT) core, which doesn't seem unreasonable considering I have both a secondary taskbar and window buttons enabled, as well as global triggers.
---
Sorry, I must have been confusing, since that's exactly backwards. If you are using .4-.8 of 4 cores (4 cores provide 4 seconds of cpu/real second), then you are using .4-.8% of 4 CPU-seconds/real-second. If you had 1/4th as many cores (1 core), it would take 4 times as many CPU seconds, right? So you would multiple by the number of cores. So the single core usage would be 1.6-3.2% of a single-core, or 0.016-0.032 CPU-seconds/real-second (real = wall clock seconds).
In the memory latency article I gave the link for above, the new Haswell CPU has up to 18 cores.
On that machine, the CPU usage, as a
percentage of 18 cores would be 18 times less of what it would, as a percentage be on 1 core. So your 1.6-3.2% on 1 core would be 0.09-0.18% on the current higher end cpus.
That's why I promote the idea of expressing CPU% in monitoring programs as a % of 1-Core, since if we don't, you won't have an idea of what a single-threaded, 1-core-using program is doing when cores get to be insanely high. Using 1-core number, you can compare usage of a program on different machines with some meaning. A program that can use 4 cores @ 100% each would show a 400%-single-core usage -- with a 6-core machine being able to provide up to 600% times the CPU of a 1 core machine.
Did my explanation make anything more clear -- I know it was dipping into technobabble, but tried to make the concepts more concrete.