-
Notifications
You must be signed in to change notification settings - Fork 5k
Add Runtime Performance Counters #11227
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I believe it is pretty fair to say that most of our .NET Framework counters are not very useful. Thus we should only add the ones we feel are known to be valuable (we can add others later) Also a number of our counters are better conceptually as events. Thing that happen rarely are typically better done as events (many of the GC performance counters fall into this category, as do exceptions …) Counters make sense in the following cases
Note that my expecation is that all information that is exposed through a counter should be already exposed to managed code without the counter in some way. For example all the process level stuff (CPU time Memory ...), is available via System.Diagnostics.Process. BytesInAllHeaps is exposed via GC.GetTotalMemory(). We should keep this pattern (e.g. if we wish to expose the JIT counters we should have a System.Runtime.JIT class with APIs that let you get at the numbers. APIs are much easer to deal with than subscribing to things that polll and serialize (and you have to deserialize), We don't want that for in-proc cases. I think we should try hard to keep as much of the logic in managed code. This should be easy because we have already exposed all the information via managed APIs already. Thus all we need is for (managed) RuntimeEventSource to propagate the values to the EventCounter logic. We will need an event on EventCounter that fires just before an EventCounter will log its payload. This allowssomething like RuntimeEventSource and update the variables just before that. Here is a strawman list of counters we shoudl start off with
There are others we want, probably NetworkBytes, # HTTP requests, and some counters at the ASP.NET level. There are undoubtely more we should add them here. If we have more than 50 we have probably made to many. |
Looking a little closer at how EventCounter is implemented I agree, it would be nice to reuse more of that than I was initially thinking and surfacing just the managed API query for the raw number isn't hard. I'll update my task list at the top. I see you are also proposing that EventCounter effectively adds a pull-model allowing metrics to be updated just in time in response to an event. I like this. Building on your idea I suggest rather than doing a generic event on EventCounter, what if we let each counter have an update mechanism:
This allows us to do a fine-grained pull which could let us efficiently support returning subsets of the metrics, or returning some metrics at higher polling rates than others. It also seems like an easy programming model rather than requiring the user to declare the counter, register/unregister for an event, then implement an event handler that computes the metric and calls SetMetric to push the update.
And slightly separately, any interest in renaming EventCounter to something more 'metric'ish? EventCounter appears to be an arbitrary tuple (name,float) and it is entirely up to the user whether or not they use it for counting events. Maybe that ship has already sailed, but it feels weird to call a metric that is tracking total size of the GC heap as an 'Event Counter'. If we are doing other work to add multi-dimensional metrics maybe that would be an opporuntiy to create a more genericly named type? |
Please also see https://github.com/dotnet/diagnostics/issues/83 for a discussion of counters that are requested. My recommendation is to start with a few of the most uncontroversial ones and then create batches that we sign off on before you go implement them. I like @noahfalk's design (adding a EventCounter constructor that takes a fetcher delegate), as a way of implementing counters that can be pulled. |
To @noahfalk's comment, I agree that EventMetric is a better name than EventCounter. Our naming was influenced by Windows (which calls them counters (they really are metrics as well)). I would not deal with trying to rename it at this point (at any point, if we decide the name is just too problematic we can make a superclass that effectively 'renames' the type, but I think that having two names makes things worse. Either way, I would not address it now. |
Several of the first Metrics should be % Time in GC, Allocation Rate, so our first step is to expose these. We can simply expose these as properties on the GC class, but these values can be highly misleading becasue they are only updated when GCs happen. One way of solving this is, instead of exposing properties, you have a EVENT 'GarbageCollection' which you can subscribe to that gets called when garbage collections happen (it will be triggered by the finalizer thread after a GC). This event can have a GCInformation class passed to it which contains things like the CPU time consumed (in and out of the GC), and the number of allocations and the time span since the last GC etc. It can also have things like PercentTimeInGC which does the math on the other numbers (and is documented to do that). The result is more accurate and won't 'lie' as much @Maoni0 do you have any comments on the idea of a managed callback when GC's have happen (for performance monitoring purposes?) Including @terrajobst @jkotas @stephentoub in case they wish to comment on the idea of having
We can quibble about the details. The main point is that we have an event that fires on every GC and it has information that allows you to compute 'counter' values. |
@vancem I would like to clarify the relation-ship of the I think that it is fine to have direct managed APIs for the most basic metrics, even if the APIs are redundant with tracing. |
Generally speaking, monitoring infrastructure like EventSource, EventCounters, diagnosticSource ... are actually relatively inconvinient to use as a way of getting information in process. This is because they serialize/stanardize the information and thus you have to do some sort of conversion/deserialization to get at the information. This is fine when such stanardization/serialization served a purpose (e.g. getting out the the process), but is a pretty inconvinient in-proc compared to 'normal' APIs. Moreover since EventCounter is managed, there needs to be SOME way of getting the raw data into managed code. These two obvervations lead to a design where raw data is always exposed as a managed API (e.g. GC.GetTotalMemory) and then we simply use a EventCounter to also 'publish' this data in that way. If you look at EventCounter, is main 'value' is
So arguably there is no redundancy. The 'raw' API provides the actual values, and the EventCounter/EventSource provide a uniform way of pushing these values to other places (typically out of process). You COULD imagine making the raw data APIs private and thus only allow the data to be exposed via EventCounter/EventSource, but as mentioned before, this makes getting at the data in-proc pretty inconvenient, so I think the better pattern is to make the raw APIs public. (And many of the needed raw APIs already exist (for this reason). The case of GC is a bit more complex because the raw data that the GC generates is actually more event-like then 'metric' like because it is update relatively rarely (during GCs). Thus two designs are possible
I could live with either of these. but arguably (2) is more accurate and gives more information (you actually know when the GCs are happening). If I had to choose I would pick (2) but as I said, I could live with either, so I want to get other's perspectives. |
Make sense. This should just go through the same path like any other new public APIs: https://github.com/dotnet/corefx/blob/master/Documentation/project-docs/api-review-process.md |
Yes. We definately would be going through the review process. It is best to have a preferred option settled by the time you come in for review. Do you have an opinion on whether a flat AP or a callback on each GC is preferred? |
I think the callback would be better. |
There are some non-trivial design decisions to be made. I have thought about the issues a bit, and I am proposing straw-man solutions below. Some issues are
Here is the strawman:
|
We have been using the reflection hacks in a very targeted way to break really bad legacy dependency cycles. This is forward looking design that is taking a major bet on them. It does not sound right to me. These hacks are incompatible with tree-shaking and they make the system hard to reason about. I think that we should do this cleanly without reflection hacks. It does not look that bad actually:
|
sorry I'm OOF so haven't been catching up on this thread. I will take a look when I'm back in Jan. thanks. |
@vancem @jkotas Sorry I was out last week due to a family emergency so I couldn't respond to this thread earlier. Quick question - I saw that dotnet/coreclr#11036 removed EventCounters from System.Private.CoreLib and moved it to System.Diagnostics.Tracing. Was there any reason for doing this? (The PR doesn't mention why we did it). I was wondering if there's anything we'd be breaking if we move it back to System.Private.CoreLib |
I assume that it was done to make the code sharable between runtimes and NuGet packages. The code had to live in CoreFX to be shareable between runtimes and NuGet packages. We have found that it is too constraining and relaxed this restriction by inventing CoreLib shared partition: https://github.com/dotnet/coreclr/blob/master/src/System.Private.CoreLib/shared/README.md . You should be able to move |
Still out-of-office, but catching up on this thread. @jkotas I too am not particularly happy with the use of reflection. As you point out there is a 'straightforward' solution (pull the process functionality down into the core, as well as the EventCounter).
Now these are not 'killer' problems, but it at least made me pause. The other notion in play here is that this problem of dependency inversion seems like a problem worth solving in a generic way. Note that the contract from the runtime to other things it very simple (activate something), can be optional (it can failure at runtime is not runtime's fault), and is conceptually similar to any other config that conceptually needs to run code to do its job (they too are doing reflection). In particular, our ideal situation is that if users ask for it (by building things just so) we would like to support a HTTP 'port' that supports diagnostics as REST apis. (that would all the UI to literally be a web page that gets its data from this port). This feature has this dependency inversion issue on steroids (we need all the HTTP security goo, as well as a little HTTP server ...). Clearly we don't want all that in the bowels of the runtime, we just want it to be 'activatable' from the base runtime if the users configures his app appropriately. There was a reason I labeled my proposal above a 'strawman'. I am ambivalent about it myself. Nevertheless we had to start SOMEWHERE, and this conversation is part of making progress. @jkotas given this new information, what do you think? |
Yes, we should be optimizing for making things "cleaner" for .NET Core where the future platform innovation is going to be. It is different from what we have been doing in the past where both design and agility were hampered by optimizing for backporting to existing runtimes. Here is similar discussion in different context: https://github.com/dotnet/corefx/issues/32640#issuecomment-427618911
Could you please elaborate? I assume that sockets can register its own event counters when you use them for the first time. We should not need to have anything that knows about networking event counters in corelib.
I am fine with injection of dependencies that the app explicitly asks for, e.g. by initializing the REST api in the app Main method; or via the startup hooks that we have added recently. It is different from the baseline EventCounters. The baseline EventCounters are included by default, without explicit opt-in. |
@vancem a lot of the things you mentioned for GC are already available via native ETW events (not CPU time - GetThreadTimes is not cheap and since we can already calculate this with CPU sample events we didn't go the GetThreadTimes route; if you want to debug it you'll need to collect CPU samples anyway). are you proposing that we should include all this info in one event that's fired after a GC, in addition to the events we already fire today? sorry if I misunderstood, I have to admit I did not read this (long) thread in its entirety. |
I feel a bit leary growing the scope to include adding semantic hints that suggest how viewers should summarize data over a period of time. In a simple world we could provide no hints and viewers could do any of: My goal is to simplify and this seems like a spot that we can invest in later as needed. |
Sorry for the delay. @jkotas - I want to follow up with you offline to understand our versioning options. Suffice it to say here that your suggestion to pull EventCounters into the framework is 'on the table'. @Maoni0 - The overarching goal is to simply implement some EventCounters that expose GC information. EventCounters do more than just expose information, they do it in a pretty uniform way (as a named stream of numbers. Unlike events, they are very uniform and thus the presentation of the data is uniform (a graph). There are already tools/dashboards (like the azure portal) that know how to deal with data in this form as well as an expectation by users to have such things . Thus the idea is to have a set of metrics that pretty much mirror the useful perf-counter GC metrics we have on Windows. The issue is how to do that. All this GC information starts out in native code, but the EventCounter logic is in managed code. So we have to deal with that somehow. Also arguably, users do care about GC statistics (self-monitoring apps are not uncommon), and frankly the most natural way of doing this is to have a 'normal' managed API for it (in general exposing information to managed code is a good thing). This is what the 'GCInformation' proposal is. It is meant to simply expose what is going out as an ETW/EventPipe event to managed code in the natural way) (Ignore the CPU metrics, I did not realize we did not already do that).
So in a word, yes, that is the proposal. With this information exposed to managed code, it is very straightforward to push the information out via EventCoutners.
I too would feel more comfortable if EventCounters only provide the data and 'something else' dealt with how it is presented. However leaving this information out seems to be a disservice to users (it is strictly less convenient), with not a lot of gain (we can feel good about being 'minimal'). As a practical matter, a particular metric really does know if it is useful to simply display the value (e.g. % time in GC), or aggregate it over some time interval (e.g. RequestCount). By not allowing some way of passing this 'meta data' you end up forcing the end tool to guess at it (either by naming convention or forcing the user to tell them). It seems to me having meta-data associated with a counter (in addition to its name, which arguably is a piece of meta-data), seems very useful and frankly not very problematic for us (our job is to simply pass it though, what it does on the other end is not our concern). It seems like a worthy request. What do you think? |
Thanks Vance! These are a few concerns that came to mind when we start adding display hints. Certainly all of these issues below could be tackled with appropriate diligence, it just made me feel that this wasn't entirely trivial and I wasn't sold that the value was that high. I don't recall seeing typical counter viewers/dash boards that were automatically computing these derived interval statistics on abitrary counters without requiring user configuration. Do you know of a typical viewer that does this? a) You suggested that all counters would either be averaged or summed over an interval, but there are counters for which neither of these operations seems appropriate. For example imagine a # of exceptions counter where each data point emitted is the absolute number of exceptions that have been thrown since the app began. Presumably the customer either wants the most recent value (the total exceptions) or they want the number of exceptions thrown in a specific interval. Computing that latter option is neither a sum nor an average, it is end_interval_value - begin_interval_value. |
@vancem got it. @noahfalk and I also chatted about this. if we mostly just want to provide the kind of info that perf counters provide, the code is mostly in gc\gcee.cpp -
the info is already conveniently stored in the datastructure provided by perf counters - and this datastructure is only updated during pre/post GC so that means the values remain unchanged inbetween GCs and you can freely read it after a GC is done. let me know if you need more info. as pointed out already, out of these threads the SVR GC threads are the only ones not managed threads. |
@noahfalk - The short answer: I am willing to drop the 'isAggregation feature. Longer answer below. Fundamentally, I am just trying to provide the information to make a dashboard much like https://portal.azure.com, (which has things like Server Response Time (Avg), Server Requests (Sum)). Notice that this portal has to do different things for these two metrics (one is averaged the other summed over a interval of time). The question is how did the dashboard decide that the first one should be averaged and the second summed. Clearly users can change this so it is just about providing defaults. You can also obviously live without it at the expense of more work setting up the dashboards. Since most dashboards are actually inherited anyway, I agree the value is small. I looked at ApplicationInsights APIs for doing Metrics and it seems they do not provide any meta-data so they must force the user to provide this (or guess at it via names or something), so I think it is OK that we do the same (we can add the feature later). |
@vancem @noahfalk Not to get too sidetracked from this discussion but it seems like this whole uncertainty about what EventCounter should look like and what other APIs we should expose in it probably means we should allow it to be updated out-of-band... That makes me kinda pause working on this (which I was doing based on the assumption that we would be moving EventCounter back to CoreLib). Would it be possible to chat about this offline some time next week? |
Regardless of what happens with EventCounters, there is a chunk of work that I don't think is contentious to just create managed APIs that expose the data we need in-proc. It is item 8 in Vance's proposal above. I'd suggest starting there, and you could easily build tests/demos for the APIs that have no dependency on EventCounter.
Happy to chat on it offline (and I assume Vance is too).
I think aside from shipping in CoreLib, there are two options you might have been refering to as OOB: a) Code isn't in CoreLib assembly, but some other assembly that is part of the .Net Core runtime package I'm not convinced either (a) or (b) buys us all that much in terms of agility, so I still lean towards Jan's position that we should put it in CoreLib and put the source in the shared partition. In particular:
|
The thread ones look like they wouldn't be too difficult to implement, I'm thinking something like below, with some questions/comments:
Some of these we don't have in ETW event data, or their meaning may be different. Is it an intention to have all data exposed in this fasion to also be included in ETW events for finer debugging? |
Also it seems like some tracking may be appropriate to enable only while the event counter is enabled. For For an API to be able to poll at any time, in info would have to be tracked all the time. For example if we wanted to track duration of waits for a lock, getting two timestamps may not be negligible overhead on the wait path if the wait does not last too long, and it seems to be a bit slower on Linux. Do you think it would be worth including APIs to enable/disable tracking, maybe on a case-by-case basis or by area or something like that? Tracking |
https://github.com/dotnet/coreclr/issues/20372#issuecomment-449193835 suggested that this should be the same number as |
@kouvel - first you should know that we get to define what these things are, and so it is completely reasonable to have a conversation as the implementation proceeds as to what the exact meaning is. Some general principles are
Note that all these values will be PULL (we probably just have a struct with poitners on both the maanged and native side. We update it on the native side and probe it from the managed side. On the specific questions:
Yes, this should be the number of threads 'owned' by the Threadpool (regardless of what they are doing).
Yes. I too worry about its cost. However polling should happen at reasonable (e.g. 1 sec or less) interval. If you have ideas on a better metric (that is cheaper but lets you know if work is backing up or not) we should discuss.
Probably should be a long (unless it is problematic. Note that the EventCounter will do a delta (it will compute the number of CompletedWorkItems since the last counter Write). This is what users want to see (a monotonically increasing number is not what users want to see), but as a static variable delta's don't work (you don't know if others have looked at the variable), so a cumulative one is reasonable (Note that overflow is not that big of a deal (because delta's still make sense)).
Again could be a long, but OK if an int is significantly easier. It should be like the Desktop (thus only managed locks, just a count. Having total duration may be useful, but we don't need it right now (we are mostly trying to get to parity with Desktop).
I am trying to avoid this for simplicity's sake. The intent is that we may expose the APIs above publically and they have no notion of 'turned on'. We could invent one, but it would be a stumbling block... |
Ok thanks, some notes:
|
I send e-mail to @sywhang but I wanted to log it here as well. The EventSource that we should hand the GC and JIT EventCounters on should be a new one called System.Runtime. In particular its declaration should be.
The name of the class is not important. RuntimeEventSource is already taken by the EventSource that supports the Native events which is why I called this one ‘ManagedRuntimeEventSource’. Note that I would actually prefer if you renamed the current one 'NativeRntimeEventSource, and then you can use 'RuntimeEventSource' to be the managed one. |
RE Both values could be useful when available. For now, couple of options:
I'm leaning towards exposing Thoughts? |
I think it is possible to track the number of existing threads in the thread pool, and it may be cheaper than to track the active thread count, taking a closer look |
I'd like to add another counter and API to this list. We need the ability to get the number of timers in the timer queue (all queues) and there should also be a counter for that. |
I'll take a look. Not sure if it's too late for API additions, but will see. |
@kouvel Worst case, it doesn't need to be a public API (like 80% of the GC counters 😄) |
Active timer counter has been added via dotnet/coreclr#25186 and dotnet/diagnostics#342 |
cc @preethikurup to make sure she's aware of this for 1st party purposes. |
The remaining work on this PR is to add managed API's for .NET core counters. This is tracked here: #36571. |
Historically .NET supported emiting certain default performance counter data into the Windows Performance Counter system. On non-windows systems there is no such facility, but in its place we now have EventCounter for counters and EventPipe as an underlying transmission method. The goal is to instrument the runtime with similar counters that we had in the past, and have that data emitted via EventPipe. It is a non-goal (of this issue at least) to transmit the data to the Windows Performance Counter system.
This task tracks:
Proposed list of runtime counters for .NET Core 3 (see https://github.com/dotnet/diagnostics/issues/83#issuecomment-446729009):
System Counters
Exception performance counters
Thread performance counters (see https://github.com/dotnet/corefx/issues/35500)
Memory
JIT
Loader
The text was updated successfully, but these errors were encountered: