Wednesday, February 8, 2012

Garbage Collection in .NET


I believe in simplifying this as far as possible. If there are n ways of performing something, then the best one is the simplest of all. So what is the definition of something being simple? Well, the one which requires least number of other things to be understood to understand it can be called the simplest one. The one which has the least number of parameters could be considered to be the simplest one.
So here I am trying to present the story of garbage collection in .Net environment in the simplest possible way (??).
Let us start with the cause which lead to the effect of garbage collection. We know necessity is the mother of all inventions. So what was the necessity to invent garbage collection?

Unmanaged Vs Managed Environment:

For those of you who have wrote programs in unmanaged environment  like C/C++, you might remember the unpredictable bugs which used to creep in inside your code, either because you forgot to free some allocated chunk of memory or because you tried to access some memory location which already had been freed earlier.

For those of you who have not wrote programs in unmanaged environment, the unmanaged programming style was to programmatically deallocate memory when the data residing in that memory location was no longer required in the program. Failure to do so would cause memory leakage which meant that memory was being wasted on data which was no longer required (this is what we call garbage data!). Also there were issues when a programmer had accidentally tried to access data after deallocating memory, there by causing a runtime exception. Remember applications crashing saying , 'The instruction at "0x7c9105f8" referenced memory at "0x025f0010". The memory could not be "read."' ? Why the memory could not be read. Most probably because it had been freed already :-)

Enter the era of Java which can be called C++ Minus Pointers in terms of features (In fact there are many other differences too, like C++ allows multiple inheritances while Java does not). Java did not allow the programmer to access any and every memory location. In fact there are no pointers in Java. All the programmer has access to were references to the objects he created. The only way of accessing objects in managed environment is via object references. So if you create an object which takes 12 bytes of space in java, there is no way that you can try to access the 13th byte, for the simple reason that there are no pointers in Java where you can take the object reference and say *(p+13).

The major advantage of removing pointers was that the runtime can do memory management. Since an object could only be accessed by using its references the runtime could always cleanup the memory used by an object once the number of references to an object came down to zero. Because if an object has no (zero) references then it just means that the object can never be accessed by any of your code which means that the memory used by it is eligible for garbage collection!

Garbage Collector:

In this article I will be talking in depth about how garbage collection works in dotnet.

In dotnet the developers never explicitly release memory. Instead, this job is done by Garbage Collector. In the rest of this article I will call Garbage Collector by its short form GC.

So what is GC? It is a thread. What does it do? To summarize, GC frees up memory used by those objects which can no longer be accessed by your runtime code.
For instance consider the two funtions below:
void A()
{
    B();
    int i = 10;
 }
void B()
{
    C c = new C();
   Console.WriteLine(c.Name);
}
In the above example once the control comes to the line int i = 10; in function A(), there is no way for the system to access the object c created in function B(). This is an object which is eligible for Garbage Collection. So the next time when the GC thread runs it can cleanup the memory allotted to the object c. (NOTE: Memory to the object c here is allotted when you create an instance of C it using new keyword)
When does the GC clean up the memory used by the object c? Well, it cant be guaranteed when. The task of garbage collection is a costly process and GC runs only when a call to create a new object fails due to lack of memory or when the user explicitly calls GC thread to do garbage collection by calling the function GC.Collect().
Considering the amount of physical memory which systems today usually have (I suppose 512 MB RAM is a common configuration today), stand alone applications (like Console Applications or Windows Form Applications) may never really face a situation where GC has to be called due to memory shortage! 

But think of those background processed like windows services which are supposed to be run theoretically forever on Windows Server Machines. In such scenarios, even a memory leakage of 10 bytes per hour adds up to a significant amount over a period of time and the process may finally run OUT OF MEMORY! But just  sometime back I said in dotnet developers do not do memory management, it is done by the run time engine using Garbage Collector right? So how can there be a memory leakage?

Well, this was what even I wondered when the first windows service which I wrote (in C#) bombed in the Test Server after two days citing Out Of Memory Error and we had to reboot the server altogether!! The server even refused to open a small text file in Notepad till we rebooted it!!

The reason here is GC can recollect memory from only those objects which are no longer accessible in your code by any reference. As you know a given object can have multiple references pointing to it. For GC to collect the memory used by an object all these multiple references should go either out of scope (like what happened to the object c in the above code sample once the execution of function B() was over) or all references to that object should be removed explicitly.

The former case happens automatically and there is no need for the developer to worry about it, because once a function goes out of scope all its local objects will be eligible for garbage collection, unless and until of course the local objects have their references passed outside the function.
In the case of the windows service which I wrote, the latter was not happening, i.e. I had objects in my windows service which were no longer required but still had references pointing to them. So for them to be eligible to Garbage Collection I had to remove all the references to these objects programmatically.
See the code snippet below:
class A
{
    private B b;
//line 3
    public void RequireB()
    {
        b = new B();
//line 6
        b.DoSomeWork();
    }
}
In the above example initially we have just created a reference to object B called b. But this reference is not pointing to any object. In line 6 we create an object of class B by calling new B() and set the reference b to this object. So now b is pointing to object created by new B().

We require this object b only inside the function RequireB(). But since the reference is at the class level, the object pointed by b will not be eligible for garbage collection even after the execution of function RequireB() is completed. This is because that object is still accessible via its reference b! So to make the object eligible for garbage collection after the function RequireB() is executed we need to remove the reference to the object by setting the reference to null as below.
class A
{
    private B b;
//line 3
    public void RequireB()
    {
        b = new B();
//line 6
        b.DoSomeWork();
        b = null;
    }
}
See the same code snippet below where I have created multiple references to the same object. Garbage collector will not be able to  collect the memory of the object till all the references to it are removed.
class A
{
    private B b;
//reference now points to nothing, trying to use this directly will give NullPointerException
   
private B b2; //another empty reference which points to nothing
    public void RequireB()
    {
        b = new B();
//Set this reference to an object
       
b2 = b; //Also set b2 as reference to the same object pointed by b
      
// Now object created above by calling new B() has 2 references b and b2
        b2.DoSomeWork();
//b2.DoSomeWork is same as b.DoSomeWork() as both point to same object
        b = null;
//Just doing this will not make the object eligible for garbage collection because even b2 is pointing to it. So you need the line below too! Else there will be a memory leakage
     
b2=null;  
    }
}
NOTE that the above example code snippet that the object created inside RequireB() function is not required outside that function, else you wont set the references to null.

Coming back to GC again, we know that every attempt to create an instance of a class using new keyword first tries to allocate the amount of bytes required by the object being created. When the process has already created too many objects (out of which most might be already eligible for garbage collection by now) there might arise a situation where an attempt to create a new object fails!!

Memory Allocation:

Look at the animation below which tries to explain memory allocation in a dotnet managed heap. Managed heap is that chunk of memory allocated to your process where all managed objects created in your process are allotted memory. Managed objects are those objects which are instances of classes written entirely in dotnet languages (like C# or VB.Net). So can we have unmanaged objects in dotnet? Yes, but they will be created in a separate heap in the process called unmanaged heap and dotnet garbage collector never even touches this heap. Any objects which we create in dotnet using Win32 API or by referring COM+ component wrappers are unmanaged ones as the code for these objects is not written using dotnet specifications/dotnet languages. Simply put there is no metadata available for these objects.
So initially when your program starts the managed heap will be empty and there will be a next object pointer pointing to the 0th location on the managed heap. Next object pointer is an internal reference used by the runtime to identify the location where the next object has to be created.
See the animation below.









 
 

 








I strongly suggest viewing the animation above completely by clicking the next button and understanding the simplified version of Garbage Collection process in dotnet. This will help you to easily understand the things which I am going to tell now.

Assuming that you have gone through the above Flashback movie, we will continue with the remaining part of the story.

Behind the Scenes:

There is one obvious difference between writing Object Oriented Programs in managed and unmanaged environments. C++ programmers, do you remember destructors in classes?

For those who don't know what destructors are, they are functions which are called to do the last rites of an object just before it is freed from memory. In C++, the developer used to voluntarily free the memory by calling free k,  but in managed environment like dotnet memory is freed by Garbage Collector and god knows when that will happen!! So we cannot have destructors in dotnet!
This is because destructors are usually used to clean up resources used by the object like closing an open database connection held  by the object being cleaned, or closing an open file held by the object being cleaned etc. Delaying this till the garbage collection happens means keeping all these resources open even when they are not used. And who knows, if a program doesn't face any memory crunch then GC may not be called altogether, there by holding these resources forever!!

So in managed environment there is no concept of destructors.
Now then what do we do about resources which need to be cleaned when an object is no longer required?

Well for this the design pattern in dotnet is to implement IDisposable interface on such classes. This interface has only one method called Dispose() and we are supposed to do all those cleanup activities which has to be done inside the destructor in this method.
So all code which was supposed to go into the destructor (if it were an unmanaged environment) will now go into the Dispose() method in dotnet managed environment.

Now who calls this Dispose method and when? The catch is Dispose() method has to be called by the developer once he feels that an object is no longer required. Usually this is done just before an object goes out of scope or just before the last reference to an object is set to null.
Note that dotnet runtime knows nothing about the Dispose() method. As I mentioned earlier this is just a design pattern. Alternatively, there is no hard and fast rule that you have to do cleanup only by implementing IDisposable interface and writing the cleanup code in its Dispose() method. You can just add a method to your class called Cleanup() or say Clear() or whatever and write the cleanup code in that method. But you will have to ensure that you call this method always to perform the cleanup action. But there are also possibilities that other developers might be using your class and hence using IDisposable is a common convention so that any developer when he sees this interface being implemented by a class automatically understands that he needs to call the Dispose() method on objects of this class when they are no longer required.

Note that dotnet FCL i.e class library has lot of classes (especially in ADO.Net say like SqlConnection) which have implemented IDisposable interface.
Another advantage for C# developers of implementing IDisposable interface instead of any other alternative is that there is a special keyword in C# called using (not the one used to import namespaces) which automatically calls the Dispose() method on objects defined within its scope.
See the code snippet below.
B b = new B();
using (b)
{
// use b
} // here compiler will call Dispose on b automatically
Note that for an object reference to be specified alongside the using keyword, its class (in this case class B) MUST have implemented the IDisposable interface so that the Dispose() method could be called once the scope of the using keyword ends. Else you will get a compile time error.
Now that I have spoken about C#, C# developers might be wondering then what about the functions which they can write in C# classes whose syntax is similar to those of C++ destructors. I already said there are no destructors in dotnet. Then what are these?

Well, dotnet framework also provides a function called Finalize() for all classes which when populated with some code will be called by the Garbage Collector just before the object is garbage collected!
Now why is this function required? Well, as I said earlier Dispose() must be programmatically called by developers (in C# one can of course use the using keyword, but this cannot be used when the scope of an object is say at the class level.) to cleanup resources if the class being implemented requires any such cleanup. What if the developer who is creating objects of such a class by chance forgets to call Dispose() method on such an object? (Even though this mistake is unpardonable :-)

Well, so in that case the developer of the class implementing the IDisposable interface can call the Dispose() method in the Finalize() method so that the cleanup is at least done when the Garbage Collector is invoked by the system. As I said earlier GC calls Finalize methods on objects which have some code implemented in these methods. (NOTE again there is no guarantee that GC calls Finalize() methods on all objects which implement the method! Pretty confusing? I'll come to this later).

Coming back to C#, functions with destructor syntax are nothing but Finalize() methods, of course with a call to base class Finalize() method.
In the C# example below:
class K
{
   K() {}
   ~K() {}
}


C# Compiler replaces above C# code as follows:

class K
{
   K() {}
   void Finalize()
    {
     base.Finalize();
    }
}
Now imagine a situation where you are calling the Dispose() method in your Finalize method to ensure that the Dispose() gets called at least during Garbage Collection even if the developer using your class forgets to calls it. What if the developer also calls your Dispose() method? Then when the GC comes and calls the Dispose() method by calling Finalize() wont there be a problem because the Dispose() has already been called?
There are two things to be noted here.
#1: You should implement your Dispose() method in such a way that it throws no exceptions no matter how many times it is called. So that in case the developer implementing your class accidentally happens to call the Dispose() method multiple times, it does not throw an exception the second time onwards. Also note that the dotnet runtime will ignore any error/exception thrown by Finalize() method and will just assume that the Finalize() method completed normally!
#2: Making the runtime call Finalize() method is a costly operation because usually GC occurs when there is a memory crunch and when GC is taking place all other threads are suspended till GC completes its activity from executing and objects which implement Finalize() methods will have to undergo an extra GC lifecycle before their memory is recovered compared to other objects which do not implement Finalize() method! (Will come to this later)
So whatz the solution to our problem? Well, Now that we are calling Dispose() from Finalize() method too, there is way to prevent Finalize() being called if Dispose() has been called by the developer. All you have to do is call GC.SuppressFinalize() method at the end of your Dispose() method. This instructs the dotnet runtime not to call Finalize() method on this object even if its class has implemented Finalize() method. Here's a sample code snippet to achieve the same.
class B : IDisposable
{
    public void Dispose()
    {
         //Do all cleanup activities
         GC.SuppressFinalize();
    }

    ~B()
    {
        Dispose();
    }
}
In the above code snippet, we have ensured two things:
#1: In case the developer using our class forgets to call the Dispose() method on its object, at least dotnet runtime will call the Dispose() method when it calls the Finalize() method (Remember C# destructor is nothing but the Finalize() method in disguise!)
#2: In case the developer using our class correctly calls Dispose() method on its objects then, we have ensured that the dotnet runtime does not call Finalize() method on such objects.

No comments:

Post a Comment