Even though automatic hand gesture recognition technology has been applied to real-world applications with relative success, there are still several problems which need to be addressed for wider applications of Human Computer Interaction (HCI). One of such problems which arise in hand gesture recognition is to extract (spot) meaningful gestures from the continuous sequence of the hand motions. Another problem is caused by the fact that there is quite a bit of variability (i.e. in shape, trajectory and duration) in the same gesture even for the same person. Throughout literature, the backward spotting technique is used which first detects the end points of gestures and then tracks back through their optimal paths to discover the start points of gestures. Upon the detection of the start and the end points, in between points trajectory is sent to the recognizer for recognition. So, a time delay is observed between the meaningful gesture spotting and recognition. This time delay is unacceptable for online applications. Given the fact of high variability of corresponding gesture to other gestures, modeling the other gesture patterns (i.e. non-gesture patterns are other movements which do not correspond to gestures) is a vital issue to accommodate the infinite number of non-gesture patterns. In this thesis, a forward gesture spotting system is proposed which handles hand gesture spotting and recognition simultaneously in stereo color image sequences without time delay. In addition, color and depth map which is obtained by passive stereo measuring based on the mean absolute difference and the known calibration data of the camera, are used to localize hands. Moreover, the hand trajectory is obtained by using Mean-shift algorithm in conjunction with depth map. This structure correctly extracts a set of hand postures to track the hand motion and achieves accurate and robust hand tracking with a stereo camera as an input device. One of the main contributions in the work is to examine the capabilities of combined features of location, orientation and velocity for gesture recognition with respect to Cartesian and Polar coordinates. Furthermore, k-means clustering algorithm is used to quantize the extracted features and employs them for Hidden Markov Models (HMMs) and Conditional Random Fields (CRFs) codewords. The effectiveness of these features yields reasonable recognition rates. In this work, isolated gestures are handled according to two different classification techniques: generative model such as HMMs and discriminative models like CRFs, Hidden Conditional Random Fields (HCRFs) and Latent-Dynamic Conditional Random Fields (LDCRFs) to decide the best in terms of recognition results. To spot meaningful gestures accurately, a stochastic method for designing a non-gesture model with HMMs versus CRFs is proposed with no training data. The non-gesture model provides a con fidence measure which is used as an adaptive threshold to find the start and the end points of meaningful gestures which are embedded in the input video stream. The number of states of non-gesture model with HMMs increases as the number of gesture models increases. However, an increase in the number of states is nothing but lead to a waste of time and space. To alleviate this problem, a relative entropy which merges similar probability distribution states is used in order to save time, space, and to increase the spotting speed. On the other hand, the non-gesture model with CRFs is improved by adding a short gesture detector to further increase gestures spotting accuracy and also tolerate errors caused by spatio-temporal variabilities. Another contribution is to use a forward spotting scheme in conjunction with sliding window mechanism to handle hand gesture segmentation and recognition at the same time. In addition, it solves the issues of time delay between meaningful gesture spotting and recognition and achieves accurate, robust results, as well as making the system capable of working for real-time applications. To demonstrate coaction of the suggested components and the effectiveness of gesture spotting and recognition system, an application of gesture-based interaction with alphabets and numbers is implemented. The HMMs models are trained by Baum-Welch (BW) algorithm while CRFs are trained using gradient ascent along Broyden-Fletcher-Goldfarb-Shanno (BFGS) optimization technique. The experiments demonstrate that the proposed systems with HMMs and CRFs are accurate and efficient for spatio-temporal variabilities. In addition, these systems automatically recognize isolated and meaningful hand gestures with superior performance and low computational complexity when applied to several video samples containing complex situations.