Post

Visualizzazione dei post da ottobre, 2018

Insight 4 - Kahan and Neumaier Summation

Immagine
Algorithms for 'online data stream' arithmetic mean Data streams can be denoted as an ordered sequence of points that must be accessed sequentially and can be read only once or a small number of times. Efficient processing over massive data sets has taken an increased importance in the last few decades due to the growth of the data traffic and storage required by even more applications. In particular, monitoring huge and rapidly changing streams of data has emerged as an important data management problem. Some applications are the analysis of network traffic, transaction logs, telephone call records and many more. Since data streams have potentially unbounded size, when the amount of computation memory is limited it may be impossible to produce an exact result. Here the challenge is to produce high quality approximated answers, that is, answers as correct as possible. Let's take the arithmetic mean as an example. Considering a huge data stream and by using the...

Research 4 - Boole's Inequality

Immagine
Boole's inequality, which is also known as the union bound , states that for any finite  or countable  set of events, the probability that at least one of the events happens is no greater than the sum of the probabilities of the individual events. More formally, given a set of events A 1 , A 2 , A 3 , ..., A i the following inequality holds. The mentioned inequality may be proved in several ways, here the proof using induction is provided. Proof For the case n  = 1, we trivially have P ( A i ) ≤ P ( A i ). For the case n , we've to conclude with the inequality statement. Since  P ( A  ∪ B ) = P ( A ) + P ( B ) - P ( A  ∩ B ) and by the associative property of the union operator, we've: By the first axiom of probability, we've that the last term of the equality is greater than (or equal to) 0 and then Which is true for any A i , thus completing the proof. The discussed inequality may be generalised to find both the upper and the lo...

Research 3 - Means and Markov Inequality

Immagine
The are several kinds of mean, each with its applications and properties. In probability and statistics, the most used kind of mean is the arithmetic mean which is defined as the central value of a discrete set of numbers and can be computed by adding all the values and then dividing the result by the total number of values. For instance, the average age is equal to the sum of the ages of every individual in a particular sample, divided by the total number of individuals. Naturally, the sample mean will differ from the population mean but, however, by the law of large numbers with a larger sample size, its mean will be closer to the population mean. The following list summarises the other kind of means which are commonly used in statistics and mathematics. Weighted arithmetic mean is similar to an ordinary arithmetic mean except that, instead of each of the data points contributing equally to the final average, some data points contribute more than others. Geometric m...

Insight 3 - Bayes' Theorem

Before giving the Bayes theorem statement is important to define the conditional probability. Once done, the Bayes theorem is easy to derive. Conditional Probability The conditional probability measures the likelihood of an event given that another event has occurred. For instance, assuming that  A  is the event of interest and that  B consists of a different event which has already occurred. The probability of the event  A  to occur, given the occurrence of B , (written as)  P ( A  | B ), can be computed as detailed by the following equality. P(A | B) = P(A ∩ B) / P(B)        P( B ) > 0                               (1) It is defined as the quotient of the probability of the joint events A  and B  and the probability of B . Now consider P(B | A), which is equals to P(B ∩ A) / P(A). Since P(B ∩ A) =  P(A ∩ B), then the fo...

Insight 2 - Kolmogorov Axioms and Relative Frequency

Immagine
Probability theory is a branch of mathematics concerned with the analysis of phenomena, characterised by randomness or uncertainty, in order to model/predict the behaviour of defined systems. Although there are several different probability interpretations, such as the propensity interpretation or the subjective one, the most used and established probability theory is due to Andrey N. Kolmogorov, a Russian mathematician which combined previous studies on the field and presented his axiom system for probability theory in 1933. This blog post is intended to introduce the reader to the main axioms and rules regarding the Kolmogorov's theory. Let S  denote a sample space with a probability measure P  defined over it, such that the probability of any event E ⊂ S  is given by P(E) . Then, the probability measure obeys the following axioms: AXIOM 1 - The probability of an event is a non-negative real number: P(E)  ∈   ℝ,   P(E) ≥ 0   ...

Research 2 - Conditional Frequency

Before giving the notion of conditional frequency first is important to define what is a contingency table, sometimes called two-way frequency table. A contingency table is a table which displays the (multivariate) frequency distribution of the variables of interest. For instance, consider the following dataset which contains information regarding a basket experiment made on a sample of 33 students. Boys Girls Totals Made 7 (21.2%) 6 (18.2%) 13 (39,4%) Missed 11 (33,3%) 9 (27.3%) 20 (60.6%) Totals 18 (54.5%) 15 (45.5%) 100% Every cell contains a joint relative frequency, which is the ratio of the frequency in a particular category and the total number of data values. As an example, 21.2% is obtained dividing 7 (the number of boys that made a basket) by the total of students, which is 33. Thus 7/33 = 21.2% is the joint relative frequency of boys that made a basket. The numbers in...

Research 1 - Statistics basic notions

Immagine
Statistics is a branch of mathematics which deals with the analysis, interpretation and presentation of data; it's used in several disciplines such as psychology, business, government, computer science and many more. In applying statistics it's convenient to begin with a statistical population to be studied. A statistical population is a set of elements which share some information and are of interest for one or more questions (or experiments). A subset of the population is called statistical  sample  and, a dataset , is a representation of the data that are considered during a statistical investigation. A dataset can be viewed as a table where every column is labeled with a variable name and each row contains the value the variable assume in a particular statistical unit. For instance, consider as the population the students of La Sapienza university and that we would like to analyse their gender, ages and their height. Below is the dataset of the information gathered a...