Why data is important for us
- Data is important because statistics relies on data and when we say data it is information that is all around
us, whether we are formally doing a statistical analysis or not - all of us are either creating data or contributing towards collection of data or we are collecting data ourselves.
- There are so many times when starting with the simple household accounts data which we keep every day.
- There is a lot of data and we are contributing to data also every time we click the button or a keyboard or the mouse, we are generating some data.
What data would have meant about 50 years back is just about numbers and categorical data.
Data Around us
- People talk about social media analytics, multimedia
analytics, text analytics and there is so much data - even a comment on a YouTube video or a multimedia video or a photo is data.
- You can see that comments that come on a product e-commerce portal that is data.
- So, there is data which is being collected as we speak.
- There is so much data that is generated there is so much data that is being collected and there is so much data waiting to be processed into meaningful information.
When we want Data
OR
Purpose of Data
- That is the purpose so and whenever we
want to do a statistical analysis, we rely heavily on data.
What is data
- Data is just facts and figures collected; by facts, it could be numerical, it could be any type.
- the comments on a multimedia or a video or anything on the internet or a product all of this contribute to what we call data.
- Data is fact; it is what is there.
- We want to summarize this data we want to analyze this data for presentation and interpretation.
- It is very wrong to say that I want my data to tell something.
- You do not want the data to say anything, you in fact, the data is a fact you use data to extract information what the data is telling for interpretation purposes.
Why do we collect data?
- Now we go to our cricketing data set why do we collect
why is this data collected at all? What are the, what is the purpose of this data? - who is the person who has played the highest number of matches.
- This is a very valid question to ask or who has the highest batting average or we would want to know who is a person who has taken the highest number of wickets.
- So, you see that or I would want to know here of course, I have only Sachin Tendulkar has played matches 463 matches, but then afterwards I would want to seek data about every match Tendulkar has played of this 463 matches to see that how has his batting performance been over every match and where he has played that match to see whether there has been a difference in his in house or in country batting performance to out ofcountry batting performance.
why do we collect data?
- we collect data is we are interested in knowing about the characteristics of groups
- it could be groups of people, it could be places, it could be things, it could be events.
- Notice that we are not always interested only in people.
- For example, I could have just a collection of data wherein I have a car model, number of doors of that car, whether it is diesel, whether it is petrol, whether it is an electric car.
- So, here you see I have absolutely no people involved or group of people involved, how many then what is the mileage of the car, whether it is a sedan or whether it is a hatchback; all these things are something which I am going to collect.
- So, you see immediately whenever I want to talk I am saying groups of people it could be things, there could be a data which I am collecting wherein I have this state and the population, literacy rate.
- So, the state Andhra Pradesh, Telangana, Assam, West Bengal, Tamil Nadu; I could note down the population and what is the literacy rate.
- Here you see I am interested in knowing about the states so, it could be places; it could
be anything. - So, when we are restricting ourselves it need not be only to people.
- We collect data whenever we are interested in some characteristic or attribute and we seek data to answer about these characteristics or attributes.
- Example, I would also know want to know about the temperature in a particular month in Chennai.
- So, Chennai is a place again I am not interested about people here, I would want to know about the marks obtained by students, I might also want to know how many people like a new song.
- These days any internet or anything that is streaming over the internet you will find likes and dislikes.
- So, you might want to know how many people like a new song, new product, new video.
- we are collecting data to actually because we are interested in characteristics of groups or people or events.
where do we collect this data from?
- either you go and collect the data; you need to
collect data and generate data or there is already data available, there is published data which is available everywhere ok. - One site which you can always go and look at most of
the governments publish their own data sites i.e. data.gov.in - So, if you look at; so, this is a site data dot gov. It is an open government data platform
India. So, this data site almost gives about all the data that is collected at a government level. - So, you can see that you have drinking water and sanitation, health data, you have economy data, the transport data, education data; so there is a data available here.
- data is either you go and collect data or you have published data you can work on any data.
- If the questions you are seeking to answer needs data that has to be generated or collected; you have to go and collect the data.
unstrustured Data
- Now, when I come to data as I said that suppose I have a file
- I have a file which is of this kind I just ask a person in a perhaps a retail market and I ask him what are the things that have been sold.
- And he comes up with a data of this kind customer 1 bought Maggie, KitKat, Pepsi and perhaps some Colgate toothbrush.
- Customer 2 bought Maggie, Coke, toothbrush.
- Customer 3 has bought say tea, then Bisleri water, they
bought some cookies. - So, this is the data the person sitting at a retail counter is just collecting and if they present to you something of this kind, you cannot make any meaning out of this data which is presented to us.
- In a sense that this is data nevertheless this is data to you, but can not make any meaning out of this data
- Suppose imagine this person who has collected this customer data is giving us similar data for 50 customers.
- it is in an unstructured form.
- We have not given any or it is in an unorganized form this is data nevertheless, but it is not in a very structured or in an organized form.
- if they present to you something of this kind, you cannot make any meaning out of this data
- information of a data to be useful we must know the context of the numbers and text it holds.
- When they are scattered it is with no structure the information is of very little use, but; however, I need to organize data.
What do I mean by organizing data?
- So, here you can see again this is a hypothetical data set. Imagine if this data set were collected by a person when every student was entering the college and all that the person was doing was as a person was entering the college wrote down Anjali, female, board, what is the board ICSE, marks obtained 98 and all of that then afterwards Ramu, male, and all of this if that person had done this kind of a data collection, then again it would have been unstructured.
- So, what we look at is we are trying to give a structure to this data in terms of what we refer to as a data table
- Hence we want to put this data or there are many ways of tabulating your data.
- If you look at another data set here you can see that this data set has collected over some about 20 patients who are entering a diagnostic center over a period of time 7:30, 8:00, 8:12; the height, gender, blood group, body temperature and blood pressure.
- So, again you see that this gives us a sort of tabulated data.
structured collection of data
- When I mean by a structured collection of data to form a data set, I first need to understand, what is a variable.
variable and unit
- variable is something that “varies” and formally it is a characteristic or attribute that varies across all units
- Now if we go to this data set, you can see that name is a variable in a sense, gender is definitely a variable it is not taken from a if I were taking this data from a men’s-only college or a girls-only school, then this would not have been a
- ariable; it would have been a I would have had everybody from the same gender.
- the marks obtained is variable.
- So, is the date of birth not everybody was born on the same day it could be likely again.
- It could be very likely if the year of birth was taken perhaps it was not varying a lot, but again it would definitely there would be in some light amount of variability there; the board, the mobile number will come and see.
- So, here if you look at it the way we have defined name, gender, date of birth, marks, the board and all of them are variables whereas, Anjali, Pradeep, Varsha, Divya they are all cases or observation.
Similarly I come to the hospital data. The variable is time the date; if you look at the date you see, it was over the same day.- So, here we can see that date is not varying, but here yes it is third.
- The first eight observations if my data is a subset and I am looking only at the first eight observations, it was taken on the same date.
- So, that is not varying whereas, time of entries varying.
- This is also data; height varies, gender varies, blood group varies, so is the blood pressure and body temperature.
- So, you can see that in this case the person I have not noted down their number, but I can call them the person 1 is recorded at time of arrival, height, gender, weight, blood group, body temperature and blood pressure. Similarly each person who enters the system is recorded on each of the variables.
- Similarly in the players data set, we have jersey number.
- The jersey number also no two players have the same jersey number and you see that everybody has a different jersey number, the matches played, the role, country, here that could also be a variable, but since this is only India then country is not a variable; highest score, wickets, bowling average.
- You see that Gautam Gambhir did not take any data, but he has bowled
- So, the value is 0 whereas, when you look at the data of Dinesh Karthik or Robin Uthappa, you see that they do not have the data that is available for giving them any bowling statistics at all.
- So, in this case even though I am collecting data, there it could be quite possible that the data which I am seeking a subset of variables which I am seeking might not be available for every unit
- non-availability of data is different from a data taking a value 0; it means a lot.
- This 0 is taken even though a person has bowled 13 overs whereas, here the data is not available because these people have not bowled.
- We cannot take it as 0, it would mean a completely different story at this point of time.
what is a variable
- So, intuitively a variable is that varies formally it is a characteristic that varies across all units.
- If the characteristic is available for that unit, as we saw in the cricket data set that characteristic is not available for certain players, the characteristic of bowling averages.
-
we record it as 0 if it is a value 0, there is a difference between a value 0 and not available.
- Columns represent variables and for each variable the same type of value is recorded.
What is same type of value?
- Again let us go back to our hospital data set.
- In this hospital data set, there are two variables here, one is which is the height variable and one is which is the weight variable.
- Next to height, you see a centimeter which is written and next to weight you can see that a kilogram is written.
- Now suppose and again you look at it at the data collection starts at 7:30 and ends at 10 O’clock on 2nd March.
- If I have people who are working in shifts of 2 hours duration, a person who starts at 7 ends at 9 and a person who starts at 9 goes up to 11.
- The person who comes in for a shift at 9 decides to take the person’s height and mention it in feet. So, he would she would or he or she would be mentioning 180 centimeters as 6 feet rather than 180.
- So suddenly your data set would start appearing after a certain point as 6.
- You would start looking at a data set.
- So, this would be 6; this might be some 5 feet 11 inches and things like that 4 feet 9 inches.
- Now, you see immediately and then at 10 O’clock another person comes and the again restart.
- So, you see that these three data units even though they are measuring height, there is no consistency in the units that have been used.
- Now this as we look at a data set a primary glance of the data set itself tells us there is some problem with the data set.
- Whenever we measure data, measuring a variable if it has units; we had to be consistent about the units we are using across all the observations.
- And that is what we mean by saying that columns represent variables and for each variable, the same type of value for each case is recorded.
- m computing or collecting data especially in the format I wanted to I need to ensure the following that the rows represent each case and columns represent variables and for each variables, I ensure that same type of value.
- If the variable has units, then every observation has its own row and every observation has the variable or each variable is measured for every observation, the units are consistent
- For every observation we are marking what is the variable value.
- If the variable value is not available, I put a not available symbol; I capture the non availability of that particular variable for that observation.
Facebook Comments Box