Ten Level Taxonomy of Data: Potential Sources of Learner Insights

Part 3 of Our 4 Part Series on Big Data in Learning

This series is extracted from the CogBooks white paper, “Big Data.” You’ll link to the next post at the end of each article. You can also use this menu to read the articles in the order you prefer.

Part 1: Big Data in Learning: The Emerging Value of Online Learning Datasets 
Part 2: Learning Data Is Incompetent: Refocusing Education Measurement
Part 4: Flipped Statistics: The Changing Paradigm of Education Data

Learning data can be harvested at ten different levels. This inverted pyramid shows a hierarchy of levels from which data can be harvested. Note that it moves through different categories, but, in general, it describes the move upwards towards big data, as each level does, potentially, include those below.
An inverted pyramid showing a hierarchy of levels from which data can be harvested. The levels from smallest to largest are: Brain, Learner, Course Component, Course, Group of Courses, Institution, Group of Institutions, National International, Web.

1. Data on Brain

We’ve seen the commercial launch of some primitive toys using brain sensors, but we’ve yet to see brain and situation really hit the world of learning. Learning is wholly about changing the brain, so one would expect, at some time, for brain research to accelerate learning through cheap, consumer brain and body-based technology. 

South Korea is developing software and hardware that may profoundly change the way we learn. With the development of an ’emotional sensor set’ that measures EEG, EKG, and, in total, seven kinds of bio-signals, along with a situational sensor set that measures temperature, acceleration, Gyro, and GPS, they want to literally read our brains and bodies to accelerate learning. There are problems with this approach as it’s not yet clear that the EEG and other brain data gathered by sensors measure much more than ‘cognitive noise’ and general increases in attention or stress, and how do we causally relate these physiological states to learning, other than the simple reduction of stress. The measures are like simple temperature gauges that go up and down. However, the promise is that a combination of these variables does the job.

2. Data on Learner

This is perhaps the most fruitful type of data as it is the foundation for both learners and teachers to improve the speed and efficacy of learning. At the simplest level, one can have conditional branches that take input from the learner and other data sources to branch the course and provide routes and feedback to the learner (and teacher). Beyond this, rule-sets and algorithms can be used to provide much more sophisticated systems that present, screen-by-screen, the content of the learning experience. There are many ways in which adaptive learning can be executed. See this white paper from Jim Thompson on ‘Types of Adaptive Learning.’ In adaptive learning systems, the software acts as a sort of satellite navigation, in that it knows who you are, what you know, what you don’t know, where you’re having difficulty and a host of data about other, useful learner-specific variables. These variables can be used by the software, learner, or teacher to improve the learning journey.

3. Data on Course Components

One can look at specific learning experiences components in a course, such as video, use of forums, specific assessment items, and so on. Peter Kese of Viidea is an expert in the analytics from recorded lectures and his results are fascinating. 

Gathering data from recorded lectures improves lectures, as one can spot the points at which attention drops and where key images, points, and slides raise attention and keep the learners engaged.

Gathering data from recorded lectures improves lectures, as one can spot the points at which attention drops and where key images, points, and slides raise attention and keep the learners engaged. When Andrew Ng, the founder of Coursera, looked at the data from his ‘Machine Learning’ MOOC, he noticed that around 2000 students had all given the same wrong answer—they had inverted two algebraic equations. What was wrong, of course, was the question. This is a simple example of an anomaly in a relatively small but complete data set that can be used to improve a course. 

The next stage is to look for weaknesses in the course in a more systematic way using algorithms designed to look specifically for repeatedly failed test items. At this level, we can pinpoint learner disengagement, weak and even erroneous test items, leading to course improvement. At a more sophisticated level, in a networked learning solution where the learning experiences are presented to the learner based on algorithms, screen-by-screen, items can be promoted or demoted within the network.

4. Data on Course

A course can produce data that also shows weak spots. It can also show dropout rates and perhaps indications of the cause of those dropouts. One can gather precourse data about the nature of the learners (age, gender, ethnicity, geographical location, educational background, employment profile, and existing competencies). During the course, time taken on tasks, note-taking, when learning takes place, and for how long. Considering physiological data such as eye-tracking and signals from the brain may provide beneficial perspectives. This pre-course or initial diagnostic data can be used to determine what is presented in the course. At a more sophisticated level, it can be used as the course progresses, much as satellite navigation provides continuous data when you drive. 

Course output data from summative assessment is also useful. However, the big data approach pushes us towards not relying solely on this, as was so often the case in the past. This is important for two reasons: the learner themselves, knowing what they’ve achieved, not achieved, and the tutor, teacher, or trainer, who can use personal data to provide formative assessment, interventions, and advice based on such data. In this sales course for a major U.S. retailer, sales staff are given sales training in a 3D simulation which delivers sales scenarios with a wide range of customers and customer needs. Individual competencies are taught, practiced, and tracked so that the actual performance of the learners is measured within the simulation. Sales in the stores where staff received the simulation training were 6% greater than the control group who did traditional training. This is a good example of fine-grained data being gathered.

5. Data on Groups of Courses

MOOCs, in particular, have raised the stakes in data-driven design and delivery of courses. In truth, less data is gathered about learners than one would imagine by the likes of Coursera and EdX, but MOOC mania has accelerated the interest in data-driven reflection. The University of Edinburgh has produced a data-heavy report on their six 2013 Coursera MOOCs taken by over 300,000 learners. The report has good data, tries to separate out active learners from window shoppers, and is not short on surprises. It’s a rich resource, and a follow-up report is promised. This is in the true spirit of higher education – open, transparent, and looking to innovate and improve. Rather than summarize the report, there’s a summary of the main findings in our Adaptive MOOCs white paper that point towards the future development of MOOCs. That, combined with the useful information on resources expended by the University, is an invaluable business planning tool. Lori Breslow, Director of MIT Teaching and Learning Laboratory, has also looked at data generated by MOOC users that provide clues on how to design the future of learning using massive data from “Circuits and Electronics” (6.002x), EdX’s MOOC, launched in March 2012 which includes IP addresses of 155,000 enrolled students, clickstream data on each of the 230 million interactions students had with platform, scores on homework assignments, labs, and exams, 96,000 individual posts on a discussion forum, and an end-of-course survey to which over 7,000 students responded.

6. Data on Institution

At this organizational level, it is vital that institutions gather data that is much more fine-grained than just assessment scores and numbers of students who leave. Many institutions, arguably most, have problems with dropouts, either across the institution or on specific courses. One way to tackle this issue is to gather data to identify deep root causes, as well as spot points at which interventions can be planned.

7. Data on Groups of Institutions

Perhaps we should be a bit realistic about the word ‘big’ in an educational context, as it is unlikely that many, other than a few large multinational, private companies, will have the truly ‘big’ data. Skillsoft, Blackboard, Laureate, and others may be able to muster massive data sets, but a typical school, college, or university may not. The MOOC providers, such as Coursera, are another group that has the ability and reach to gather significantly large amounts of data about learners.

8. Data on National

National data is gathered by governments and organizations to diagnose problems and successes and reflect on whether policies are working. This is most often input data, such as numbers of students applying for courses, who those students are, and so on. Then there’s output data, usually measured in terms of exams and certification. This misses much in terms of actual improvement and often leads to an obsession with testing that takes attention away from the more useful data about the processes of learning and teaching.

9. Data on International

At the international level, the United Nations, UNESCO, and others collect data, such as PISA, PIAC, and OECD data, produced to compare countries’ performance. It is not at all clear whether this data is as reliable as its authors claim. Within countries, politicians then take these statistics, exaggerate their significance, cherry-pick the comparative countries (Singapore but not Finland), and use it to design and implement policies that can potentially do great harm. PISA, for example, has huge differences in demographics, socio-economic ranges, and linguistic diversity within the tested nations. The skews in the data include the selection of one flagship city (Shanghai) to compare against entire nations. Immigration skews include numbers of immigrants, effect of selective immigration, migration towards English-speaking nations, and first-generation language issues. There’s also the issue of taking longer to read irregular languages and selectivity in the curriculum.

10. Data on Web

Google, Amazon, Wikipedia, YouTube, Facebook, and others gather huge amounts of data from users of their services; this data is then used to improve the service. Indeed, I have argued that Google Search, Google Translate, Wikipedia, Amazon, and other services now play an important pedagogical role in real learning. There are lessons here for education in terms of the importance of data. One should always be looking to gather data on online learning, and Google Analytics is a wonderful tool.

Read Part Four: "Flipped Statistics: The Changing Paradigm of Education Data."

Give Students Greater Agency and Instructors More Control

CogBooks weaves student agency, instructor empowerment, and curriculum affordability ($39.95 per course) into a comprehensive, adaptive learning platform. This simple to adopt and manage tool is a direct-replacement for textbooks. Higher education institutions or instructors can choose CogBooks for a single course or create an entire degree program such as the Biospine Initiative at Arizona State University. The CogBooks adaptive learning platform has been used by more than 200,000 students worldwide. It is proven to reduce dropouts by 90%* while improving student performance by 24%.* Connect with us if you’re interested in learning more, creating a custom course, or developing an entire degree program.

*Data from a consecutive four-year study in Introduction to Biology for Non-Majors at Arizona State University.