Courses/CS 2114/Lab Manual/Data Processing

From A-State Computer Science Wiki
Jump to: navigation, search

Data Processing

Introduction

This lab will explore the use of the algorithms in C++ for processing data. Students are encouraged to build a program using the code provided in the tutorial.

Topics Covered in this Lab:
  • Introduction to Data Processing
  • Standard Template Library (STL)
  • STL algorithms
Questions Answered in this Lab:
  • What is data processing?
  • What is an algorithm in regards to the STL?
  • How are the algorithms provided in the STL used in a program?
Demonstrable Skills Acquired in this Lab:
  • ability to use STL algorithms


Introduction to Data Processing

According to the Google Dictionary, data processing is a series of operations on data, especially by a computer, to retrieve, transform, or classify information.

The various steps or processes of data processing, as shown on Computer Hope include:

  • Validation – ensuring that supplied data is without error, accurate and useful
  • Sorting – arranging items in different sets and sequences
  • Summarization – distilling detailed data to its main points
  • Aggregation – combining all of the different pieces of data
  • Analysis – the actual process of collecting, organizing, analyzing, interpreting, and presenting the data
  • Reporting – a summary of the processed information
  • Classification – separates data into various categories

Previous labs and homeworks have included some data processing (e.g., the calculation of the mean and standard deviation of a series of numbers). This lab will focus on data processing by implementing some of the algorithms provided in the Standard Template Library.


Standard Template Library (STL)

The Standard Template Library (STL) is a library to provide common, generic classes and functions that implement many commonly used algorithms and data structures.

The architecture of STL is largely the creation of Alexander Stepanov. In 1979 he began working out his initial ideas of generic programming and exploring their potential for revolutionizing software development. Although David Musser had developed and advocated some aspects of generic programming already by year 1971, it was limited to a rather specialized area of software development (computer algebra).

Stepanov recognized the full potential for generic programming and persuaded his then-colleagues at General Electric Research and Development (including, primarily, David Musser and Deepak Kapur) that generic programming should be pursued as a comprehensive basis for software development. At the time there was no real support in any programming language for generic programming.

(Historical information found on Wikipedia.org)

The STL contains:

  • Containers - A container is an abstract data structure. Some examples of containers defined in the STL are vectors, maps and queues. We discussed vectors in a previous lab.
  • Iterators - iterators provide a way to traverse and access data stored in containers.
  • Algorithms - The algorithms library contains functions which implement several common algorithms. These algorithms are designed to be used across a range of elements, which can be found in arrays and STL containers. There are algorithms for such actions as sorting, searching, reversing, etc. We will discuss some of these algorithms in this tutorial.


The algorithm Library

The algorithm library functions, for such common activities as searching and sorting, etc., most often work on ranges of elements. These ranges are defined as [first, last) where first is an iterator indicating the first element of the range and so is included in the range, as indicated by the left square bracket `['; last is an iterator indicating one element past the last element of the range and so is excluded from the range, as indicated by the right parenthesis `)'.

The algorithms in the algorithm library are categorized as follows:

  • Nonmodifying algorithms - Nonmodifying algorithms do not change the value of any element, nor do they modify the order in which the elements appear. They can be called for all of the standard STL container objects, as they make use of the simpler forward iterators.
  • Modifying algorithms - Modifying algorithms are designed to alter the value of elements within a container. This can either be done on the container directly or via a copy into another container range. Some algorithms classed as 'modifying' algorithms can change the order of elements (in particular copy_backward()), but most do not.
  • Removal algorithms - Removal algorithms are, by definition, modifying algorithms, but they are designed to remove elements in a container or when copying into another container.
  • Mutating algorithms - Once again, mutating algorithms are modifying algorithms, but they are designed specifically to modify the order of elements (e.g. a random shuffle or rotation).
  • Sorting algorithms - Sorting algorithms are modifying algorithms specifically designed for efficient sorting of elements in a container (or into a range container).
  • Sorted range algorithms - Sorted range algorithms are special sorting algorithms designed to function on a container which is already sorted according to a particular sorting criterion. This allows for greater efficiency.
  • Numeric algorithms - Numeric algorithms are designed to work on numerical data in some fashion. The principal algorithm in this category is accumulate(), which allows mathematical operators to be applied to all elements in a container.

There are many resources on the Internet that provide a list, description and example use of all the algorithms provided in the algorithm library; C++ Reference is one such resource. This tutorial will demonstrate only a smattering of the algorithms from across several of the above categories.

In order to use an algorithm, the algorithm library must be made accessible to the program and, in particular, the algorithm that will be used. Be sure to add the following preprocessor directive to the beginning of any program using an algorithm function. The example using statement shown below indicates that the sort algorithm will be used somewhere in the program.

Verbatimcode.pngCode Illustration
#include<algorithm>
    using std::sort;

(Category names and descriptions from QuantStart)


Nonmodifying Algorithms

Nonmodifying algorithms are algorithms that do not change the data nor the container in which the data resides.

Searching

Searching is one of the most fundamental problems in Computer Science. Searching involves looking through a set of data for a particular value or set of values. Some search algorithms required the data being searched to be sorted while others do not. Consider the following two search algorithms.

The most basic algorithm for searching is the linear search, where the search begins at the first piece of data then moves to the second piece of data and so on until either the search key is found or the end of the data is reached. This algorithm does not required the data to be sorted. This search is implemented in the find() and count() algorithms which are demonstrated below.

A more advance algorithm is the binary search. The binary search requires the data to be sorted. The binary search begins the search in the middle of the data. If the data value being examined is the search key, the search is a success. If the search key is smaller than the data value being examined, the search concentrates on the first half of the data; otherwise, the search concentrates on the second half of the data. The new data range is then searched in the same manner until either the search key is found or there is no more data to search. This search is implemented in the binary_search() algorithm which is demonstrated below.

std::find()
Information Icon.pngSyntax
std::find(first, last, searchValue)

The find() algorithm is one that works on a range of elements. The first parameter is the iterator of the first data element to be examined. The second parameter is the iterator that indicates the end of the data. The third parameter is the value for which to search.

find() begins searching in the first data element indicated via the first parameter. If this data value is not the same as the searchValue, the next next data value is examined. This process is repeated until either the data value being examined is the same as the searchValue or the end of data is reached.

find() returns an iterator. If the search value was found, the iterator that is returned specifies the first occurrence of the search value in the data. If the search value was not found, the iterator is the same as the iterator indicating the end of the data.

Consider the following data.

Verbatimcode.pngCode Illustration
vector<int> dataValues{3, 6, 8, 4, 6, 9, 2, 4, 9, 6};
int searchValue1 = 4;
int searchValue2 = 7;

Using the data and search values defined above,

Verbatimcode.pngCode Illustration
cout << "\nSearching for " << searchValue1 << ":\n";
vector<int>::iterator itOfSearchValue1 = std::find(dataValues.begin(), dataValues.end(), searchValue1);
if (itOfSearchValue1 != dataValues.end())
    cout << searchValue1 << " was found:  " << (*itOfSearchValue1) << endl;
else
    cout << searchValue1 << " was not found.\n";

would produce a message indicating that the search value was found while

Verbatimcode.pngCode Illustration
cout << "\nSearching for " << searchValue2 << ":\n";
vector<int>::iterator itOfSearchValue2 = std::find(dataValues.begin(), dataValues.end(), searchValue2);
if (itOfSearchValue2 != dataValues.end())
    cout << searchValue2 << " was found:  " << (*itOfSearchValue2) << endl;
else
    cout << searchValue2 << " was not found.\n";

would produce a message indicating the the search value was not found.

std::binary_search()
Information Icon.pngSyntax
std::binary_search(first, last, searchValue)

The binary_search() algorithm is one that works on a range of elements. The first parameter is the iterator of the first data element to be examined. The second parameter is the iterator that indicates the end of the data. The third parameter is the value for which to search.

The find() algorithm implements a linear search on data which means that all values are examined, and that the data does not have to be in sorted order. A single linear search on a small amount of unsorted data is not "expensive" time-wise. A single linear search on a large amount of unsorted data is not considered to be all that expensive either; however, repeated linear searches on large amounts of unsorted data can become expensive. The sorting of the data can also be expensive in regards to computing time, but often the time invested in the sorting process is surpassed by the time saved when conducting repeated searches.

As noted above, a linear search begins comparing the first data value with the search value, then the second, then the third, and so on, until the data is either found or the end of the data is reached. With unsorted data, there is no other alternative. With sorted data, a divide-and-conquer approach can be taken by first comparing the middle data value with the search value and using the result of that comparison to determine what data, if any, should be searched next. If the value to be found is smaller than the value in the middle of the data, only the first half of the data needs to be considered next. If the value to be found is bigger than the value in the middle of the data, then only the last half of the data needs to be considered next. Of course, if the value to be found was indeed found, the search is complete. This concept can be applied repeatedly to narrow the search to a smaller and smaller section of the data. This approach requires far fewer comparisons than the linear approach making it the more time efficient algorithm when repeatedly searching large amounts of data.

The binary_search() algorithm implements this approach to searching.

Consider the following data.

Verbatimcode.pngCode Illustration
vector<int> sortedData{3, 4, 6, 8, 9, 11, 15, 16, 17};

Using the data in the sortedData vector and the two search values defined above,

Verbatimcode.pngCode Illustration
cout << "\nSearching for " << searchValue1 << ":\n";
if (std::binary_search(sortedData.begin(), sortedData.end(), searchValue1))
    cout << searchValue1 << " was found.\n";
else
    cout << searchValue1 << " was not found.\n";

would produce a message indicating that the search value was found while

Verbatimcode.pngCode Illustration
cout << "\nSearching for " << searchValue2 << ":\n";
if (std::binary_search(sortedData.begin(), sortedData.end(), searchValue2))
    cout << searchValue2 << " was found.\n";
else
    cout << searchValue2 << " was not found.\n";

would produce a message indicating the the search value was not found.

Note that the find() returns a location, while the binary_search() returns a Boolean result.

Note: There are three other algorithms using the binary search approach; these algorithms, along with binary_search(), although nonmodifying algorithms, are often categorized together under Binary Search on code reference sites such as cplusplus.com or cppreference.com.


🤔   What do you think would happen if the data being searched in the binary_search() was not sorted?

If the container is not sorted, the divide-and-conquer approach would not work, since looking at the middle element would no longer give you any information about where the target might be located.


std::count()
Information Icon.pngSyntax
std::count(first, last, valueToCount)

The count() algorithm is one that works on a range of elements. The first parameter is the iterator of the first data element to be examined. The second parameter is the iterator that indicates the end of the data. The third parameter is the value for which to search.

The count() algorithm will search the container linearly for the specified search value and return the number of times it was found.

The following two examples demonstrate when data is found and when it is not. The code assumes the same data and search values as defined previously.

Verbatimcode.pngCode Illustration
cout << "\nNumber of times " << searchValue1 << " is seen in the data:  "
     << std::count(dataValues.begin(), dataValues.end(), searchValue1) << endl;

The code above should indicate that the value was found 2 times while the code below should indicate that the value was found 0 times.

Verbatimcode.pngCode Illustration
cout << "Number of times " << searchValue2 << " is seen in the data:  "
     << std::count(dataValues.begin(), dataValues.end(), searchValue2) << endl;

Modifying Algorithms

std::copy()

Information Icon.pngSyntax
std::copy(first, last, result)

where first and last specify the range from which a series of values are to be copied and result is the first element of the container into which the values to are to be copied.

Verbatimcode.pngCode Illustration
cout << "\ncopying from one vector to another:\n";
vector<int> copyOfDataValues(dataValues.size());
std::copy(dataValues.begin(), dataValues.end(), copyOfDataValues.begin());

After execution, copyOfDataValues will contain the same data as the dataValues vector. Display the elements of each vector to verify they are the same.

Note that copy() can only place a value in an element that already exists; it cannot create any elements.


🤔  (A) What happens when the vector that will contain a copy of the data is an empty vector?
(B) What happens when the vector that will contain a copy of the data contains fewer elements than the range that is being copied?

(A) and (B): Undefined behavior in both cases: In (A) the destination has no storage allocated, so it will overflow if you copy values into it. In (B), there is a chance that the vector has internally allocated enough space to accept the values, but you have no guarantee of that (unless you checked prior to the copy). Always verify the size or use resize() first!


std::fill()

Information Icon.pngSyntax
std::fill(first, last, fillValue)

The fill() algorithm will fill a range with the specified value. For example,

Verbatimcode.pngCode Illustration
int numberOfPlayers = 2;
vector<int> playerLivesRemaining;
playerLivesRemaining.resize(numberOfPlayers);
std::fill(playerLivesRemaining.begin(), playerLivesRemaining.end(), 3);

would create a vector that would keep track of the number of lives remaining for each player in a game. fill() would start each player with 3 lives.

Note that fill() can only place a value in an element that already exists; it cannot create any elements.


🤔  What happens when the vector to be filled is an empty vector?

There are no elements to fill, so fill() does nothing. This is not an error.


std::reverse()

Information Icon.pngSyntax
std::reverse(first, last)

The reverse() algorithm will, as its name states, reverse the order of the data in the container range specified. Example code follows.

Verbatimcode.pngCode Illustration
std::reverse(dataValues.begin(), dataValues.end());

If the container's data has been sorted in ascending order, the data would then be in descending order.

std::swap()

Information Icon.pngSyntax
std::swap(firstValue, secondValue)
std::swap(firstContainer, secondContainer)

The swap() algorithm will work with two single values or with two containers. When given two single values as parameters, The values in the two parameters are swapped. Upon return from the function, firstValue will hold the value originally in secondValue while secondValue will hold the value originally in firstValue.

When swap() is given two containers as parameters, all of the elements in each are swapped. Upon return from the function, firstContainer will hold the elements originally stored in secondContainer while secondContainer will hold all of the elements that were originally stored in firstContainer.

Verbatimcode.pngCode Illustration
int base = 2, exponent = 10;
std::swap(base,exponent);

The above code would result in base holding the value 10 and exponent holding the value 2.

Verbatimcode.pngCode Illustration
vector<int> dataset1{1, 2, 3}, dataset2{6, 7, 8, 9};
std::swap(dataset1,dataset2);

Display the contents of each container to verify the values in each container were changed.

Sorting

std::sort()

Information Icon.pngSyntax
std::sort(first, last)

The sort() algorithm will change the data in the range specified. The data in the elements within the range will be moved so that the resulting sequence of values are sorted in ascending order. Example code follows.

Verbatimcode.pngCode Illustration
vector<int> dataToSort{12,54,89,76,10,43,45,87,22,55};
std::sort(dataToSort.begin(), dataToSort.end());

Display the data in dataToSort to verify it is now in sorted order.

std::is_sorted()

Information Icon.pngSyntax
std::is_sorted(first, last)

The is_sorted() algorithm will examine a container's data within the range specified and will return true if the data is sorted in ascending order and false if it is not sorted in ascending order.

Verbatimcode.pngCode Illustration
if (std::is_sorted(sortedData.begin(), sortedData.end()))
    cout << "The data is sorted in ascending order.\n";
else
    cout << "The data is not sorted in ascending order.\n";

Using the vectors defined previously, the above code would produce a message indicating that the container's data is in sorted order, while the code below would indicate that the data is not sorted.

Verbatimcode.pngCode Illustration
if (std::is_sorted(dataValues.begin(), dataValues.end()))
    cout << "The data is sorted in ascending order.\n";
else
    cout << "The data is not sorted in ascending order.\n";


Minimum/Maximum

std::min()

Information Icon.pngSyntax
std::min(firstValue, secondValue)

The min() algorithm will return the least of the two parameters. This algorithm works on two single values rather than a range of values. Using the search values specified earlier, consider the following code.

Verbatimcode.pngCode Illustration
int minimumValue = std::min(searchValue1, searchValue2);

After execution, minimumValue would hold the smallest of searchValue1 and searchValue2.

std::max()

Information Icon.pngSyntax
std::max(firstValue, secondValue)

The max() algorithm will return the greater of the two parameters. This algorithm works on two single values rather than a range of values. Using the search values specified earlier, consider the following code.

Verbatimcode.pngCode Illustration
int maximumValue = std::max(searchValue1, searchValue2);

After execution, maximumValue would hold the largest of searchValue1 and searchValue2.

std::min_element()

Information Icon.pngSyntax
std::min_element(first, last)

min_element() will return the smallest value in a range of elements defined by first and last. The data within the range is not necessarily in sorted order. Consider the following.

Verbatimcode.pngCode Illustration
vector<int>::iterator itMinimum = std::min_element(dataValues.begin(), dataValues.end());
cout << "The smallest value in dataValues is " << (*itMinimum) << endl;

Using dataValues as defined above, this code should report that the smallest value is 2.

Notice that the return value of this function is an iterator. The returned iterator will either indicate the smallest value in the range or it will be set to the same as the last parameter if the range specified is empty. To demonstrate this feature, consider the following code.

Verbatimcode.pngCode Illustration
vector<int> dataToRead;     // an empty vector
vector<int>::iterator itMinimum2 = std::min_element(dataToRead.begin(), dataToRead.end());
if (itMinimum2 != dataToRead.end())
    cout << "The smallest value in dataToRead is " << (*itMinimum) << endl;
else
    cout << "dataToRead is an empty vector and so has no minimum value.\n";

std::max_element()

Information Icon.pngSyntax
std::max_element(first, last)

max_element() will return the greatest value in a range of elements defined by first and last. The data within the range is not necessarily in sorted order. Consider the following.

Verbatimcode.pngCode Illustration
vector<int>::iterator itMaximum = std::max_element(dataValues.begin(), dataValues.end());
cout << "The largest value in dataValues is " << (*itMaximum) << endl;

Using dataValues as defined above, this code should report that the greatest value is 9.

Just as min_element() returns an iterator, so too does max_element(). If the range specified is empty, the iterator returned is the same as the last parameter.


Numeric

The numeric library contains common mathematical functions and types. We will review the accumulate() function below. The reader is encouraged to research the other functions found in the numeric library.

std::accumulate()

Information Icon.pngSyntax
std::accumulate(first, last, initialValue)

The accumulate() algorithm will calculate the summation of the data within the specified range. The third parameter sets the initial value of the sum. The accumulate() algorithm is found in the numeric library rather than the algorithm library. Be sure to add the following to the beginning of any program using the accumulate algorithm.

Verbatimcode.pngCode Illustration
#include<numeric>

With the numeric library accessible, consider the following code.

Verbatimcode.pngCode Illustration
vector<double> quarterlySales{10000, 20000, 25000, 30000};
double totalSales = std::accumulate(quarterlySales.begin(), quarterlySales.end(), 0);

After execution, totalSales would hold the value 85000.

Algorithms provided by Containers

Most of the STL containers (such as std::vector) provide some specialized algorithms as well. You should take a look at the following algorithms provided by std::vector:

Nonmodifying:
  at()            : get value of element at index (with bounds checking!)
  front()         : get value of first element
  back()          : get value of last element

Capacity:
  empty()         : returns true if container is empty, false otherwise
  size()          : gets size of container (# of elements stored)
  max_size()      : gets maximum possible number of elements
  shrink_to_fit() : reduce memory useage by fitting to current contents

Modifying:
  clear()         : clears all contents (empties the container)          
  erase()         : erase one element, given an iterator to it          
  insert()        : inserts element before a position (given an iterator)             
  pop_back()      : removes the last element              
  push_back()     : adds a new element at the end             
  resize()        : changes the size of the vector          
[...] There are others not shown here

For more information, visit cplusplus.com or cppreference.com.