# Courses/CS 2114/Lab Manual/Data Processing

## Data Processing

### Introduction

This lab will explore the use of the algorithms in C++ for processing data. Students are encouraged to build a program using the code provided in the tutorial.

Topics Covered in this Lab:

- Introduction to Data Processing
- Standard Template Library (STL)
- STL algorithms
Questions Answered in this Lab:

- What is data processing?
- What is an algorithm in regards to the STL?
- How are the algorithms provided in the STL used in a program?
Demonstrable Skills Acquired in this Lab:

- ability to use STL algorithms

### Introduction to Data Processing

According to the Google Dictionary, data processing is a series of operations on data, especially by a computer, to retrieve, transform, or classify information.

The various steps or processes of data processing, as shown on Computer Hope include:

**Validation**– ensuring that supplied data is without error, accurate and useful**Sorting**– arranging items in different sets and sequences**Summarization**– distilling detailed data to its main points**Aggregation**– combining all of the different pieces of data**Analysis**– the actual process of collecting, organizing, analyzing, interpreting, and presenting the data**Reporting**– a summary of the processed information**Classification**– separates data into various categories

Previous labs and homeworks have included some data processing (e.g., the calculation of the mean and standard deviation of a series of numbers). This lab will focus on data processing by implementing some of the algorithms provided in the Standard Template Library.

### Standard Template Library (STL)

The Standard Template Library (STL) is a library to provide common, generic classes and functions that implement many commonly used algorithms and data structures.

The architecture of STL is largely the creation of Alexander Stepanov. In 1979 he began working out his initial ideas of generic programming and exploring their potential for revolutionizing software development. Although David Musser had developed and advocated some aspects of generic programming already by year 1971, it was limited to a rather specialized area of software development (computer algebra).

Stepanov recognized the full potential for generic programming and persuaded his then-colleagues at General Electric Research and Development (including, primarily, David Musser and Deepak Kapur) that generic programming should be pursued as a comprehensive basis for software development. At the time there was no real support in any programming language for generic programming.

(Historical information found on Wikipedia.org)

The STL contains:

**Containers**- A container is an abstract data structure. Some examples of containers defined in the STL are`vectors`,`maps`and`queues`. We discussed`vectors`in a previous lab.**Iterators**- iterators provide a way to traverse and access data stored in containers.**Algorithms**- The`algorithms`library contains functions which implement several common algorithms. These algorithms are designed to be used across a range of elements, which can be found in arrays and STL containers. There are algorithms for such actions as sorting, searching, reversing, etc. We will discuss some of these algorithms in this tutorial.

### The `algorithm` Library

The `algorithm` library functions, for such common activities as searching and sorting, etc., most often work on ranges of elements. These ranges are defined as `[first, last)` where `first` is an iterator indicating the first element of the range and so is *included* in the range, as indicated by the left square bracket ``['`; `last` is an iterator indicating one element *past* the last element of the range and so is *excluded* from the range, as indicated by the right parenthesis ``)'`.

The algorithms in the `algorithm` library are categorized as follows:

**Nonmodifying algorithms**- Nonmodifying algorithms do not change the value of any element, nor do they modify the order in which the elements appear. They can be called for all of the standard STL container objects, as they make use of the simpler forward iterators.**Modifying algorithms**- Modifying algorithms are designed to alter the value of elements within a container. This can either be done on the container directly or via a copy into another container range. Some algorithms classed as 'modifying' algorithms can change the order of elements (in particular`copy_backward()`), but most do not.**Removal algorithms**- Removal algorithms are, by definition, modifying algorithms, but they are designed to remove elements in a container or when copying into another container.**Mutating algorithms**- Once again, mutating algorithms are modifying algorithms, but they are designed specifically to modify the order of elements (e.g. a random shuffle or rotation).**Sorting algorithms**- Sorting algorithms are modifying algorithms specifically designed for efficient sorting of elements in a container (or into a range container).**Sorted range algorithms**- Sorted range algorithms are special sorting algorithms designed to function on a container which is already sorted according to a particular sorting criterion. This allows for greater efficiency.**Numeric algorithms**- Numeric algorithms are designed to work on numerical data in some fashion. The principal algorithm in this category is`accumulate()`, which allows mathematical operators to be applied to all elements in a container.

There are many resources on the Internet that provide a list, description and example use of all the algorithms provided in the `algorithm` library; C++ Reference is one such resource. This tutorial will demonstrate only a smattering of the algorithms from across several of the above categories.

In order to use an algorithm, the `algorithm` library must be made accessible to the program and, in particular, the algorithm that will be used. Be sure to add the following preprocessor directive to the beginning of any program using an algorithm function. The example `using` statement shown below indicates that the `sort` algorithm will be used somewhere in the program.

| ||

```
#include<algorithm>
using std::sort;
``` |

(Category names and descriptions from QuantStart)

### Nonmodifying Algorithms

Nonmodifying algorithms are algorithms that do not change the data nor the container in which the data resides.

#### Searching

Searching is one of the most fundamental problems in Computer Science. Searching involves looking through a set of data for a particular value or set of values. Some search algorithms required the data being searched to be sorted while others do not. Consider the following two search algorithms.

The most basic algorithm for searching is the linear search, where the search begins at the first piece of data then moves to the second piece of data and so on until either the search key is found or the end of the data is reached. This algorithm does not required the data to be sorted. This search is implemented in the `find()` and `count()` algorithms which are demonstrated below.

A more advance algorithm is the binary search. The binary search requires the data to be sorted. The binary search begins the search in the middle of the data. If the data value being examined is the search key, the search is a success. If the search key is smaller than the data value being examined, the search concentrates on the first half of the data; otherwise, the search concentrates on the second half of the data. The new data range is then searched in the same manner until either the search key is found or there is no more data to search. This search is implemented in the `binary_search()` algorithm which is demonstrated below.

`std::find()`

| ||

```
std::find(first, last, searchValue)
``` |

The `find()` algorithm is one that works on a range of elements. The first parameter is the iterator of the first data element to be examined. The second parameter is the iterator that indicates the end of the data. The third parameter is the value for which to search.

`find()` begins searching in the first data element indicated via the `first` parameter. If this data value is not the same as the `searchValue`, the next next data value is examined. This process is repeated until either the data value being examined is the same as the `searchValue` or the end of data is reached.

`find()` returns an iterator. If the search value was found, the iterator that is returned specifies the first occurrence of the search value in the data. If the search value was not found, the iterator is the same as the iterator indicating the end of the data.

Consider the following data.

| ||

```
vector<int> dataValues{3, 6, 8, 4, 6, 9, 2, 4, 9, 6};
int searchValue1 = 4;
int searchValue2 = 7;
``` |

Using the data and search values defined above,

would produce a message indicating that the search value was found while

| ||

```
cout << "\nSearching for " << searchValue2 << ":\n";
vector<int>::iterator itOfSearchValue2 = std::find(dataValues.begin(), dataValues.end(), searchValue2);
if (itOfSearchValue2 != dataValues.end())
cout << searchValue2 << " was found: " << (*itOfSearchValue2) << endl;
else
cout << searchValue2 << " was not found.\n";
``` |

would produce a message indicating the the search value was not found.

`std::binary_search()`

| ||

```
std::binary_search(first, last, searchValue)
``` |

The `binary_search()` algorithm is one that works on a range of elements. The first parameter is the iterator of the first data element to be examined. The second parameter is the iterator that indicates the end of the data. The third parameter is the value for which to search.

The `find()` algorithm implements a linear search on data which means that all values are examined, and that the data does not have to be in sorted order. A single linear search on a small amount of unsorted data is not "expensive" time-wise. A single linear search on a large amount of unsorted data is not considered to be all that expensive either; however, repeated linear searches on large amounts of unsorted data can become expensive. The sorting of the data can also be expensive in regards to computing time, but often the time invested in the sorting process is surpassed by the time saved when conducting repeated searches.

As noted above, a linear search begins comparing the first data value with the search value, then the second, then the third, and so on, until the data is either found or the end of the data is reached. With unsorted data, there is no other alternative. With sorted data, a divide-and-conquer approach can be taken by first comparing the middle data value with the search value and using the result of that comparison to determine what data, if any, should be searched next. If the value to be found is smaller than the value in the middle of the data, only the first half of the data needs to be considered next. If the value to be found is bigger than the value in the middle of the data, then only the last half of the data needs to be considered next. Of course, if the value to be found was indeed found, the search is complete. This concept can be applied repeatedly to narrow the search to a smaller and smaller section of the data. This approach requires far fewer comparisons than the linear approach making it the more time efficient algorithm when repeatedly searching large amounts of data.

The `binary_search()` algorithm implements this approach to searching.

Consider the following data.

| ||

```
vector<int> sortedData{3, 4, 6, 8, 9, 11, 15, 16, 17};
``` |

Using the data in the `sortedData` vector and the two search values defined above,

| ||

```
cout << "\nSearching for " << searchValue1 << ":\n";
if (std::binary_search(sortedData.begin(), sortedData.end(), searchValue1))
cout << searchValue1 << " was found.\n";
else
cout << searchValue1 << " was not found.\n";
``` |

would produce a message indicating that the search value was found while

| ||

```
cout << "\nSearching for " << searchValue2 << ":\n";
if (std::binary_search(sortedData.begin(), sortedData.end(), searchValue2))
cout << searchValue2 << " was found.\n";
else
cout << searchValue2 << " was not found.\n";
``` |

would produce a message indicating the the search value was not found.

Note that the `find()` returns a location, while the `binary_search()` returns a Boolean result.

Note: There are three other algorithms using the binary search approach; these algorithms, along with `binary_search()`, although nonmodifying algorithms, are often categorized together under *Binary Search* on code reference sites such as cplusplus.com or cppreference.com.

`binary_search()` was not sorted?

If the container is not sorted, the divide-and-conquer approach would not work, since looking at the middle element would no longer give you any information about where the target might be located.

`std::count()`

| ||

```
std::count(first, last, valueToCount)
``` |

The `count()` algorithm is one that works on a range of elements. The first parameter is the iterator of the first data element to be examined. The second parameter is the iterator that indicates the end of the data. The third parameter is the value for which to search.

The `count()` algorithm will search the container linearly for the specified search value and return the number of times it was found.

The following two examples demonstrate when data is found and when it is not. The code assumes the same data and search values as defined previously.

| ||

```
cout << "\nNumber of times " << searchValue1 << " is seen in the data: "
<< std::count(dataValues.begin(), dataValues.end(), searchValue1) << endl;
``` |

The code above should indicate that the value was found 2 times while the code below should indicate that the value was found 0 times.

| ||

```
cout << "Number of times " << searchValue2 << " is seen in the data: "
<< std::count(dataValues.begin(), dataValues.end(), searchValue2) << endl;
``` |

### Modifying Algorithms

`std::copy()`

| ||

```
std::copy(first, last, result)
``` |

where `first` and `last` specify the range from which a series of values are to be copied and `result` is the first element of the container into which the values to are to be copied.

| ||

```
cout << "\ncopying from one vector to another:\n";
vector<int> copyOfDataValues(dataValues.size());
std::copy(dataValues.begin(), dataValues.end(), copyOfDataValues.begin());
``` |

After execution, `copyOfDataValues` will contain the same data as the `dataValues` vector. Display the elements of each vector to verify they are the same.

Note that `copy()` can only place a value in an element that already exists; it cannot create any elements.

(B) What happens when the vector that will contain a copy of the data contains fewer elements than the range that is being copied?

(A) and (B): *Undefined behavior* in both cases: In (A) the destination has no storage allocated, so it will overflow if you copy values into it. In (B), there is a chance that the vector has internally allocated enough space to accept the values, but you have no guarantee of that (unless you checked prior to the copy). Always verify the size or use `resize()` first!

`std::fill()`

| ||

```
std::fill(first, last, fillValue)
``` |

The `fill()` algorithm will fill a range with the specified value. For example,

| ||

```
int numberOfPlayers = 2;
vector<int> playerLivesRemaining;
playerLivesRemaining.resize(numberOfPlayers);
std::fill(playerLivesRemaining.begin(), playerLivesRemaining.end(), 3);
``` |

would create a vector that would keep track of the number of lives remaining for each player in a game. `fill()` would start each player with `3` lives.

Note that `fill()` can only place a value in an element that already exists; it cannot create any elements.

What happens when the vector to be filled is an empty vector?

There are no elements to fill, so `fill()` does nothing. This is not an error.

`std::reverse()`

| ||

```
std::reverse(first, last)
``` |

The `reverse()` algorithm will, as its name states, reverse the order of the data in the container range specified. Example code follows.

| ||

```
std::reverse(dataValues.begin(), dataValues.end());
``` |

If the container's data has been sorted in ascending order, the data would then be in descending order.

`std::swap()`

| ||

```
std::swap(firstValue, secondValue)
std::swap(firstContainer, secondContainer)
``` |

The `swap()` algorithm will work with two single values or with two containers. When given two single values as parameters, The values in the two parameters are swapped. Upon return from the function, `firstValue` will hold the value originally in `secondValue` while `secondValue` will hold the value originally in `firstValue`.

When `swap()` is given two containers as parameters, all of the elements in each are swapped. Upon return from the function, `firstContainer` will hold the elements originally stored in `secondContainer` while `secondContainer` will hold all of the elements that were originally stored in `firstContainer`.

| ||

```
int base = 2, exponent = 10;
std::swap(base,exponent);
``` |

The above code would result in `base` holding the value `10` and `exponent` holding the value `2`.

| ||

```
vector<int> dataset1{1, 2, 3}, dataset2{6, 7, 8, 9};
std::swap(dataset1,dataset2);
``` |

Display the contents of each container to verify the values in each container were changed.

### Sorting

`std::sort()`

| ||

```
std::sort(first, last)
``` |

The `sort()` algorithm will change the data in the range specified. The data in the elements within the range will be moved so that the resulting sequence of values are sorted in ascending order. Example code follows.

| ||

```
vector<int> dataToSort{12,54,89,76,10,43,45,87,22,55};
std::sort(dataToSort.begin(), dataToSort.end());
``` |

Display the data in `dataToSort` to verify it is now in sorted order.

`std::is_sorted()`

| ||

```
std::is_sorted(first, last)
``` |

The `is_sorted()` algorithm will examine a container's data within the range specified and will return `true` if the data is sorted in ascending order and `false` if it is not sorted in ascending order.

| ||

```
if (std::is_sorted(sortedData.begin(), sortedData.end()))
cout << "The data is sorted in ascending order.\n";
else
cout << "The data is not sorted in ascending order.\n";
``` |

Using the vectors defined previously, the above code would produce a message indicating that the container's data is in sorted order, while the code below would indicate that the data is not sorted.

| ||

```
if (std::is_sorted(dataValues.begin(), dataValues.end()))
cout << "The data is sorted in ascending order.\n";
else
cout << "The data is not sorted in ascending order.\n";
``` |

### Minimum/Maximum

`std::min()`

| ||

```
std::min(firstValue, secondValue)
``` |

The `min()` algorithm will return the least of the two parameters. This algorithm works on two single values rather than a range of values. Using the search values specified earlier, consider the following code.

| ||

```
int minimumValue = std::min(searchValue1, searchValue2);
``` |

After execution, `minimumValue` would hold the smallest of `searchValue1` and `searchValue2`.

`std::max()`

| ||

```
std::max(firstValue, secondValue)
``` |

The `max()` algorithm will return the greater of the two parameters. This algorithm works on two single values rather than a range of values. Using the search values specified earlier, consider the following code.

| ||

```
int maximumValue = std::max(searchValue1, searchValue2);
``` |

After execution, `maximumValue` would hold the largest of `searchValue1` and `searchValue2`.

`std::min_element()`

| ||

```
std::min_element(first, last)
``` |

`min_element()` will return the smallest value in a range of elements defined by `first` and `last`. The data within the range is not necessarily in sorted order. Consider the following.

| ||

```
vector<int>::iterator itMinimum = std::min_element(dataValues.begin(), dataValues.end());
cout << "The smallest value in dataValues is " << (*itMinimum) << endl;
``` |

Using `dataValues` as defined above, this code should report that the smallest value is `2`.

Notice that the return value of this function is an iterator. The returned iterator will either indicate the smallest value in the range or it will be set to the same as the `last` parameter if the range specified is empty. To demonstrate this feature, consider the following code.

| ||

```
vector<int> dataToRead; // an empty vector
vector<int>::iterator itMinimum2 = std::min_element(dataToRead.begin(), dataToRead.end());
if (itMinimum2 != dataToRead.end())
cout << "The smallest value in dataToRead is " << (*itMinimum) << endl;
else
cout << "dataToRead is an empty vector and so has no minimum value.\n";
``` |

`std::max_element()`

| ||

```
std::max_element(first, last)
``` |

`max_element()` will return the greatest value in a range of elements defined by `first` and `last`. The data within the range is not necessarily in sorted order. Consider the following.

| ||

```
vector<int>::iterator itMaximum = std::max_element(dataValues.begin(), dataValues.end());
cout << "The largest value in dataValues is " << (*itMaximum) << endl;
``` |

Using `dataValues` as defined above, this code should report that the greatest value is `9`.

Just as `min_element()` returns an iterator, so too does `max_element()`. If the range specified is empty, the iterator returned is the same as the `last` parameter.

### Numeric

The `numeric` library contains common mathematical functions and types. We will review the `accumulate()` function below. The reader is encouraged to research the other functions found in the `numeric` library.

`std::accumulate()`

| ||

```
std::accumulate(first, last, initialValue)
``` |

The `accumulate()` algorithm will calculate the summation of the data within the specified range. The third parameter sets the initial value of the sum. The `accumulate()` algorithm is found in the `numeric` library rather than the `algorithm` library. Be sure to add the following to the beginning of any program using the `accumulate` algorithm.

| ||

```
#include<numeric>
``` |

With the `numeric` library accessible, consider the following code.

| ||

```
vector<double> quarterlySales{10000, 20000, 25000, 30000};
double totalSales = std::accumulate(quarterlySales.begin(), quarterlySales.end(), 0);
``` |

After execution, `totalSales` would hold the value `85000`.

## Algorithms provided by Containers

Most of the STL containers (such as `std::vector`) provide some specialized algorithms as well. You should take a look at the following algorithms provided by `std::vector`:

Nonmodifying: at() : get value of element at index (with bounds checking!) front() : get value of first element back() : get value of last element Capacity: empty() : returns true if container is empty, false otherwise size() : gets size of container (# of elements stored) max_size() : gets maximum possible number of elements shrink_to_fit() : reduce memory useage by fitting to current contents Modifying: clear() : clears all contents (empties the container) erase() : erase one element, given an iterator to it insert() : inserts element before a position (given an iterator) pop_back() : removes the last element push_back() : adds a new element at the end resize() : changes the size of the vector [...] There are others not shown here

For more information, visit cplusplus.com or cppreference.com.