We think you are located in South Africa. Is this correct?

Data Warehousing

image

image CHAPTER OVERVIEW

Unit 1.1 Data collection
Unit 1.2 Data warehousing
Unit 1.3 Data mining
Unit 1.4 Caring for and managing data

image Learning outcomes

At the end of this chapter you should be able to:

  • provide an overview of data collection
  • provide examples of data collection
  • describe data warehousing
  • compare data warehousing with database
  • describe data mining and provide examples
  • describe the processes of data mining
  • describe how data should be cared for and managed.

INTRODUCTION

In this chapter, you will learn about what data collection is and also look at some examples where data collection is used. Several key warehousing concepts will be introduced. This will enable you to compare data warehousing with a database. You will also learn how to collect and mine data. However, before learning any new concepts, let’s revise what we know about databases.

DATABASES IN A NUTSHELL

Computers store data for a number of reasons. viz. Instructions that a computer uses to function, data required in RAM when an application is being used, data in the form of Images and files that the user’s application requires. The structures commonly used to store data are: Files and Databases.

FILES AND DATABASES

Data is saved in the computer’s memory while a user is working in an application. The information that the user will need to access later, is stored in a database or file on a more permanent storage devices.

image
Did you know

Data is unprocessed numbers, facts or signals. In order for humans to use it and for computers to share it, the data needs to be processed and organised into meaningful information. One way in which data is shared amongst people is by using letters and characters.

Databases are a collection of organised data. Databases are the single most important tool used to store data by programmers and web developers. It can, for example, be used to store everything such as, the settings in an application, text on a website, graphics, status updates, personal messages and comments on social networks.

DATABASE MANAGEMENT SOFTWARE (DBMS)

Database management software (DBMS) is the software responsible for managing a database. This includes creating the database and tables and managing data security. Examples of popular database management software include: Microsoft SQL Server, Microsoft Access, MySQL and SQLite.

Activity 1.1 Revision Activity

1.1.1Define data.

1.1.2What does a computer do with the data?

1.1.3List two reasons why computers store data.

1.1.4Define the term, database.

1.1.5List two instances where a database can be used.

1.1.6Explain DBMS and its advantages.

1.1.7List two examples of DBMS.

UNIT
1.1 Data collection

In Grade 11, you created a database and manually added some data to your tables. This is a very ineffective way of adding data to a database and is only suitable for small databases. Instead of using manual entries, most databases use different techniques to capture data automatically. In this section, you will learn about a few of these techniques.

FORMS

A web form, is an online interactive page that allows for user input. This page imitates a physical form that users need to complete. To assist with user input and limit errors, a web form usually contains GUI components such as:

  • checkboxes
  • combo boxes
  • spinners
  • drop down lists
  • text boxes
  • buttons, and so on.

Web designers can also create quite sophisticated web forms for most online activities such as banking, shopping and job or enrolment applications by using programmed objects, functions and methods. These Web forms have changed the way that business is done because it limits paperwork and documentation, and favours online documentation.

TAGS

Electronic tags transmit a radio frequency from the tag to a tag reader, or vica versa. It transmits the information to a computer program associated with the tag. Tags can be used to track or identify different items and are often used in merchandising warehouses, for vehicle tracking, pet tracking, and so on. Let’s look at some examples of how tags are used.

RFID

RFID stands for radio frequency identification and refers to tiny chips (like those in a phone SIM card) that can store a few kilobytes of information. When scanned, this information is read and displayed on a computer. By using the correct software, it could also be added to a database directly.

image
Figure 1.1: An RFID access card

RFID is used by thousands of businesses across the world. For example, RFID can be used to tag:

  • all products stored in a warehouse. When an item is removed from the warehouse, it is automatically scanned and removed from the database.
  • tools stored in a workshop. This allows the company to track which tools are being used, which employees are using them, and when the tools are returned.
  • the tickets of people going to conferences, sports events or concerts. When a registered RFID chip arrives at the gate, the gates open automatically, and the data on the ticket is added to the database.
  • the tickets of people using public transport. For example, the Gautrain uses RFID cards to allow people to board trains. The trip is recorded on a database, and the cost of the trip is subtracted from the user’s balance.
  • all products sold in a shop. When an item is purchased, the RFID tags are scanned and the details about the product are retrieved and added to the bill. The item can then also be removed from the shop’s inventory automatically.

A basic RFID reader will only scan the information stored on RFID chips and make them available to a computer. However, there are many software solutions that will automatically take this information and add it to a database. These applications may also write additional information to the database, like the date and time that the RFID chip was scanned and the location.

E-TOLLS AND RFID

In December 2014, South African National Roads Agency Ltd (SANRAL) opened its electronic toll (or e-toll) system in Gauteng. The goal of the system was to help the government pay for the R20 billion highway improvement project that had been running since 2007.

image
Figure 1.2: An e-toll gantry

Motorists had to purchase an e-tag (the RFID chip read by the toll gantry). The system used a combination of cameras and RFID readers to automatically record the information of any car driving through the gantry. This information is then stored in a database, which is used to automatically generate an invoice at the end of each month.

DIGITAL SENSORS

A digital sensor is an electronic or electrochemical sensor, where data conversion and data transmission are done digitally, for example, temperature, distance, humidity, light. Sensors sense a wide range of different energy forms like movement, electrical signals, radiant energy, pressure, thermal or magnetic energy. For example, wireless sensor tags connect events in physical world, for example, motion in an area, whether a door/window is opened/closed, temperature or humidity exceeding a certain limit, or smartphones or any Web browsers with internet access.

image Activity 1.2

Answer the following questions:

1.2.1Do some research to find out how this e-toll system automatically detects information for:

a.vehicles with an e-tag, and

b.vehicles without an e-tags.

Write a short report on how reliable you think this system is.

Discuss ways in which:

i.the system could work, and

ii.how this system could work well.

1.2.2Can you think of any other situations where RFID chips can be used to automatically create an invoice?

image
QR CODE

INVISIBLE ONLINE DATA COLLECTION

Databases form the heart (or memory) of the internet because they store most of the information shown on websites today.

Databases are especially important on websites where users create the content (like YouTube, Facebook and Wikipedia), as it would be impossible for a data capturer to add this information manually. Instead, these websites are set up so that any information entered by users are automatically stored in different databases. This includes items like status updates, likes, tweets and any videos or pictures the user uploaded. The databases on computers or mobile phones also store the user’s personal information, such as their email address, username and password.

image
Figure 1.3: Social media usernames and passwords are stored in a database

image
Animation

Animation showing a banking transaction. Money is subtracted from one account, then transferred from starting bank to another bank over the internet, but before the destination account the connection is broken. This causes all the transactions to be rolled back.

image

image
Figure 1.4: Credit card payments make use of database transactions

A cookie is a message given to a web browser by a web server. The web browser stores the message in a text file. Each time the browser requests a page from the server the message is sent back to the server. The cookie identifies users and could customise web pages for each person. When a person first enters a website using cookies, they may be asked to fill out a form providing personal information, which is packaged into a cookie and sent to the browser. The next time you go to the same website, your browser will send the cookie to the server.

Since this incredibly detailed information is recorded for billions of internet users, online advertising companies require massive databases with software that can automatically track users and their activity across web pages.

There are hundreds of different systems that have been designed specifically to automatically record information in databases, for example credit card payments, automatic toll gates, cookies and cell phone calls.

In all these examples, software was specifically built to read the information and record it in a database. This software aiso allows the companies to create automatic reports from the database. For example, you may receive a credit card statement or phone bill at the end of each month. This is a report that your service provider’s database generates automatically.

TRANSACTION TRACKING

Transaction data (type of transaction, store where transaction occurred, employee who performed the transaction, customer information, payment information including signatures if captured) is sent from stores to the corporate database. Data is stored on cards, such as credit cards, store cards (for example, Jet or Edgars) and store loyalty cards (for example, Clicks ClubCard, Pick ‘n Pay SmartShopper card). Some transaction tracking benefits include: consumer safety when it comes to purchasing toys, cars, food and medication, enhancing a user’s on-site experience by tracking browsing history and demographic profiles, fraud detection, and so on. The main disadvantage of transaction tracking is the unauthorised use of your personal Information.

LOCATION BASED DATA

Location based data is data that provides information about different things that can be shown on maps. Static location-based data includes data such as maps with roads, street names, shops, offices and other features, but also farming information, government buildings and assets, electric distribution networks with pylons and substations, military information and so on. Non-static information obtains the position of aeroplanes, tracking transport or stolen vehicles, personal positioning, satellites, traffic patterns, rhinos and even pets; all of which are based on input from geographic positioning systems (GPS).

image
Figure 1.5: A city map based on a GIS database

Location based services (LBS) are services that use software applications together with the location-based databases to deliver a service such as finding the best route to a destination or shop, find the location or movement of a stolen vehicle, a lost pet, find the characters in the Pokemon GO game or the places where photographs were taken during a holiday trip.

Smartphones and tablets are also better at location-based computing, which refers to software based on a user’s location. Examples of location-based computing includes:

  • Weather applications tell you what the weather in your specific area will be, based on your location.
  • Food ordering applications can use your location to deliver food directly to you.
  • Car sharing services, such as, Uber use your phone’s location to find the nearest Uber driver, who will pick you up and drop you off at the correct location.

The problem with so much data being stored in databases is how to make sense of it and use it. In the next section, you will see how companies mine databases to improve their decision making.

image Activity 1.3 Data collection

1.3.1Define a web form and provide an example.

1.3.2Provide a brief explanation of what tagging is and give an example of how a tagging system (not mentioned in this book) has been used in South Africa. (You will need to do a bit of research to answer this question.)

1.3.3Choose the answer and write ‘true’ or ‘false’ next to the question number. Correct the statement if it is FALSE. (You may not simply use the word ‘NOT’ to change the statement.)

a.RFID stands for radio frequency identifier and refers to tiny chips that can store a few kilobytes of information.

b.A basic RFID reader will scan all the information stored on the product with an RFID tag and make them available to a computer.

c.Databases are especially important on websites where users create the content, since it would be impossible for a programmer to add this information manually

d.Credit card payments, automatic toll gates and cell phone calls automatically record information in databases.

e.RFIDs automatically identify objects, collect data about them, and enter those data directly into computer systems with little or no human intervention.

f.Databases are not important on websites where users create the content, since it is possible for a data capturer to add this information manually.

g.These websites are set up so that any information entered by users are manually stored in different databases.

h.Invisible data capturing refers to data that is captured online, usually entered by the user.

i.Online data capturing refers to data that is automatically recorded and captured on a database without requiring user input and usually without informing the user.

j.Websites help advertisers to create advertisements that are specific to each user. To do this, advertisers create a database recording information from each user on the website that includes information like what the user searches for, what links they click on and how long they stay on different pages.

1.3.4Whenever an item is purchased from an online shop and the payment is received, several database transactions must occur. Underline what this includes:

  • Creating a customer order in the database.
  • Removing the purchased items from the company’s inventory.
  • Subtracting the amount paid from the amount owed on the customer’s account.
  • Placing an order with a shipping company.

1.3.5Explain how websites use the data they collect to determine which advertisements to run on the website.

1.3.6Which other examples do you know of that are commonly used to add information to a database automatically?

1.3.7On many websites, when an internet transaction is done, the data can be captured without the user’s knowledge or consent.

a.How can this type of data capturing be used in day-to-day life? Give an example of the data that is captured to support your answer.

b.RFID chips can also be used to automatically capture large amounts of data. What does RFID stand for?

c.Give four examples of how RFID cards are used in real life to capture data.

1.3.8What does transaction processing refer to?

1.3.9Why is this it important to group transfers into a single transaction?

1.3.10What is meant by ‘rolled back’ in the following sentence:

‘If a single task fails, the transaction is rolled-back.’

1.3.11Why is a complete roll-back important? Explain it in terms of a bank transfer being cancelled during the final step.

1.3.12Banks and online shops use transaction processing systems. Can you think of any other situations in which transaction processing systems may be used? Make a list.

1.3.13Answer the following questions in your own words.

a.What is location-based data and what is its main goal?

b.Name and shortly describe each of the two types of location-based data.

c.Which technology is generally used for location-based computing?

d.How would location-based computing assist the organisers in gathering information about the participants using mobile devices?

e.What is the name of the system used to determine the position of non-static devices?

f.What two technologies are combined by location-based computing to deliver a service to customers?

g.Give three examples of location-based services.

h.The Computer Olympiad will be managed from a main office and all communications will be sent from this office. On the day of the event the data will be distributed electronically to all institutions that have registered to participate in the Computer Olympiad.

Suggest ONE advantage for the Computer Olympiad if they use location-based computing.

UNIT
1.2 Data warehousing

A warehouse is a place where you can store a large number of items or products.

image
Did you know

Normalisation is the process of refining the structure of a database to minimise redundancy and improve integrity.

WHAT IS DATA WAREHOUSING?

Data warehousing is similar to how factories and distribution companies store their products in a warehouse in real life. It is a technique for storing data from more than one database. Unlike most databases, where the data is normalised and stored in the most efficient manner, data warehousing takes the data from these databases and stores it in a non-normalised way.

Since the data is not normalised, there may be redundancies and it can use additional storage space.

image
Figure 1.6: Data warehousing is used to make data available for analysis

image
Did you know

Data redundancy occurs within a database when the same piece of data is held in two different places.

The goal of a data warehouse is to make the data available and ready for analysis.

It is important to understand that a data warehouse does not contain copies of the original databases, instead, it is a new database that is created to hold structured data from multiple sources. A data warehouse is usually opened to multiple people in different departments of a company. Each of these people can then access the data and use it to create graphs and reports useful to them.

image
Did you know

Data warehousing is a key tool in business intelligence (BI), which refers to several technologies and applications that are used to improve business decision making.

DATA WAREHOUSING VS DATABASE

Data warehouses and databases are similar in that they are both relational data systems. However, they were built to serve different purposes.

  • A data warehouse stores a large quantity of historical data and all the data from the various databases are availabe.
  • A database stores current transactions and enables easy access to specific transactions for ongoing business processes.

image Activity 1.4 Data warehousing

1.4.1In real life, a warehouse is a place where you can store many items or products.

a.What is a data warehouse?

b.How does data warehousing store the data? Explain it in your own words.

c.What are two advantages of denormalised databases?

d.What is the goal of a data warehouse?

e.Does the data warehouse replace the original data base? Give a reason for your answer:

1.4.2Explain in one or two sentences what this diagram means.

image

1.4.3Tabulate two differences between data warehousing and database.

UNIT
1.3 Data mining

WHAT IS DATA MINING?

Data mining is a process used to identify trends and patterns between different sets of data in large databases. Selecting the right data from such large amounts of data (called big data) can help show trends and patterns between data sets, which can improve decision making dramatically. For example:

Marketing: Data mining is used to explore increasingly large databases to improve market segmentation. By extracting data like customer age, gender, tastes etc. and looking for patterns and trends between them, it is possible to guess their behaviour in order to direct personalised loyalty campaigns. Data mining in marketing also predicts which users are likely to unsubscribe from a service, what interests them based on their searches, or what a mailing list should include to achieve a successful reponse rate.

image Activity 1.5 Data mining

You have been given access to three massive data sets containing hundreds of thousands of records. The three data sets are:

  • Sales information from Amazon, including sales information per product, per product category and per country.
  • Number of viewers per televisions show, including information on the time of day and the demographics of the people watching.
  • List of tweets from the past year, including the content of the tweet, the hashtags used and the location the tweet was sent from.

Based on these datasets, complete the following tasks.

1.5.1Give three questions that would you be interested in answering using these data sets.

1.5.2Describe how a business owner could you use these datasets to improve their decision making on the following topics:

a.Women are 50% more likely to purchase your product than men. When should the business advertise their product on TV?

b.An electronics company is not sure if they should make an affordable electric toothbrush, or a high-quality expensive toothbrush.

c.A restaurant owner has noticed that people often talk about food on Twitter. How can he use this information to decide which type of restaurant to open in Johannesburg?

To make sense of the massive amounts of information available, people need to follow a basic process.This is called the data mining process.

DATA MINING PROCESS

To mine a database, you will:

  • extract the relevant data
  • look for patterns in the data
  • discover knowledge from the patterns

This section will look at each of these steps in some detail.

image
Animation

Animation showing how small pieces from different tables are taken and combines to create a new database.

image

EXTRACT THE RELEVANT DATA

Once you have access to a database, you need to extract the data that will be useful to you. Many of these databases store incredible amounts of data (such as the UN database, which contains information for most countries in the world), and you may only need a small section of that data to answer your questions. Rather than filtering a massive database the whole time, you use the full database to create a new database containing only the information relevant to you.

In most situations, data will be extracted from the datasets using SQL. This is a programming language that was created specifically to manipulate and extract data from a database. By using SQL queries, users can specify exactly which data they want to extract. This is done by specifying:

  • Which fields to extract
  • Which data table to use
  • What conditions are used

Only data that meets these requirements will then be extracted to be used for analysis.

LOOK FOR PATTERNS IN THE DATA

When working with large amounts of data, it is impossible to look at each data point on its own. Instead, you need to look for patterns in the data that will help you understand the dataset. As you will see in the later steps, these patterns can result in knowledge, which is used to make better decisions and develop strategies.

For example, you may want to determine whether fewer people use the highway between Pretoria and Johannesburg today than before the Gautrain was built. To do this, you may find Gautrain project planning data for 2014 and N1 toll gate data for 2019 and calculate daily and monthly averages that you can then compare. Based on this analysis, you will know if the Gautrain decreased the traffic on the N1 or not, and importantly, by how much. This information can then be used to decide whether or not to build a system similar to the Gautrain system for Durban or Cape Town, who have similar traffic problems.

image
Figure 1.7: The Gautrain is a high-speed train traveling between Pretoria and Johannesburg
DISCOVER KNOWLEDGE

By identifying these patterns, you have turned an overwhelming amount of disorganised data into a few useful facts. This allows you to understand what happened rather than rely on vague uninformed views of people. Once you have confirmed the situation, informed decisions can be made, or strategies developed.

Example 1.1 Data mining used for Government Social Grants

Social grants are administered by the South African Social Security Agency (SASSA). SASSA is mandated to “ensure the provision of comprehensive social security services against vulnerability and poverty within the constitutional legislative framework”. Most social grants are “means tested”, this assesses the value of assets and income. Only a certain threshold is eligible for a social grant.

To measure the living circumstances of South African households, the Government conducts an annual General Household Survey (GHS). This is a household survey which collects big data on education, health, social development, housing, access to services and facilities, food security, and agriculture.

Using data mining techniques the relevant data that will be useful is extracted. This data is then analysed (finding patterns) and used to obtain information/ knowledge.

This knowledge can then be used to provide the necessary budget for social grants and to know what limit to set the threshold or provide reports like:

  • The number of households receiving at least one form of social grant rose from 29,9% (of the population) in 2003 to 44,3% in 2010, according to the latest General Household Survey report2 from Stats SA. This rose to 45,5%in 2015.
    Further details and analysis on the issue of the growing social welfare net are available from Stats SA’s Poverty Trends in South Africa report3.
image

Example 1.2 Data mining used by Facebook

Facebook (and other Web giants) accumulate all our personal data over time. Increasingly, identifiable data collection is happening in more dimensions than are ever understood by most users. The more data there is in one place, the more data mining can be done. Over time, and in context of other individual data points, it becomes Big Data. Using data integration, it’s then mixed with other data sources that, as end-users, we’ll never be aware.

Companies, individuals and Facebook Itself that have access to this data extract the relevant data that Is useful to them and store it in a smaller database.

Apps, that use data analytics, are used to analyze “friends of friends” comments, textual analysis, online behaviour, and so on, to compile data about us.

This information/knowledge is then used to: determine our current emotional state, correlate how sad or depressed someone might be, suggest possible friends etc.

image Activity 1.6 Data mining

1.6.1The tour company requires the use of a database management system to store details of tours and tour guides. A record of the tours and the agent responsible for each tour is stored in a table called ‘TBL Tours’. The first five records of the table are shown below.

image

This data in the database can be used for data mining.

a.Give examples of three questions that can be analysed using the database?

b.Whom will your answers help? I.e. who will get the advantage of the answers obtained from your questions?

c.What must you do to mine a database?

d.To gain access to a database, which of the following options can you legally apply?

i.You can ask someone that you know to get access for the database.

ii.You can buy access to their database from a company.

iii.You can give money to an employee of the organisation to get you access to the database.

iv.You can hack into a database.

v.You could personally request access to the database.

vi.You could use a scraper or the website’s API to record data directly from the site.

vii.You can use a free database from the internet.

1.6.2Which website does not provide data sets for free.

a.Code for South Africa

b.Gapminder

c.University of Pretoria

d.Reddit comments

1.6.3What must you do after you obtained access to the database?

a.Filter the information

b.Read through the information to find what you are looking for.

c.Look for patterns in the information you obtained.

d.Extract the information, creating a new database that contains only the information relevant to you.

1.6.4The following is an example of a pattern that may emerge from the TBL Tour database.

a.Senzo does every second trip.

b.The starting dates are evenly spaced.

c.The lengths of the tours are all the same.

d.There is an even distribution between the different destinations.

1.6.5The final step in the data mining process is _____

a.Extracting data.

b.Obtaining knowledge

c.Developing a strategy.

d.Accessing a data set

1.6.6Which of the following targets are SMART.

a.By decreasing the price of Gautrain tickets by 70%, you hope to increase the number of passengers by 20% over the next six months.

b.By decreasing the price of Gautrain tickets, you hope to increase the number of passengers by 20%.

c.By decreasing the price of Gautrain tickets by 10%, you hope to increase the number of passengers by 20% over the next six months.

UNIT
1.4 Caring for and managing data

In this unit, you will learn more about the value of data and how data can be protected.

VALUE OF DATA AND METADATA

Data can be valuable for many different reasons. For example, online shopping websites can charge owners a fee for placing an advertisement of their house, used car or gym equipment on the website. However, the only way in which owners will be willing to pay this fee is if the database already contains many other products. To gather the data needed to sell these products, the website’s creator can ask sellers to enter the important data for their products on the website, from where it is added to the database.

For a database to be useful, it needs to record and store valuable and useful data. To decide if the data is valuable to record, you can ask yourself these questions:

  • Will I ever use the data in this field?
  • Will anyone else use the data in this field?
  • What fields do I need specifically for my application?
  • What fields would I need for my application in the future?

Based on these questions, you can determine whether certain types of data are valuable to record and store in your database. Only data that will potentially be used by you (either now, or in the future) is valuable. Data that will not be used increases the size of your database without adding value to it.

Once you have determined whether the data itself is valuable, you need to ensure that the data is of a high quality. The following five characteristics determine the quality of the data.

image

HOW TO PROTECT DATA

Databases need to be protected from several different threats, including incorrect data entry, data corruption, data loss, accidental data deletion, purposeful data deletion and unauthorised access. In the following sections, you will look at eight tools and techniques that can be used to protect databases from these threats:

  • Data validation
  • Data verification
  • Data integrity checks
  • Logging changes
  • Data warehousing
  • Access control
  • Parallel data sets

Since no single tool offers perfect protection against all threats, most large databases use most or all these tools.

DATA VALIDATION

Data validation refers to the process in which you check whether the data is accurate, in the correct format or of the correct type before allowing your database to record it. This ensures that the data in your database is consistent and accurate. It can also be used to prevent certain types of unauthorised access (such as SQL injections).

For example, if you are recording data for a government survey, you can use a digit validation to ensure all people’s ID numbers are exactly 13 digits long. This prevents you from accidentally adding incorrect ID numbers to the database.

DATA VERIFICATION

Data verification is a manual technique that can be used to make sure that the data on a database is correct and accurate. There are two main ways to do this:

  • Full verification which requires that each piece of data that is entered into a database is read and checked by someone. This, however, can be very time consuming.
  • Sample verification, in which a randomly selected sample of data is checked to ensure there are not systematic errors. This can help you to notice errors that occur in all data, but with this method it is possible to miss small mistakes (like typing errors).
DATA INTEGRITY

When we speak about data integrity, we are referring to the reliability, accuracy and how trustworthy data is over its entire lifecycle. This ties in with data security because uncorrupted data (integrity) is considered to be ‘clean data’ that stays unchanged throughout its lifecycle.

Many DBMSs have built-in integrity controls that help to maintain the data integrity. This includes features such as data validation and access control, but also refers to several automatic maintenance processes to prevent data corruption and errors from occurring in a database.

LOGGING CHANGES

Logging is the process of recording any changes made by users to a database. While logging does not directly protect a database from any possible threats, it can discourage people from sabotaging a database and make it easier to solve any problems once they occur.

This is called creating an audit trail. The audit trail records exactly:

  • who made the changes
  • what the user changed
  • when they made the changes.
DATA WAREHOUSING

Data warehousing is a technique used for storing data from more than one database, it is usually stored in a way that is secure, reliable and easy to retrieve. While data warehousing is generally not considered as a security tool, it can help improve data integrity in several ways:

  • Data warehouses make incorrect data entries or data corruption more visible by allowing data analysis.
  • Data warehouses can make data loss more visible, allowing the problem to be fixed.
  • Data warehouses can be used to recover critical data if it is deleted or corrupted.

Data warehouses are also a potential security weakness for databases. Your database could have all the best security features in the world, and it would not matter if hackers could simply access your data warehouse.

ACCESS CONTROL

Access control refers to managing and controlling the parts of a database that users have access to. By limiting the number of people who can change a database, and by limiting what changes each user can make, you can reduce the damage that any single user can do to a database. This becomes especially important when many users have access to your database.

There are three important ways to control access to your data:

  • Passwords ensure that only the owner of a username can log in with that username.
  • User rights determine which tables and fields every username can access, and what changes (if any) the user can make to these tables. This prevents others from gaining access to certain data or making changes they are not authorised to do.
  • Good database security ensures that the data is secure and that outside people cannot find other ways to access the database.

For example, by using access control on a large news website, regular guests may be able to read stories but not change the data. Writers may be able to add new stories to the website but not add new users or delete stories. Finally, one or two administrators may be able to add new users or delete stories.

PARALLEL DATA SETS

Backups are the most important tool to protect databases from data loss and data corruption. To ensure that data has not been corrupted or deleted, the database is checked at intervals against a perfect copy of it, called a parallel data set. If there are differences, it means that data was either corrupted or deleted.

A company, like YouTube, would break the database into smaller pieces and have multiple backups of each piece. In this way you never risk losing the entire database at once, and even if a database crash occurs, the system can automatically load an existing backup while creating new backups.

Database backups should be protected as securely as the database itself. There have been many instances of websites where the database backup, rather than thecan protect data from data corruption, data loss, accidental deletion and purposeful deletion.

image
QR CODE

image Activity 1.7 Caring for and managing data

1.7.1Choose the correct answer.

a.Data validation refers to _____?

A.Built-in integrity controls that help to maintain the data integrity.

B.A technique used for storing data from more than one database in a way that is secure, reliable and easy to retrieve.

C.The process in which you check whether the data is accurate.

D.A manual technique that can be used to make sure that the data on a database is correct and accurate.

b.Verification refers to _____?

A.Built-in integrity controls that help to maintain the data integrity.

B.A technique used for storing data from more than one database in a way that is secure, reliable and easy to retrieve.

C.The process in which you check whether the data is accurate.

D.A manual technique that can be used to make sure that the data on a database is correct and accurate.

c.Warehousing refers to _____?

A.Built-in integrity controls that help to maintain the data integrity.

B.A technique used for storing data from more than one database in a way that is secure, reliable and easy to retrieve.

C.The process in which you check whether the data is accurate.

D.A manual technique that can be used to make sure that the data on a database is correct and accurate.

d.Databases need to be protected from several different threats, including:

A.Data mining

B.Unauthorised access

C.Data redundancy

D.Audit trails

e.What does an audit trail record? [Hint: This question has more than one correct answer.]

A.Who made the changes?

B.Why were the changes made?

C.What did the user change?

D.When did they make the changes?

f.Explain what data corruption is.

A.When a hacker steals money from someone else using personalised data.

B.When someone says something that is not true about someone else on the internet.

C.A criminal action by someone in a position of power using private data.

D.Errors in computer data that occur during writing, reading, storage, transmission, or processing, which introduce unintended changes to the original data.

g.Which of the following threats are prevented by the validation of data?

A.Incorrect data entry

B.Data corruption

C.Accidental deletion

D.Data loss

1.7.2Explain in your own words why data is valuable.

1.7.3List three tools and techniques that can prevent accidental and purposeful data deletion.

1.7.4Explain how parallel data is used to detect and correct data corruption or loss

1.7.5Today, the entire world’s financial system is stored on databases that are connected to the internet. Whether you are an individual with R10 in your bank account or a massive company with billions of Rand in your account, your money is stored on a database connected to the internet. As you can imagine, this data is not only incredibly valuable, but also incredibly important to protect.

Answer the following questions.

a.List six techniques that can be used protect data.

b.Criminal hackers around the world are trying to break into banks’ databases. List two techniques that can be used to protect data from unauthorised access.

c.What is data integrity and why is data integrity critical for a bank’s database?

d.Do you think bank’s make backups of their databases? Give reasons for your answer.

CONSOLIDATION ACTIVITY Chapter 1: Database Management

1.Choose the correct answer.

a.Which of the following scenarios does NOT need a database?

A.Storing the credentials of all Gmail accounts.

B.Backing up all the information on your personal computer.

C.Storing all the webpages of a website.

D.Storing all the information about a business’s inventory.

b.Which of the following techniques is NOT used to protect data in databases?

A.Data encryption

B.Data validation

C.Data verification

D.Access control

c.Integrity control is used to check ______.

A.Unauthorised access, Data loss and Data corruption.

B.Data corruption, Incorrect data entry and Data loss.

C.Incorrect data entry, Data corruption and Purposeful deletion.

D.Purposeful deletion, Accidental deletion and Unauthorised access.

d.Which of the following techniques can be used to recover lost data?

A.Access control

B.Data verification

C.Data validation

D.Data warehousing

e.Which of the following people would financially benefit from data mining?

A.Online advertisers

B.Programmers

C.Database administrators

D.Security personnel

f.When you create a data warehouse you need to _____ the data.

A.Access, Analyse and Use

B.Find, Extract and Store

C.Access, Extract and Load

D.Extract, Transform and Load

g.In location-based data, which of the following is an example of non-static information?

A.The most popular Gautrain routes.

B.The current flight route of an aeroplane.

C.The location of a car.

D.The location of a hotel.

2.Indicate if the following statements are ‘True’ or ‘False’. Choose the answer and write ‘true’ or ‘false’ next to the question number Correct the statement if it is FALSE. Change the underlined word(s) to make the statement TRUE. (You may not simply use the word ‘NOT’ to change the statement.)

a.Full verification is when someone randomly selects a sample of data and manually checks it for systematic errors.

b.Gapminder provides free access to databases of South African matric results.

c.SMART strategies are Specific. Measurable. Achievable, Relevant and Time-bound.

3.Choose a term/concept from COLUMN B that matches a description in COLUMN A. Write only the letter next to the question number (e.g. 1-A).

4.Explain TWO ways RFIDS can be used to improve each of the following industries:

a.Retail stores

b.Train transportation systems.

c.Universities and colleges

d.Medical hospitals

5.Below are THREE development strategies based on different databases. State which of them are SMART strategies. Give a reason for your answer.

image

6.Other than banking, name ONE other industry that can make use of database transaction tracking. Also provide an example of a scenario in which they would use it.

7.Cherry wants to expand her small bakery business, by creating a website so that she can sell her goods online. Cherry will need to make use of a DBMS and she will need to find ways to secure her business’s data.

a.Who would Cherry need to hire to help her keep her create, maintain and secure her database.

b.Mention ONE way location-based technology can help Cherry expand her business.

c.Mention ONE way Cherry can use data mining to improve and expand her business.