Behind AI #2: What Is a Database and Why Is Important in AI
Here's everything you need to know about databases
Hi!
Data is an important resource in AI. Companies store data in different types of databases and today we’ll learn everything about them (in the next article, we’ll focus on a type of database that is specially used in AI, so stay tuned!)
No matter what your job is probably you’ve ever heard of the word database.
Companies out there use different types of databases to store all the information they collected throughout the years. Although all these databases might seem the same, they have some functionalities that make them more suitable for certain situations, so it’s worth learning more about them.
In this article, we’ll see what databases are and the most commonly used in companies.
Table of Contents
1. What is a Database?
2. Types of Databases
- Relational databases
- Non-Relational databases
3. Top databases
What is a Database?
A database is a collection of data typically stored electronically in a computer system and controlled by a database management system (DBMS). The data, the DBMS, and the applications associated with them are referred to as a database system (or just “database”).
The words database management system and database are often used interchangeably, but technically they’re not the same.
To distinguish them consider the case of a social media app that stores different information about its users such as messages, photos, comments, etc. The database stores this big collection of data, but that’s pretty much what it does. If you want to edit, update, or delete data, you need a DBMS that does the talking for you. Some of the most popular DBMS are Oracle, MySQL, SQL Server, and PostgreSQL.
Again, people often refer to both the DBMS and DB simply as “database” but now you know how this actually works.
Another way to think of a database is as a big spreadsheet with many rows and columns. That’s a good comparison, but a database goes beyond that. Both the database and spreadsheet are good for storing information, but they mainly differ in the following aspects:
How the data is stored and manipulated: Databases allow complex data manipulation, while spreadsheets aren't meant for users who need much data manipulation.
Who can access the data: Databases allow multiple users to quickly access and query the data, while spreadsheets were designed only for a single user or a small number of users.
The amount of data that can be stored: Databases are designed to store larger collections of data, while spreadsheets have a limitation.
Last but not least, a database cannot only store data in tables and rows. That’s how relational databases typically work, but there’s also another type of database called non-relational database. This leads us to our next point.
Types of Databases
Databases are typically divided into relational and non-relational databases. Among the top 10 databases, you’ll see both relational and non-relational databases. One type of database is not better than the other, but they suit different needs.
Relational databases
A relational database (aka SQL database), stores data in tables and rows also referred to as records. This type of database links information from different tables through keys.
A key is a unique value in a table that is also known as the “primary key”. When this key is added to a record located in another table, it’s called “foreign key” in this second table. This connection between primary and foreign keys creates a relationship between records within both tables.
Some popular relational database management systems (RDBMS) are Oracle, MySQL, SQL Server, and PostgreSQL.
Here’s a basic schema that shows how a relational database works.
To query data in a RDBMS, we use Structured Querying Language (SQL). With SQL we can create new records, update them, and more. This makes the RDBMS good for apps that need transactional functionality, data mining, and complex reporting.
Non-Relational databases
A non-relational database (aka NoSQL database), stores data without tables, rows, or keys. In other words, a non-relational database stores data in a non-tabular form. This adds some flexibility and helps satisfy specific requirements of the type of data being stored.
You can think of a non-relational database as a collection of documents. A document can contain a lot of detailed information about a customer. Each customer can have different types of information, but they can be stored in the same document.
The ability to process and organize different types of information makes non-relational databases more flexible than relational databases.
There are four popular non-relational types: document data store, column-oriented database, key-value store, and graph database. One of the most popular NoSQL databases is MongoDB.
What are the top databases?
It’s hard to rank the database based on their functionality because they suit different needs and can be more convenient in certain scenarios than in others. That said, it’s possible to rank the database management systems according to their popularity.
In fact, DB-Engine ranks DBMS by their current popularity. To do so, they calculate scores following different parameters.
Here are the top 10 databases by their popularity.
Now let’s see more about them, compare them, and see their pros and cons.
1. Oracle
Oracle Database is a widely used RDBMS across industries. In fact, it has the largest market share of around 30.2% in the RDBMS market.
Oracle Database supports SQL language to interact with the database. It’s considered one of the best databases because it supports all data types involving relational, graph, structured, and unstructured information. In addition to that, Oracle Database is preferred for its flexible standards, scalability, high availability, and strong security.
Pros
It’s highly compatible with different apps and platforms
Helps with scalability
It offers good privacy and security
Cons
The license is expensive
Users might need extensive SQL knowledge to use Oracle Database
Popularity
Google Trends shows more interest in Oracle than in MySQL over the past 5 years. The graph also reveals the same ups and downs for both databases.
2. MySQL
MySQL is one of the most popular databases. It’s open-source so any person or company can use MySQL for free, but if the code needs to be integrated into a commercial application, you need to purchase a license. That said, this database is still worth it for anyone who wants to experiment with a friendly yet powerful database.
MySQL was developed by Oracle and it’s a relational database management system. As explained before, the relation model consists in organizing data in tables with rows and columns, while the relationship between elements follows a logical structure. Companies such as Facebook, Twitter, Wikipedia, and YouTube employ MySQL backends.
Pros
It’s open source: Unlike other options, you don’t have to pay to use most features of MySQL
It’s cross-platform: Runs on Linus, Solaris, and Windows and supports platforms with programming languages such as C, C++, Java, Python, etc.
Reliable data security: MySQL is known for being a secure database management system. This is why so many well-known companies use it in their applications.
It’s easy to use: Anyone can download, install and start to use MySQL in a few minutes.
Cons
It’s not for large-sized data
It doesn’t support SQL check constraints
It doesn’t have a good debugging tool compared to paid databases
It doesn’t handle transactions very efficiently
Popularity
Google Trends shows that the interest in MySQL has slightly decreased over the past 5 years, but has suddenly risen compared to other databases like SQL Server.
3. SQL Server
SQL Server was developed by Microsoft and it’s considered a great RDBMS for both on-premise and cloud environments. It has a Database Engine component that allows storing, processing, and securing data. The database engine is divided into two segments — the relational and storage engine. The first is used to process commands and queries, while the second is used to manage features such as tables, pages, files, indexes, and transactions.
Besides SQL language, SQL Server also includes Transact-SQL (T-SQL), which is Microsoft’s extension to the SQL used to interact with relational databases. SQL Server is a good option for businesses that want to scale the performance, availability, and security seamlessly based on their requirements.
Pros
It has various supported editions (enterprise, standard, express, and developer). The express SQL server edition is free of cost.
It has an online documentation
On-premise and cloud database support
It offers different tools and apps
Cons
Expensive enterprise edition
It’s available for Windows, Linux, and macOS, but the steps to install it on a Mac aren’t as straightforward as on a Windows machine.
Popularity
Google Trends shows more interest over time in SQL Server than in PostgreSQL. In fact, the interest in PostgreSQL hasn’t changed that much over the past 5 years.
4. PostgreSQL
PostgreSQL is known as the world’s most advanced open source object-relational database management system (ORDBMS). Part of this reputation is due to its architecture, reliability, robustness, and extensibility.
PostgreSQL comes with many features that help build apps, protect data integrity and help manage data no matter how big or small the data is. It’s also highly extensible in many areas. To name a few:
Stored functions and procedures
PL/PGSQL, Perl, Python
SQL/JSON path expressions
Additional functionality such as PostGIS (spatial database extender for PostgreSQL)
These extensions help us process data right from PostgreSQL, so we don’t need to find workarounds to implement them.
Pros
It’s extremely programmable: You can extend PostgreSQL thanks to its directory-based operation and dynamic loading
It’s highly extensible
It has a very rich set of indexing options
Cons
Performance: PostgreSQL is sometimes less efficient than other RDBMS such as MySQL (at least for simple intensive reading operations)
It might be difficult to troubleshoot PostgreSQL
Popularity
Google Trends shows similar interest over time in PostgreSQL and MongoDB. That said, we should consider that PostgreSQL was initially released in 1996, while MongoDB was released in 2009.
5. MongoDB
MongoDB is an open-source document database that uses a flexible schema for storing data. Unlike SQL databases that store data in tables of rows and columns, NoSQL database programs like MongoDB use JSON-like documents with optional schemas.
MongoDB is great for those who build internet and business applications and need to evolve and scale quickly. Some of the advantages of MongoDB for developers are the power of document-oriented databases (documents can be retrieved directly in JSON format, which developers find easy to work with), user experience, scalability and transactionality, and its thriving community.
Overall, MongoDB is good if you’re looking for a database that:
Supports rapid iterative development
Enables the scale to high levels of read and write traffic
Stores, manages, and searches data when creating apps
Pros
It offers a flexible schema that it’s not possible to get in a RDBMS
Scalability: MongoDB uses sharding, which allows the database to use horizontal scalability.
It’s free and supports Windows, macOS, and Linux
Cons
High memory usage: The data size in MongoDB is higher than in other databases
Less flexibility with querying: It fails to support joins as a relational database
Very good written refresher on databases. Thank you!