Do you want to know how to drop duplicates from a pandas DataFrame but find the drop_duplicates function not working in pandas when developing with Python? 🤔
Python is a high-level, general-purpose programming language used for a wide variety of applications. It is well-known for its readability and simple syntax, which makes it an ideal language for beginners.
Pandas is a data analysis library for Python that provides high-performance, easy-to-use data structures, and tools for data manipulation and analysis. It allows users to manipulate, clean, and analyze data in an intuitive way. It is often used in conjunction with other scientific computing libraries like NumPy and SciPy.
In this article, we will go through different ways to fix the error drop_duplicates not working in pandas. So without further ado, let’s dive deep into the topic.
Table of Contents
What is drop_dupicates() used for?
The drop_duplicates() method in Pandas is used to remove rows with duplicate values from the data frame. It takes the subset of columns as an argument and considers them the key to finding duplicate rows. If the same values are present in all the columns in the subset, then it will remove one row and keep the other one. This method also has the option of keeping the first or last occurrence of the duplicate row.
Syntax of df.drop_duplicates()
DataFrame.drop_duplicates(subset=None, keep='first',inplace=False)
The drop_duplicates() method is used to remove duplicate rows from a DataFrame. It takes three optional parameters:
- Subset is used to specify a subset of columns to consider when removing duplicates.
- keep is used to specify which duplicates to keep (‘first’ or ‘last’).
- in place is used to specify whether the changes should be made in place or not.
However, sometimes you may encounter issues when using this function, such as it not removing all the expected duplicates or returning an empty DataFrame.
What Are The Different Variations of Drop_Duplicates() in Python?
In Python, the following are three different ways used to drop duplicates from a DataFrame.
- Dropping rows from duplicate rows
- Dropping rows from duplicate subsets of columns
- Keeping the last duplicate instead of the default first column
Method 1: Dropping Rows From Duplicate Rows
Code
import pandas as pd df = pd.DataFrame({ 'Name':['John', 'Robert', 'Martin', 'Henry', 'John', 'Robert'], 'Age':[41, 22, 24, 28, 41, 22], 'Designation':['Manager', 'Consultant', 'Lead', 'Consultant', 'Manager', 'Consultant'] }) df.drop_duplicates(keep = 'first', inplace = True) print(df)
Output
Name Age Designation 0 John 41 Manager 1 Robert 22 Consultant 2 Martin 24 Lead 3 Henry 28 Consultant
The code above imports the Pandas library as pd and creates a data frame (df) that contains information about the names, ages, and designations of some people. Then, the code uses the drop_duplicates() function to remove duplicate rows from the data frame. The argument keep = ‘first’ specifies that only the first occurrence of the duplicate row should be kept and all other duplicate rows should be dropped. The argument inplace = True specifies that the changes should be made in the same DataFrame.
Method 2: Dropping Rows From a Duplicate Subset of Columns
Code
import pandas as pd df = pd.DataFrame({'Name': ['John', 'Paul', 'George', 'Ringo'], 'Age': [30, 25, 45, 35], 'Job': ['Engineer', 'Doctor', 'Lawyer', 'Musician']}) df.drop_duplicates(subset=['Name', 'Age'], keep='first', inplace=True) print(df)
Output
Name Age Job 0 John 30 Engineer 1 Paul 25 Doctor 2 George 45 Lawyer 3 Ringo 35 Musician
The above code imports the Pandas library and creates a DataFrame with four columns and four rows. It then drops any duplicate rows that have the same value in the name and age columns. The keep=’first’ argument ensures that the first occurrence of a duplicate is kept while the rest are dropped. Finally, the new data frame is printed.
Method 3: Keeping The Last Duplicate Instead of The Default First Column
Code
import pandas as pd df= pd.DataFrame({ 'Name':['Justin', 'Ross', 'Mary', 'Henry', 'Justin', 'Ross'], 'Age':[41, 22, 24, 28, 41, 22], 'Designation':['Manager', 'Consultant', 'Lead', 'Consultant', 'Manager', 'Consultant'] }) df.drop_duplicates(keep = 'last', inplace = True) print(df)
Output
Name Age Designation 2 Mary 24 Lead 3 Henry 28 Consultant 4 Justin 41 Manager 5 Ross 22 Consultant
The above code imports the Pandas library as “pd.” This library allows us to create and manipulate data frames, which are data structures used to store data in the form of columns and rows. In the code, we create a data frame with three columns (“Name,” “Age,” and “Designation“) and six rows. We then use the drop_duplicates() function to drop any duplicate rows, keeping the last one. The result is a data frame with four rows and three columns.
Ways to Fix When Pandas Drop_Duplicates() Method Is Not Working
Following are different scenarios due to which errors can arise in the drop-duplicate method of Pandas, and we have solved them with an example as well.
1. Check Data Types of Columns Before Dropping Duplicates
Check the data types of the columns you are trying to drop the duplicates from. If they are not the same, then the DataFrame won’t recognize them as duplicates.
The drop_duplicates()
function considers two rows to be duplicates, if all their values are the same. If the columns you are considering have different data types, they may be considered different even if their values are the same. For example, the number 1 and the string “1” are considered different. To fix this issue, you can convert the data types of the columns to be the same using the astype()
function.
code
import pandas as pd df = pd.DataFrame({"S.no": [1, 2, 3], "Name": ["John", "John", "Jack"], "Age": [20, 20, 30], "Location": ["New York", "New York", "London"]}) print(df.dtypes) df = df.drop_duplicates() print(df)
Output
S.no int64 Name object Age int64 Location object dtype: object S.no Name Age Location 0 1 John 20 New York 2 3 Jack 30 London
First, we import the Pandas library as pd. Then, we create a DataFrame with four columns and three rows of data. The third line of code prints the column data types. To ensure all columns have the same data types, drop duplicate rows. and print the values after the drop_duplicates() method, and we can see that the data types remain unchanged.
2. Fill in Missing Values in a Data Frame Before Dropping Duplicates
The drop_duplicates()
function considers nan
values to be different from each other, even if they are in the same column. To fix this issue, you can use the dropna()
function to remove rows with nan
values before using drop_duplicates()
.
Check the data frame for any missing values. If there are missing values, they could be causing the drop_duplicates method to not work properly.
Code
data = {'Name':['John','Paul','George','Ringo','John'], 'Age':[20,21,20,19,20], 'Instrument':[None,'Guitar','Guitar','Drums',None]} df = pd.DataFrame(data) df['Instrument'].fillna('Unknown', inplace=True) df.drop_duplicates(subset=['Name', 'Age', 'Instrument'], keep='first', inplace=True) print(df)
Output
Name Age Instrument 0 John 20 Unknown 1 Paul 21 Guitar 2 George 20 Guitar 3 Ring 19 Drums
The code above imports the Pandas library as pd. Then, it creates a data frame from the given data. After that, it is filling in any missing values in the “Instrument” column with the string “unknown.” Finally, it is dropping any duplicates based on the “Name,” “Age,” and “Instrument” columns and keeps only the first occurrence. The final data frame is then printed.
3. For Categorical Data Drop Duplicates
If the data frame contains a column with categorical data, try using the parameter keep=False to drop the duplicates. This will ensure that only one instance of each category is kept.
Code
import pandas as pd data = {'name': ['John', 'Mary', 'John', 'Alice', 'Alice', 'Mary'], 'score': [10, 8, 10, 5, 7, 8]} df = pd.DataFrame(data) df.drop_duplicates(subset="score", keep=False)
Output
name score 3 Alice 5 4 Alice 7
The code above will drop the duplicates in the data frame based on the “score” column. This means that if there are multiple rows with the same score, they will be removed from the data frame. The argument “keep = false” means that all duplicates will be removed.
4. Use “Inplace=True” to Directly Change DataFrame
Finally, try using the inplace=True parameter when calling the drop_duplicates method. This will ensure that any changes are made directly to the DataFrame, and not to a copy of the DataFrame, which could be why the method is not working.
Code
import pandas as pd data = {'Name':['John', 'Mary', 'Tim', 'John', 'Tim'], 'Age':[20, 21, 19, 20, 19], 'Gender':['Male', 'Female', 'Male', 'Male', 'Male']} df = pd.DataFrame(data) df.drop_duplicates(inplace=True) print(df)
Output
Name Age Gender 0 John 20 Male 1 Mary 21 Female 2 Tim 19 Male
import Pandas, as pd allows us to use the Pandas library. The code then creates a DataFrame from a dictionary called data. The drop_duplicates() method is then used to remove any duplicate rows from the DataFrame. This means that if there are any rows with the same values in all columns, they will be removed, leaving only unique values. The argument inplace = True specifies that the changes should be made in the same data frame.
In order to get the drop_duplicates method to work and fix the error “drop_duplicates not working in pandas”, you should use the inplace=True parameter. This will ensure that any changes are made directly to the DataFrame, without creating a copy of the DataFrame.
5. Check the Parameters of Drop_Duplicates()
Make sure you have specified the correct subset of columns to consider when identifying duplicates. By default, drop_duplicates()
considers all columns, but you can specify a list of columns using the subset
parameter.
6. Check for Hidden Characters
Sometimes there may be hidden characters in your data that are causing the duplicates to be considered different. To fix this issue, you can use the strip()
function to remove leading and trailing whitespace from the columns.
7. Check the Order of The Rows
The drop_duplicates()
function only removes the first occurrence of a duplicate row. If the duplicates are not adjacent to each other, they will not be removed. To fix this issue, you can use the sort_values()
function to sort the DataFrame by the columns you are considering before using drop_duplicates()
.
Additional Things to Note About the Pandas drop_duplicates() Function
It’s also worth noting that you can use the keep
parameter of drop_duplicates()
to specify which duplicate rows to keep. The keep
parameter takes three values:
first
: Keep the first occurrence of each set of duplicates (default)last
: Keep the last occurrence of each set of duplicatesFalse
: Drop all duplicates
Here’s an example of how to use the keep
parameter to keep the last occurrence of each set of duplicates:
import pandas as pd df = pd.DataFrame({'A': [1, 1, 2, 2, 3, 3], 'B': [1, 1, 2, 2, 3, 3]}) df = df.drop_duplicates(subset=['A', 'B'], keep='last') print(df)
This will output the following DataFrame:
A B 3 2 2 5 3 3
As you can see, the last occurrence of each set of duplicates has been kept.
In conclusion, the keep
parameter of drop_duplicates()
allows you to specify which duplicate rows to keep when removing duplicates from a DataFrame. By using this parameter, you can customize the behavior of drop_duplicates()
to suit your specific needs.
Conclusion
To summarize the article, the drop_duplicates method in Pandas can be used to remove duplicates from a DataFrame. However, sometimes the method does not work as expected. To fix this, it is important to understand the parameters of the method and make sure the DataFrame contains only a single index.
Additionally, it is important to pay attention to the data types in the data frame to ensure that the correct data is being compared. Finally, it is important to check the DataFrame for any null values that can cause the drop_duplicates method to fail.
By following these steps, you should be able to effectively use the drop_duplicates()
function to remove duplicate rows from your DataFrames.
Let’s have a quick recap of the topics discussed in this article.
- What is drop_duplicates() used for in Python?
- Using Drop Duplicates In 3 Ways
- Dropping rows from duplicate rows
- Dropping rows from a duplicate subset of columns
- How to fix the error drop_duplicates not working in pandas?
- Keeping the last duplicate instead of the default first column
- Fixing Different Errors Using Drop-Duplicates() Method
If you’ve found this article helpful, comment below and let 👇 know which solutions have helped you solve the problem.