How to Fix drop_duplicates Not Working in Pandas?

How to Fix drop_duplicates not working in pandas

Do you want to know how to drop duplicates from a pandas DataFrame but find the drop_duplicates function not working in pandas when developing with Python? 🤔

Python is a high-level, general-purpose programming language used for a wide variety of applications. It is well-known for its readability and simple syntax, which makes it an ideal language for beginners.

Pandas is a data analysis library for Python that provides high-performance, easy-to-use data structures, and tools for data manipulation and analysis. It allows users to manipulate, clean, and analyze data in an intuitive way. It is often used in conjunction with other scientific computing libraries like NumPy and SciPy.

In this article, we will go through different ways to fix the error drop_duplicates not working in pandasSo without further ado, let’s dive deep into the topic. 

 

 

What is drop_dupicates() used for?

The drop_duplicates() method in Pandas is used to remove rows with duplicate values from the data frame. It takes the subset of columns as an argument and considers them the key to finding duplicate rows. If the same values are present in all the columns in the subset, then it will remove one row and keep the other one. This method also has the option of keeping the first or last occurrence of the duplicate row.

 

Syntax of df.drop_duplicates() 

DataFrame.drop_duplicates(subset=None, keep='first',inplace=False) 

 

The drop_duplicates() method is used to remove duplicate rows from a DataFrame. It takes three optional parameters: 

  1. Subset is used to specify a subset of columns to consider when removing duplicates. 
  2. keep is used to specify which duplicates to keep (‘first’ or ‘last’). 
  3. in place is used to specify whether the changes should be made in place or not.

However, sometimes you may encounter issues when using this function, such as it not removing all the expected duplicates or returning an empty DataFrame.

 

 

What Are The Different Variations of Drop_Duplicates() in Python?

In Python, the following are three different ways used to drop duplicates from a DataFrame.

  1. Dropping rows from duplicate rows
  2. Dropping rows from duplicate subsets of columns
  3. Keeping the last duplicate instead of the default first column

 

Method 1: Dropping Rows From Duplicate Rows

Code

import pandas as pd



 df = pd.DataFrame({ 'Name':['John', 'Robert', 'Martin', 'Henry', 'John', 'Robert'],

                                    'Age':[41, 22, 24, 28, 41, 22], 

                                    'Designation':['Manager', 'Consultant', 'Lead', 'Consultant',                                                            

                                                           'Manager', 'Consultant'] })



df.drop_duplicates(keep = 'first', inplace = True) 

print(df)  

 

Output

  Name    Age   Designation

0 John    41     Manager

1 Robert  22     Consultant

2 Martin  24     Lead

3 Henry   28     Consultant

 

The code above imports the Pandas library as pd and creates a data frame (df) that contains information about the names, ages, and designations of some people. Then, the code uses the drop_duplicates() function to remove duplicate rows from the data frame. The argument keep = ‘first’ specifies that only the first occurrence of the duplicate row should be kept and all other duplicate rows should be dropped. The argument inplace = True specifies that the changes should be made in the same DataFrame.

 

 

Method 2:  Dropping Rows From a Duplicate Subset of Columns

Code

import pandas as pd


df = pd.DataFrame({'Name': ['John', 'Paul', 'George', 'Ringo'],

                                  'Age': [30, 25, 45, 35],

                                   'Job': ['Engineer', 'Doctor', 'Lawyer', 'Musician']})



 df.drop_duplicates(subset=['Name', 'Age'], keep='first', inplace=True)

 print(df)

 

Output

 Name     Age    Job

0 John    30     Engineer 

1 Paul    25     Doctor 

2 George  45     Lawyer 

3 Ringo   35     Musician

 

The above code imports the Pandas library and creates a DataFrame with four columns and four rows. It then drops any duplicate rows that have the same value in the name and age columns. The keep=’first’ argument ensures that the first occurrence of a duplicate is kept while the rest are dropped. Finally, the new data frame is printed.

 

 

Method 3:  Keeping The Last Duplicate Instead of The Default First Column

Code

import pandas as pd

df= pd.DataFrame({ 'Name':['Justin', 'Ross', 'Mary', 'Henry', 'Justin', 'Ross'],

                                    'Age':[41, 22, 24, 28, 41, 22], 

                                    'Designation':['Manager', 'Consultant', 'Lead', 'Consultant',                                                            

                                                           'Manager', 'Consultant'] })



df.drop_duplicates(keep = 'last', inplace = True) 

print(df)

 

Output

     Name  Age    Designation

2    Mary   24    Lead

3   Henry   28    Consultant

4   Justin  41    Manager

5   Ross    22    Consultant

 

The above code imports the Pandas library as “pd.” This library allows us to create and manipulate data frames, which are data structures used to store data in the form of columns and rows. In the code, we create a data frame with three columns (“Name,”  “Age,”  and “Designation“) and six rows. We then use the drop_duplicates() function to drop any duplicate rows, keeping the last one. The result is a data frame with four rows and three columns.

 

 

Ways to Fix When Pandas Drop_Duplicates() Method Is Not Working

Following are different scenarios due to which errors can arise in the drop-duplicate method of Pandas, and we have solved them with an example as well.

 

 

1. Check Data Types of Columns Before Dropping Duplicates

Check the data types of the columns you are trying to drop the duplicates from. If they are not the same, then the DataFrame won’t recognize them as duplicates.

The drop_duplicates() function considers two rows to be duplicates, if all their values are the same. If the columns you are considering have different data types, they may be considered different even if their values are the same. For example, the number 1 and the string “1” are considered different. To fix this issue, you can convert the data types of the columns to be the same using the astype() function.

 

code

import pandas as pd 

df = pd.DataFrame({"S.no": [1, 2, 3], "Name": ["John", "John", "Jack"], 

               "Age": [20, 20, 30], 

         "Location": ["New York", "New York", "London"]}) 

 print(df.dtypes) 

df = df.drop_duplicates() 

print(df)

 

Output 

S.no    int64

Name object 

Age int64 

Location object 

dtype: object 



  S.no   Name  Age    Location 

0  1     John   20    New York 

2  3     Jack   30    London


First, we import the Pandas library as pd. Then, we create a DataFrame with four columns and three rows of data. The third line of code prints the column data types. To ensure all columns have the same data types, drop duplicate rows. and print the values after the drop_duplicates() method, and we can see that the data types remain unchanged.

 

 

2. Fill in Missing Values in a Data Frame Before Dropping Duplicates

The drop_duplicates() function considers nan values to be different from each other, even if they are in the same column. To fix this issue, you can use the dropna() function to remove rows with nan values before using drop_duplicates().

Check the data frame for any missing values. If there are missing values, they could be causing the drop_duplicates method to not work properly.

 

Code

data = {'Name':['John','Paul','George','Ringo','John'], 

'Age':[20,21,20,19,20], 

'Instrument':[None,'Guitar','Guitar','Drums',None]} 

df = pd.DataFrame(data) 


df['Instrument'].fillna('Unknown', inplace=True) 


df.drop_duplicates(subset=['Name', 'Age', 'Instrument'], keep='first', inplace=True) 

print(df)

 

Output

   Name  Age Instrument

0  John    20 Unknown

1  Paul     21  Guitar

2  George 20  Guitar

3  Ring    19  Drums

 

The code above imports the Pandas library as pd. Then, it creates a data frame from the given data. After that, it is filling in any missing values in the “Instrument” column with the string “unknown.” Finally, it is dropping any duplicates based on the “Name,” “Age,” and “Instrument” columns and keeps only the first occurrence. The final data frame is then printed.

 

 

3. For Categorical Data Drop Duplicates

 If the data frame contains a column with categorical data, try using the parameter keep=False to drop the duplicates. This will ensure that only one instance of each category is kept.

 

Code

import pandas as pd 

data = {'name': ['John', 'Mary', 'John', 'Alice', 'Alice', 'Mary'],

  'score': [10, 8, 10, 5, 7, 8]} 

df = pd.DataFrame(data) 


df.drop_duplicates(subset="score", keep=False)

 

Output

   name  score 

3  Alice   5 

4  Alice   7

 

The code above will drop the duplicates in the data frame based on the “score” column. This means that if there are multiple rows with the same score, they will be removed from the data frame. The argument “keep = false” means that all duplicates will be removed.

 

 

4. Use “Inplace=True” to Directly Change DataFrame

Finally, try using the inplace=True parameter when calling the drop_duplicates method. This will ensure that any changes are made directly to the DataFrame, and not to a copy of the DataFrame, which could be why the method is not working.

 

Code

import pandas as pd 

data = {'Name':['John', 'Mary', 'Tim', 'John', 'Tim'],

 'Age':[20, 21, 19, 20, 19], 

'Gender':['Male', 'Female', 'Male', 'Male', 'Male']}

df = pd.DataFrame(data) 

df.drop_duplicates(inplace=True) 

print(df)

 

Output 

   Name  Age   Gender 

0  John   20   Male 

1  Mary   21   Female 

2  Tim    19   Male

 

import Pandas, as pd allows us to use the Pandas library. The code then creates a DataFrame from a dictionary called data. The drop_duplicates() method is then used to remove any duplicate rows from the DataFrame. This means that if there are any rows with the same values in all columns, they will be removed, leaving only unique values.  The argument inplace = True specifies that the changes should be made in the same data frame.

In order to get the drop_duplicates method to work and fix the error “drop_duplicates not working in pandas”, you should use the inplace=True parameter. This will ensure that any changes are made directly to the DataFrame, without creating a copy of the DataFrame.

 

 

5. Check the Parameters of Drop_Duplicates()

Make sure you have specified the correct subset of columns to consider when identifying duplicates. By default, drop_duplicates() considers all columns, but you can specify a list of columns using the subset parameter.

 

 

6. Check for Hidden Characters

Sometimes there may be hidden characters in your data that are causing the duplicates to be considered different. To fix this issue, you can use the strip() function to remove leading and trailing whitespace from the columns.

 

 

7. Check the Order of The Rows

The drop_duplicates() function only removes the first occurrence of a duplicate row. If the duplicates are not adjacent to each other, they will not be removed. To fix this issue, you can use the sort_values() function to sort the DataFrame by the columns you are considering before using drop_duplicates().

 

 

Additional Things to Note About the Pandas drop_duplicates() Function

It’s also worth noting that you can use the keep parameter of drop_duplicates() to specify which duplicate rows to keep. The keep parameter takes three values:

  • first: Keep the first occurrence of each set of duplicates (default)
  • last: Keep the last occurrence of each set of duplicates
  • False: Drop all duplicates

Here’s an example of how to use the keep parameter to keep the last occurrence of each set of duplicates:

import pandas as pd

df = pd.DataFrame({'A': [1, 1, 2, 2, 3, 3], 'B': [1, 1, 2, 2, 3, 3]})

df = df.drop_duplicates(subset=['A', 'B'], keep='last')

print(df)

 

This will output the following DataFrame:

   A  B
3  2  2
5  3  3

 

As you can see, the last occurrence of each set of duplicates has been kept.

In conclusion, the keep parameter of drop_duplicates() allows you to specify which duplicate rows to keep when removing duplicates from a DataFrame. By using this parameter, you can customize the behavior of drop_duplicates() to suit your specific needs.

 

Conclusion

To summarize the article, the drop_duplicates method in Pandas can be used to remove duplicates from a DataFrame. However, sometimes the method does not work as expected. To fix this, it is important to understand the parameters of the method and make sure the DataFrame contains only a single index. 

Additionally, it is important to pay attention to the data types in the data frame to ensure that the correct data is being compared. Finally, it is important to check the DataFrame for any null values that can cause the drop_duplicates method to fail.

By following these steps, you should be able to effectively use the drop_duplicates() function to remove duplicate rows from your DataFrames.

Let’s have a quick recap of the topics discussed in this article.

  1. What is drop_duplicates() used for in Python?
  2. Using Drop Duplicates In 3 Ways
  3. Dropping rows from duplicate rows
  4. Dropping rows from a duplicate subset of columns
  5. How to fix the error drop_duplicates not working in pandas?
  6. Keeping the last duplicate instead of the default first column
  7. Fixing Different Errors Using Drop-Duplicates() Method

If you’ve found this article helpful, comment below and let 👇 know which solutions have helped you solve the problem.

 

Total
0
Shares
Leave a Reply

Your email address will not be published. Required fields are marked *

Related Posts
Total
0
Share