如何使用Python减少Pandas资料框中的重复元素-编程知识-白鹭情

我正在使用一个如下所示的资料框：

A                       B           C       D       E   F   G   H
ctg.s1.000000F_arrow    CDS gene    21215   22825   0       .   DAFEIOHN_00017
ctg.s1.000000F_arrow    CDS gene    21215   22825   0       .   DAFEIOHN_00017
ctg.s1.000000F_arrow    CDS gene    64501   66033   0   -   .   DAFEIOHN_00049
ctg.s1.000000F_arrow    CDS gene    70234   78846   0       .   DAFEIOHN_00053
ctg.s1.000000F_arrow    CDS gene    103455  106526  0       .   DAFEIOHN_00074
ctg.s1.000000F_arrow    CDS gene    161029  161712  0       .   DAFEIOHN_00132
ctg.s1.000000F_arrow    CDS gene    170711  171520  0       .   DAFEIOHN_00142
ctg.s1.000000F_arrow    CDS gene    203959  204450  0   -   .   DAFEIOHN_00174
ctg.s1.000000F_arrow    CDS gene    211381  212196  0       .   DAFEIOHN_00184
ctg.s1.000000F_arrow    CDS gene    236673  238499  0       .   DAFEIOHN_00209
ctg.s1.000000F_arrow    CDS gene    533077  533850  0       .   DAFEIOHN_00475
ctg.s1.000000F_arrow    CDS gene    533995  535194  0       .   DAFEIOHN_00572
ctg.s1.000000F_arrow    CDS gene    641146  643083  0       .   DAFEIOHN_00572

如您所见，H列中有重复的元素，例如DAFEIOHN_00017或DAFEIOHN_00572。我想修改这个资料框以获得这样的东西：

A                       B           C       D       E   F   G   H                I
ctg.s1.000000F_arrow    CDS gene    21215   22825   0       .   DAFEIOHN_00017   2
ctg.s1.000000F_arrow    CDS gene    64501   66033   0   -   .   DAFEIOHN_00049   1
ctg.s1.000000F_arrow    CDS gene    70234   78846   0       .   DAFEIOHN_00053   1
ctg.s1.000000F_arrow    CDS gene    103455  106526  0       .   DAFEIOHN_00074   1
ctg.s1.000000F_arrow    CDS gene    161029  161712  0       .   DAFEIOHN_00132   1
ctg.s1.000000F_arrow    CDS gene    170711  171520  0       .   DAFEIOHN_00142   1
ctg.s1.000000F_arrow    CDS gene    203959  204450  0   -   .   DAFEIOHN_00174   1
ctg.s1.000000F_arrow    CDS gene    211381  212196  0       .   DAFEIOHN_00184   1
ctg.s1.000000F_arrow    CDS gene    236673  238499  0       .   DAFEIOHN_00209   1
ctg.s1.000000F_arrow    CDS gene    533077  533850  0       .   DAFEIOHN_00475   1
ctg.s1.000000F_arrow    CDS gene    533995  535194  0       .   DAFEIOHN_00572   2

在第二个资料框中，重复元素仅显示一次，并且有一个新列I，其中H提供了该列中每个元素的出现次数。

我怎样才能做到这一点？

谢谢你。

uj5u.com热心网友回复：

您可以使用drop_duplicates洗掉了在一个特定的列复制行，并使用assign创建与组合回传值的新列groupby('H')，并transform('count')以确定的每个唯一值的计数H：

df = df.drop_duplicates(subset='H').assign(I=df.groupby('H')['H'].transform('count'))

输出：

>>> df
                       A         B       C       D  E  F  G               H  I
0   ctg.s1.000000F_arrow  CDS-gene   21215   22825  0     .  DAFEIOHN_00017  2
2   ctg.s1.000000F_arrow  CDS-gene   64501   66033  0  -  .  DAFEIOHN_00049  1
3   ctg.s1.000000F_arrow  CDS-gene   70234   78846  0     .  DAFEIOHN_00053  1
4   ctg.s1.000000F_arrow  CDS-gene  103455  106526  0     .  DAFEIOHN_00074  1
5   ctg.s1.000000F_arrow  CDS-gene  161029  161712  0     .  DAFEIOHN_00132  1
6   ctg.s1.000000F_arrow  CDS-gene  170711  171520  0     .  DAFEIOHN_00142  1
7   ctg.s1.000000F_arrow  CDS-gene  203959  204450  0  -  .  DAFEIOHN_00174  1
8   ctg.s1.000000F_arrow  CDS-gene  211381  212196  0     .  DAFEIOHN_00184  1
9   ctg.s1.000000F_arrow  CDS-gene  236673  238499  0     .  DAFEIOHN_00209  1
10  ctg.s1.000000F_arrow  CDS-gene  533077  533850  0     .  DAFEIOHN_00475  1
11  ctg.s1.000000F_arrow  CDS-gene  533995  535194  0     .  DAFEIOHN_00572  2

uj5u.com热心网友回复：

我们可以groupby像这样使用 a并计算元素：

df.groupby('H').count()

如何使用Python减少Pandas资料框中的重复元素

0 评论

发表评论

最新文章

斥350亿美元建新航厦，迪拜将打造世界最大机场

Windows系统安装最详细教程，基于U盘方式

十首精美绝伦的爱情宋词

分手后仍难以与前任断绝联系的三大星座，纠缠不清的情感纠葛！

优秀的女人，必须坚持的11个生活习惯！

此刻，像宋人一样热爱生活！

随机推荐

应该作业的简单查询不是

“GIL”如何影响带有i/o系结任务的Pythonasyncio`run_in_executor`？

C++ 利用模板偏特化和 decltype(()) 识别表达式的值类别

seaborn可视化条形图并按照降序排序条形图进行可视化Sort Bars in Barplot in Descending Order in Python

Spring_AOP

python爬虫实体——基于python实作有道云翻译界面

如何在cypress中宣告一个url

如何将这些HTML属性添加到JavaScript物件？

如何在DRF中通过单个db呼叫在多个父序列化器变量中使用单个嵌套序列化器的资料？

Linq中嵌套类的通用层次过滤器

热门分类

热门标签